Improved rates for Wasserstein deconvolution with ordinary smooth error in dimension one

This paper deals with the estimation of a probability measure on the real line from data observed with an additive noise. We are interested in rates of convergence for the Wasserstein metric of order $p\geq 1$. The distribution of the errors is assumed to be known and to belong to a class of supersmooth or ordinary smooth distributions. We obtain in the univariate situation an improved upper bound in the ordinary smooth case and less restrictive conditions for the existing bound in the supersmooth one. In the ordinary smooth case, a lower bound is also provided, and numerical experiments illustrating the rates of convergence are presented.


Introduction
Consider the following convolution model: we observe n real-valued random variables Y 1 , . . . , Y n such that where the X i 's are independent and identically distributed according to an unknown probability µ, which we want to estimate. The random variables ε i , i = 1, . . . , n, are independent and identically distributed according to a known probability measure µ ε , not necessarily symmetric. Moreover we assume that (X 1 , . . . , X n ) is independent of (ε 1 , . . . , ε n ). The purpose of the paper is to investigate rates of convergence for the estimation of the measure µ under Wasserstein metrics. For p ∈ [1, ∞), the Wasserstein distance W p between µ and ν is given by where Π(µ, ν) is the set of probability measures on R × R with marginal distributions µ and ν (see [24] or [26]). The distances W p are natural metrics for comparing measures. For instance they can compare two singular measures, which is of course impossible with the functional metrics commonly used in density estimation. Convergence of measure under Wasserstein distances is an active domain of research in probability and statistics. For instance, the rate of convergence of the empirical measure under these metrics has been obtained recently by both [14] and [19] in R d and also by [1] in the one-dimensional framework. Moreover, Wasserstein metrics are involved in many fields of mathematics and computer sciences. For instance, in the field of Topological Data Analysis (TDA) [5], Wasserstein distances recently appeared to be natural metrics for controlling the estimation of geometric and topological features of the sampling measure and its support. Indeed, in [7], a distance function to measures is introduced to solve geometric inference problems in a probabilistic setting: if a known measure ν is close enough with respect to W 2 to a measure µ concentrated on a given shape, then the topological properties of the shape can be recovered by using the distance to ν. More generally, the Wasserstein loss could be used as a guide for inferring the support (see the Cantor experiment in Section 6.4).
Other results in TDA with stability results involving the Wasserstein distances can be found in [20] and [8]. In practice, the data can be observed with noise, which motivates in this framework the study of the Wasserstein deconvolution problem [4], in particular if the deconvolved measure and the "true measure" are singular.
Rates of convergence in deconvolution have mostly been considered in density estimation, for pointwise or global convergence. Minimax rates can be found for instance in [17,2,3] and in the monograph of [23]. In this paper, however, we shall not assume that µ has a density with respect to the Lebesgue measure. In this context, rates of convergence for the W 2 Wasserstein distance have first been studied for several noise distributions by [4]. Recently, [10] have obtained optimal rates of convergence in the minimax sense for a class of supersmooth error distributions, in any dimension, under any Wasserstein metric W p . The result relies on the fact that lower bounds in any dimension can be deduced in this case from the lower bounds in dimension 1. Such a method cannot be used in the ordinary smooth case, where the rate of convergence depends on the dimension. As noticed by [17], establishing optimal rates of convergence in the ordinary smooth case is more difficult than in the supersmooth one, even for pointwise estimation.
A key fact in the univariate context is that Wasserstein metrics are linked to integrated risks between cumulative distribution functions (cdf), see the upper bound (5) below. In dimension 1, when estimating the density of µ, optimal rates of convergence for integrated risks can be found in [16,18]. When estimating the cdf F of µ, optimal rates for the pointwise and integrated quadratic risks are given in [21], where it is shown in particular that the rate √ n can be reached when the error distribution is ordinary smooth with a smoothness index less than 1/2. Concerning the pointwise estimation of F (x 0 ), optimal rates for the quadratic risk are also given in [9], when the density of µ belongs to a Sobolev class.
The case β = 0 in the upper bound (3.9) of [21] corresponds to the case where no assumption (except a moment assumption) is made on the measure µ (in particular µ is not assumed to be absolutely continuous with respect to the Lebesgue measure). This is precisely the case which we want to consider in the present paper. However the results by [21] cannot be applied to the Wasserstein deconvolution problems for two reasons: firstly, the integrated quadratic risk for estimating a cdf is not linked to Wasserstein distances, and secondly, the estimator of the cdf of µ proposed in [21] is the cdf of a signed measure, and is not well defined as an estimator of µ for the Wasserstein metric.
In the present contribution, we propose in the univariate situation an improved upper bound for deconvolving µ under W p , and a lower bound when the error is ordinary smooth. We recover the optimal rate of convergence in the supersmooth case with slightly weaker regularity conditions than in [10]. The estimator of the cdf F of µ is built in two steps: firstly, as in [21], we define a preliminary estimator through a classical kernel deconvolution method, and secondly we take an appropriate isotone approximation of this estimator. For controlling the random term, we use a moment inequality on the cdfs, which is due to [15]. To be complete, we show in Section 4 that for p > 1, the Wasserstein deconvolution problem is different from the cdf deconvolution problem with loss L p associated toÈbralidze's inequality (see (14) for the definition).
The paper is organized as follows. In Section 2, some facts about the case without error are recalled and discussed. The upper bounds for Wasserstein deconvolution with supersmooth or ordinary smooth errors are given in Section 3, and Section 4 is about lower bounds. The upper bound is proved in Section 5. Section 6 presents the implementation of the method and some experimental results. In particular, observed rates of convergence are compared with the theoretical bounds for the Wasserstein metrics W 1 and W 2 , and we study as an illustrative example the deconvolution of the uniform measure on the Cantor set.

On the case without error
We begin by considering the simple case when one observes directly X 1 , . . . , X n with values in R without error. Let us recall some results for the quantities W p (µ n , µ), where µ n is the empirical measure, given by Let F be the cdf of X 1 , F n the cdf of µ n , and let F −1 and F −1 n be their usual cadlag inverses. Recall that, for any p ≥ 1, and if p = 1: The case p = 1 has been well understood since the paper by [11]. The random variable √ nW 1 (µ n , µ) converges in distribution to |B(F (t))|dt, where B is a standard Brownian bridge, if and only if More recently, [1] have shown that the rate of EW 1 (µ n , µ) can be characterized by the quantities More precisely, the rate 1/ √ n is achieved if and only if (3) is satisfied. When this is not the case, EW 1 (µ n , µ) may decay at an arbitrary slow rate. See the Theorems 3.3 and 3.5 in their paper.
For p > 1, the situation is more complicated. Extra conditions are necessary to ensure that W p (µ n , µ) is of order 1/ √ n. If the random variables take their values in a compact interval [a, b] and if the cdf F is continuously differentiable on [a, b] with strictly positive derivative f , then n p/2 W p p (µ n , µ) converges in distribution to Lemma 3.9.23 in [25]). But in general, the rate can be much slower. The convergence in distribution for the case p = 2 has been studied in detail by [12]. Under additional conditions on F (see condition (2.7) in [12], which requires in particular that F is twice differentiable), the rate of convergence depends on the behavior of F −1 in a neighborhood of 0 and 1. For instance, if where α > 3, it follows from Theorem 4.7 in [12] that converges in distribution. The limiting distribution is explicitly given in [12]. The rates of decay of EW p (µ n , µ) and [EW p p (µ n , µ)] 1/p have been studied more recently in [1]. They show that these quantities decay at the standard rate 1/ √ n if and only if where f is the density of the absolutely continuous component of µ. In particular (see their Theorem 5.6), they show that However, this approach cannot be applied when the measure µ and the Lebesgue measure are singular. An alternative approach to obtain the rate of decay of EW p p (µ n , µ) is to use the following inequality, due to [15] (see also Sections 7.4 and 7.5 in [1]): for any p ≥ 1, where κ p = 2 p−1 p. Starting from (5), we get that where F n is the empirical cdf. Now, it is easy to see that this last integral is finite if and only if It follows that EW p p (µ n , µ) ≤ Cn −1/2 as soon as (6) is satisfied. For instance, taking p = 2, a tail satisfying P (|X| > x) = O( 1 x 4 log x 2+ε ) gives the rate 1/ √ n. Hence, we obtain the same rate as in (4) for α = 5, with a slightly stronger tail condition (due to the fact that we control the expectation), but without additional assumptions on the cdf F .
Since we want to estimate singular measures, we shall follow this approach in the sequel.

Construction of the estimator
Let us start with some notations. For µ a probability measure and ν another probability measure, with density g, we denote by µ ⋆ g the density of µ ⋆ ν, given by We further denote by µ * (respectively f * ) the Fourier transform of the probability measure µ (respectively of the integrable function f ), that is Finally, let F be the cumulative distribution function of µ.
The estimatorμ n of the measure µ is built in two steps: 1. A preliminary estimator of F . Let ⌈p⌉ be the least integer greater than or equal to p. We first introduce a symmetric nonnegative kernel k such that its Fourier transform k * is ⌈p⌉ times differentiable with Lipschitz ⌈p⌉-th derivative and is supported on [−1, 1]. An example of such a kernel is given by where C p is such that k(x)dx = 1.
We define now a preliminary estimatorF n of F : wherek Let us first give some conditions under which these quantities are well defined. Clearly,k h (x) is well defined as soon as µ * ε does not vanish, since in that case it is the Fourier transform of a continuous and compactly supported function (it can be easily checked thatk h (x) is a real function). In the sequel, we shall always assume that r ε = 1/µ * ε is at least two times continuously differentiable. In that case, the function w(u) = k * (u) µ * ε (−u/h) is two times differentiable with bounded and compactly supported derivatives. An integration by parts yields It follows thatk h is a continuous function such thatk h (x) = O(1/(1+x 2 )).
However, this estimatorF n , based on the standard deconvolution kernel density estimatork h first introduced by [6], is not a cumulative distribution function since it is not necessarily non-decreasing. 2. Isotone approximation. We need to define an estimatorF n of F which is a cumulative distribution function. We choose the estimatorF n as an approximate minimizer over all distribution functions of the quantity More precisely, given ρ > 0, letF n be such that, for every distribution function G, Here, ρ may be chosen equal to 0 (best isotone approximation) but the condition ρ = O(n −1/2 ) is the only condition required to get the rates of Section 3.3 below.
The estimatorμ n is then defined by: µ n is the probability measure with distribution functionF n .
Remark 3.1. The second step is different from that of [10], who chooseμ n as the (normalized) positive part of µ n . As we shall see, the isotone approximation allows to get better rates of convergence in the ordinary smooth case. The superiority of the isotone estimator will also be clearly highlighted through the simulations (see Section 6.2). However, this approach works only in the onedimensional case. One may argue that the estimatorF n is not explicit, and can be quite difficult to compute, because the minimization is done over an infinite dimensional set. In fact, this is not an issue, because powerful algorithms have been developed to deal with this situation. In Section 6, we shall use the function gpava from the R package isotonic [22] (see Section 6.1 for more details).

First upper bounds for
The non-random quantity W p p (µ ⋆ K h , µ) is the bias of the estimatorμ n . 2. Control of the bias. Let V h be a random variable with distribution K h and independent of X 1 , in such a way that the distribution of 3. Control of the random term. Note that is the cdf of µ ⋆ K h . ApplyingÈbralidze's inequality (5), we obtain that Now, by the triangle inequality and the definition ofF n , From (10), (11) and (12), to get explicit rates of convergence for E[W p p (μ n , µ)], it remains to control the term Remark 3.2. Another main difference between the present paper and [10] is the use ofÈbralidze's inequality (5) to control the random term. In [10] the term W p p (μ n , µ ⋆ K h ) (for another choice ofμ n ) is bounded by a term involving the variation norm betweenμ n and µ. In our case, this upper bound would give a worse rate of convergence.
Note that Inequality (5) is used here to control the random term only. A possible alternative approach is to use (5) directly, as in the case without error (see Section 2). This would give the upper bound In that case, the bias term would be However, without extra regularity assumptions on µ, this would give a bias term of order h, and then the same rate of convergence as in the case p = 1, that is n 1/(2β+1) under the assumptions of Theorem 3.1 (Item 2) of the next section. But this rate is always too slow for p > 1, see again Theorem 3.1. Moreover, there is no hope to obtain a better rate from (13) because n 1/(2β+1) is also the minimax rate to estimate F with the loss function This last assertion comes from the lower bound stated in Theorem 4.2 of Section 4.

Main results
Let r ε = 1/µ * ε , and let r (ℓ) ε be the ℓ-th derivative of r ε . Let m 0 denote the least integer strictly greater than p + 1 2 , and m 1 be the least integer strictly greater than p − 1 2 . Our first result is a general proposition which gives an upper bound for EW p p (μ n , µ) involving a tail condition on Y and the regularity of r ε .
Proposition 3.1. Let ρ ≤ n −1/2 , and letμ n be the estimator defined in (9). Assume that r ε is m 0 times continuously differentiable. For any h ≤ 1, we have For the sake of readability, the proof of Proposition 3.1 is postponed to Section 5.
We are now in a position to give the rates of convergence for the Wasserstein deconvolution, for a class of supersmooth error distributions, and for a class of ordinary smooth error distributions.
In the ordinary smooth case, when β < 1/2, any bandwidth h = O(n −1/2p ) leads to the rate n −1/2 . The fact that there are three different situations according as β > 1/2, β = 1/2 or β < 1/2 has already been pointed out in Theorem 3.2 of [21] and in Theorem 2.1 of [9] for the estimation of the cdf F . Note that the estimatorF n of [21] is exactly the estimator defined in (8) (with possibly a slightly different kernel). Hence it is not always non-decreasing and cannot be used directly to estimate µ with respect to Wasserstein metrics. For instance, for a Laplace error distribution, the estimatorF n of [21] is such that while the rate of convergence of our estimator for W 1 is In both cases, there are no assumptions on µ, except moment assumptions; in particular, µ needs not be absolutely continuous with respect to the Lebesgue measure. It is then a different context than that considered by [9] for the pointwise estimation of F (x 0 ). In this paper, the authors always assume that µ is absolutely continuous with respect to the Lebesgue measure, with a density f belonging to a Sobolev space of order α > −1/2. Note that the two rates described in this remark are minimax (see Section 4 for our estimator).
in Assumption (15) implies that H Y (x) = O(1/|x| 2p ). Hence |Y | has a weak moment of order 2p, which implies a strong moment of ordrer q for any q < 2p. Note that (19) is the same as the tail condition (6) obtained in Section 2 to get the rate EW p p (µ n , µ) ≤ Cn −1/2 in the case without noise. Recall that, in the case without noise when p = 1, this condition is necessary and sufficient for the weak convergence of √ nW 1 (µ n , µ). Note also that (19) holds iff (6) holds and The "if" part follows easily from the simple inequality P (|X +ε| > x) ≤ P (|X| > x/2) + P (|ε| > x/2). To prove the "only if" part, note that, since X and ε are independent, (19) can be written But this implies that for µ ε almost every y. Now if (21) holds for one y, then it holds for every y, proving that (6) holds (and the same is true for ε by interchanging X and ε in (20)). As we have seen, the tail condition on ε implies that |ε| has a moment of ordre k for any integer k strictly less than 2p, hence µ * ε is at least k times continuously differentiable.
Remark 3.5. The rate EW p p (μ n , µ) ≤ C(log n) −p/β in the supersmooth case has already been given in Theorem 4 of [10] and is valid in any dimension. However the condition on the regularity of r ε is more restrictive in the paper by [10], since it is assumed there that Condition (16) is true for ℓ ∈ {0, 1, . . . , ⌈p⌉ + 1}. Note that this rate is minimax, as stated in Theorem 2 of [10].
Remark 3.6. Applying Proposition 1 in [10], if Condition (17) is true for ℓ ∈ {0, 1, . . . , ⌈p⌉+1}, one can build an explicit estimatorμ n such that EW p p (μ n , µ) ≤ Cn −p/(2p+2β+1) , which is worse than (18). The estimatorμ n is the "naive" estimator defined in Section 6.1. However, the procedure given in [10] works also when the observations Y i are R d -valued, whereas the estimatorμ n defined in (9) is well defined for d = 1 only. Hence, a reasonable question is: can we improve on Proposition 1 of [10] in any dimension?

Lower bound
For some M > 0 and q ≥ 1, we denote by D(M, q) the set of measures µ on R such that |x| q dµ(x) ≤ M .
Theorem 4.1. Let M > 0 and q ≥ 1. Assume that there exist β > 0 and c > 0, such that for every ℓ ∈ {0, 1, 2} and every t ∈ R, Then, there exists a constant C > 0 such that, for any estimatorμ,  1) is an open question. From Section 2, it seems reasonable to think that better rates can be obtained when µ has an absolutely continuous component with respect to the Lebesgue measure which is strictly positive on the support of µ (and also that this should be a necessary condition condition to reach the lower bound when β > 1/2).
We also give a lower bound for the cdf deconvolution problem with loss L p defined in (14).
We give below the proof of Theorem 4.1 for the Wasserstein metric. The proof of Theorem 4.2 is similar, it can be easily adapted from the proofs of Theorem 4.1 and of Theorem 3 in [10].
Proof. Let M > 0 and q ≥ 1. The proof is similar to the proof of Theorem 3 in [10] and thus we only give here a sketch of the proof. We first define a finite family in D(M, q) using the densities with some r > (1 + q)/2. Next, let b n be the sequence where [·] is the integer part. For any θ ∈ {0, 1} bn , let where C is a positive constant and t s,n = (s − 1)/b n . The function H is a bounded function whose integral on the line is 0. Moreover, we may choose a function H such that (see for instance [17] or [18]): du is a primitive of H. Note that by replacing H by H/C in the following, we finally can take C = 1 in (25). Let µ θ be the measure of density f θ with respect to the Lebesgue measure. Then we can find some M large enough such that for all θ ∈ {0, 1} bn , µ θ ∈ D(M, q). Moreover, under these assumptions the first two derivatives of H * are continuous and bounded.
We also consider the densities h θ,s,u = f θ,s,u ⋆ µ ε for u = 0 or 1. Since W 1 is dominated by W p , and using Jensen's inequality, it follows that sup µ∈D(M,q) Using a standard randomization argument (see for the instance the proof of Theorem 3 in [10] for the multivariate case), it can be shown that there exists a constant C > 0 such that sup µ∈D(M,q) as soon as there exists a constant c > 0 such that, for any θ ∈ {0, 1} bn , where the χ 2 distance between two densities h 1 and h 2 on R is defined by If (28) is satisfied, we take b n as in (24) and the theorem is thus proved according to (26), (27) and (A1). It remains to prove (28). Using (A2), we can find a constant C > 0 such that for any t ∈ R and any s ∈ {1, . . . , b n }, The right side of (29) is typically the kind of χ 2 divergence that is upper bounded in the proofs of Theorems 4 and 5 in [17] for computing pointwise rates of convergence: under Assumption (22), it gives that there exists a constant C such that n and (28) is proved.

Proof of Proposition 3.1
Throughout, C will denote a positive constant depending on p which may change from line to line. We start from the basic inequality (10). Inequality (11) yields the bias term and it remains to control the term EW p p (μ n , µ ⋆ K h ). By (12), we have Now, let φ denote a symmetric function, ⌈p⌉+1 times continuously differentiable, equal to 1 on the interval [−1, 1] and to 0 outside [−2, 2]. Our preliminary estimatorF n may be written Here, From (30), we infer that where I = |x| p−1 Var(F 1,n )(x)dx and J = |x| p−1 Var(F 2,n )(x)dx.
To prove Proposition 3.1, we shall give some upper bounds for the terms I and J.
Control of I We first split the integral into two parts:

Now,
Then, letting z = uh and applying Cauchy-Schwarz's inequality we obtain, for any a ∈ ]0, 1[ , To control the term I − 1 , note that Here we shall use the following lemma.
Lemma 5.1. For any nonnegative integer k and any h ≤ 1 we have Proof. By definition ofk 1,h , Now, by Parseval-Plancherel's identity, It can be checked that, for h ≤ 1, which concludes the proof of the Lemma.
Applying Lemma 5.1 with k = 1, we obtain that We now control the term I − 2 . Let b ∈ ]0, 1[ . Applying Cauchy-Schwarz's inequality Consequently, by Fubini's Theorem Let m 0 be the least integer strictly greater than p + 1/2. Taking a and b close enough to 0, it follows that Applying Lemma 5.1 with k = m 0 , it follows that In the same way, we have Using the same arguments as for I − , we obtain, Consequently, gathering (32), (33) and (34) we obtain that Control of J Let a ∈ ]0, 1/2[ . By definition of the term J, and applying Cauchy-Schwarz's inequality, Let us write Using Parseval-Plancherel's identity, we get Consequently, Setting u = t/h and using the fact that |x| q ≤ 2 q−1 |x − Y | q + 2 q−1 |Y | q for any q ≥ 1, we obtain that Let m 1 be the least integer strictly greater than p − 1 2 . Taking a close enough to zero, it follows that By Parseval-Plancherel's identity, Finally, Starting from (10) and gathering the upper bounds (11), (31), (35) and (36), the proof of Proposition 3.1 is complete.

Numerical experiments
This section is devoted to the implementation of the deconvolution estimators. We continue the experiments of [4] about Wasserstein deconvolution in the ordinary smooth case. In particular, we study the W 1 and W 2 univariate deconvolution problems and we compare our numerical results with the upper and lower bounds given in the previous sections. We also apply our procedure to the deconvolution of the uniform measure on the Cantor set. The deconvolution method is implemented in R.

Implementation of the deconvolution estimators
For all the experiments we use the kernel which corresponds to the kernel given by (7) with p = 2 and a Fourier support over [−1/2, 1/2]. Computing the deconvolution estimators requires to evaluate many times the functionk which is the Fourier transform of The Fourier decomposition of ψ h is given by ψ h (u) = k∈Z a k,h e 2iπku where a k,h = In this section we consider symmetric distributions for µ ε . Thus k * and µ * ε are even functions, and the a k,h 's are real coefficients. Next,k For large N , the coefficient a k,h can be approximated by the k-th coefficient of a discrete Fourier transform taken at (ψ h (0), ψ h (1/N ), . . . , ψ h (1 − 1/N )), denoted a k,h,N in the sequel. Of course we use the Fast Fourier Transform algorithm to compute these quantities. For some large K, we evaluatek h at some point x bŷ For intensive simulation, it may be relevant to preliminary computek h on a grid of high resolution rather than calling this function each time.
We first define a discrete approximation of the function Let P = {t 1 < · · · < t q } be a finite regular grid of points in R with resolution η. A discrete approximationμ d n,h ofμ n,h is defined on P bŷ where δ x is the Dirac distribution at x. Sinceμ n,h (t j ) can be negative, the first method for estimating µ consists in taking the positive part ofμ d n,h : This first estimator is called the "naive" deconvolution estimator henceforth. Note that it was studied in [4] and [10]. For implementing the alternative estimatorμ n,h proposed in this paper, we first need to find some probability distributionF n,h on R such that In practice, this corresponds to finding a distribution function close to the step functionF We computeF isot,p n,h thanks to the function gpava from the R package isotonic [22]. The measure µ is finally estimated by the absolutely continuous measurê µ isot,p n,h whose distribution function isF isot,p n,h . We call this estimator the isotone deconvolution estimator for the metric W p .
The construction ofμ isot,p n,h depends on many parameters, for instance K, h, N and η. Tuning all these parameters is a tricky issue. For this paper we only tune these quantity by hand. The bandwidth choice is discussed in Section 6.5. Note that one crucial point is the length N of the vector we use for computing the the a k,h,N 's with the FFT. For ordinary smooth distributions, we observe thatk h decreases slowly for small β for the range of bandwidths h giving minimum Wasserstein risks. Consequently, a small β requires many terms in the expansion (37), and hence a large N . For β smaller than 0.5, it was necessary to take N ≈

Computation of Wasserstein risks for simulated experiments
For fixed distributions µ and µ ε , we simulate Y 1 , . . . , Y n according to the convolution model (1). For a given bandwidth h and p ≥ 1, we can compute W p p (μ naive n , µ) and W p p (μ isot,p n,h , µ) using the quantile functions of the measures, thanks to the relation (2). The Wasserstein risks R naive (n, h) := EW p p (μ naive n,h , µ) and R isot (n, h) := EW p p (μ isot,p n,h , µ) can be estimated by an elementary Monte Carlo method by repeating the simulation of the Y i 's and averaging the Wasserstein distances. Letr isot p (n, h) andr naive p (n, h) be the estimated risks obtained this way (see Figure 1 for an illustration of such curves for the Dirac experi- Table 1 Ordinary smooth distributions used for the error where H is a grid of bandwidth values.

Estimation of the rates of convergence
In this experiment we study the rates of convergence of the estimators for the deconvolution of three distributions: We take for µ ε the ordinary smooth distributions summarized in Table 1. Recall that the coefficient β of a symmetrized Gamma distribution is twice the shape parameter of the distribution. For each error distribution and for n chosen between 100 and 2000, we simulate 200 times a sample of length n from which we compute the estimated minimal risksr isot p, * (n) andr naive p, * (n). We study the Wasserstein risks W 1 and W 2 . We obtain some estimation of the exponent of the rate of convergence for each deconvolution problem by computing the linear regression of logr p, * (n) by log n. See Figure 2 for an illustration and Figures 7 and 8 at the end of the paper for the complete outputs of the Dirac case. A linear trend can be observed in all cases. As expected, the risks are smaller for the isotone estimators than for the naive ones.
The estimated exponents of the convergences rates are plotted in Figure 3 as functions of β. These estimated rates can be compared with the upper and lower bounds obtained in the paper. Of course the rates of convergence of the isotone estimator have no reason to match exactly the lower bounds. However it can be checked that the estimated rates we obtain are consistent with the theoretic bounds proved before. In particular we see that the parametric rate is reached for values of β close to 0, at least in the Dirac case. These results also suggest that the correct minimax rate for W 2 probably corresponds to the upper bound given in Theorem 3.1 (that is, when no further assumption is made on the unknown distribution µ).

Cantor set experiment
We now illustrate the deconvolution method with a more original experiment. We take for µ the uniform distribution on the Cantor set C. Remember that the Cantor set can be defined by repeatedly deleting the open middle thirds of a set of line segments: and F m+1 is obtained by cutting out the middle thirds of all the intervals of F m : F 1 = [0, 1 3 ] ∪ [ 2 3 , 1] and F 2 = [0, 1 9 ] ∪ [ 2 9 , 1 3 ] ∪ [ 2 3 , 7 9 ] ∪ [ 8 9 , 1], etc. . . The uniform measure µ C on C can be defined as the distribution of the random variable X := 2 k≥1 3 −k B k where (B k ) k≥1 is a sequence of independent random variables with Bernoulli distribution of parameter 1/2. Note that the Lebesgue measure of C is zero and thus the Lebesgue measure and µ C are singular. The deconvolution estimators being densities for the Lebesgue measure, the Wasserstein distances are relevant metrics for comparing these with µ C .
Let µ C,K be the distribution of the random variable defined by the partial sumX := 2 K k=1 3 −k B k where the B k 's are defined as before. The distribution µ C,K is an approximation of µ C which can be computed in practice. We simulate a sample of n = 10 4 observations from µ C,K with K = 100. These observations are contaminated by random variables with symmetrized Gamma distribution (the shape parameter is equal to 1/4 (so that β = 0.5) and the scale parameter is equal to 1/2).  In Figure 4, the isotone estimators for W 1 and W 2 and the naive estimator are plotted on the first four levels F m of the Cantor set. The bandwidths are chosen by minimizing the Wasserstein risks over a grid, as in Section 6.3. This requires to approximate the quantile functions for the isotone deconvolution estimator   and for the µ C . Regarding the quantile function of µ C , we simulate a large sample according to µ C,100 and we compute the corresponding empirical distribution function. This last cdf is an approximation of the so called "Devil's staircase" (see Figure 5). For the naive deconvolution estimator we find h = 0.  Figure 4, the W 1 -isotone deconvolution estimator is able to "see" the first three levels of the Cantor set and the three other deconvolution methods recover the first two levels. A kernel density estimator (with no deconvolution) only recovers the first level.

About the bandwidth choice
In practice, we need to choose a bandwidth h for the deconvolution estimators.
As was explained in [4] (see Remark 3 in this paper), it seems that the influence of the measure µ is weak. We now propose a simple experiment to check this principle. We choose for µ ε the symmetrized gamma distribution with a shape parameter equal to 0.375 (β = 0.75) and we simulate contaminated observations from the following various distributions:  Table 1: log-log plots of the estimated W 1 -risks for the naive method and the isotone method. We focus here on the study of the W 2 -isotone deconvolution estimator. Figure 6 compares the locations of the minimums of the five risk curves h →r isot 2,h by  Table 1: log-log plots of the estimated W 2 -risks for the naive method and the isotone method.
averaging over 200 samples of 1000 contaminated observations. For this experiment, the sensitivity of the minimum risk location to the distribution µ is not very large. On another hand, from Figure 3, it seems that the rates for the mixture Dirac Uniform are quite slow (in particular, they are close to the minimax rates for W 1 ).
From these remarks, it seems that the bandwidth minimizing the risk computed for the mixture Dirac Uniform should be a reasonable choice for deconvolving other distributions. Of course, this is in some sense a "minimax choice", and it will not give the appropriate rate for measures which are easier to estimate (for instance measures with smooth densities).
A bootstrap method in the spirit of [13] may give a more satisfactory answer to this problem. However, note that the use of the Wasserstein metric makes difficult the asymptotical analysis of the risk. This interesting problem is out of the scope of this paper, we intend to investigate it in a future work.