Spectral cut-off regularisation for density estimation under multiplicative measurement errors

We study the non-parametric estimation of an unknown density f with support on R+ based on an i.i.d. sample with multiplicative measurement errors. The proposed fully data driven procedure is based on the estimation of the Mellin transform of the density f , a regularisation of the inverse of the Mellin transform by a spectral cut-off and a data-driven model selection in order to deal with the upcoming bias-variance trade-off. We introduce and discuss further Mellin-Sobolev spaces which characterize the regularity of the unknown density f through the decay of its Mellin transform. Additionally, we show minimax-optimality over Mellin-Sobolev spaces of the data-driven density estimator and hence its adaptivity.


Introduction
In this paper we are interested in estimating the unknown density f : R + → R + of a positive random variable X given independent and identically distributed (i.i.d.) copies of Y = XU , where X and U are independent of each other and U has a known density g : R + → R + . In this setting the density f Y : R + → R + of Y is given by f Y (y) = [f * g](y) := ∞ 0 f (x)g(y/x)x −1 dx ∀y ∈ R + such that * denotes multiplicative convolution. The estimation of f using an i.i.d. sample Y 1 , . . . , Y n from f Y is thus an inverse problem called multiplicative deconvolution. Vardi [1989] and Vardi and Zhang [1992] introduce and study intensively multiplicative censoring, which corresponds to the particular multiplicative deconvolution problem with multiplicative error U uniformly distributed on [0,1]. This model is often applied in survival analysis as explained and motivated in Van Es et al. [2000]. The estimation of the cumulative distribution function of X is considered in Vardi and Zhang [1992] and Asgharian et al. [2012]. Series expansion methods are studied in Andersen and Hansen [2001] treating the model as an inverse problem. The density estimation in a multiplicative censoring model is considered in Brunel et al. [2016] using a kernel estimator and a convolution power kernel estimator. Assuming a uniform error distribution on an interval [1 − α, 1 + α] for α ∈ (0, 1) Comte and Dion [2016] analyze a projection density estimator with respect to the Laguerre basis. Belomestny et al. [2016] study a beta-distributed error U . In this work, covering all those three variations of the multiplicative censoring model we consider a density estimator using the Mellin transform and a spectral cut-off regularization of its inverse, which borrows ideas from Belomestny et al. [2020]. The key to the analysis of the multiplicative deconvolution problem is the multiplication theorem of the Mellin transform M, which roughly states M[f Y ] = M[f ]M [g] for a density f Y = f * g. Exploiting the multiplication theorem Belomestny et al. [2020] introduce a kernel density estimator of f allowing more generally X and U to be real-valued. Moreover, they point out that the following widely used naive approach is a special case of their estimation strategy. Transforming the data by applying the logarithm the model Y = XU writes log(Y ) = log(X) + log(U ). In other words, multiplicative convolution becomes convolution for the log-transformed data. As a consequence, the density of log(X) is eventually estimated employing usual strategies for non-parametric deconvolution problems (see for example Meister [2009]) and then transformed back to an estimator of f . However, it is difficult to interpret regularity conditions on the density of log(X). Furthermore, the analysis of a global risk of an estimator using this naive approach is challenging as Comte and Dion [2016] pointed out. Our strategy differs in the following way. Making use of the multiplication theorem of the Mellin transform and applying an additional spectral cut-off on the inversion of the Mellintransform we define a density estimator. We measure the accuracy of the estimator by introducing a global risk in terms of a weighted L 2 R + -norm. Exploiting properties of the Mellin transform we characterize the underlying inverse problem and natural regularity conditions which borrow ideas from the inverse problems community (Engl et al. [2000]). The regularity conditions expressed in the form of Mellin-Sobolev spaces and their relations to the analytical properties of the density f are discussed in more details. The proposed estimator, however, involves a tuning parameter which is selected by a data-driven method. We establish an oracle inequality for the fully-data driven spectral cut-off estimator under fairly mild assumptions on the error density g. Moreover we show that uniformly over Mellin-Sobolev spaces the proposed data-driven estimator is minimax-optimal. Precisely, we state both an upper bound for the mean weighted integrated squared error of the fully-data driven spectral cut-off estimator and a general lower bound for estimating the density f based on i.i.d. copies from f Y = f * g.
The paper is organized as follows. In section 2 we collect properties of the Mellin transform. We explain our general estimation strategy by first introducing and analyzing an estimator based on direct observations X 1 , . . . , X n from f . The estimator relies on an inversion of the Mellin-transform which we stabilize using a spectral cut-off. Exploiting then the multiplication theorem of the Mellin-transform we propose a fully-data driven estimator of f based on the sample Y 1 , . . . , Y n . We derive an oracle type upper bound for its mean weighted integrated squared error. We finish the theoretical part by showing in section 3 that our fully-data driven estimator is minimax optimal over Mellin-Sobolev spaces for a large class of error densities g. Finally, results of a simulation study are reported in section 4 which visualize the reasonable finite sample performance of our estimators. The proof of section 2 and section 3 are postponed to the appendix.
2 Adaptive weighted L 2 R + estimation In this section we introduce the Mellin transform and collect some of its properties. For a more detailed introduction we refer to Paris and Kaminski [2001] and Barucq et al. [2015].
Mellin transform Let L 1,loc R + denote the set of locally integrable real-valued functions. For h ∈ L 1,loc R + the Mellin transform of h in the point c + it ∈ C is defined by provided that the integral is absolutely convergent. If there exists a c ∈ R such that the mapping x → x c−1 h(x) is integrable over R + then the region Ξ h ⊆ C of absolute convergence of the integral in (2.1) is either a vertical strip {s + it ∈ C : s ∈ (a, b), t ∈ R} for a < b with c ∈ (a, b) or a vertical line {c + it ∈ C : t ∈ R}. In the case that Ξ h is a vertical strip the function s + it → M s [h](t) is analytical on this strip. In the literature Ξ h is often called strip of analyticity. In the following illustration we give some techniques to determine Ξ h .
Illustration 2.1. Note that for any density h 1 ∈ L 1,loc R + the vertical strip {1+it : t ∈ R} belongs to Ξ h 1 , and hence the Mellin transform and for all ε > 0 small enough. The latter property implies that for the family (g k ) k∈N with g k (x) : For c ∈ Ξ h the inversion formula of the Mellin transform is given by whenever Ξ h is not a vertical line (c.f. Paris and Kaminski [2001]) or alternatively if the func- Barucq et al. [2015]). Considering Illustration 2.1 we see that the assumption on Ξ h not being a vertical line is rather weak. It is fulfilled for almost all functions in the upcoming theory. However, in all the other cases we make use of the second assumption without further reference. It can be shown that This result combined with the Mellin inversion formula implies an isometry in the following way. For α 0 define the weight function ω α (x) := x 2α−1 , x ∈ R, and the corresponding weighted norm by h 2 ωα := ∞ 0 h 2 (x)ω α (x)dx for a measurable function. Denote by L 2 R + (ω α ) the set of all measurable functions with finite . ωα -norm and by h 1 , h 2 ωα := ∞ 0 h 1 (x)h 2 (x)ω α (x)dx for h 1 , h 2 ∈ L 2 R + (ω α ) the corresponding weighted scalar product. Similarly, define L 2 R := {h : R → C measurable : Barucq et al. [2015] both operators M α : are isometries. We will first construct an estimator for f given an i.i.d. sample X 1 , . . . , X n , that is the direct case, and afterwards we construct an estimator based on the i.i.d. sample Y 1 , . . . , Y n , which constitutes the censored case.
Case of direct observation In this paragraph we define the estimator of f ∈ L 2 R + (ω α ) given the direct observations X 1 , . . . , X n by using the Mellin transform and spectral cut-off regularised inverse. Since f ∈ L 2 R + (ω α ) the Mellin transform M α [f ] is well-defined and a natural unbiased estimator of M α [f ][t] for each t ∈ R is given by M α (t) := n −1 n j=1 X α−1+it j . It is worth pointing out that this estimator is bounded in t ∈ R by | M α (t)| | M α (0)| which is finite almost surely. Thus 1 [−k,k] M α ∈ L 2 R for all k ∈ R + which allows us to apply the operator in (2.3) to define as an unbiased estimator of which as a random variable has only a finite first moment if and only if the by f induced distribution has a finite 2(α − 1) moment by application of Fubini-Tonelli theorem. We state the following proposition which implies the consistency of the estimator for a suitable choice of the cut-off parameter k ∈ R + .
By choosing k = k n such that n −1 k n → 0 and k n → ∞, f kn is a consistent estimator of f .
The proof of Proposition 2.2 can be found in appendix B. Note that, if one would like to consider the case α = 1/2, that is ω α = 1, it is necessary to assume a finite first moment of X −1 . On the other hand, a less restrictive situation would be to consider the case of α = 1 which needs no additional moment condition on f (respectively on g) since in this case σ 2 = 1. The corresponding weight function would be ω(x) := ω 1 (x) = x for x ∈ R + . For the estimation of densities without a compact support assumption, the intervals far away from zero are of special interest. A weighted L 2 R + -norm could model this and may allow us to capture more interesting characteristics of the density like being heavy-tailed or compactly supported. From now on, we will restrict ourselves to this special case while we want to remark that the upcoming theory can be expanded to different values of α 0 under additional assumptions. Before we start to define the estimator in the case of censored observation let us briefly discuss the upcoming bias term f − f k 2 ωα in Proposition 2.2. A natural condition which allows a more sophisticated study on the bias, is to assume that the Mellin-transform decays with a polynomial rate, that is for s 0, The definition of these spaces strongly resemble to the frequently considered Sobolev spaces which are defined as W s : To distinguish between them, we refer W s as Mellin-Sobolev space and W s as Fourier-Sobolev space. But not just the general motivation seems to be similar as we can easily see by the following relationship between Fourier transformation and Mellin transformation.
Therefore, it does not seem to be a suprise that it is possible to characterise the Mellin-Sobolev spaces via analytical properties as done in the case of the Fourier-Sobolev spaces. This is stated in the following proposition while its proof is postponed to the appendix.
If a positive random variable R has the density h, the real-valued random variable T = log(R) has the density f T (y) = H(−y) = (ωh) • ϕ(−y) for y ∈ R. Again, we observe the strong connection between the Mellin transform of h and the Fourier transform of H. This has the following interesting implication for the multiplicative measurement error model. Remark 2.4. As already mentioned the application of the logarithm to the random variable Y , Z := log(Y ) = log(X) + log(U ) =: V + ε, where V and ε are independent, can be used to transfer the model to a regular deconvolution setting. This technique was used for instance by Comte and Dion [2016]. Given an estimator ω which illustrate the difficulties which arise when considering the global risk. Furthermore, in the deconvolution approach via a Fourier transformation one usually assume that the densities f V , f ε ∈ L 2 R which again correspond to the fact that f, g ∈ L 2 R + (ω) using the considerations above.
In the next part we define an estimator of f based on the censored observation Y 1 , . . . , Y n . Since Y = XU , In the multiplicative measurement error model we would need to assume that both X and U have a finite −1-moment to consider the unweighted norm. Especially the latter scenario would exclude several interesting examples, for instance the multiplicative censoring model where U is uniformly distributed on [0, 1].

Case of censored observation
The key property which makes the Mellin transform useful for multiplicative deconvolution is that for two function h 1 , h 2 ∈ L 1,loc (2.6) We will refer to it from now on as the multiplication theorem. In the context of deconvolution a similar property adressing the convolution and its Fourier transform is frequently used to construct deconvolution estimators. Since f and g are both densities we have . Under the assumption that M 1 [g](t) = 0 for all t ∈ R, which we do in the upcoming theory without further reference, we see that using (2.6) we can express the functions (f k ) k∈R + in the following way for k, x ∈ R + . Similar to the direct case we define our estimator by replacing The following mild assumption on the error density ensures that the estimator is well-defined, by definition of f k , and hence the estimator defined in (2.7) and (2.4) coincide when setting M 1 [g](t) = 1 for t ∈ R. The proof of the next proposition is very similar to the proof of Proposition 2.2 and thus omitted.
Let us now have a closer look at the second summand in Proposition 2.5 which bounds the variance term of the estimator. In the following, we use for two functions f, g the notation f ∼ g over a set A ⊂ R if the function f /g is bounded away both from zero and infinity over the set A. In analogy to the usual deconvolution setting and to the work of Belomestny et al.
[2020] we say that the error density is smooth if there exist parameters γ, τ 1 ∈ R + such that Corollary 2.6. Assume that f ∈ L 2 R + (ω) and that [G1] holds. Then for all k ∈ N we have Under the assumptions of Corollary 2.6 it is natural to restrict the set of suitable parameters k to K n := 1, K n with K n := n 1/(2γ+1) and to choose k n :∈ arg min{ f − f k 2 ω + C g (2πn) −1 k 2γ+1 : k ∈ K n }. Unfortunately, this choice is not feasible since it depends on the unknown density f itself. We note that the bias for χ > 0. The following theorem shows that this procedure is adaptive up to a negligeable term.
The proof of Theorem 2.7 is postponed to appendix appendix B. The assumption that The last assertion establishes an oracle inequality assuming a smooth error density as in [G1]. For a super smooth error density with exponentially decay of its Mellin transform (see Belomestny et al. [2020]) a result similar to Theorem 2.7 can be derived from Lemma B.4 in the appendix provided the upper bound K n and the penalty terms pen(k), k ∈ 1, K n are choosen accordingly. However, we omit the details, since the minimax theory presented in the next chapter does not cover a super smooth error density.

Minimax theory
In this section we develop the minimax theory for the proposed estimator in section 2. Over the Mellin-Sobolev spaces we derive an upper and lower bound for the mean weighted integrated squared error, which are equal up to a multiplicative constant, showing that our estimator is minimax-optimal over these spaces.
Regularity assumptions Let us define for s 0 and the ellipsoids W s (L) := {h ∈ W s : |h| 2 s L} for any L 0 which correspond to the Mellin-Sobolov spaces defined in (2.5). We see that for any f ∈ W s (L) we have [−k,k] We denote the subset of densities by is a constant only dependent on the error density g. These considerations imply the following theorem whose proof is omitted.
Theorem 3.1. Assume that g satisfies assumption [G1]. Then for k o := n 1/(2s+2γ+1) , As mentioned before for γ > 1/2 we have ωf Y ∞ f ω g ω . Thus, we can state the following corollary which is a direct consequence of Theorem 2.7 and Theorem 3.1.
where χ is defined in (2.8). In fact this covers the model considered by Belomestny et al. [2016] as a generalisation of the multiplicative censoring where we consider a uniform distributed error density , that is k = 1. Again, we can include the case of direct observations, getting a rate of n −2s/(2s+1) over the ellipsoid.
To prove that the rate of Theorem 3.1 is minimax-optimal over the ellipsoids under certain assumptions on g we finish this section by stating a lower bound result. We want to emphasize that up to now we had no constraints on the support of g. To prove the lower bound we will need to assume that g has a bounded support. For the sake of simplicity we will assume that g has a support in [0, 1]. We assume that there exist parameters γ, τ 1 ∈ R + such that Theorem 3.4. Let s, γ ∈ N, assume that [G1'] holds. Then there exist constants C g , L s,g , n s,γ > 0 such that for all L L s,g , n n s,γ and for any estimator f of f based on an i.i.d. sample We want to emphasize that the error densities (g k ) k∈N with g k (x) = k(1 − x) k−1 1 (0,1) , x ∈ R + , fulfill both the assumption [G1] and [G1']. Thus, in this situation our estimation strategy is minimax-optimal. The proof of the lower bound can be extended to the case of directly observed X 1 , . . . , X n or for different weight functions ω α , α 0.

Numerical study
Let us illustrate the performance of the estimator f k defined in (2.7)  By minimising an integrated weighted squared error over a family of histogram densities with randomly drawn partitions and weights we select χ = 1.2, χ = 0.8 and χ = 0.01 for the cases γ = 0, γ = 1 and γ = 2, respectively, where χ is the penalty constant, see (2.8).
In the direct case we compare the estimator f k with the data-driven density estimator f from the work of Brenner Miguel and Johannes [2020] which is based on the adaptive aggregation of projection estimators with respect to the Laguerre basis. Comment Since the estimator f k is built to minimize the weighted global risk, it seems natural that the estimator f k behaves worse in the region close to zero then the estimator f which is built to minimize the unweighted global risk. This effect is observable in fig. 1. Furthermore, the developed minimax theory suggests that the cases of U ∼ U [0,1] and U ∼ U [0.5,1.5] are of similiar complexity which is reflected in the plots of fig. 2. For the case U ∼ β(1, 2), parameter γ = 2 in [G1], both theory and simulation imply that the recovering of the density f based on the noise sample Y 1 , . . . , Y n leads to a more difficult inverse problem than in the other cases.

Appendix A Preliminaries
Properties of the Mellin transform By assuming that h ∈ L 1,loc R + is a at least b-time differentiable function h, where h (b) denotes its b-th derivative, b ∈ N, and that c − b ∈ Ξ h and c + a ∈ Ξ h , a ∈ N, we get that Lemma A.1. (Talagrand's inequality) Let X 1 , . . . , X n be independent Z-valued random variables and letν Var(ν h (X i )) τ.
Remark A.2. Keeping the bound (A.2) in mind, let us specify particular choices K, in fact K 1 100 . The next bound is now an immediate consequence, In the sequel we will make use of the slightly simplified bounds (A.3) rather than (A.2).

B Proofs of section 2
Proof of Proposition 2.2.
| 2 ) = σ 2 n −1 . Now by using Fubini we get Proof of Proposition 2.3. The main strategy of this proofs relies on the well-known fact that W s = {H ∈ L 2 R : H weakly differentiable up to the order s, H (i) ∈ L 2 R , i ∈ 0, s } for s ∈ N and the already discussed connection between the Mellin transform and the Fourier transform. Further we want to stress out that for a function H ∈ L 2 Ω , Ω ⊂ R open, being weakly differentiable corresponds to being locally absolutely continuous on Ω, that is h is absolutely continuous on all compact intervals [a, b] ⊂ Ω, a < b ∈ Ω. Let us start by assuming f ∈ W s then F := (ωf ) • ϕ ∈ W s which means that F is stimes weakly differentiable with F (i) ∈ L 2 R for i ∈ 0, s . Thus we get that F is a s − 1times continously differentiable function, more precisely there exists a representant of the equivalence class of F such that it is s − 1-times continuously differentiable. Now since f = ω −1 F (ϕ −1 ) we can deduce that f itself is s − 1-times continuously differentiable. Let us define the operator T : From this follows directly that ω j f (j) ∈ L 2 R + (ω) for j ∈ 0, s . As the next step we show that f (s−1) is locally absolutely continuous. To do so, we see first for j ∈ 0, s we have that T j [f ] ∈ W 1 . Now using the following lemma implies that ω s−1 f (s−1) is absolutely continuous as linear combination of {T j [f ] : j ∈ 0, s }.
Lemma B.2. Let h ∈ W 1 . Then h is a locally absolutely continuous function with derivative h : R + → R and h, ωh ∈ L 2 R + (ω). Let now δ denote the derivative of ω s−1 f (s−1) . Then by Lemma B.2 we see that ωδ ∈ using the integration by part rule for absolutely continuous function (see Cohn [2013]). Finally we have that ω s f (s) = δω − (s − 1)ω s−1 f (s−1) ∈ L 2 R + (ω). Let us now show the other direction, indeed let us assume that f is s−1-times continuously differentiable, f (s−1) is locally absolutely continuous with derivative f (s) and . We can conclude that T j [f ] ∈ W 1 for j ∈ 0, s − 1 applying the following lemma.
Proof of Lemma B.2. Since h ∈ W 1 we have that H = (ωh) • ϕ lies in the Sobolev space of order 1. In equal H is locally absolutely continuous with derivative H and H, H ∈ L 2 R . From this we can conclude that h is locally absolutely continuous. Indeed for h : applying the integration by part rule for absolutely continuous function. Further we have that h(x)dx.
Proof of Theorem 2.7. Let us define the nested subspaces (S k ) k∈R + by S k := {h ∈ L 2 R + (ω) : ∀|t| k : M 1 [h](t) = 0}. For any h ∈ S k we consider the empirical contrast By definition of k we have that γ n ( f k ) − pen( k) γ n ( f k ) − pen(k) γ n (f k ) − pen(k) for any k ∈ K n . Now using (B.1) we get that First we note that S k 1 ⊆ S k 2 for k 1 k 2 . Let us now denote by a ∨ b := max(a, b) and define for all k ∈ K n the unit balls B k := {h ∈ S k : h ω 1}. Next we deduce from 2ab . Putting all this together and define Assuming now that χ 12C g π −1 we get that 4p( k ∨ k) pen(k) + pen( k) and thus We will use the following lemma which we will be proven afterwards.
Lemma B.4. Assuming that ωf Y ∞ < ∞ and that for all k ∈ K n the function G k : R → where ∆ g is defined in (B.2).

Now under [G1]
we have that ∆ g (k) c g k 2γ+1 and for all t ∈ R holds |G k (t)| C g k 2γ thus we have that the first summand is bounded by For the second summand we use that n −1 ∆ g (k) C g n −1 k 2γ+1 C g and thus bounded in N. Applying the lemma we get that Since this inequality holds for all k ∈ K n this implies the claim.
Proof of Lemma B.4. We will use the Talagrand inequality (A.3) to show the claim. We want to emphasize that we are able to apply the Talagrand inequality on the sets B k since B k has a dense countable subset and due to continuity arguments. To do so we start to determine the constant Ψ 2 . We have for any h ∈ B k thatν 2 Thus 6Ψ 2 = p(k). Next we consider ψ. Let y > 0 and h ∈ B k then using the Cauchy Schwartz inequality we get |ν h (y) ) .

C Proofs of section 3
Proof of Theorem 3.4. First we outline here the main steps of the proof. We will construct a family of functions in D s,L R + by a perturbation of the density f o : R + → R + with small bumps, such that their L 2 R + (ω)-distance and the Kullback-Leibler divergence of their induced distributions can be bounded from below and above, respectively. The claim follows then by applying Theorem 2.5 in Tsybakov [2008]. We use the following construction, which we present first. Denote by C ∞ c (R) the set of all smooth functions with compact support in R and let ψ ∈ C ∞ c (R) be a function with support in [0, 1] and 1 0 ψ(x)dx = 0. For each K ∈ N (to be selected below) and k ∈ 0, K we define the bump-functions ψ k,K (x) := ψ(xK − K − k), x ∈ R. and define for j ∈ N 0 the finite constant C j,∞ := max( ψ (l) ∞ , l ∈ 0, j ). Let us further define the operator S : for all x ∈ R and define S 1 := S and S n := S • S n−1 for n ∈ N, n 2. Now, for j ∈ N, we define the function for x ∈ R + and c i,j 1 and let c j := j i=1 c i,j For a bump-amplitude δ > 0, γ ∈ N and a vector θ = (θ 1 , . . . , θ K ) ∈ {0, 1} K we define Until now, we did not give a sufficient condition to ensure that our constructed functions {f θ : θ ∈ {0, 1} K } are in fact densities. This condition is given by the following lemma.
Further one can show that these densities all lie inside the ellipsoids D s,L R + for L big enough. This is captured in the following lemma.
Proof of Lemma C.3.