Density Deconvolution with Non-Standard Error Distributions: Rates of Convergence and Adaptive Estimation

It is a typical standard assumption in the density deconvolution problem that the characteristic function of the measurement error distribution is non-zero on the real line. While this condition is assumed in the majority of existing works on the topic, there are many problem instances of interest where it is violated. In this paper we focus on non--standard settings where the characteristic function of the measurement errors has zeros, and study how zeros multiplicity affects the estimation accuracy. For a prototypical problem of this type we demonstrate that the best achievable estimation accuracy is determined by the multiplicity of zeros, the rate of decay of the error characteristic function, as well as by the smoothness and the tail behavior of the estimated density. We derive lower bounds on the minimax risk and develop optimal in the minimax sense estimators. In addition, we consider the problem of adaptive estimation and propose a data-driven estimator that automatically adapts to unknown smoothness and tail behavior of the density to be estimated.


Introduction
Density deconvolution is a problem of estimating a probability density from observations with additive measurement errors. Specifically, assume that we observe random sample Y 1 , . . . , Y n generated by the model where X i 's are i.i.d. random variables with unknown density f with respect to the Lebesgue measure on R, ǫ i 's are i.i.d. measurement errors with distribution function G, and X i 's are independent of ǫ i 's. The objective is to estimate f on the basis of the sample Y n := {Y 1 , . . . , Y n }. Since Y i is the sum of two independent random variables, X i and ǫ i , density f Y of Y i is given by the convolution (1.1) An estimator of the value of f (x 0 ) is a measurable function of Y n ,f (x 0 ) =f (x 0 ; Y n ), and the risk off (x 0 ) is where E f stands for the expectation with respect to the probability measure P f generated by the observation Y n when the unknown density of X i 's is f . For a particular functional class F , accuracy off (x 0 ) is measured by the maximal risk and an estimatorf * (x 0 ) is called rate-optimal or optimal in order on F if Here R * n [F ] is the minimax risk, and the infimum in its definition is taken over all possible estimators of f (x 0 ). The objective in the density deconvolution problem is to construct an optimal in order estimator, and to study the rate at which the minimax risk R * n [F ] converges to zero as n → ∞. In what follows we refer to the latter as the minimax rate of convergence.
The outlined problem is a subject of vast literature under various assumptions on the functional class F and distribution of measurement errors G; see, e.g., Carroll and Hall [5], Stefanski and Carroll [21], Zhang [22], Fan [8], Butucea and Tsybakov [3,4], Meister [19], Lounici and Nickl [17] for representative publications, where further references can be found. Typically F is a class of functions satisfying smoothness conditions (e.g., Hölder or Sobolev functional classes). As for assumptions on the measurement error distribution, they are usually put in terms of the characteristic function of G and read as follows.
Assumption (E0) is inarguably conventional and presumed in nearly all works dealing with density deconvolution problems. Under Assumption (E0) accuracy in estimating f is detemined by the rate at which φ g tends to zero and by smoothness of f as characterized in terms of functional class F . Condition (E0-I) ensures that the statistical model is identifiable (it is well known that if φ g vanishes on a set of non-zero Lebesgue measure then f is not identifiable). It underlies applicability of the standard Fourier-transform-based techniques for constructing estimators of f . Note however that (E0-I) does not hold if φ g has isolated zeros which is the case in many interesting situations, e.g., for continuous distributions with compactly supported densities or for general discrete distributions. For example, if G is a uniform distribution on [−1, 1] then φ g (iω) = sin ω/ω has zeros at ω = ±πk, k ∈ N, and (E0-I) is not fulfilled.
The settings in which the error characteristic function φ g may have isolated zeros have been studied to a considerably lesser extent; the available results in this area are fragmentary and disparate. Devroye [7] pointed out that density f can be estimated consistently in the L 1 -norm when the characteristic function φ g of the error distribution is non-zero almost everywhere. Although it is a quite general result, the convergence is not uniform, and the evaluation procedure is not based on the minimax criterion. Several previous studies investigated the problem with the uniform error distribution. In particular, Groeneboom and Jongbloed [13] and Feuerverger et al. [9] demonstrate that zeros of the characteristic function φ g do not have influence on the minimax rate of convergence: it remains the same as under condition (E0-I) when the estimated density f is supported on the positive real line [13], or has bounded second moment [9]. Considering a more general class of so-called Fourier-oscillating error distributions, Delaigle and Meister [6] derive a similar result for densities f having finite left endpoint. In contrast to the aforementioned results, Hall and Meister [14] demonstrate that for the class of Fourieroscillating error distributions zeros of the error characteristic function lead to a slower minimax convergence rate than the one under condition (E0-I). Hall and Meister [14] suggest a "ridge" modification of the kernel density deconvolution estimator in which characteristic function of the error distribution is regularized to avoid singularities due to the zeros. For another closely related work we also refer to Meister [18].
Recently a principled method for solving density deconvolution problems under general assumptions on the error characteristic function has been proposed in Belomestny and Goldenshluger [2]. This method uses the Laplace transform (the Fourier transform in complex domain) in conjunction with the linear functional strategy for constructing rate-optimal kernel deconvolution estimators. The results show that zeros of the error characteristic function have no influence on the achievable estimation accuracy when, in addition to usual smoothness conditions, the estimated density f has sufficiently light tails. On the other hand, if f is heavy tailed then zeros of the error characteristic function affect the minimax rates of convergence that become slower. Belomestny and Goldenshluger [2] provide an explicit condition on the tail behavior of f and zeros geometry of φ g under which the minimax rates of convergence are not influenced by the zeros of φ g .
In this paper we focus on the setting when φ g has zeros, and f is heavy tailed relative to the multiplicity m of zeros of φ g on the imaginary axis. The prototypical settings of this type arise when mesurement error distribution is the binomial distribution Bin(m, 1/2) or the mfold convolution of uniform distributions on [−θ, θ]. Utilizing the methodology proposed in [2] we develop rate-optimal estimators of f and investigate their properties. It is shown that, in contrast to the well known results under Assumption (E0), in the considered regime the minimax rate of convergence is determined not only by the smoothness of f and the rate at which φ g tends to zero, but also by the tail behavior of f and the zero multiplicity of φ g . The derived lower bounds on the minimax risk demonstrate that dependence of the estimation accuracy on these factors is essential.
The construction of the proposed rate-optimal estimator of f depends on tuning parameters, and their specification requires prior information on smoothness and tail behavior of f . In practice such information is rarely available. To overcome this difficulty we propose and study an adaptive estimator of f that is based on the methodology developed in Goldenshluger and Lepski [11,12]. An interesting feature of the proposed estimator is that it involves two tuning parameters, and the adaptation here is not only with respect to the unknown smoothness, but also with respect to the unknown tail behavior of f . We derive an oracle inequality for the developed adaptive estimator and show that it achieves the minimax rate of convergence up to a logarithmic factor which is unavoidable payment for adaptation in point-wise estimation.
The rest of the paper is organized as follows. In Section 2 we present the general idea for estimator construction and introduce our estimator. Section 3 deals with minimax estimation of f (x 0 ) with respect to proper functional classes. In Section 4 we introduce the corresponding adaptive procedure and investigate its properties. Lastly, Section 5 is reserved for discussion and concluding remarks. All the proofs are deferred to Appendix.

Idea of Construction
We start with presenting the key idea for estimator construction in our density deconvolution problem. The construction uses Laplace transform (Fourier transform in the complex domain) which allows us to handle the situation where the first condition of Assumption (E0) is not satisfied. Our goal is to deliver the main idea of construction; for further details we refer to Belomestny and Goldenshluger [2].
The following definitions will be utilized throughout the study. For a generic function w the bilateral Laplace transform of w is defined to be The integral convergence region Σ w (if exists) is a vertical strip in the complex plane, Σ w = {z ∈ C : Re(z) ∈ (σ − w , σ + w )} for some σ − w , σ + w ∈ R, and φ w (z) is analytic in Σ w . The inverse Laplace transform is For the error distribution function G we write φ g (z) := ∞ −∞ e −zx dG(x), and note that the integral convergence region necessarily includes the imaginary axis {z ∈ C : Re(z) = 0} with φ g (iω) being the characteristic function of G. In what follows we assume that Σ g is a vertical strip in the complex plane, Σ g := {z ∈ C : Re(z) ∈ (σ − g , σ + g )} for some σ − g < 0 < σ + g . Our estimator uses a kernel whose construction relies upon the linear functional strategy for solution of ill-posed problems (see, e.g., [10]). Let K ∈ C ∞ (R) be a kernel on [−1, 1] satisfying standard conditions: for fixed k ∈ Z + 1 −1 Note that φ K (z) is an entire function, i.e. Σ K = C. We would like to find a function L : R → R such that for any given where we recall that f Y and f are related to each other by the convolution integral (1.1). If function L satisfying (2.3) is found then a reasonable estimator of f (x 0 ) is given by the empirical estimator of the integral on the left hand side of (2.3) based on the sample Y n . In our deconvolution problem this strategy is realized as follows.
In addition to the analyticity of φ g in Σ g we suppose that φ g (z) does not vanish on the set {z : Re(z) ∈ κ − g , κ + g \{0}} for some κ − g , κ + g such that σ − g ≤ κ − g < 0 < κ + g ≤ σ + g . Note that φ g may have zeros on the imaginary axis {z : Re(z) = 0}, so that the conventional Fourier transform technique would not work in this situation. Let S g := z : Re(z) ∈ −κ + g , −κ − g \ {0} ; in fact, S g is the union of two open vertical strips in the complex plane having the imaginary axis as the boundary. Note that φ g (−z) = 0 on S g , and for h > 0 define Obviously, φ L is analytic on S g , and we define kernel L s h as the inverse Laplace transform of φ L : Depending on the sign of parameter s formula (2.4) defines two different kernels which in the sequel are denoted L + h (·) for s > 0 and L − h (·) for s < 0. If the integral on the right hand side of (2.4) is absolutely convergent and While a general form of the kernel L s h is given in (2.4), it would be beneficial to specialize it for particular error distributions. We handle this in the next subsection in relation to error characteristic functions φ g having zeros on the imaginary axis.

Measurement Error Distributions
The following assumption on characteristic function of measurement errors has been introduced in [2].
Assumption (E1). φ g is analytic in Σ g := {z : Re(z) ∈ (σ − g , σ + g )} with σ − g < 0 < σ + g and admits the following representation where {a k } q k=1 and {b k } q k=1 are real numbers, a k > 0, b k ∈ [0, 2π) for all k, {m k } q k=1 are nonnegative integer numbers, and pairs {(a k , b k )} q k=1 are distinct for all k. The function ψ(z) has the following representation: where ψ 0 (z) is analytic and has no zeros in a vertical strip Σ ψ , {z : Assumption (E1) postulates that characteristic function φ g (z) is analytic in a vertical strip and can be factorized in a product of two functions: the first function has zeros on the imaginary axis while the second one does not vanish is the strip. Under (2.5), the zeros of φ g (z) are z k,j = i(b k + 2πj)/a k , j = 0, ±1, ±2, . . ., z k,j = 0, and the multiplicity of z k,j is equal to m k for any j.
Assumption (E1) is rather general. It holds for a wide class of discrete and continuous distributions for specific examples we refer to [2, Section 3.2]. Since the main focus of this study is to investigate the effect of zeros multiplicity of φ g (z) on the estimation accuracy, we will concentrate on the following prototypical examples: Let G be the distribution function of the binomial random variable with parameters m and p = 1/2; then so that Assumption (E1) holds with q = 1, a 1 = 1, b 1 = π and ψ(z) = 2 m .

Estimator and Zero Multiplicity
Under Assumption (E1) the kernel in (2.4) takes the following particular form: where C j,m := j + m − 1 m − 1 is the number of weak compositions of j into m parts (see, e.g., [20]). Note that the derived kernels L ± h are not integrable, and, in general, condition (2.3) is fulfilled only if f has sufficiently light tails. That is why in the estimator construction we truncate the infinite series by a cut-off 8 parameter N coming to the kernels for examples (a) and (b) respectively.
The multiplicity of zeros clearly manifests itself in construction of kernel L ± h,N : in setting (a) multiplicity m determines ill-posedness of the deconvolution problem, and in the both settings coefficients C j,m in (2.9) and (2.10) grow with m affecting the variance of the corresponding estimators in the case of heavy tailed densities f . Intuitively, the larger multiplicity m, the flatter the characteristic function φ g (z) in the vicinity of zeros, and the harder the deconvolution problem.
Based on the derived kernels we define the estimators of f ( where h and N are two tuning parameters that should be specified.

Minimax Results
In this section we derive upper bounds on the risk of the estimators constructed in the previous section, and show that they are rate optimal over functional classes characterized by the smoothness and tail conditions. The analysis of the risk for the both estimators in cases (a) and (b) coincides in almost every detail. Therefore in the sequel we concentrate on the example (a); the corresponding results for binomial error distribution are discussed in Section 5.

Functional Classes
The following assumption introduces the functional class over which accuracy off ± h,N (x 0 ) will be assessed.
Assumption (F). Let A and B be a positive real numbers.
(I) For α > 0, a probability density f belongs to the functional class H α (A) if f is ⌊α⌋ := max{n ∈ N ∪ {0} : n < α} times continuously differentiable, and (II) Let q be a positive real number. We say that a probability density f belongs to the functional class N q (B) if Combining the two conditions in Assumption (F), we define the following functional class: Remark. While first assumption defines the usual Hölder class H α (A), the second condition imposes a uniform upper bound on the decay of the tails of the measurement error density. Note that this tail condition is comparable to the moment condition in [2, Definition 3].

Rates of Convergence
Now we are in a position to establish upper bounds on the maximal risk of the estimatorf ± h, and define (3.5) where C 1 is a constant independent of A and B. Remark.
(a) The result of Theorem 1 shows how the tail behavior of f and zeros multiplicity m affect the estimation accuracy. If the tail of f is sufficiently light, i.e., q > 2m − 1, then the risk off h * ,N * (x 0 ) converges to zero at the rate n −α/(2α+2m+1) which is obtained in the ordinary smooth case with γ = m and non-vanishing characteristic function φ g [see Assumption (E0)]. On the other hand, for heavy tailed densities f with q < 2m − 1 the maximal risk off h * ,N * (x 0 ) converges at a slower rate, and parameter r in (3.4) characterizes deterioration in the convergence rate. (b) The existence of different regimes depending on the tail behavior of f and zeros multiplicity m has been noticed in [2]; however, the case of heavy tailed densities has not been studied there.
Next theorem provides a lower bound on the minimax risk of estimation over functional class W α,q (A, B).
where ν is defined in (3.4), and C 2 is a positive constant independent of A. Remark.
(a) Theorems 1 and 2 show that there are two regimes in behavior of the minimax risk. These regimes are characterized by the tail behavior of the estimated density f and the multiplicity of zeros of the error characteristic function φ g . In the light tail regime, q > 2m − 1, the zeros of φ g have no influence on the minimax rate of convergence: it is fully determined by the tail behavior of φ g . On the other hand, if q < 2m − 1 (the heavy tail regime) then zeros of φ g have significant influence on the minimax rate, it becomes much slower than in the case of non-vanishing φ g . (b) Theorems 1 and 2 demonstrate that the proposed estimatorf h * ,N * (x 0 ) is rate optimal in both light tail and heavy tail regimes. We note that on the boundary q = 2m − 1 between two regimes there is a logarithmic gap between the upper and lower bounds of Theorems 1 and 2.
Thus far, the risk evaluations are under the functional class W α,q (A, B) defined in Assumption (F). Although these conditions are pretty reasonable in the context of the density deconvolution, they involve an extra assumption on the tail behavior of f , and it is natural to ask what happens when the tail condition does not hold. The next result provides an answer to this question.
Remark. In view of (3.6), the rate of convergence ψ n on the functional class H α (A) is significantly slower than the one achieved on H α (A) in the setting with non-vanishing characteristic function φ g . Note that the upper bound in (3.7) is achieved on a slightly smaller functional class. The assumption f ∈ N 1 (B) is very mild and is fulfilled for virtually any probability density. However it does not hold uniformly for all densities. We were not able to derive the upper bound (3.7) without this additional condition.

Adaptive Procedure
The minimax results in the previous section can only be achieved when the information on the functional class is known to us in advance. This is evident by observing that the optimal choice of tuning parameters h * and N * requires knowledge of the functional class. However, in most of applications, it is extremely rare to have the advance information about the functional class where the target function f resides in. Therefore, it is natural to ask whether one can construct an estimator with the equivalent or comparable accuracy guarantees without knowing the functional class parameters.
In this section we develop an adaptive estimator of f (x 0 ) whose construction is based on the idea of data-driven selection from a family of estimators {f h,N (x 0 ) : (h, N ) ∈ H × N }, wherê f h,N (x 0 ) is defined in the previous section, and H and N are some fixed sets of bandwidths and cut-off parameters. Since the estimatorsf h,N (x 0 ) depend on two tuning parameters, we adopt the general method of adaptive estimation proposed in [11].

Selection Rule
Let H and N be the discrete sets defined as follows: for real numbers 0 < h min < h max = θ and integer number N max to be specified later The adaptive estimator is based on data-driven selection from the family F(T ). For the sake of definiteness in the sequel we assume that x 0 ≥ 0 and consider estimatorsf + τ (x 0 ) only; the case x 0 < 0 and f − τ (x 0 ) is handled in exactly the same way. The selection rule uses auxiliary estimators that are constructed as follows. For τ, τ ′ ∈ T let τ ∨ ∧τ ′ := (h ∨ h ′ , N ∧ N ′ ) denote the operation of coordinate-wise maximum and minimum. With any pair τ, τ ′ ∈ T we associate the estimator [cf. (2.11 Selection rules based on convolution-type auxiliary kernel estimators are developed in [11,12], while Lepski [16] uses auxiliary estimators that are based on the operation of point-wise maximum of multi-bandwidths. Our construction is close in spirit to the latter one; it is dictated by the structure of estimatorsf ± h,N (x 0 ) in the deconvolution problem. An important ingredient in the construction of the proposed selection rule is a uniform upper bound on the stochastic error of estimatorf + τ (x 0 ), τ ∈ T . For τ ∈ T the stochastic error of where see (2.9). Define The proof of Theorem 1 shows that var f {ξ τ (x 0 )} ≤ σ 2 τ /n. Let and for real number κ > 0 that will be specified later we put In Lemma 1 in Appendix we demonstrate that Λ τ (κ) is a uniform upper bound on |ξ τ (x 0 )| in the sense that all moments of the random variable sup τ ∈T [|ξ τ (x 0 )| − Λ τ (κ)] + are suitably small as κ increases. Note however that Λ τ (κ) cannot be used in the selection rule because it depends on the unknown density. In order to overcome this problem we consider a data-driven uniform upper bound on ξ τ (x 0 ) that is constructed as follows. 13 For τ ∈ T letσ Note thatσ 2 τ is the empirical estimator of σ 2 τ . Let Λ τ (κ) := 7 σ τ 2κ n + 2u τ κ 3n . (4.5) With the introduced notation the selection rule is the following. For any τ ∈ T definê Then, the adaptive estimatorf * (x 0 ) is defined bŷ Remark. The defined selection rule is fully data-driven; it only requires specification of parameter κ in (4.5). This parameter provides a uniform control of the stochastic errors for the family of estimators F(T ), and has no relation to the properties of the density to be estimated. In addition, the parameters h min and N max should be chosen; they determine the sets of admissible bandwidths H and cut-off parameters N .
Remark. Explicit expressions for constants C 1 , C 2 and C 3 appear in the proofs of Theorem 3 and Lemma 2. Note that the oracle inequality holds for any probability density f , without any functional class assumptions.
The oracle inequality (4.11) allows us to derive the following result on the accuracy of the adaptive estimatorf * (x 0 ) on the class W α,q (A, B). . (4.13) Letf * (x 0 ) be the estimator defined by selection rule (4.6)-(4.7) and associated with parameter κ = κ * := 5 log n; then where ϕ(·) is defined in (3.5), and C does not depend on A and B.
Remark. Note that the resulting rate is the same as the rate of convergence in Theorem 1 except for the extra log n factor. It is a well-known fact by Lepski [15] that this factor cannot be avoided in the adaptive nonparametric estimation of a function at a single point.

Concluding Remarks
We close this paper with a few concluding remarks.
In this paper we concentrated on the setting when the error distribution is the m-fold convolution of the uniform distribution on [−θ, θ]. Here the error characteristic function has infinite number of isolated zeros on the imaginary axis, each of them has the same multiplicity m. Note that the results of Theorems 1, 2, and Corollary 2 also hold for the binomial error distribution Bin(m, 1/2) with the following minor changes in notation: in (3.4) parameter ν should be redefined as ν = 1/(2α + 1 + r), and in (3.5) and in the statement of Theorem 2 expression A (2m+1)/α should be replaced by A 1/α . The specific form of the error characteristic functions used in this paper facilitates derivation of lower bounds on the minimax risk. However, in general, the proposed technique is applicable to other error distributions whose charatceristic function has zeros on the imaginary axis.
We developed rate optimal estimators with respect to the point-wise risk. It is worth noting there is a significant difference between settings with point-wise and L 2 -risks when the error characteristic function has zeros on the imaginary axis. This fact has been already noticed in [2]. Some results for density deconvolution with L 2 -risk for non-standard error distributions appeared in [18] and [14]. In general, deconvolution problems under global losses with nonstandard error distributions deserve a thorough study.

A.1. Proof of Theorem 1
Proof. In the subsequent proof c 1 , c 2 , . . . , stand for positive constants independent of A and B. Without loss of generality we assume that x 0 ≥ 0; the proof for the case x 0 < 0 is identical in every detail. We follow the ideas of the proof of Theorem 2 in [2].
(a). We begin with bounding the variance off + h,N (x 0 ). It is shown in [2] there that the variance off + h,N (x 0 ) is bounded from above as follows where we have used that C j,m = j+m−1 and θ > h for large n. The term N j=0 C 2 j,m S 2,j is also bounded from above by the same expression as on the right hand side of (A.2). Furthermore,

Combining (A.3), (A.2) and (A.1) we conclude that
. Now we bound the bias off + h,N (x 0 ). It is shown in [2] that Taking into account that f ∈ N q (B) we obtain for any j = 1, . . . , m This leads to the following upper bound on the bias off h,N (x 0 ): (A.5) (c). We complete the proof by combining the bounds in (A.4) and (A.5) for the cases q > 2m − 1, q = 2m − 1 and q < 2m − 1. Sraightforward algebra shows that the following choice of h = h * and N = N * yields the theorem result: where constants c 1 , . . . c 6 do not depend on A and B.

A.2. Proof of Theorem 2
Proof. Without loss of generality we fix x 0 to be 0. The proof is split into a few steps: (i) defines two functions in W α,q (A, B) and provides their point-wise distance; (ii) bounds the χ 2divergence between densities of the observations; (iii) specifies the proper tuning parameters and provides the rate for the lower bound, and (iv) deals with derivation of the lower bound for the light tail regime.
(i). For s > 1/2 define where C(s) is a normalizing constant depending on s. Then, f 0 ∈ N q (B) for 1 < q ≤ 2s since f 0 (x) ≤ C(s)/x 2s ≤ B/x q for x > 1 with properly chosen B > 0. In addition, since f 0 is infinitely differentiable, f 0 ∈ H α (A) for any α with properly chosen A.
Given positive h with h < π/θ and N ∈ N, define Note that φ η is supported on: Then, define function η through the inverse Fourier transform as follows: In the subsequent proof the parameters h and N are specified so that h → 0 and N → ∞ as n → ∞; thus, we tacitly assume that N is large and h is small for large enough sample size n.
Given real numbers M > 0 and c 0 > 0, define (A.13) 20 We demonstrate that under appropriate choice of c 0 and M . f 1 is a probability density from W α,q (A, B) for any h and N . Observe that φ η (0) = 0 implies ∞ −∞ η(x)dx = 0 so that f 1 integrates to one. Moreover, since φ η 0 is infinitely differentiable and compactly supported, η 0 is a rapidly decreasing function, i.e., |η (j) 0 (x)x ℓ | ≤ c j,l for any j, ℓ = 0, 1, 2, . . .. In particular, for some constant c 1 (s) depending on s only one has |η 0 (x)| ≤ c 1 (s)|x| −2s for all x ∈ R. It follows from (A.12) that |η(x)| ≤ c 2 h −2s+1 |x| −2s N for x ∈ R. Therefore choosing for c 0 small enough. Therefore f 0 is non-negative, and it is a probability density. Moreover, f 1 ∈ N q (B) for q ≤ 2s. If α is a positive integer then it follows from (A.12) that Therefore, we can ensure f 1 ∈ H α (A) by selecting h and N so that Thus, under (A.14) we have f 0 , f 1 ∈ W α,q (A, B). In addition, (ii). Now we derive an upper bound on the χ 2 -divergence between the densities of observations f Y,0 = g ⋆ f 0 and f Y,1 = g ⋆ f 1 that correspond to f 0 and f 1 . Observe the following expression: Consider the denominator, g ⋆ f 0 , of the integrand. We have where we have used the elementary inequality 1 + |x − y| 2 ≤ 2(1 + |x| 2 )(1 + |y| 2 ), ∀x, y. Then the χ 2 -divergence can be bounded: Let us handle the second integral on the right-hand side. For any positive integer number s we have g can be expanded by Faá di Bruno formula for j ∈ N: if φ g 0 (ω) := sin(θω)/(θω) then φ g (ω) = [φ g 0 (ω)] m and where B j,l denotes the Bell polynomials. Recall that B j,l is a homogeneous polynomial in j variables of degree l, and note that |φ Combining the above results and the fact that sets A k (h) in (A.10) are disjoint for k = N + 1, . . . , 2N , we bound the integral in (A.17) as follows: In addition, the first integral on the left-hand side in (A.16) can be bounded with s = 0, so that Therefore, for positive integer s, The same upper bound holds for any non-integer s ≥ 0; this fact is due to the interpolation inequality for the Sobolev spaces, see, e.g., Aubin [1] for the details.
(iii). Now, based on (A.14) and (A.19), we specify parameters h = h * and N = N * as follows: . Under this choice (A.14) holds, and χ 2 (f Y,1 , f Y,0 ) ≤ c 15 /n. Then the lower bound on the minimax risk is obtained by plugging these expressions in (A.15) and letting 2s = q > 1: . (A.20) (iv). To complete the proof of the theorem it remains to observe that in the considered problem the following standard lower bound on the minimax risk can be also established: . (A.21) For completeness, we provide the proof sketch. Let f 0 be given by (A.9), and let η be the function defined via its Fourier transform φ η as follows where φ η 0 is a function with properties (a)-(c The function f 1 is defined by (A.13), and the choice M = Ah α+1 and properties of function η 0 guarantee that f 1 is a density from the class W α,q (A, B) with q ≤ 2s. With this construction |f 0 (0) − f 1 (0)| = c 0 M η(0) = c 16 Ah α . The upper bound on the χ 2 -divergence between f Y,0 and f Y,1 is computed along the same lines as above with the following modifications. Now we apply (A.18) to get and, by properties of function φ η , The same upper bound holds for the integral ∞ −∞ |(g ⋆ η)(x)| 2 dx which leads to Combining (A.20) and (A.21) and noting that the following relation holds for 1 < q < 2m − 1 α 2α + 2m we complete the proof.

A.3. Proof of Corollary 1
Proof. The upper bound (3.7) is obtained directly from Theorem 1 applied with q = 1. We need to establish (3.6) only. The proof goes along the lines of the proof of Theorem 2 with minor modifications that are indicated below. Define where h > 0 is a parameter to be specified. Obviously, f 0 ∈ H α (A) for small enough h. Therefore, for x 0 = 0, we have the following point-wise distance The bound on the χ 2 -divergence takes the following form where B τ (x 0 ; f ) is the bias term, and ξ τ (x 0 ) is the stochastic error given by (4.1). The bias term is expressed as follows (see the proof of Theorem 1): Therefore by definitions ofB h (f ) andB N (x 0 ; f ) [see (4.8), (4.9)] we have whereB τ (x 0 ; f ) is defined in (4.10).
(II). Now we demonstrate that For this purpose denote and write In view of (A.24) for any pair τ = (h, N ), τ ′ = (h ′ , N ′ ) we have We consider the three terms on the right hand side of (A.25): and similarly whereB h (f ),B N (x 0 ; f ) andB τ (x 0 ; f ) are defined in (4.8), (4.9), and (4.10) respectively.
(III). Letτ = (ĥ,N ) be the parameter selected by the rule (4.6)-(4.7). For any τ ∈ T we have by the triangle inequality Now we bound the terms on the right hand side separately.
We begin with the following simple observation: it follows from (4.6) that

26
Hence by (A.29)R Therefore for any τ, τ ′ ∈ T where the last inequality follows from the definition ofR τ (x 0 ). This inequality together with (A.31) imply the following bound on the first term on the right hand side of (A.30): where in the penultimate inequality we have used thatRτ (x 0 ) ≤R τ (x 0 ) for any τ ∈ T .

A.5. Proof of Corollary 2
Proof. Below c 1 , c 2 , . . . stand for positive constants independent of n, A and B. The proof goes along the following lines. We select values of h and N from H × N and apply the oracle inequality of Theorem 3.
To complete the proof we note that This completes the proof.

A.6. Auxiliary Results
Denote Then