Deconvolution for an atomic distribution

Let $X_1,...,X_n$ be i.i.d. observations, where $X_i=Y_i+\sigma Z_i$ and $Y_i$ and $Z_i$ are independent. Assume that unobservable $Y$'s are distributed as a random variable $UV,$ where $U$ and $V$ are independent, $U$ has a Bernoulli distribution with probability of zero equal to $p$ and $V$ has a distribution function $F$ with density $f.$ Furthermore, let the random variables $Z_i$ have the standard normal distribution and let $\sigma>0.$ Based on a sample $X_1,..., X_n,$ we consider the problem of estimation of the density $f$ and the probability $p.$ We propose a kernel type deconvolution estimator for $f$ and derive its asymptotic normality at a fixed point. A consistent estimator for $p$ is given as well. Our results demonstrate that our estimator behaves very much like the kernel type deconvolution estimator in the classical deconvolution problem.


Introduction
Let X 1 , . . ., X n be i.i.d.copies of a random variable X = Y + σZ, where X i = Y i + σZ i , and Y i and Z i are independent and have the same distribution as Y and Z, respectively.Assume that Y 's are unobservable and that Y = U V, where U and V are independent, U has a Bernoulli distribution with probability of zero equal to p (we assume that 0 ≤ p < 1) and V has a distribution function F with density f.Furthermore, let the random variable Z have a standard normal distribution and let σ be a known positive number.The X will then have a density, which we denote by q.The distribution of Y is completely determined by f and p.Note that the distribution of Y has an atom at zero.Based on a sample X 1 , . . ., X n , we consider the problem of (nonparametric) estimation of the density f and the probability p.
Our estimation problem is closely related to the classical deconvolution problem, where the situation is as described above, except that in the classical case p vanishes and Y i has a continuous distribution with density f, which we want to estimate.The Y i 's can for instance be interpreted as measurements of some characteristic of interest, contaminated by noise σZ i .Some works on deconvolution include [3,4,6,7,9,10,11,13,14,16,19,20,21,22,23,28,30,32,35,38,39,42,43,45,46] and [50].Practical problems related to deconvolution can be found e.g. in [31], which provides a general account of mixture models.The deconvolution problem is also related to empirical Bayes estimation of the prior distribution, see e.g.[2] and [33].Yet another application field is the nonparametric errors in variables regression, see [24].
Unlike the classical deconvolution problem, in our case Y does not have a density, because the distribution of Y has an atom at zero.Hence our results, apart of the direct applications below, will also provide insight into the robustness of the deconvolution estimator when the assumption of absolute continuity is violated.
One situation where the atomic deconvolution can arise, is the following: one might think of the X i 's as increments X i − X i−1 of a stochastic process X t = Y t +σZ t , where Y = (Y t ) t≥0 is a compound Poisson process with intensity λ and jump size density ρ, and Z = (Z t ) t≥0 is a Brownian motion independent of Y.The distribution of Y i − Y i−1 then has an atom at zero with probability equal to e −λ , while Z i − Z i−1 has a standard normal distribution.Notice that X = (X t ) t≥0 is a Lévy process, see Example 8.5 in [37].An exponential of the process X can be used to model the evolution of a stock price, see [34].The law of X can be completely characterised by f, λ and σ.Furthermore, estimation of f in the atomic deconvolution context is closely related to estimation of the jump size density of a compound Poisson process Y, which is contaminated by noise coming from a Brownian motion, see [26].
Another practical situation might arise in missing data problems.Suppose for instance that a measurement device is used to measure some quantity of interest and that it has a fixed probability p of failure to detect this quantity, in which case it renders zero.Repetitive measurements can be modelled by random variables Y i defined as above.Assume that our goal is to estimate the density f and the probability p.In practice measurements are often contaminated by an additive measurement error and to account for this, we add the noise σZ i to our measurements (σ quantifies the noise level).If we could directly use the measurements Y i , then the zero measurements could be discarded and we would have observations with density f to base our estimator on.However, due to the additional noise σZ i , the zeroes cannot be distinguished from the nonzero Y i 's.The use of deconvolution techniques is thus unavoidable.The same situation occurs for instance when Y i are left truncated at zero.In the error-free case, i.e. when σ = 0, estimation of the mean and variance of a positive random variable V was considered in [1].Our model appears to be more general.
In what follows, we first assume that p is known and construct an estimator for f.After this, in the model where p is unknown, we will provide an estimator for p and then propose a plug-in type estimator for f.An estimator for f will be constructed via methods similar to those used in the classical deconvolution problem.In particular we will use Fourier inversion and kernel smoothing.Let φ X , φ Y and φ f denote the characteristic functions of the random variables X, Y and V, respectively.Notice that the characteristic function of Y is given by (1.1) Furthermore, since the characteristic function of V can be expressed as Assuming that φ f is integrable, by Fourier inversion we get An obvious way to construct an estimator of f (x) from this relation is to estimate the characteristic function φ X (t) by its empirical counterpart, e itXj , see e.g.[25] for a discussion of its applications in statistics, and then obtain the estimator of f by a plug-in device.Alternatively, one can estimate the density q of X by a kernel estimator where w denotes a kernel function and h > 0 is a bandwidth.Denote by φ w the Fourier transform of the kernel w.The characteristic function of q nh , which is equal to φ emp (t)φ w (ht), will serve as an estimator of φ q , the characteristic function of q.A naive estimator of f can then be obtained by a plug-in device, and would be However, this procedure is not always meaningful, because the integrand in (1.3) is not integrable in general.Therefore, instead of (1.3), we define our estimator of f as where the integral is well-defined under the assumption that φ w has a compact support on [−1, 1].Notice that where fnh and w h (x) = (1/h)w (x/h) .Hence fnh has the same form as an ordinary deconvolution kernel density estimator based on the sample X 1 , . . ., X n , see e.g.pp.231-232 in [49].
Under the assumption of integrability of φ f and some additional restrictions on w, the bias of the estimator (1.4) will asymptotically vanish as h → 0. Indeed, The result follows via the dominated convergence theorem, once we know that φ w is bounded and φ(0) = 1.Observe that (1.7) coincides with the bias of an ordinary kernel density estimator based on a sample from f.In case we know that f belongs to a specific Hölder class, it is possible to derive an order bound for (1.7) in terms of some power of h, see Proposition 1.2 in [40].Further properties of kernel density estimators can be found in [15,17,36,40,48] and [49].Estimation of p is not as easy, as it might appear at first sight.Indeed, due to the convolution structure X = Y + σZ, the random variable X has a density and the atom in the distribution of Y is not inherited by the distribution of X.On the other hand p is identifiable, since because φ f (t) → 0 as t → ∞ by the Riemann-Lebesgue theorem.However, this relation cannot be used as a hint for the construction of a meaningful estimator of p because of the oscillating behaviour of φ emp (t), the obvious estimator of φ X (t), as t → ∞.
As an estimator of p we propose where the number g > 0 denotes a bandwidth and φ k denotes the Fourier transform of a kernel k.We assume that φ k has support [−1, 1].The definition of p ng is motivated by the fact that lim g→0 g 2 Assuming the integrability of φ f , the last equality follows from Finally, let us consider the general case when both p and f are unknown.Plugging in an estimator of p into (1.4) leads to the following definition of an estimator of f, where png = min(p ng , 1 − ǫ n ). (1.10) Here 0 < ǫ n < 1 and ǫ n ↓ 0 at a suitable rate, which will be specified in Condition 1.5.The truncation of p ng in (1.10) is introduced for technical reasons, see formula (5.18),where we need that the random variable 1 − png is bounded away from zero.
In practice it might also happen that the error variance σ 2 is unknown and hence has to be estimated.This is a difficult problem in the classical deconvolution density estimation if only observations X 1 , . . ., X n are available, as the convergence rate for estimation of σ is not the usual √ n rate, see e.g.[35].Moreover, the convergence rate of an estimator of σ would dominate the asymptotics.If additional measurements are available, then as suggested for instance in [9], σ 2 can be estimated e.g.via the empirical variance of the difference of replicated observations or by the method of moments via instrumental variables.A recent paper on this subject is [14].We do not pursue this question any further and assume that σ is known.
Concluding this section, we introduce some technical conditions on the density f, kernels w and k, bandwidths h and g and the sequence ǫ n .These are needed in the proof of Theorem 2.5, the main theorem of the paper, and subsequent results.Weaker forms of these conditions are sufficient to prove other results from Section 2 and will be given directly in the corresponding statements.
This condition is similar to the one used in [30] and [46] in the classical deconvolution problem.An example of a kernel that satisfies this condition is (1.12) Its Fourier transform is given by .13)In this case α = 2 and A = 4.The kernel (1.12) and its Fourier transform are plotted in Figures 1 and 2.
as t ↓ 0.Here B and C are some constants, and γ and α are the same as above.
An example of such a kernel is given by Its Fourier transform is given by where η n and δ n are such that η n ↓ 0, δ n ↓ 0, η n − δ n > 0, and   Furthermore, we assume that An example of η n and δ n in the definition above is Conditions on the bandwidths h n and g n in Condition 1.4 are not the only possible ones and other restrictions are also possible.However the logarithmic decay of h n and g n is unavoidable.Following the default convention in kernel density estimation and to keep the notation compact, we will suppress the index n when writing h n and g n and will write h and g instead, since no ambiguity will arise.
Condition 1.5.Let ǫ n ↓ 0 be such that An example of such ǫ n for η n and δ n given above is (log log log n) −1 .
The remainder of the paper is organised as follows: in Section 2 we derive the theorem establishing the asymptotic normality of f nh (x), the fact that the estimator p ng is weakly consistent, and finally that the estimator f * nhg (x) is asymptotically normal.Section 3 contains simulation examples.Section 4 discusses a method for implementation of the estimator in practice.All the proofs are collected in Section 5.

Main results
We will first study the estimation of f when p is known, and then proceed to the general case with unknown p.The reason for this is twofold.Firstly, it is interesting to compare the behaviour of the estimator of f under the assumption of known and unknown p, and secondly, the proofs of the results for the latter case rely heavily on the proofs for the former case.
The first result in this section deals with the nonrobustness of the estimator fnh .In ordinary kernel deconvolution, when it is assumed that Y is absolutely continuous, the estimator for its density is defined as Now suppose that the assumption of absolute continuity of Y is violated.What will happen, if we still use the estimator fnh (x)?The following result addresses this question.
From this theorem it follows that E [ fnh (0)] diverges to infinity as h → 0, because so does h −1 w(0), if w(0) = 0 (the latter is the case for the majority of conventional kernels).In practice this will also result in an equally undesirable behaviour of E [ fnh (x)] in the neighbourhood of zero.When x = 0, with a proper selection of a kernel w, one can achieve that the first term in (2.2) asymptotically vanishes as h → 0. Indeed, it is sufficient to assume that w is such that lim u→±∞ uw(u) = 0.The second term in (2.2) will converge to (1 − p)f (x) as h → 0, provided that φ f is integrable, φ w is bounded and φ w (0) = 1.These facts address the issue of the nonrobustness of fnh : under a misspecified model, i.e. under the assumption that the distribution of Y is absolutely continuous, while in fact it has an atom at zero, the classical deconvolution estimator will exhibit unsatisfactory behaviour near zero.This will happen despite the fact that fnh (x) will be asymptotically normal when centred at its expectation and suitably normalised, see Corollary 5.1 in Section 5.The asymptotic normality follows from Lemmas 5.2 and 5.3 of Section 5, where only absolute continuity of the distribution of X is required.
Our next goal is to establish the asymptotic normality of the estimator f nh (x).We formulate the corresponding theorem below.
Theorem 2.2.Assume that φ f is integrable.Let E [X 2 ] < ∞, and suppose that Condition 1.2 holds.Let f nh be defined as in (1.4).Then, as n → ∞ and h → 0, where Γ denotes the gamma function, dv.Note that Theorem 2.2 establishes asymptotic normality of f nh under an atomic distribution, which constitutes a generalisation of a result in [46] (see also [45]) for the case of the classical deconvolution problem.The generalisation is possible, because the proof uses only the continuity of the density of X, which is still true when Y has a distribution with an atom.Furthermore, notice that in order to get a consistent estimator, from this theorem it follows that √ nh −1−2α e −σ 2 /(2h 2 ) has to diverge to infinity.Therefore the bandwidth h has to be at least of order (log n) −1/2 , as it is actually stated in Condition 1.4.In practice this implies that the bandwidth h has to be selected fairly large, even for large sample sizes.This is the case for the classical deconvolution problem as well in the case of a supersmooth error distribution, cf.[46].
Observe that the asymptotic variance in (2.3) does not depend on the target density f nor on the point x.This phenomenon is quite peculiar, but is already known in the classical deconvolution kernel density estimation, see for instance equation ( 6) in [3].There, provided that h is small enough, the asymptotic variance of the deconvolution kernel density estimator (or, strictly speaking an upper bound for it) also does not depend neither on the target density f, nor on the point x.In this respect see also [46].Such results do not contradict the asymptotic normality result in [21], see Theorem 2.2 in that paper, as there the asymptotic variance of the deconvolution kernel density estimator is not evaluated.Now we state a theorem concerning the consistency of p ng , the estimator of p.
< ∞, and let the kernel k have a Fourier transform φ k that is bounded by one and integrates to two.Let p ng be defined as in (1.8).If g is such that g 4+4α e σ 2 /g 2 n −1 → 0, then p ng is a consistent estimator of p, i.e.
Here ǫ is an arbitrary positive number.Furthermore, under Condition 1.1 and 1.3 One can also show that p ng is asymptotically normal, when centred and suitably normalised.We formulate the corresponding theorem below.
Theorem 2.4.Assume that the conditions of Theorem 2.3 hold.Let p ng be defined as in (1.8) and let (1.15) hold.Then as n → ∞ and g → 0.
Finally, we consider the case when both p and f are unknown.We state the main theorem of the paper.
Notice that the asymptotic variance is the same as in (2.3), which justifies the plug-in approach to the construction of an estimator of f, when p is unknown.
A natural question to consider is what happens when we centre f * nhg (x) not at its expectation, but at f (x).This has practical importance as well, e.g. for the construction of (asymptotic) confidence intervals.Writing we see, that we have to study the second term here, i.e. to compare the behaviour of the bias of f * nhg (x) to the normalising factor √ nh −(1+2α) e −σ 2 /(2h 2 ) .We will study the bias of f * nhg (x) in two steps: first we will show that it asymptotically vanishes, which itself is of independent interest.After this we will provide conditions under which it asymptotically vanishes when multiplied by √ nh −(1+2α) e −σ 2 /(2h 2 ) .Recall the definition of a Hölder class of functions H(β, L).Definition 2.1.A function f is said to belong to the Hölder class H(β, L), if its derivatives up to order l = [β] exist and verify the condition Such a smoothness condition on a target density f is standard in kernel density estimation, see e.g.p. 5 of [40].Often one assumes that β = 2.If l = 0, then set f (l) = f.We also need the definition of a kernel of order l.In particular, we will use the version given in Definition 1.3 of [40].Definition 2.2.A kernel w is said to be a kernel of order l for l ≥ 1, if the functions x → x j w(x) are integrable for j = 0, . . ., l and if x j w(x)dx = 0 for j = 1, . . ., l.
Theorem 2.6.Let f * nhg (x) be defined by (1.9) and assume conditions of Theorem 2.5.Then, as n → ∞, we have Combination of this theorem with Theorem 2.5 leads to the following result.
Theorem 2.7.Assume that the conditions of Theorem 2.6 hold.Then, as n → ∞, we have One should keep in mind that these results deal only with asymptotics.In the next section we will study several simulation examples, which will provide some insight into the finite sample properties of the estimator.

Simulation examples
In this section we consider a number of simulation examples.We do not pretend to provide an exhaustive simulation study, rather an illustration, which requires further verification.
Assume that σ = 1, p = 0.1 and that f is normal with mean 3 and variance 9.This results in a nontrivial deconvolution problem, because the ratio of 'noise' compared to 'signal' is reasonably high: NSR = Var[σZ]/ Var[Y ]100% ≈ 11%.We have simulated a sample of size n = 1000.As kernels w and k we selected kernels (1.12) and (1.16), respectively.The bandwidths h = 0.58 and g = 0.5 were selected by hand.A possible method of computing the estimate is given in Section 4. The estimator p ng produced a value equal to 0.11.The estimate of f (bold dotted line), resulting from the procedure described above, together with the target density f (dashed line) is plotted in Figure 5.For comparison purposes, we have also plotted the estimate f nh (x) (it can be obtained using (1.5) and the true value of the parameter p), see Figure 6.As can be seen from the comparison of these two figures, the estimates f * nhg and f nh look rather similar.
As the second example we consider the case when f is a gamma density with parameters α = 8 and β = 1, i.e.

f (x) =
x 7 e −x Γ(8) and p = 0.25.We simulated a sample of size n = 1000.The kernels were chosen as above and the bandwidths g = 0.6 and h = 0.6 were selected by hand.The estimate p ng took a value approximately equal to 0.23.The resulting estimate   f * nhg is plotted in Figure 7.As above we also plotted the estimate fnh , see Figure 8 (notice that the estimate takes on negative values in the neighbourhood of zero).Again both figures look similar.
Examination of these figures leads us to two questions: how well does p ng estimate p for moderate sample samples?How sensitive is f * nhg to under-or overestimation of p? To get at least a partial answer to the first question, we considered the same model as in our first example in this section (i.e.deconvolution of the normal density) and repeatedly, i.e. 1000 times, estimated p for the bandwidth g = 0.5 and the sample size n = 1000 for each simulation run.Then the same procedure was repeated for the bandwidths g = 0.55, 0.6 and 0.65.The resulting histograms are plotted in Figures 9-12.They look quite satisfactory.The sample means and sample standard deviations (SD) of estimates of p for different choices of bandwidth g together with the theoretical standard deviations are summarised in Table 1.One notices that the sample means in Table 1 are close to the true value 0.1 of the parameter p.The theoretical standard deviations in the same table were computed using Theorem 2.4, which predicts  that they should be equal to (recall, that in our case α = 2) From Table 1 one sees that there is a large discrepancy between the sample standard deviations and the standard deviations predicted by the theory.The explanation of this discrepancy lies in the fact that the proof of the asymptotic normality of p ng heavily relies on the asymptotic equivalence see Lemma 5.1 and the proof of Lemma 5.2 in Section 5 below.However, by direct evaluation of the integral on the left-hand side of (3.2) for different values of g, it can be seen that this relation does not provide an accurate approximation in those cases where the bandwidth is relatively large, as it actually is in our case.It then follows that the asymptotic standard deviation will not provide a good approximation of the sample standard deviation unless the bandwidth is very small.This in turn implies that the corresponding sample size must be extremely large.We can correct for this poor approximation of the integral in (3.2) by using the integral itself as a normalising factor instead of the righthand side of (3.2).The results of this correction are represented in the last line of Table 1.As it can be seen, the theoretical standard deviation and the sample standard deviation are much closer to each other.Since the kernel k was  selected more or less arbitrarily, one is tempted to believe that an inaccurate approximation in (3.2) might be due to the kernel.This might be the case, however to a certain degree this seems to be characteristic of all popular kernels employed in kernel deconvolution.Consider for instance the kernel Its Fourier transform is given by The kernel w and its Fourier transform are plotted in Figures 13 and 14, respectively.This kernel was used for simulations in [22] and [47] and it was shown in [13] that it performs well in a deconvolution setting.Notice that this kernel cannot be used to estimate p if we want to plug in the resulting estimator p ng into f * nhg .However, this kernel satisfies Condition 1.2 and can be used to estimate f.Nevertheless, the ratio of the left and right hand sides in (3.2) for h = 0.5 is equal to 0.4299, which is still far from 1.This issue is further discussed in [42].Another issue here is that often the error variance σ 2 is quite small and it is sensible to treat σ as depending on the sample size n (with σ → 0 as n → ∞), see [9].However, this is a different model and this question is not addressed here.Notice also that a perfect match between the sample standard deviation and the theoretical standard deviation is impossible to obtain, because we neglect a remainder term when computing the latter.How large the contribution of the remainder term can be in general requires a separate simulation study.
We also considered the case when the error term variance and the sample size are smaller (the target density f was again the standard normal density, while p was set to be 0.1).In particular, we took σ = 0.3 and n = 500.The corresponding histograms are given in Figures 15-18, while the sample and theoretical characteristics for four different choices of the bandwidth g = 0.5, 0.55, 0.6 and 0.65 are summarised in Table 2. Notice a particularly bad match between the asymptotic standard deviation and its empirical counterpart.Other conclusions are similar to those in the previous example.To test the robustness of the estimator f * nhg with respect to the estimated value of p, we again turned to the model that was considered in the first example of this section.Instead of png three different values p = 0.05, p = 0.1 and p = 0.15 were plugged in into (4.2).The resulting estimates f * nhg are plotted in Figure 19 (the true density is represented by the dashed line).As one can see from Figure 19, under-or overestimation of p in the given range does not have a significant impact on the resulting estimate fnh (of course one should keep in mind that p is relatively small in this case).On the other hand, if the value of p were larger, e.g. if p = 0.2, that would have a noticeable effect, e.g. it could have suggested bimodality in the case where the density is actually unimodal, see Figure 20 on the facing page.At the same time the simulated examples concerning the estimates p ng that we considered above seem to suggest that such instances of unsatisfactory estimates of p are not too frequent, because most of the observed values of p ng are concentrated in the interval [0.05, 0.15].We also considered the case when f = 0.5φ −2,1 + 0.5φ 2,1 , where φ x,y denotes the normal density with mean x and variance y.Hence in this case f is a mixture of two normal densities and it is also bimodal.The match is visually slightly worse for p = 0.2, but it is still acceptable.The simulation examples that we considered in this section suggest that, despite the slow (logarithmic) rate of convergence, the estimator f * nhg works in practice (given that p is estimated accurately).This is somewhat comparable to the classical deconvolution problem, where by finite sample calculations it was shown in [47] that for lower levels of noise, the kernel estimators perform well for reasonable sample sizes, in spite of slow rates of convergence for the supersmooth deconvolution problem, obtained e.g. in [21] and [22].However, Condition 1.4 tells us, that the bandwidths h and g have to be of order (log n) −1/2 .In practice this implies that to obtain reasonable estimates, the bandwidths have to be selected fairly large, even for large samples.
One more practical issue concerning the implementation of the estimator f * nhg (or p ng ) is the method of bandwidth selection, which is not addressed in this paper.We expect that techniques similar to those used in the classical deconvolution problem will produce comparable results in our problem.This requires a separate investigation of the behaviour of the mean integrated square error of f * nhg .In the case of the classical deconvolution problem papers that consider the issue of data-dependent bandwidth selection are [10,11,18,28] and [39].Yet another issue is the choice of kernels w and k.For the case of the classical deconvolution problem we refer to [13].In general in kernel density estimation it is thought that the choice of a kernel is of less importance for the performance of an estimator than the choice of the bandwidth, see e.g.p. 31 in [48], or p. 132 in [49].

Computational issues
To compute the estimator f * nhg in Section 3, a method similar to the one used in [44] (in turn motivated by [5]) can be employed.Namely, notice that Using the trapezoid rule and setting v j = η(j − 1), we have where N is some power of 2 and The Fast Fourier Transform is used to compute values of f (1) nh at N different points (concerning the application of the Fast Fourier Transform in kernel deconvolution see [12]).We employ a regular spacing size δ, so that the values of x are where u = 1, . . ., N. Therefore, we obtain In order to apply the Fast Fourier Transform, note that we must take It follows that a small η, which is needed to achieve greater accuracy in integration, will result in values of x which are relatively far from each other.Therefore, to improve the integration precision, we will apply Simpson's rule, i.e.
where δ j denotes the Kronecker symbol (recall, that δ j is 1, if j = 0 and is 0 otherwise).The same reasoning can be applied to f (2) nh (x).The estimate f * nhg can then be computed by noticing that One should keep in mind that even though w h can be evaluated directly, it is preferable to use the Fast Fourier Transform for its computation, thus avoiding possible numerical issues, see [12].Also notice that the direct computation of φ emp is rather time-demanding for large samples.One way to avoid this problem is to use WARPing, cf.[27].However, for the purposes of the present study, we restricted ourselves to the direct evaluation of φ emp .

Proofs
Proof of Theorem 2.1.The proof is elementary and is based on the definition of fnh (x).By Fubini's theorem we have Here we used the facts that This concludes the proof.
The proof of Theorem 2.2 is based on the following three lemmas, all of which are reformulations of results from [46].Lemma 5.1.Assume Condition 1.2.For h → 0 and δ ≥ 0 fixed we have (5.2) Proof.We follow the same line of thought as in [46].Using the substitution s = 1 − h 2 v and the dominated convergence theorem in the one but last step, we get The lemma is proved.
For the remainder we have, by (5.4) and Lemma 5.1, which follows from Chebyshev's inequality.Finally, we get and this completes the proof of the lemma.
The next lemma establishes the asymptotic normality.
Lemma 5.3.Assume conditions of Lemma 5.2 and let, for a fixed x, U nh (x) be defined by (5.3).Then, as n → ∞ and h → 0, For 0 ≤ y < 2π we have . Since h → 0, the last equivalence follows from a Riemann sum approximation of the integral and continuity of the density q of X.Consequently, as h → 0, we have Y j D → U, where U is uniformly distributed on the interval [0, 2π].Since the cosine is bounded and continuous, it then follows by the dominated convergence theorem that To prove asymptotic normality of U nh (x), first note that it is a normalised sum of i.i.d.random variables.We will verify that the conditions for asymptotic normality in the triangular array scheme of Theorem 7.1.2in [8] hold (Lyapunov's condition).In our case this reduces to the verification of the fact that The following corollary immediately follows from Lemmas 5.2 and 5.3.
Corollary 5.1.Under the conditions of Lemma 5.2 we have that . Now we prove Theorem 2.2.
Proof of Theorem 2.2.From (1.4) we have that Hence the result follows from Corollary 5.1.
The following lemma gives the order of the variance of f nh (x).

Notice that Var
Recalling Lemma 5.1, we conclude that Next we deal with consistency of p ng and prove Theorem 2.3.
Proof of Theorem 2.3.We have To prove that this expression converges to zero in probability, it is sufficient to prove that Var[p ng ] → 0 and E [p ng ] − p → 0 as n → ∞, g → 0. We have Here it is understood that replacing subindex h by g entails replacement of the smoothing characteristic function φ w by φ k .By Lemma 5.4, (5.5) This converges to zero due to the condition on g.Furthermore, The first term here is zero, since φ k integrates to 2, while the second term converges to zero, which can be seen upon noticing that φ k is bounded, φ f is integrable and that this term is bounded by which converges to zero as g → 0. The last part of the theorem follows from the identity We want to prove that the first term is asymptotically normal, while the second term converges to zero in probability.Application of Slutsky's lemma, see Lemma 2.8 in [41], will then imply that the above expression is asymptotically normal.
First we deal with the second term.We have √ n g 2+2α e σ 2 /(2g 2 ) √ n Next we prove that √ n is asymptotically normal.Then (5.8) will converge to zero in probability, since convergence to a constant in distribution is equivalent to convergence to the same constant in probability and because w is bounded.We have which can be seen as follows: Due to Theorem 2.4 the first term here yields the asymptotic normality.We will prove that the second term converges to zero in probability.To this end it is sufficient to prove that It follows from the definition of png and Lemma 5.1 that (5.12) where K is some constant.This and (5.11) imply that we have to prove (5.13)  is bounded by a constant K, say, times g 2+2α e σ 2 /(2g 2 ) n −1 .Hoeffding's inequality, see [29], then yields (5.16) it is enough to prove that the term on the right-hand side converges to zero.Taking the logarithm yields log This diverges to minus infinity, because the last term dominates log n, The latter fact can be seen by taking the logarithm of the left-hand side and using (1.18).We obtain which follows from (1.18).This in turn proves that (5.10) is asymptotically normal.Since the derivative (y/(1 − y)) ′ = 0, a minor variation of the δ-method then implies that (5.9) is also asymptotically normal (see Theorem 3.8 in [41] for the δ-method).Consequently, the second term in (5.7) converges to zero in probability.
We now consider the first term in (5.7) and want to prove that it is asymptotically normal.Rewrite this term as Thanks to Corollary 5.1 the first summand here is asymptotically normal.We will prove that the second term vanishes in probability.Due to Chebyshev's inequality, it is sufficient to study the behaviour of By the Cauchy-Schwarz inequality, after taking squares, we can instead consider It is easy to see that this expression is of order h −2 .Indeed, due to Lemma 5.4 the first term in this expression is of order n −1 h 2(1+2α) e σ 2 /h 2 .The fact that this in turn is of lower order than h −2 can be seen in the same way as we did with (5.6).For the second term we have and this is of order h −2 , because and because w is bounded.Consequently, taking into account (5.17), we have to study Hence we have to prove that (5.20) The first fact essentially follows from the arguments concerning (5.11), since the presence of an additional factor ǫ −2 n given Condition 1.5 does not affect the arguments used.Indeed, (5.19) will hold true, if we prove that 1 ǫ 2 n n h 4+4α e σ 2 /h 2 g 4+4α e σ 2 /g 2 P(p ng > 1 − ǫ n ) → 0.
The first term here converges to zero by (5.5) and Conditions 1.4 and 1.5.Now we turn to the second term.Taking into account (5.6), we have to study the behaviour of √ n ǫ n h 2+2α e σ 2 /(2h 2 ) (1 − p) This can be rewritten as √ n ǫ n h 2+2α e σ 2 /(2h 2 ) g 2+2α e σ 2 /(2g Conditions 1.1, 1.3, 1.4 and 1.5 imply that this expression converges to zero, because the integral converges to a constant by the dominated convergence theorem, while √ n g 1+2α e σ 2 /(2g 2 ) g γ → 0, which can be seen by taking the logarithm and noticing that it diverges to minus infinity.We obtain  The fact that this term converges to zero follows from (5.17) and subsequent arguments in the proof of Theorem 2.5.Now we have to study the second summand in (5.23).By the Cauchy-Schwarz inequality and the fact that (1 − png ) −2 ≤ ǫ −2 n , it suffices to consider instead.The fact that this term converges to zero follows from the arguments concerning (5.18), which were given in the proof of Theorem 2.5.Indeed, the expression above can be rewritten as Now use arguments concerning (5.19), (5.20) and the facts that w is a bounded function and under Condition 1.4 we have h 2(1+2α) e σ 2 /h 2 n −1 → 0. This concludes the proof of the first part of the theorem.Now we prove the second part, an order expansion of the bias E [f * nhg (x)] − f (x) under additional assumptions given in the statement of the theorem.The proof follows the same steps as the proof of the first part of the theorem.Notice that under the condition f ∈ H(β, L), the second summand in (5.22) is of order h β , see Proposition 1.2 in [40].We have to show then that h β times √ nh −1−2α e −σ 2 /(2h 2 ) converges to zero.To this end it is sufficient to show that log h β−1−2α √ ne −σ 2 /(2h 2 ) → −∞.
This essentially follows from the same argument as (5.21) (with γ replaced by β).Now consider (5.23).Its first term is bounded by (5.24) and we have to show that this term multiplied by √ nh −1−2α e −σ 2 /(2h 2 ) tends to zero.The arguments from the proof of Theorem 2.6 lead us to (5.18) and hence the desired result.

Fig 5 .
Fig 5.The normal density f (dashed line) and the estimate f * nhg (solid line).The sample size n = 1000.

Fig 6 .Fig 7 .
Fig 6.The normal density f (dashed line) and the estimate f nh (solid line).The sample size n = 1000.

Fig 11 .Fig 12 .
Fig 11.The histogram of estimates of p for g = 0.6 and the sample size n = 1000.

Fig 19 .
Fig 19.The normal density f and estimates f * nhg evaluated for p = 0.05, p = 0.1, p = 0.15 and the sample size n = 1000.

Fig 20 .
Fig 20.The normal density f and estimate f * nhg evaluated for p = 0.2 and the sample size n = 1000.
Condition 1.3.Let φ k be real valued, symmetric and have support [−1, 1].Let φ k integrate to 2 and let Let the bandwidths h and g depend on n, h = h n and g = g n , and let

Table 1
Sample and theoretical means and standard deviations (SD) of estimates of p for different choices of bandwidth g.The sample size n = 1000

Table 2
Sample and theoretical means and standard deviations (SD) of estimates of p for different choices of bandwidth g.The sample size n = 500 Next we prove asymptotic normality of p ng .Proof of Theorem 2.4.The result follows from the definition of p ng and Corollary 5.1, because p ng = gπ fng (0) essentially is a rescaled version of fng (0).
.14) Denote t n ≡ 1 − ǫ n − E [p ng ]and select n 0 so large that for n ≥ n 0 , we have t n > 0. Notice that t n → 1 − p, which follows from(5.6).The probability in(5.14)isbounded by P(|p ng − E [p ng ]| > t n ).Note that