and

In this paper we study the problem of estimation of a distribu- tion from data that contain small measurement errors. The only assumption on these errors is that the average absolute measurement error converges to zero for sample size tending to infinity with probability one. In partic- ular we do not assume that the measurement errors are independent with expectation zero. Throughout the paper we assume that the distribution, which has to be estimated, has a density with respect to the Lebesgue-Borel measure. We show that the empirical measure based on the data with measure- ment error leads to an uniform consistent estimate of the distribution func- tion. Furthermore, we show that in general no estimate is consistent in the total variation sense for all distributions under the above assumptions. However, in case that the average measurement error converges to zero faster than a properly chosen sequence of bandwidths, the total variation error of the distribution estimate corresponding to a kernel density estimate converges to zero for all distributions. In case of a general additive error model we show that this result even holds if only the average measurement error converges to zero. The results are applied in the context of estimation of the density of residuals in a random design regression model, where the residual error is not independent from the predictor.


Introduction
Let X be a real-valued random variable with distribution µ and let B be the sigma field of all Borel sets on the real line.One of the main problems in statistics is to estimate µ from a sample X 1 , . . ., X n of X.The well-known theorem of Glivenko-Cantelli implies that in case X, X 1 , X 2 , . . .are independent and identically distributed, we have, where denotes the empirical distribution of X 1 , . . ., X n (cf., e.g., Theorem 12.4 in [11]), and where Z n → Z a.s. is the abbreviation for almost sure convergence, i.e., for Z n → Z almost surely as n → ∞.So with this estimate we get consistent estimates of the probabilities of all intervals.However, if we are interested in estimation of general sets, we can consider the total variation error sup and try to construct estimates μn such that this total variation error converges to zero almost surely.Unfortunately, as was shown in [10], no estimate exists with the property sup B∈B |μ n (B) − µ(B)| → 0 a.s.(1.3) for all distributions.But if we assume that a density f of X exists, i.e., if µ is given by then we can construct estimates which satisfy (1.3) for all distributions via properly defined density estimates.More precisely, let f n (•) = f n (•, X 1 , . . ., X n ) be an estimate of f by a density f n satisfying |f n (x) − f (x)| dx → 0 a.s.(1.4) for all densities f .E.g., the kernel density estimate (cf., e.g., [30,29]) which depends on a density K : R → R (so-called kernel) and a sequence of bandwidths h n > 0, has this property if h n satisfies (cf., e.g., [27] and [5]; general results in density estimation can be also found in the books [9,6] and [12]).In this case, Scheffé's Lemma (cf., e.g., [9]) implies that the estimate μn (B) = 3) for all distributions µ, which have a density.
In this paper we assume that instead of the sample X 1 , . . ., X n of X we have available only data X1,n , . . ., Xn,n such that the average absolute error between X i and Xi,n converges to zero almost surely, i.e., we assume that (1.6) Here we do not assume anything on the measurement errors Xi,n − X i (i = 1, . . ., n).In general, those errors do not need to be random, and, in case that they are random, they do not need to be independent or identically distributed and they do not need to have expectation zero.So estimates for convolution problems, where independent and identically distributed noise is added to the data (see, e.g., [26] and the literature cited therein), are not applicable in the context of this paper.Note also that our set-up is triangular.Since we do not assume anything on the nature of the measurement errors besides that they are asymptotically negligible in the sense that (1.6) holds, it seems to be a natural idea to ignore them completely and to try to use the same estimates as in the case that an independent and identically distributed sample is given.In this paper we investigate whether the above mentioned distribution estimates are in this situation still consistent.As main results we first show that the corresponding empirical distribution satisfies (1.1) for all distributions µ which have a density with respect to the Lebesgue-Borel measure.Secondly, we show that the kernel density estimate satisfies (1.4) whenever (1.5) and hold.But, if we just assume (1.6), then our third result implies, that there does not exist any estimate satisfying (1.4) for all distributions and all data with measurement errors satisfying (1.6).Thus, (1.6) is in general not a strong enough condition to guarantee total variation convergence.There is a large literature on the recovery of densities from noisy data if the noise is fixed.If the noise distribution is fixed and known, and if the noise is independent, then by deconvolution, it is possible to consistently estimate the density (see, e.g., [26] and the literature cited therein).However, if the noise distribution is fixed and unknown, and if the noise is independent, then it is clearly impossible to recover the density.The situation for independent but variable unknown noise is a bit better.Our fourth result shows that (1.6) together with weak assumptions on the kernel is all that is needed for the above kernel density estimate to satisfy (1.4).
Finally, we apply our results in the context of estimation of the density of residuals in a random design regression model.Recent results in this setting include [7,19] and [20].In the first one a consistency result was proven under the assumption that the residual error is independent of the predictor.The latter papers make the weaker assumption that a conditional density of Y given X = x exists and derive consistency and rate of convergence results.In this paper we consider a assumption, which is weaker than both kinds of assumptions, and derive a consistency result.
The outline of the paper is as follows.The main results are formulated in Section 2 and proven in Section 4. In Section 3 we describe the application of our main results to the problem of estimation of the density of residual errors in a regression model.

Main results
The empirical distribution function is possibly the simplest way to estimate a distribution function.Even if there is no sample X 1 , . . ., X n of X available, we obtain a Glivenko-Cantelli result with adequate assumptions on the available data X1,n , . . ., Xn,n in case that the distribution of X 1 has a density with respect to the Lebesgue-Borel measure.
Theorem 1.Let X 1 , X 2 . . .be independent and identically distributed real valued random variables with density f (with respect to the Lebesgue-Borelmeasure), and let X1,n , . . ., Xn,n be random variables which satisfy (2.1) Then the empirical distribution function of X1,n , . . ., Xn,n satisfies Whenever µ has a density with respect to the Lebesgue-Borel measure the total variation error of the above estimate does not converge to zero.Because in this case we have µ({ X1,n , . . ., Xn,n }) = 0 and, by definition of μn , we have μn ({ X1,n , . . ., Xn,n }) = 1.However, our next theorem shows that if we choose a proper sequence (h n ) n of bandwidths satisfying then we can construct a density estimate which is universally consistent in the L 1 -sense and hence for which by Scheffé's Lemma the total variation error of the corresponding distribution estimate converges to zero regardless of the density f .To do this, we ignore the measurement errors again completely for estimation, and define a standard kernel density estimate applied to the data with measurement errors via Theorem 2. Let K be any density on R + , let h n > 0 and let f n be defined as above.Assume that As shown in [8] (cf., proof of Theorem 2 in [8]), Theorem 1 is no longer valid if we replace (2.3) by

4) holds we can always find h
2) and (2.3) hold, and consequently the resulting estimator f n is strongly universally L 1 -consistent.However, this estimator depends on the non-observable X 1 , . . ., X n .Surprisingly, as our next theorem shows, it is in general not possible to construct an estimate which is consistent for all densities and all samples satisfying (2.4), even if our sample with measurement errors does not change each time completely when the sample size changes, i.e., if we have given data X1 , . . ., Xn instead of X1,1 , . . ., Xn,n .From this result we can also conclude that in general a data-dependent choice of a more or less optimal bandwidth in Theorem 2 is not possible.
Theorem 3.There does not exist a sequence (f n ) n of density estimates satisfying for all densities f and all random variables X1 , X2 , . . .satisfying for some independent and identically distributed X 1 , X 2 , . . .with density f .Remark 1. Assume that X1,n , . . ., Xn,n changes with every n ∈ N such that max i=1,...,n Then there does not exist a sequence (f n ) n of density estimates satisfying for all densities f and all random variables X1,n , . . ., Xn,n , which satisfy the condition (2.6).This can be proven as Theorem 3 above, if we set Xi,n = X In the sequel we show that under a particular noise model, where independent noise is added to the true data such that the average noise is small, we can obtain weak consistency of our kernel estimate under an even weaker assumption than (2.5).More precisely, assume that the given data X1,n , . . ., Xn,n is of the following form where the additive noise Y i,n is independent of X 1 , . . ., X n and where (X i , Y i,n ), 1 ≤ i ≤ n are independent.Additionally, we presume that Y 1,n , . . ., Y n,n have probability measures on the Borel sets of the real line.We don't need to make any structural conditions on these probability measures.
The sequence (Y i,n ) i of random variables is called diminishing additive noise when 1 n weakly, where δ 0 denotes the probability measure with all of its mass at zero.Here, 1 n n i=1 P Yi,n is the probability measure which assigns to a Borel set B the probability 1 n n i=1 P Yi,n (B).And a sequence of measures µ n defined on B converges weakly to a measure µ : for all continuous and bounded functions f : R → R.
For the kernel estimate we obtain the following result.
Theorem 4. Let K be a square integrable function that integrates to one, assume that and define f n as above.If the data satisfies the above diminishing additive noise condition, then If we drop the adjective "additive", and assume merely that the pairs (X i , Y i,n ), n ≥ 1, i ≤ n are independent [but Y i,n is not independent of X i ] and that the noise is diminishing, then, as shown previously, the density f cannot be consistently estimated by any estimator.If we keep the additivity but drop the diminishing noise condition then f can also not be estimated, although we will not show that in this paper.

Estimation of the density of residuals
Let (X, Y ), (X 1 , Y 1 ), . . .be independent and identically distributed exists.Here we do not assume that ǫ and m are independent.Given (X 1 , Y 1 ), . . ., (X n , Y n ) we are interested in an estimation of f .Estimating the density of the error distribution in nonparametric regression models has been dealt with by several researchers.Ahmad showed in [1] that under a Lipschitz-condition of the kernel function, the kernel density estima-tor converges in probability at every continuity point to the real density of the residuals.In case of a continuous error density, the same estimator is pointwise and uniformly consistent (see [4]), and, in addition, the histogram error density estimator is uniformly and in L 1 consistent (see [3]).In [15] Efromovich investigated in a homeoscedastic regression model estimates which are as good as estimates using an oracle that knows the underlying regression errors.In the heteroscedastic nonparametric regression model, where the Y i 's have different variances, Efromovich generalized his optimal estimation for a twice differentiable error density with finite support (see [16]).Estimators of the residual distribution function include that of Akritas and Van Keilegom (see [2]), who extended the results of Durbin (see [14]) and Loynes (see [23]) to a weak convergence result for a distribution function estimator in a nonparametric heteroscedastic regression model.The empirical distribution function of residuals was used as an estimator in an heteroscedastic model with multivariate covariates by Neumeyer and Van Keilegom (see [28]).
The L 1 error of estimates of the density of residual errors was considered in the papers [7,19] and [20].In the first one it is assumed that the residual error is independent of the predictor, while the latter papers make the weaker assumption that a conditional density of Y given X = x exists.In our setting both kinds of assumptions are not satisfied.
In the sequel, we estimate f from (X 1 , Y 1 ), . . ., (X n , Y n ) by the following procedure: In a first step we compute a regression estimate using the first half of the data.Then compute ǫi = Y i − m n (X i ) (i = ⌊n/2⌋ + 1, . . ., n) and estimate f by From Theorem 2 we can conclude the following result.
Corollary 1.Let K be any density on R + , let h n > 0 and let f n be defined as above.Assume that holds.Then for (i = 1, . . ., n − ⌊n/2⌋).Since our data is independent and identically distributed, we know that, whenever we compute an expectation, we can permutate (X 1 , Y 1 ), . . ., (X n , Y n ) arbitrarily.Hence, By combining (3.2) and the observation Thus, the assertion follows from Theorem 2.
It is well-known in the literature, that there exists weakly universally consistent nonparametric regression estimates, i.e., estimates m n with the property for all distributions of (X, Y ) satisfying EY 2 < ∞.This was first shown in [31] in case of nearest neighbor regression estimates, and later also proven for many other nonparametric regression estimates, cf., e.g., [13] for corresponding results for kernel estimates, [17] for corresponding results for partitioning estimates, [24] for corresponding results for least squares estimates, and [22] for corresponding results for penalized squares estimates.
If we use such an estimate, the Cauchy-Schwarz inequality implies that for every distribution of (X, Y ) with EY 2 < ∞ we can find a sequence (h n ) n of bandwidths satisfying h n → 0 (n → ∞) and

This together with Corollary 1 implies
Corollary 2. Let K be any density on R + , and let f n be defined as above, where m n is one of the above mentioned weakly universally consistent regression estimates.Then for any distribution of (X, Y ) with EY 2 < ∞ there exists a sequence of bandwidths (h n ) n such that holds and the estimate f n corresponding to that sequence of bandwidths satisfies Remark 2. The above estimate depends on the distribution of (X, Y ) and hence is not applicable in practice.
It is an open problem, whether there exists a weakly universally consistent regression estimate such that we can construct a datadependent choice of the bandwidth If we impose regularity conditions on (X, Y ), in particular smoothness assumptions on m, we can derive rate of convergence results for the expected L 2 error of the regression estimate, and choose a fixed sequence of bandwidths satisfying (3.1) and (3.2).In this way we can prove results like Corollary 3. Let K be any density on R + , and let f n be defined as above, where x−Xj hn and hn = n −1/(2+d) .Set d+2) .Then for all distributions of (X, Y ) with the properties that m is Lipschitz continuous, X has compact support supp(X) and sup x∈supp(X) E{Y 2 |X = x} < ∞.
Proof.Assume that (X, Y ) satisfies the assumptions given at the end of Corollary 3. By Theorem 5.2 in [18] we have for some constant c ∈ R. Corollary 1 implies the assertion (3.3).

Proof of Theorem 1
Let µ n be the empirical distribution function of X 1 , . . ., X n , i.e., set We split the expression in two different ways for ǫ > 0:

First we consider
The i-th summand becomes one, if Xi,n ≤ x and X i > x + ǫ.
In this case we have | Xi,n − X i | > ǫ.If the i-th summand is not equal to one, it is less than or equal to zero.Hence Analogously, we can conclude Hence, we get By the Glivenko-Cantelli Lemma and condition (2.1), it follows that, lim sup Similarly, we obtain the following assertion from which we conclude that lim sup Since µ has a density with respect to the Lebesgue-Borel measure, µ is Lebesgue continuous.For the Lebesgue measure λ we know sup x∈R λ((x, x + ǫ]) ≤ ǫ.By the Lebesgue continuity it follows for ǫ → 0 This completes the proof.

Proof of Theorem 2
Set By [9], Theorem 1 in Chapter 3, we know Hence it suffices to show in expected value or almost surely, respectively.Now, writing K h (x) = (1/h)K(x/h) and setting u = (x − X i )/h n we get For ǫ > 0, we may find a δ > 0 so small that (In case that K is continuous and has compact support, this follows by an application of the dominated convergence theorem.And otherwise we can approximate K by such a function arbitrarily exactly.)Then, by Markov's inequality, which is almost surely smaller than 2ǫ for all n large enough by (2.3) in case that this condition holds almost surely.Otherwise the expectation of the right-hand side above is smaller than 2ǫ for n large enough.This completes the proof.

Proof of Theorem 3
Assume to the contrary that there exists a sequence (f n ) n of density estimates satisfying whenever X1 , X2 , . . .are such that, for some independent and identically distributed X 1 , X 2 , . . .with density f , we have which is possible because of (4.1).Given n 1 , . . ., n k−1 , we choose n k > n k−1 such that which is possible because of (4.2).But if we define n 1 , n 2 , . . . in such a way, we have and accordingly for all k ∈ N. By triangle inequality, we know From this we can conclude for any k ∈ N This completes the proof of Theorem 3.

Proof of Theorem 4
Throughout the proof we use the abbreviation K h (x) := (1/h)K(x/h) for x ∈ R. Furthermore, we introduce the probability measures ν n : The diminishing noise condition implies that a random variable Z n drawn from ν n tends to 0 in distribution and hence also in probability (cf., e.g., Theorem 18.3 in [21]).We use the notation * for the convolution operation.In general for a function f and a measure µ, we write Similarly, for two functions f, g, we have The first result we require is the following: For an arbitrary ǫ > 0, find a uniformly continuous density g, of compact support, such that Then, omitting (x) and dx in the integrals, Then, by triangle inequality, definition of the modulus of continuity, the theorem of Fubini, the fact, that g is a density, the uniform continuity of g and the diminishing noise condition, we get By Jensen's inequality and independence, we have This tends to zero if nh n → ∞.The proof is complete.

. 4 )
Next we consider the second integral on the right hand side of (4.4):First, since Z n → 0 in probability, we can find a n ↓ 0 such that P{|Z n | ≥ a n } → 0. Let the uniform modulus of continuity of g : R → R be ω, i.e., ω(δ) = sup x,z∈R,|x−z|<δ |g(x) − g(z)|, and assume that g vanishes off [−b, b].