Nonparametric deconvolution problem for dependent sequences

We consider the nonparametric estimation of the density function of weakly and strongly dependent processes with noisy observations. We show that in the ordinary smooth case the optimal bandwidth choice can be influenced by long range dependence, as opposite to the standard case, when no noise is present. In particular, if the dependence is moderate the bandwidth, the rates of mean-square convergence and, additionally, central limit theorem are the same as in the i.i.d. case. If the dependence is strong enough, then the bandwidth choice is influenced by the strength of dependence, which is different when compared to the non-noisy case. Also, central limit theorem are influenced by the strength of dependence. On the other hand, if the density is supersmooth, then long range dependence has no effect at all on the optimal bandwidth choice.


Introduction
The nonparametric estimation of the density function for dependent sequences has attracted many researchers in the past. We are not claiming to provide the full overview of this topic, however results can be summarized as follows. In case of weak dependence the results on the (mean square error) optimal bandwidth choice, optimal rates of convergence for the mean square error and central limit theorems for the Parzen-Rosenblatt kernel estimator are exactly the same as in i.i.d. case (see e.g. [3] or [28,Theorem 1]). The situation is a bit more complicated for long-range dependent sequences. Although dependence has no influence on the optimal bandwidth choice, the rates of mean-square convergence may differ according to very strong and moderate dependence. In the latter case they are the same as in the i.i.d. situation. We refer to [5,8,13,14] and [23]. Similarly, if the bandwidth is "small", then central limit theorem for the kernel density estimates is the same as in the i.i.d. case. On the other hand, if the bandwidth is "big" enough, then the long range dependence effect dominates (see [6], [28,Theorem 2]). The similar phenomena occur in random-

R. Kulik/Deconvolution and dependence
723 design regression problems. The reader is referred to [22] and [24] for the up-todate results and references for a kernel and a local linear estimation, respectively.
As for smooth estimators of the distribution function, either short or long range dependence have no influence on the optimal bandwidth choice. The optimal bandwidth is the same as in the i.i.d. case. However, the optimal rates of convergence for the mean square error are always affected by long range dependence (see [8] for more details).
In the present paper we will consider deconvolution problem for dependent sequences. Suppose that we have n observations Y 1 , . . . , Y n available. We want to estimate the unknown density f = f X of a random variable X, where Y = X +ǫ, with a measurement error ǫ of a known distribution F ǫ and the density f ǫ . It is assumed that X and ǫ are independent and that {ǫ, ǫ j , j ≥ 1} is the i.i.d. sequence.
We will estimate f (x 0 ) using the classical estimator (cf. [4,9]) Above, φ ǫ is a characteristic function which corresponds to the density f ǫ and φ K (t) = I R exp(itx)K(x)dx. The mean square error is defined as We also study the behavior of the estimator of the distribution function F . In the i.i.d. case the deconvolution problems were studied in [4,9,10] and [25] among others. In the latter paper, Fan provided the optimal rates of convergence for MSE(f, h n ) in both ordinary smooth and supersmooth case. As for weakly dependent case, the previous results have been obtained under various mixing conditions (see [18,19,20]) and under association (see [21]). The principal message from the latter papers is that the results (optimal bandwidth, optimal rates, central limit theorem) for weakly dependent sequences are the same as in the i.i.d. case. As for the distribution function, the problem was studied in [9] in the i.i.d case and in [17] in the dependent case.
However, mixing is rather hard to verify and requires additional assumptions. In particular, let {Z, Z i , i ∈ Z Z} be a centered sequence of i.i.d. random variables. Consider the class of stationary linear processes To obtain a strong mixing for linear processes both regularity of the density of Z 1 and some constraints on c k 's are required (see e.g. [7]). On the other hand, association requires that all c k are positive. To overcome such problems, the martingale based method has been proposed and it works surprisingly well in a variety of problems, not necessary connected with nonparametric estimation (see [15,16,26,27,28]). Thus, from technical point of view assuming that c 0 = 1 and the sequence c k , k ≥ 0, is summable (referred later as short range dependence, SRD), we will extend Masry's results to moving averages, without referring to mixing or association at all. However, the more interesting problem is the influence of long range dependence on the deconvolution estimator. To deal with it, we assume that c 0 = 1 and c k is regularly varying with index −γ, γ ∈ (1/2, 1). This means that c k ∼ k −γ L 0 (k) as k → ∞, where L 0 is slowly varying at infinity. We shall refer to all such models as long range dependent (LRD) linear processes. In particular, if the variance exists, then the covariances ρ k := EX 0 X k decay at the hyperbolic rate, ρ k = L(k)k −(2γ−1) , where lim k→∞ L(k)/L 2 0 (k) = B(2γ − 1, 1 − γ) and B(·, ·) is the beta-function. Consequently, the covariances are not summable.
We will show below that in the ordinary smooth case the optimal bandwidth choice for the density problem is influenced by the dependence parameter γ, as opposite to the optimal bandwidth in the standard (non-noisy) kernel density estimation. In particular, if the dependence is moderate, then the optimal bandwidth and the optimal rates for the density estimation are the same as in the i.i.d. case. If the dependence is very strong, the optimal bandwidth depends on γ itself. See Proposition 2.2 and Corollary 2.3. In case of the distribution function, the dependence parameter is always present in the optimal bandwidth and the optimal rates of convergence, as opposite to the non-noisy case (Proposition 2.5).
As for central limit theorem for the density estimator, we have results mimicking CLT for standard kernel density estimators (cf. [28,Theorem 2]): if h n is small, then CLT is the same as in the i.i.d. case; if h n is "big", LRD effect starts to dominate. Note that the change from "i.i.d." behavior to LRD behavior occurs in the same way as in the standard kernel estimation, according to h n = o σ 2 n,1 /n or σ 2 n,1 /n = o(h n ), where σ 2 n,1 := Var( n j=1 X j ).
In the distribution case, we do not have such dichotomous behavior and longrange dependence always influences the limiting behavior. We note in passing that "small" and "big" bandwidth may have different meanings for different estimation problems. For example, "small" bandwidths are different when estimating a function and its derivative (see [22] for a complete analysis in the regression setting). In the present context and density estimation for error-in-variables models, "small" and "large" bandwidths are the same as for non-noisy case. OF course, this dichotomous behavior is wellknown, however, the crucial difference between noisy and non-noisy problem is the optimal bandwidth choice. Note that in the non-noisy setting the opti-mal bandwidth, under appropriate conditions, is not influenced by γ, regardless whether estimation of the function (as mentioned above) or its derivatives is considered. Thus, in errors-in-variables models we have different phenomena than those described in [22].
Another phenomena is that for supersmooth densities the optimal bandwidth choice and the rates for MSE(f, h n ) are always the same as in the i.i.d. case, irrespectively of the dependence being moderate or very strong. At the first sight this message seems to be optimistic, however, it means that the rate of convergence is so slow that even the very strong dependence cannot worsen it.

Results
Recall that by SRD assumption we mean that We assume that f = f X is twice differentiable with continuous and bounded second order derivatives and K is of the second order. i.e. uK(u)du = 0 and 0 = u 2 K(u)du < ∞. Furthermore, we assume that and that φ ǫ , φ K are twice differentiable with continuous and bounded derivatives. These assumptions are standard in the i.i.d. situation for both ordinary smooth and supersmooth case. The proofs are based on the following decomposition: Note that {m n (x 0 ), F n , n ≥ 1} is a martingale. We call l n (x 0 ) the differentiable part. The similar decomposition is also valid in the distribution case.

Ordinary smooth densities
Throughout this section, we consider the ordinary smooth case, i.e.
Furthermore, assume that and To deal with the LRD case we will impose a stronger condition than (5): with β > 1, We will consider some technical assumptions on the densities: (A) f Z , f ǫ , the densities of Z and ǫ, respectively, are uniformly bounded and Lipschitz continuous.
Note that in this case, f ǫ+Z -the density ǫ + Z, f = f X and f Y -the density of Y are also uniformly bounded and Lipschitz continuous. These conditions are required to handle SRD case.
dv < ∞ and f Z , f ǫ+Z are twice differentiable with continuous and bounded derivatives.
These conditions are required for LRD case (see Appendix for more discussion). First, we provide the asymptotic expansion of the mean square error.
Remark 2.4. The result of Proposition 2.1 extends the previous ones for ρ− and α−mixing (see [19,Lemma 2.1b]) or associated sequences ( [21]). In principle, it says that the optimal bandwidth, the rate of convergence of Varf n (x 0 ) (and, consequently, of MSE(f, h n )) for weakly dependent sequences are the same as in the i.i.d. case. Thus is also true for LRD sequences with moderate dependence (γ close to 1). On the other hand, if the dependence is very strong, then the bandwidth and the rate of convergence may depend on γ.
As for the distribution estimator we have the following result.
We can see that the optimal bandwidth is h n ∼ C σ 2 n,1 /n 2 1/2(β+2) and optimal mean square error is of the order σ 2 n,1 /n 2 2/(β+2) . Under weak dependence the optimal bandwidth and the optimal mean square error are proportional to n −1/2(β+2) and n −2/(β+2) , respectively. Consequently, in case of the distribution function the optimal bandwidth and the rates change as soon as we cross the boundary between short-and long range dependence.
As for CLT we have the following results.
Theorem 2.7. Suppose that nh n → ∞ and let σ 2 (x 0 ) = D 1 f Y (x 0 ). Under conditions of Proposition 2.2 we have Remark 2.8. Theorem 2.6 extends results of [10] and [20]. The result of Theorem 2.7 should be compared with Theorem 2 in [28]. Note that the change from SRD to LRD behavior occurs in the same way as in the standard kernel density case, i.e. by crossing the boundary h n ∼ n/σ 2 n,1 . Theorem 2.7 describes the dichotomous behavior off n (x 0 ). If f ′ Y (x 0 ) = 0, then we may establish trichotomous behavior along the lines of Theorem 3 in [28].
Remark 2.9. We shall comment on the assumption EZ 4 1 < ∞. This is necessary for us to use Wu [26] result for empirical processes (see Lemma C below). Instead, we can use Giraitis and Surgailis [12] assumption E|Z 1 | 2+δ < ∞ together with additional condition on f Z (See also Section 2.2). However, it does not solve completely the problem in case EZ 2 1 < ∞. Remark 2.10. It would be desirable to extend the results of, especially, Proposition 2.2 and Theorem 2.7 to the multivariate setting. However, it does not seem to be feasible when using the martingale approximation approach as in the current paper.
Remark 2.11. We do not provide CLT forF n (x 0 ) in the weakly dependent case. The martingale method we use here is based on fact that in the density case the differentiable part is negligible compared to the martingale part, provided that SRD conditions hold (compare (18) with (20)). However, in the distribution case if SRD assumptions are fulfilled, then the martingale part and the differentiable part are of the same order and the method does not apply. We also note that the problem is symmetric in X and ǫ, i.e. instead of assuming that X j are dependent and ǫ j are i.i.d., we may assume that X j are i.i.d. and ǫ j are dependent. What is important in our results is the dependence structure of Y j 's. In [17] it is assumed that X j is mixing and it is claimed that VarF n (x 0 ) has different behavior according to ǫ j being dependent or i.i.d. Note, however, that their proof of Lemma 3.2(i) is invalid.
To obtain confidence interval forf n (x 0 ) we choose appropriate bandwidth to make sure that the variance of the estimator dominates the bias term. In particular, in the LRD case it reads as follows.

Supersmooth densities
In a supersmooth case we consider the usual assumptions (cf. [20]): (iv) The real (imaginary) part of φ ǫ is negligible as t → ∞ with respect to the imaginary (real) part.
To deal with LRD linear processes (1), recall that ρ k = L(k)k −(2γ−1) as k → ∞. We assume additionally that for all x, y, where f k is the joint density of (Y 0 , Y k ) and with δ > 0 and h being an integrable and continuous function. The result of Proposition 2.15 means that in the supersmooth case long range dependence has no influence on the optimal bandwidth choice and the optimal rates for MSE(f, h n ). They are the same as in the i.i.d. and weakly dependent situation situation (cf. [9,19]).
Remark 2.17. Note that the martingale approximation method used in the ordinary smooth case requires the precise information about ||g n || 1 , in particular, its finiteness. It is not feasible in the supersmooth case. Instead, we additionally assume (10). We could have worked with this assumption in the ordinary smooth case and obtain the results for MSE(f, h n ). However, using linear structure and the martingale approximation method we can obtain at the same time MSE(f, h n ) and the central limit theorem.

Proofs
Since f X is twice differentiable with continuous and bounded second order derivatives and K is of the second order, we obtain (see [18])
Proof of Proposition 2.5. We sketch it briefly, since it is similar to the previous one.
Further, we have as in (19), Taking Taylor expansion R n (ξ) we obtain as in the proof of Proposition 2.2, Comparing (22) with (23) we can see that the martingale part is of smaller order. Consequently, Since bias(F n (x 0 )) = 1 2 f ′ (x 0 ) u 2 K(u)duh 2 n + o(h 2 n ) we conclude the result.
In order to prove CLT, we will use the martingale central limit theorem.
Lemma 3.1. Assume that nh n → ∞. Then Proof. The proof is similar to that of Lemma 2 in [28]. Since m n is a martingale it suffices to verify the Lindeberg condition and convergence of conditional variances.
Letζ j = (nh 1−2β Note that for sufficiently large n we have h β n |g n (v)| ≤ C and the bound does not depend on v nor n. As for the Lindeberg condition we have by The set {h β n |g n (v)| > (nh n ) 1/2 ε/2} becomes empty for sufficiently large n. Consequently, As in (17), we have and it suffices to prove By (15) and the ergodic theorem, the second part is o P (1). By Lipschitz continuity of f ǫ+Z and f Y , the first part is bounded by Consequently, the result follows by Lemma C since the martingale part is negligible.

Supersmooth case
Proof of Proposition 2.15. Let Z n,j = 1 hn g n From [20] we know that l (1) n := C Further, as in (25), To assure that I 1 → 0 as n → ∞ we choose h n = d κ > 0 as long as 0 < 2 − 2γ < θ < 1. Consequently, via (13), the bias term dominates and the mean square rate of convergence is of the order (ln n) −2/β .
Proof. Integrate by parts three times to obtain

dt.
Consequently, if we show that the right-hand side is bounded by Ch −2 n we will prove that |ug n (u)| = O(|u| −2 ) (the bound depends on h n ) and hence |ug n (u)| is integrable.