Uniform-in-bandwidth consistency for kernel-type estimators of Shannon's entropy

We establish uniform-in-bandwidth consistency for kernel-type estimators of the differential entropy. We consider two kernel-type estimators of Shannon's entropy. As a consequence, an asymptotic 100% confidence interval of entropy is provided.


Introduction -Main Results
Let X 1 , . . ., X n be independent random replicae of a random vector X ∈ R d with distribution function F(x) = P(X ≤ x) for x ∈ R d , with d ≥ 1.We set here X = (X 1 , . . ., X d ) ≤ x = (x 1 , . . ., x d ) whenever X i ≤ x i , for all i = 1, . . ., d.We assume that the distribution function F(•) has a density f (•) (w.r. to Lebesgue measure in R d ).The differential entropy of f (•) is then given by whenever this integral is meaningful, and where dx denotes Lebesgue measure in R d .The notion of differential entropy was essentially introduced by Shannon (1948).We refer to Cover and Thomas (2006) (see their Chapter 8), and the references therein, for details.Because of numerous applications, the problem of estimating H(f ) has been the subject of considerable interest in the last decades (refer to Beirlant et al. (1997) and the references therein).The main purpose of the present article is to establish consistency and to provide asymptotic confidence intervals for the entropy functional H(f ), based on kernel-type functional estimators.
Below, we will work under the following assumptions on f (•).
(F.1) H(f ) is properly defined by the integral (1), in the sense that is bounded and strictly positive on R d .
E-mail address: salim.bouzebda@upmc.fr& issam.elhattab@upmc.fr We refer to Györfi and van der Meulen (1991) for conditions characterizing (2) in terms of f (•).To define our entropy estimator we define, in a first step, a kernel density estimator.Towards this aim, we introduce a measurable function K(•) fulfilling the conditions We then make use of an Akaike-Parzen-Rosenblatt (refer to Akaike (1954), Parzen (1962) andRosenblatt (1956)) kernel estimator of f (•), defined as follows.Given a bandwidth sequence In a second step, given fn,hn (•), we estimate H(f ) by setting fn,hn (x) log fn,hn (x) dx, where A n,β := {x : fn,hn (x) ≥ (log + n) −β } and β ∈ (0, 1/4) is a specified constant.Here, we set The limiting behavior of fn,hn (•), for appropriate choices of the bandwidth h n , has been extensively investigated in the literature (refer to Bosq and Lecoutre (1987), Devroye and Györfi (1985) and Devroye and Lugosi (2001)).In particular, under our assumptions, the condition that h n → 0 together with nh n → ∞ is necessary and sufficient for the convergence in probability of fn,hn (x) → Deheuvels (2000), Einmahl and Mason (2000), Deheuvels and Mason (2004), Einmahl and Mason (2005) and Dony and Einmahl (2006) established uniform consistency results for such estimators, where h n varies within suitably chosen intervals indexed by n.In the present paper we will use their methods to establish convergence results for Our main result is as follows.
Theorem 1.1 Let K(•) satisfy (K1)-(K4), and let f (•) fulfill (F1)-(F2).Then for each β ∈ (0, 1/4), ∞.An application of Theorem 1.1 shows that, with probability 1, This, in turn, implies that We note that the main problem in using entropy estimates such as (4) is to choose properly h n .The uniform in bandwidth consistency result given in (6) shows that any choice of h between a n and b n ensures the consistency of H n,h,β (f ).
Recalling that A n,β := {x : fn,hn (x) ≥ (log n) −β }, we readily obtain from these relations that, for We can therefore write, for any n ≥ 1, the inequalities We note that By combining (8) with Theorem 9.1, page 79, in Devroye and Lugosi (2001), we obtain that Since h n ≤ b n and b n ↓ 0 as n → ∞, there exists a positive constant C 1 such that for all n sufficiently large We next evaluate the second term ∆ 2,n,hn,β in the right side of (7).Since |log z| ≤ 1 z + z, for all z > 0, we see that + fn,hn (x) dx.
Similarly as above, we get, for any x ∈ A n,β , 1 fn,hn (x) + fn,hn (x) = 1 fn,hn (x) fn,hn (x) + 1 fn,hn (x) We can therefore write, for any n ≥ 1, By combining (10) with Theorem 3.1, page 30, in Devroye (1987), we conclude that, almost surely Since cn −1 log n ≤ h n ≤ b n and b n ↓ 0 as n → ∞, there exists a positive constant C 2 such that, almost surely for n sufficiently large, We now impose some slightly more general assumptions on the kernel K(•) than that of Theorem 1.1.Consider the class of functions , where the supremum is taken over all probability measures Q on (R d , B).Here, d Q denotes the L 2 (Q)-metric and N (κε, K, d Q ) is the minimal number of balls {g : d Q (g, g ′ ) < ε} of d Q -radius ε needed to cover K.We assume that K satisfies the following uniform entropy condition.
(K.5) for some C > 0 and ν > 0, Finally, to avoid using outer probability measures in all of statements, we impose the following measurability assumption.
(K6) K is a pointwise measurable class, that is, there exists a countable subclass K 0 of K such that we can find for any function g( Remark that Condition (K.5) is satisfied whenever K(•) is of bounded variation, and Condition (K.6) is satisfied whenever K(•) is right continuous (refer to Deheuvels and Mason (2004) and Einmahl and Mason (2005)).
Proof of Corollary 1.2.Recall A n,β = {x : fn,hn (x) ≥ (log + n) −β } and let A c n,β the complement of A n,β in R d (i.e.A c n,β = {x : fn,hn (x) < (log + n) −β }).We repeat the arguments above with the formal change of H n,hn,β (f ) by H(f ).We show that there exists positive constants D 1 and D 2 such that, for all n sufficiently large, We know (see, e.g, Einmahl and Mason (2005)), that when the density f (•) is uniformly Lipschitz and continuous, we have for each a n < h < b n , as n → ∞, Thus, we have This when combined with ( 16), entails that, as n → ∞, Using ( 17) in connection with (5) imply (6).