Entropy Estimate for $k$-Monotone Functions via Small Ball Probability of Integrated Brownian Motions

Metric entropy of the class of probability distribution functions on $[0,1]$ with a $k$-monotone density is studied through its connection with the small ball probability of $k$-times integrated Brownian motions.


Introduction
In statistical consultation, one is often confronted with the problem that a client shows a graph of a certain observed frequency distribution and asks, "What theoretical probability distribution would fit this observed distribution?" This question becomes mathematically meaningful once one specifies the family of densities to consider and the distance to measure the deviation between the real density and the estimator of the density (Groeneboom (1985)). To answer such a question, non-parametric estimators, such as the Maximum Likelihood Estimator, are often used. It is well known that the rate of convergence of an estimator depends on the richness of the function class. In particular, it depends on the metric entropy of the function class. Thus, it is of interest to study the metric entropy of various shape-constrained function classes that have statistical significance. For a bounded set T in a metric space equipped with distance d, the metric ε-entropy of T is defined as the logarithm of the minimum covering number, i.e, log N (ε, T, d), where N (ε, T, d) := min m : there exist t 1 , t 2 , . . . , t m such that T ⊂ m k=1 {t ∈ T : d(t, t i ) < ε} .
Metric entropy was first introduced by A. N. Kolmogorov and has been extensively studied and applied in approximation theory, geometric functional analysis, probability theory, and complexity theory, etc.; e.g., see the books by Kolmogorov and Tihomirov (1961), Lorentz (1966), Carl and Stephani (1990), Edmunds and Triebel (1996). Among many beautiful results are the duality theorem (Tomczak-Jaegermann (1987), Artstein et.al (2004)), and the small ball probability connection (Kuelbs and Li (1993), Li and Linde (1999)), which will be used in this paper. Nevertheless, the estimate of metric entropy for specific function classes remains difficult, especially the lower bound estimate, which often requires a construction of a wellseparated subset. In this paper, we study the metric entropy estimate of a class of shape-constrained functions called k-monotone functions. k-monotone functions have been studied since at least the 1950s; for example, Williamson (1956) gave a characterization of k-monotone functions on (0, ∞) in 1956. In recent years, there has been a lot of interest in statistics regarding this class of functions. We refer the recent paper by Balabdaoui and Wellner (2004) and the references therein for recent results and their statistical applications. A function on a bounded interval, say [0, 1], is said to be m-monotone if (−1) k f (k) (x) is nonnegative, non-increasing, and convex for 0 ≤ k ≤ m − 2 if m ≥ 2, and f (x) is non-negative, non-increasing if m = 1. Let us note that in dealing with the metric entropy of this function class under L p norms, 1 ≤ p < ∞, we can always assume that the functions are differentiable infinitely many times by using the the following basic lemma.
Proof. The idea of the proof is simple. Basically, we can approximate a continuous function f by f * K for some C ∞ kernel K without changing the m-monotonicity. However, this requires an extension of f to a larger interval containing [0, 1] while maintaining the m-monotonicity, which is not immediately clear for m ≥ 2. Thus, we give a detailed proof for the case m ≥ 2. If f is m-monotone on [0, 1] for m ≥ 2, then, by definition, (−1) m−2 f (m−2) is non-negative increasing and convex. For any ε > 0, we can find a piecewise linear non-negative increasing convex function g m−2 , such that (−1) m−2 f (m−2) − g m−2 p < ε. Extend g m−2 to R, so that g m−2 is supported on [−1, 1], and once restricted on [−1, 1], g m−2 is a continuous non-negative increasing convex function. Let K ε be a C ∞ (−∞, ∞) kernel supported on [0, 1] such that is a non-negative, increasing and convex function of t, and we see that h m−2 is also a non-negative, increasing and convex function on [0, 1]. Now, define Repeating this process, we can obtain an m-monotone C ∞ function h 0 such that f − h 0 p ≤ 2ε.
In view of Lemma 1.1, we will simply say for convenience that a function is m-monotone if Our main result is the following This is a generalization of a result due to Van

A Characterization
First, we need a characterization of the function class M m . Recall that Williamson (1956) proved that a function g is k-monotone on (0, ∞) if and only if there exists a non-decreasing function γ bounded at 0 such that where a 1 , a 2 , ..., a m ≥ 0, µ is a non-negative measure on [0, L], and Proof. Suppose F is a probability distribution function on [0, L] with an m-monotone density. Then

Repeatedly using integration by parts gives
. Then a k ≥ 0, and we have It remains to prove that a 1 L + a 2 L 2 + · · · + a m L m + µ = 1. Note that by repeatedly using integration by parts, we also have This proves that F can be expressed as (1). The other direction is trivial.
The proof of Theorem 2.1 also gives the following where a 1 , a 2 , ..., a m ≥ 0, and µ is a non-negative measure. Furthermore, where µ is a finite measure on (0, ∞).

Proof of the Main Result
We denote by Q m the class of functions on [0, 1] of the form where µ(t) is a non-negative measure with total variation bounded. We also denote by P m the class of polynomials of the form a 1 (1 − x) + · · · + a m (1 − x) m , with a 1 , a 2 , ..., a m ≥ 0 and a 1 + · · · + a m ≤ 1. Then Theorem 2.1 implies that Q m ⊂ M m ⊂ Q m + P m . Thus, On the other hand, it is easy to see that Indeed, the set forms an m/N -net of P m , and there are only N m elements in this set. By choosing N = ⌈m/ε⌉, inequality (3) follows. Substituting (3) into (2), we obtain provided that we show log N (ε, Q m , · 2 ) ≍ ε −α for some α > 0.
To estimate the covering number N (ε, Q m , · 2 ), we introduce an auxiliary function class Q m that consists of all the functions on [0, 1] that can be expressed as 1 − 1 where ν is a signed measure on [0, 1] with total variation bounded by 1. The benefit of using this auxiliary function class is that Q m has a certain useful symmetry, which will become clear later in the proof. It is clear that Q m ⊂ Q m . So, N (ε, Q m , · 2 ) ≤ N (ε, Q m , · 2 ). On the other hand, if F ∈ Q m , then there exists a signed measure ν with total variation bounded by 1, such that Let µ 1 := ν + and µ 2 := ν − . We have This means that for any F ∈ Q m , there exist F 1 , F 2 ∈ Q m such that F (x) = F 1 (x) − F 2 (x) + 1 for all x ∈ [0, 1], or Q m ⊂ Q m − Q m + 1. This immediately implies that provided that log N (ε, Q m , · 2 ) is of the order ε −α for some α > 0, which will be proved later.
Of course, N (ε, S, · l2 ) is the same as N (ε, T, · l2 ), where T = (a 1 , a 2 , ...) : a n = F, φ n , n ∈ N, Note that T is a symmetric convex subset of l 2 . (The purpose of introducing the auxiliary function class Q m is to create this symmetry.) By the duality theorem of metric entropy (Tomczak-Jaegermann (1987), Artstein et.al (2004)), provided that either side of the relation below is of the order ε −α for some α > 0, where D 2 is the unit ball of l 2 and · T • is a norm induced by the set Now, let us take a closer look at the set T . Note that by changing the order of integration, we can write Thus, T is the absolute convex hull of the set (a 1 (t), a 2 (t), ...) : a n (t) = Next, we relate T to an m-times integrated Brownian motion. Let W (t), t ∈ [0, 1] be the Brownian motion on [0, 1]. Writing W (t) in a canonical expansion, we have where ξ n are independent N (0, 1) random variables. Let B m be an m-times integrated Brownian motion, i.e., By using the canonical expansion (5) and changing the order of integration, we have Note that the right hand side is exactly the inner product of the vector (ξ 1 , ξ 2 , ...) and a vector in the set T . Thus, We will use the connection between metric entropy and small ball probability to estimate log N (ε, D 2 , · T • ). By a general connection between small ball probability and metric entropy discovered by Kuelbs and Li (1993) and completed in Li and Linde (1999), the covering number N (ε, D 2 , · T • ) is connected with the Gaussian measure of {x ∈ R ∞ : x T • ≤ ε}, that is, the small ball probability P(sup t∈[0,1] |B m (t)/t m | ≤ ε}. The precise connection is as follows: Therefore, it remains to estimate log P(sup t∈[0,1] |B m (t)/t m | ≤ ε). It is clear that On the other hand, by the Weak Gaussian Correlation Inequality (Li (1999) By choosing δ = m 4m+2 and λ = 2 √ δ, we have By iteration, we have log P( sup for some constants C > 0 and 0 < c < 1, which, together with (7), implies log P( sup provided that the right-hand-side is of the order −ε −β for some β > 0. However, it was proved in Chen and Li (1999) that Putting (4), (6), (8) and (9) together, we conclude that

Some Remarks
In statistical applications, one may also want to consider the metric entropy of m-monotone densities. That is, m-monotone functions on [0, 1] satisfying g 1 = 1 with m ≥ 1. If we denote this class of functions by D m , then a similar argument gives Also note that for the class of m-monotone functions on [0, 1], even if we only consider the functions with continuous f (m) , we generally cannot assume f (m) to be bounded. One might think that by restricting f (m) to be bounded by a certain number, one would obtain a smaller metric entropy. However, through a similar argument one can show that the metric entropy of the subclass of m-monotone function on [0, 1] with |f (k) | ≤ 1 for all k ≤ m has order ε −1/m as well.
Let us also remark that instead of requiring (−1) k f (k) ≥ 0 for all 1 ≤ k ≤ m, one can require . We call such a class of functions as a general m-monotone class. We note that not only the same result as Theorem 1.2 holds for that class, but also that the same argument works. Indeed, if in the definition of m-times integrated Brownian motion W (x n )dx n · · · dx 3 dx 2 dx 1 , we replace some of the integral limits "from 0 to x i " by "from x i to 1", we obtain a general m-times integrated Brownian motion B m , which was introduced in Gao et.al (2003). By interchanging the order of integration, a general m-times integrated Brownian motion can then be expressed as 1 0 K(t, s)dW (s) for some kernel K(t, s). By properly choosing the integral limits (either from 0 to x i , or from x i to 1) in the definition of B m , we can make (−1) ε k ∂ (k) ∂t k K(t, s) ≤ 0.
Denoting Q(t) = s 0 K(t, s)ds, one can characterize the class of probability distribution functions on [0, 1] with a general m-monotone density as in Theorem 2.1, and argue that the problem of estimating the metric entropy of the function class under the L 2 norm becomes the problem of estimating the small ball probability of B m (t)/Q(t) under the supremum norm, which eventually leads to the problem of estimating the small ball probability of general mtimes integrated Brownian motion. However, it was recently proved by Gao and Li (2006