Some extensions of an inequality of Vapnik and Chervonenkis

The inequality of Vapnik and Chervonenkis controls the expectation of the function by its sample average uniformly over a VC-major class of functions taking into account the size of the expectation. Using Talagrand's kernel method we prove a similar result for the classes of functions for which Dudley's uniform entropy integral or bracketing entropy integral is finite.


Introduction and main results.
Let Ω be a measurable space with a probability measure P and Ω n be a product space with a product measure P n . Consider a family of measurable functions F = {f : Ω → [0, 1]}. Denote The main purpose of this paper is to provide probabilistic bounds for P f in terms off and the complexity assumptions on class F . We are trying to extend the following result of Vapnik and Chervonenkis ( [17]). Let C be a class of sets in Ω. Let S(n) = max x∈Ω n {x 1 , . . . , x n } ∩ C : C ∈ C * The work was done during summer internship at AT&T Research Labs at the Laboratory of Speech and Image Processing.
The VC dimension d of class C is defined as d = inf{j ≥ 1 : S(j) < 2 j }.
C is called VC if d < ∞. The class of functions F is called VC-major if the class of sets is a VC class of sets in Ω, and the VC dimension of F is defined as the VC dimension of C. The inequality of Vapnik and Chervonenkis states that (see Theorem 5.3 in [18]) if F is a VC-major class of [0, 1] valued functions with dimension d then for all δ > 0 with probability at least 1 − δ for all f ∈ F where for n ≥ d, S(n) can be bounded by [16]) to give The factor (P f ) −1/2 allows interpolation between the n −1 rate for P f in the optimistic zero error casef = 0 and the n −1/2 rate in the pessimistic case whenf is "large". In this paper we will prove a bound of a similar nature under different assumptions on the complexity of the class F . Using Talagrand's abstract concentration inequality in product spaces and the related kernel method for empirical processes [14] we will first prove a general result that interpolates between optimistic and pessimistic cases. Then we will give examples of application of this general result in two situations when it is assumed that either Dudley's uniform entropy integral is finite or the bracketing entropy integral is finite. Let us formulate Talagrand's concentration inequality that is used in the proof of our main Theorem 2 below. Consider a probability measure ν on Ω n and x ∈ Ω n . We will denote by x i the i th coordinate of x. If C i = {y ∈ Ω n : y i = x i }, we consider the image of the restriction of ν to C i by the map y → y i , and its Radon-Nikodym derivative d i with respect to P. As in [14] we assume that Ω is finite and each point is measurable with a positive measure. Let m be a number of atoms in Ω and p 1 , . . . , p m be their probabilities. By the definition of d i we have For α > 0 we define a function ψ α (x) by We set For each α > 0 let L α be any positive number satisfying the following inequality: The following theorem holds (see [9]).
Theorem 1 Let α > 0 and L α satisfy (1.3). Then for any n and A ⊆ Ω n we have . (1.4) Below we will only use this theorem for α = 1 and L 1 ≈ 1.12. Let us introduce the normalized empirical process as (1.5) The factor ϕ(f ) will play the same role as (nP f ) 1/2 plays in (1.1) The following theorem holds.
Theorem 2 Let L ≈ 1.12. If (1.5) holds then for any u > 0, Proof. The proof of the theorem repeats the proof of Theorem 2 in [9] with some minor modifications, but we will give it here for completeness. Let us consider the set A = {Z(x) ≤ M}. Clearly, P n (A) ≥ 1/2. Let us fix a point x ∈ Ω n and then choose f ∈ F . For any point Therefore, for any probability measure ν such that ν(A) = 1 we will have It is easy to observe that for v ≥ 0, and −1 ≤ u ≤ 1, Therefore, for any δ > 1 i≤n Taking the infimum over ν we obtain that for any δ > 1 Let us denote the random variable ξ = f (y 1 ), F ξ (t) -the distribution function of ξ, and One can check that h(c) is decreasing, convex, h(0) = P f 2 and h(1) = 0. Therefore, Hence, we showed that Theorem 1 then implies via the application of Chebyshev's inequality that with probability For u ≤ nP f /L the infimum over δ > 1 equals 2 √ LnuP f . On the other hand, for u ≥ nP f /L this infimum is greater than 2nP f whereas the left-hand side is always less than nP f.
We will now give two examples of normalization ϕ(f ) where we can prove that (1.5) holds.

Uniform entropy conditions.
Given a probability distribution Q on Ω we denote Let the packing number D(F , u, L 2 (Q)) be the maximal cardinality of any u−separated set. We will say that F satisfies the uniform entropy condition if and the supremum is taken over all discrete probability measures. It is well known (see, for example, [3]) that if one considers the subset F p = {f ∈ F : P f ≤ p}, then the expectation of sup Fp (P f −f (x i )) can be estimated (in some sense, since the symmetrization argument is required) by We will prove that it holds for all p > 0 simultaneously.
Proof. The proof is based on standard symmetrization and chaining techniques. We will first prove that where y = (y 1 , . . . , y n ) lives on an independent copy of (Ω n , P n ). We will show that the inequalities If we define by P y the probability measure on the space of y, it would mean that and taking expectation of both sides with respect to x would prove (1.10). To show the remaining implication we consider two cases when nP f ≤ f (y i ) and nP f ≥ f (y i ). First assume that nP f ≤ f (y i ). Since, as easily checked, both ϕ(p) and p/ϕ(p) are increasing we get , y)) .
In the case nP f ≥ f (y i ) we have The assumption D(F , √ P f ) ≥ D(F , 1) ≥ 2 garantees that ϕ(P f ) ≥ √ nP f log 2 and, finally, which completes the proof of (1.10). We have where (ε i ) is a sequence of Rademacher random variables. We will show that there exists u independent of n such that for any x, y ∈ Ω n Clearly, this will prove the statement of the theorem. For a fixed x, y ∈ Ω n let The packing number of F with respect to d can be bounded by D(F, u, d) ≤ D(F , u).
Consider an increasing sequence of sets such that for any g = h ∈ F j , d(g, h) > 2 −j and for all f ∈ F there exists g ∈ F j such that d(f, g) ≤ 2 −j . The cardinality of F j can be bounded by For simplicity of notations we will write D(u) := D(F , u). If D(2 −j ) = D(2 −j−1 ) then in the construction of the sequence (F j ) we will set F j equal to F j+1 . We will now define the sequence of projections π j : F → F j , j ≥ 0 in the following way. If f ∈ F is such that In the case when F k = F k+1 we will choose π k (f ) = π k+1 (f ). This construction implies that d(π k−1 (f ), π k (f )) ≤ 2 −k+2 . Let us introduce a sequence of sets . The cardinality of ∆ j does not exceed By construction any f ∈ F can be represented as a sum of elements from ∆ j and define the event On the complement A c of the event A we have for any It remains to prove that for some absolute constant u, P (A) < 1/2. Indeed, , The fact that D(u) is decreasing implies and, therefore, for α = u 2 /2 8 − 2 big enough.
Combining Theorem 2 and Theorem 3 we get

Bracketing entropy conditions.
Given two functions g, h : Ω → [0, 1] such that g ≤ h and (P (h−g) 2 ) 1/2 ≤ u we will call a set of all functions f such that g ≤ f ≤ h a u−bracket with respect to L 2 (P ). The u−bracketing where K(F ) does not depend on n.
We omit the proof of this theorem since it is a modification of a standard bracketing entropy bound (see Theorem 2.5.6 and 2.14.2 in [15]) similar to what Theorem 3 is to the standard uniform entropy bound. The argument is more subtle as it involves a truncation argument required by the application of Bernstein's inequality but otherwise it repeats Theorem 3. Combining Theorem 2 and Theorem 4 we get Corollary 2 If (1.11) holds then there exists an absolute constant K > 0 such that for any u > 0 with probability at least 2 Examples of application.
Example 1 (VC-subgraph classes of functions). A class of functions F is called VC-subgraph if the class of sets is a VC-class of sets in Ω × R. The VC dimension of F is equal to the VC dimension d of C.
On can use Corollary 3 in [4] to show that Corollary 1 implies in this case that for any δ > 0 with probability at least 1 − δ for all where K > 0 is an absolute constant. Instead of the log n on the right-hand side of (2.1) one could also write log(1/P f ), but we simplify the bound to eliminate this dependence on P f. Note that the bound is similar to the bound (1.2) for VC classes of set and VC-major classes. Unfortunately, our proof does not allow us to recover the same small value of K = 2 as for VC classes of sets.
(2.1) improves the main result in [7], where it was shown that for any fixed ν > 0 for any δ > 0 with probability at least 1 − δ for all f ∈ F It is easy to see that, in a sense, one would get (2.1) from (2.2) only after optimizing over ν.
Indeed, for P f < ∼ ν, (2.2) gives , which compared to (2.1) contains an additional factor of (P f /ν) 1/2 . In the situation when ν is small (this is the only interesting case) this factor introduces an unnecessary penalty for any function f such that P f ≫ ν. Hence, for a fixed ν (2.2) improves the bound for P f ≤ ν at cost of f with P f ≥ ν.
One can find alternative extensions of (2.2) in [5]. For some other applications of Corollary 1 see [10].
Then Corollary 1 or Corollary 2 imply that for u > 0 with probability at least 1 − 2e −u for all f ∈ F P f −f ≤ c γ √ n (P f ) Iff = 0 then it is easy to see that for u ≤ n γ 2+γ we have P f ≤ K γ n − 2 2+γ .
As an example, if F is a class of indicator functions for sets with α−smooth boundary in [0, 1] l and P is Lebesgue absolutely continuous with bounded density then well known bounds on the bracketing entropy due to Dudley (see [3]) imply that γ = 2(l − 1)/α and P f ≤ K α n − α l−1+α . Even though γ = 2(l − 1)/α may be greater than 2 and Corollary 2 is not immediately applicable, one can generalize Theorem 4 to different choices of ϕ(x), using the standard truncation in the chaining argument, to obtain the above rates even for γ ≥ 2.