On the Concentration of the Missing Mass

A random variable is sampled from a discrete distribution. The missing mass is the probability of the set of points not observed in the sample. We sharpen and simplify McAllester and Ortiz's results (JMLR, 2003) bounding the probability of large deviations of the missing mass. Along the way, we refine and rigorously prove a fundamental inequality of Kearns and Saul (UAI, 1998).


Introduction
Hoeffding's classic inequality [3] states that If X is a [a, b]-valued random variable with EX = 0 then A standard proof of where the last inequality follows by noticing that log f (0) = [log f (t)] ′ | t=0 = 0 and that [log f (t)] ′′ ≤ 1/4.Although (1) is tight, it is a "worst-case" bound over all distributions with the given support.Refinements of (1) include the Bernstein and Bennett inequalities [5], which take the variance into account -but these are also too crude for some purposes.
In 1998, Kearns and Saul [4] put forth an exquisitely delicate inequality for (generalized) Bernoulli random variables, which is sensitive to the underlying distribution: One easily verifies that (3) is superior to (1) -except for p = 1/2, where the two coincide.In fact, (3) is optimal in the sense that, for every p, there is a t for which equality is achieved.The Kearns-Saul inequality allows one to analyze various inference algorithms in neural networks, and the influential paper [4] has inspired a fruitful line of research [1,7,9,10].
One specific application of the Kearns-Saul inequality involves the concentration of the missing mass.Let p = (p 1 , p 2 , . ..) be a distribution over N and suppose that X 1 , X 2 , . . ., X n are sampled iid according to p. Define the indicator variable ξ j to be 0 if j occurs in the sample and 1 otherwise: The missing mass is the random variable McAllester and Schapire [8] first established subgaussian concentration for the missing mass via a somewhat intricate argument.Later, McAllester and Ortiz [7] showed how the standard inequalities of Hoeffding, Angluin-Valiant, Bernstein and Bennett are inadequate for obtaining exponential bounds of the correct order in n, and developed a thermodynamic approach for systematically handling this problem 1 .We were led to the Kearns-Saul inequality (3) in an attempt to understand and simplify the missing mass concentration results of McAllester and Ortiz [7], some of which rely on (3).However, we were unable to complete the proof of (3) sketched in [4], and a literature search likewise came up empty.The proof we give here follows an alternate path, and may be of independent interest.As an application, we simplify and sharpen some of the missing mass concentration results given in [8,7].

Main results
In [4, Lemma 1], Kearns and Saul define the function A natural attempt to find the maximum of g leads one to the transcendental equation In an inspired tour de force, Kearns and Saul were able to find that g ′ (t * ) = 0 for This observation naturally suggests (i) arguing that t * is the unique zero of g ′ and (ii) supplying (perhaps via second-order information) an argument for t * being a local maximum.In fact, all evidence points to g ′ (t) having the following properties: 1 The latter has, in turn, inspired a general thermodynamic approach to concentration [6].
Unfortunately, besides straightforwardly verifying (**), we were not able to formally establish (*) or (***) -and we leave this as an intriguing open problem.Instead, in Theorem 4 we prove the Kearns-Saul inequality (3) via a rather different approach.Moreover, for p ≥ 1/2 and t ≥ 0, the right-hand side of (3) may be improved to exp[p(1 − p)t 2 /2].This refinement, proved in Lemma 5, may be of independent interest.As an application, we recover the upper tail estimate on the missing mass in [7, Theorem 16]: We also obtain the following lower tail estimate: where Since C 0 /4 ≈ 1.92, Theorem 2 sharpens the estimate in [7, Theorem 10], where the constant in the exponent was e/2 ≈ 1.36.Our bounds are arguably simpler than those in [7] as they bypass the thermodynamic approach.

Proofs
The following well-known estimate is an immediate consequence of (2): We proceed with a proof of the Kearns-Saul inequality.
Theorem 4. For all p ∈ [0, 1] and t ∈ R, Proof.The cases p = 0, 1 are trivial.Since lim for p = 1/2 the claim follows from Lemma 3.For p = 1/2, we multiply both sides of ( 6) by e tp , take logarithms, and put t = 2s log((1 − p)/p) to obtain the equivalent inequality For s ∈ R, denote the left-hand side of (7) by h s (p).A routine calculation yields and where µ = ((1 − p)/p) 2s .As h ′′ s ≥ 0, we have that h s is convex, and from ( 8) it follows that h s (p) ≥ 0 for all s, p.
We will also need a refinement of (3): Remark: Since the right-hand side of ( 6) majorizes the right-hand side of ( 9) uniformly over [1/2, 1], the latter estimate is tighter.
Proof.The claim is equivalent to For t ≥ 4, we have which is obviously non-positive.Now the inequality clearly holds at p = 1 (as equality), and the p = 1/2 case is implied by Lemma 3. The claim now follows by convexity.
Our numerical constants are defined in the following lemma, whose elementary proof is omitted: x ∈ (0, 1/2).
Proof.(a) We invoke Theorem 4 with p = q and t = λp to obtain Thus it suffices to show that Collecting the p and q terms on opposite sides, it remains to prove that We claim that L ≤ 1 ≤ R. The second inequality is obvious from the Taylor expansion, since To prove that L ≤ 1, we note first that L(q) ≥ L(1 − q) for q ∈ (0, 1/2).Hence, it suffices to consider q ∈ (0, 1/2).To this end, it suffices to show that the function (b) The inequality is equivalent to where L is obtained from the left-hand side of (6) after replacing p by 1 − q and t by λp.
Our proof of Theorems 1 and 2 is facilitated by the following observation, also made in [7].Although the random variables ξ j whose weighted sum comprises the missing mass (4) are not independent, they are negatively associated [2].A basic fact about negative association is that it is "at least as good as independence" as far as exponential concentration is concerned [7, Lemmas 5-8]: Lemma 8. Let ξ ′ j be independent random variables, where ξ ′ j is distributed identically to ξ j for all j ∈ N. Define also the "independent analogue" of U n : Then for all n ∈ N and ε > 0, Proof of Theorems 1 and 2. Observe that the random variables ξ ′ j defined in Lemma 8 have a Bernoulli distribution with P(ξ ′ j = 1) = q j = (1 − p j ) n and put X j = ξ j − Eξ j .Using standard exponential bounding with Markov's inequality, q j e λ(p j −p j q j ) + (1 − q j )e −λp j q j ≤ e −λε j∈N exp(p j λ 2 /4n) = exp(λ 2 /4n − λε), where the last inequality invoked Lemma 7(a).Choosing λ = 2nε yields Theorem 1.
The proof of the Theorem 2 is almost identical, except that X j is replaced by −X j and Lemma 7(b) is invoked instead of Lemma 7(a).