A Good-Turing estimator for feature allocation models

Feature allocation models generalize species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, given $n$ samples, we study the problem of estimating the missing mass $M_{n}$, namely the expected number hitherto unseen features that would be observed if one additional individual was sampled. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimator $\hat{M}_{n}$ of $M_{n}$. $\hat{M}_{n}$ turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass in species sampling models, with the difference that the two estimators have different ranges. We show that $\hat{M}_{n}$ admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator, we give provable guarantees for the performance of $\hat{M}_{n}$ in terms of minimax rate optimality, and we provide with an interesting connection between $\hat{M}_{n}$ and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for $\hat{M}_{n}$, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.


Introduction
Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features.In particular, every observation is endowed with a (unknown) finite set of features selected from a (possibly infinite) collection of features (F j ) j≥1 .We conveniently represent each observation with a binary sequence, with entries indicating the presence (1) or absence (0) of each feature.Every feature F j is associated to an unknown probability p j , and each observation displays feature F j with probability p j .The Bernoulli product model, or binary independence model, is arguably the most popular feature allocation model.It models the i-th observation as a sequence Y i = (Y i,j ) j≥1 of independent Bernoulli random variables with unknown success probabilities (p j ) j≥1 , and that Y r is independent of Y s for any r = s.Bernoulli product models have found applications in several scientific disciplines.They have been applied extensively in ecology when animals are captured using traps and each observation is an incidence vector collecting the presence or absence of each species in the traps (e.g., Colwell et al. (2012) and Chao et al. (2014)).Besides ecology, Bernoulli product models found applications in the broad area of biosciences (e.g., Ionita-Laza et al. (2009), Zou et al. (2016)); in the analysis of choice behaviour arising from psychology and marketing (Görür et al. (2006)); in binary matrix factorization for dyadic data (Meeds et al. (2007)); in graphical models (e.g., Wood et al. (2006) and Wood & Griffiths (2007)); in the analysis of similarity judgement matrices (Navarro & Griffiths (2007)); in network data analysis (Miller et al. (2006)).
Let Y n = (Y 1 , . . ., Y n ) be n random samples collected under the Bernoulli product model with unknown feature probabilities (p j ) j≥1 , and let X n,j = 1≤i≤n Y i,j be the number of times that feature F j has been observed in Y n .That is, X n,j is a Binomial random variable with parameters (n, p j ) for any j ≥ 1.In this paper we consider the problem of estimating the conditional expected number, given the sample Y n , of hitherto unseen features that would be observed if one additional sample Y n+1 was collected, i.e.
where I denotes the indicator function.For easiness of notation, in the rest of the paper we will not highlight the dependence on Y n and (p j ) j≥1 , and simply write M n instead of M n (Y n , (p j ) j≥1 ).The statistic M n is typically referred to as the missing mass, namely the sum of the probability masses of unobserved features in the first initial n samples.
Interest in estimating the missing mass is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources, and further draws can be only motivated by the possibility of recording new unobserved features.In genetics, for instance, the ambitious prospect of growing databases to encompass hundreds of thousands of human genomes, makes important to quantify the power of large sequencing projects to discover new genetic variants (Auton et al. (2015)).An accurate estimate of M n will enable better study design and quantitative evaluation of the potential and limitations of these datasets.Indeed one can fix a threshold such that the sequencing procedure takes place until the estimate of M n becomes for the first time smaller than the threshold.This introduces a criterion for evaluating the effectiveness of further sampling, providing a roadmap for large-scale sequencing projects.A Bayesian parametric estimator of (1) has been introduced in Ionita-Laza et al. (2009), and it relies on a Beta prior distribution for the unknown probabilities p j 's.A limitation of this approach is that it requires parametric forms for the distribution of variant frequencies, which requires some model of demography and selection.For instance, the Beta prior is a reasonable assumption for neutrally evolving variants but may not be appropriate for deleterious mutations.To overcome this drawback, a nonparametric approach to estimate (1) has been proposed in Zou et al. (2016).This is a purely algorithmic type approach, whose core is a linear programming based algorithm.
In this work, we introduce a simple, robust and theoretically sound estimator Mn of the missing mass.As in Zou et al. (2016), our approach is nonparametric in the sense that it does not rely on any distributional assumption on the unknown p j 's.The proposed estimator Mn turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass under the multinomial model for species sampling (e.g., Good (1953) and Robbins (1968)), with the difference that the two estimators have different ranges (supports).For this reason we refer to the estimator Mn as the Good-Turing estimator for feature allocation models.We show that Mn admits a natural interpretation both as a jackknife (resampling) estimator (Quenouille (1956) and Tukey (1958)) and as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973).Theoretical properties of the estimator Mn are investigated.Specifically, we first provide a lower bound for the minimax risk under a squared loss function and then we show that the mean squared error of the proposed estimator achieves the optimal minimax rate for the estimation of M n .This asymptotic analysis provides with an interesting connection between the Good-Turing estimators for species sampling models and for feature allocation models in terms of the limiting Poisson regime of the Binomial distribution.Finally, we derive a non-asymptotic confidence interval for Mn by means of novel Bernstein type concentration inequalities which are of separate interest; the confidence interval is easy to compute and it does not rely on any asymptotic assumption or any parametric constraint on the unknown p j 's.We illustrate the proposed methodology by the analysis of various synthetic data and SNP datasets from the ENCODE (http://www.hapmap.org/downloads/encode1.html.en)sequencing genome project.
The paper is structured as follows.In Section 2 we introduce the Good-Turing estimator Mn for the missing mass M n , and we give provable guarantees for its performance.In Subsection 3 we derive a non-asymptotic confidence interval for Mn , in Subsection 4 we discuss the use of Mn for designing cost-effective feature inventories, and in Section 5 we present a numerical illustration of Mn with synthetic and real data.Some concluding remarks on future works are discussed in Section 6. Proofs are deferred to Appendix A.

A Good-Turing estimator for M n
Let Y n = (Y 1 , . . ., Y n ) be n random samples under the Bernoulli product model with unknown feature probabilities (p j ) j≥1 such that j≥1 p j < +∞.That is, Bernoulli random variables, with p j being the success probability of Y i,j for any i = 1, . . ., n, and Y r is independent of Y s for any r = s.The assumption j≥1 p j < +∞ implies that each Y i displays finitely many features almost surely; indeed, by monotone convergence, we have j≥1 p j < +∞ if and only if E( j≥1 Y i,j ) < +∞, which in turns implies that j≥1 Y i,j < +∞ almost surely.If X n,j denotes the number of times that feature F j has been observed in the sample Y n , then is the number of features with frequency r in Y n .We denote by K n the total number of features in the random sample Y n , i.e.K n = r≥1 K n,r .For any two sequences (a n ) n≥1 and (b n ) n≥1 , write a n ≃ b n to mean that a n /b n → 1 as n → +∞.An intuitive estimator of M n can be deduced from a comparison between expectations of M n and K n,1 .Specifically, since X n,j is a Binomial random variable with parameter (n, p j ), we can write and set Mn = K n,1 n as an estimator of M n , for any n ≥ 1.The estimator Mn is nonparametric, in the sense that it does not rely on any distributional assumptions on the feature probabilities (p j ) j≥1 .
The estimator Mn turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass for species sampling models.The Good-Turing estimator first appeared in Good (1953) as a nonparametric empirical Bayes estimator under the classical multinomial model for species sampling, i.e., (Y 1 , . . ., Y n ) are n random samples from a population of individuals belonging to a (possibly infinite) collection of species (S j ) j≥1 with unknown proportions (p j ) j≥1 such that j≥1 p j = 1.Under the multinomial model every observation is endowed with one species selected from (S j ) j≥1 , and hence K n,1 ∈ {0, 1, . . ., n}.Therefore, although the estimator Mn has the same analytic form of the Good-Turing estimator, the two estimators have different ranges (supports): while the Good-Turing estimator for species sampling models takes values in [0, 1], the Good-Turing estimator for feature allocation models takes positive (finite) values.Hereafter we show that: i) Mn admits interpretations both as a jackknife (resampling) estimator and as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973); ii) the mean squared error of Mn converges to zero at the best possible rate for the estimation of M n ; iii) Mn is linked to the Good-Turing estimator for species sampling models in terms of the limiting Poisson regime of the Binomial distribution.

Interpretations of Mn
The estimator Mn admits a natural interpretation as a jackknife estimator in the sense of Quenouille (1956) and Tukey (1958).See also the monograph by Efron (1987) and references therein for a comprehensive account on jackknife (resampling) estimators.Let K n (j) be the number of distinct features observed in the sample Y n after the removal of the j-th sample Y j .It is easy to show that the missing mass M n−1 equals the posterior expected value of K n − K n−1 (j), given all the samples but Y j , i.e., with an unbiased estimator of M n−1 .The corresponding jackknife estimator is then 1≤j≤n (K n − K n−1 (j))/n which equals Mn−1 = K n,1 /n; indeed, for any fixed j, K n − K n−1 (j) coincides with the number of features displayed only by the j-th sample Y j .
The estimator Mn also has an interpretation as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973).A Bayesian nonparametric approach to estimate M n relies on the specification of a prior distribution for the unknown feature probabilities (p j ) j≥1 .We consider the three parameters Beta process prior (Teh and Görür (2009) and James (2017)).This is a generalization of the Beta process prior of Hjort (1990), and it is defined as the distribution of a completely random measure (Daley and Vere-Jones (2008)) with intensity measure ν(ds, df ) = ρ(s)dsI(0, 1)(f )df where , for α ∈ (0, 1), β > −α and ϑ > 0, with Γ(•) being the Gamma function.Under the three parameters Beta process prior, it is shown that an estimator of M n with respect to a squared loss function is We refer to the Appendix A.1 for further details on the three parameters Beta process prior, and on the derivation of (2 is a consistent estimator of ϑ for any α ∈ (0, 1) and β > −α, and plugging the estimator θn into (2) we obtain Mn ( θn , α, β) = K n,1 n = Mn .
In other terms, the proposed estimator Mn coincides with the nonparametric empirical Bayes estimator Mn ( θn , α, β) of the missing mass.Mn ( θn , α, β) is obtained by assigning the three parameters Beta process prior to (p j ) j≥1 and then estimating its mass parameter ϑ with an appropriately chosen consistent estimator.

Optimality of Mn
We start by showing that the mean squared error (L 2 -risk) of the estimator Mn goes to zero at the best possible rate.This legitimates the use of Mn as an estimator of M n .For any 0 < W < +∞ let P W = {(p j ) j≥1 : p j ∈ (0, 1) and j≥1 p j ≤ W } be a class of features probabilities.Also, let Tn denote a whichever estimator of the missing mass M n based on Y n , that is a map on the n-fold and taking values in R + .For a specific collection of feature probabilities (p j ) j≥1 in the class P W , the L 2 -risk of the estimator Tn is defined as follows and the minimax risk is i.e. the infimum, with respect to the set of possible estimators Tn of M n , of the worst-case risk over the class P W .The next theorem provides with an upper bound for R n ( Mn ; (p j ) j≥1 ) and a lower bound for R * n .
Theorem 2.1 Let (p j ) j≥1 be feature probabilities in the class P W .Then, i) for any n ≥ 1 sup showing that R * W/n.In other terms the mean squared error of Mn goes to zero at the best possible rate.Theorem 2.1 then shows that the estimator Mn is essentially rate optimal due to the matching minimax lower bound in the class P W of admissible probabilities' masses up to a constant factor.
In Theorem 2.1 we have considered the class P W of feature probabilities having total mass bounded by W .Here we briefly remark the crucial role played by this assumption in order to be able to provide asymptotic results.First, let us notice from Theorem 2.1 that, for a fixed value of the sample size n, the minimax rate increases linearly in W .This implies that, for a fixed n, the estimation problem can be made as hard as desired if no bounds are imposed on the sum of the possible vectors of probabilities (p j ) j≥1 .Since at first glance this result can seem slightly counter intuitive, let us consider an illustrative example which should clarify why the estimation difficulty of the problem is proportional to W .
To simplify the argument, let us suppose that W is a strictly positive integer and consider W frequency vectors (p j ≤ 1 for all r ∈ {0, . . ., W − 1}.Now, let us construct another vector of frequencies P = (p j ) j≥1 ∈ P W obtained by setting p qW +r = p (r) q for q ≥ 0 and 0 ≤ r ≤ W − 1.The resulting vector P is the concatenation of (p such that in the first W entries of P we put the first elements (p ) and so on.Furthermore, any observation (Y i,j ) j≥1 sampled from the Bernoulli product model with frequencies given by this P can be rewritten, following a similar construction, as the concatenation of W observations Y where n ).From this construction, we can see that by trying to estimate the missing mass on the left hand side, we are basically trying to estimate a sum of W unrelated quantities, which explains why the error is linear in W .
A final remark is that in order for practitioners to use Theorem 2.1 and evaluate the performances of the Good-Turing estimator compared to the minimax rate, we would need to know an upper bound W , i.e. an upper bound on the total mass of the unknown vector of probabilities (p j ) j≥1 .In real life applications, W is unlikely to be known.However, we can easily estimate it.Specifically, since the total mass W is the expected number of features displayed per observation, we can use as a consistent estimator for W the estimator j≥1 X n,j /n.

Connection to the Good Turing estimator for species sampling models
We relate the Good-Turing estimator for feature allocation models with the classical Good-Turing estimator for species sampling, by relying on the limiting Poisson approximation of Binomial random variables.In particular, Theorem 2.1 states that, in the feature allocation models, the Good-Turing estimator achieves a risk of order R n ≍ W n , while it is known from Rajaraman et al. ( 2017) that the risk Rn of the Good Turing estimator in the species sampling case asymptotically behaves as Rn ≍ 1/n.In order to compare the two models, we will consider the limiting scenario when W → 0.
Let us consider a vector of feature frequencies (p j ) j≥1 , such that j p j = W = λ n for some positive value λ and denote pj = pj j≥1 pj the normalized probability vector.Applying the large n Poisson approximation of the binomial distribution, it follows that each X n,j is now approximately distributed according to a Poisson distribution with mean λp j .Therefore, the approximated model for large n boils down to sample first an effective size ñ = j≥1 X n,j , distributed according to a Poisson with parameter λ, and, conditionally on it, sample ñ independent and identically distributed observations ( Ỹ1 , . . ., Ỹñ ) from probability vector (p j ) j≥1 .Hence it is equivalent to a species sampling model where the sample size ñ is random.Denote Mñ the missing mass in the corresponding species sampling model.Now, by noticing that . We conclude by remarking that this construction also provides a justification for the Good-Turing in the feature allocation models.Indeed, in feature allocation models, we want to estimate M n = W Mñ , where both W and Mñ are unknown and need to be estimated.In order to estimate Mñ , we can use the Good-Turing estimator for species models, which here turns to be .
However, we also need to estimate W = j p j in order to use M GT ñ to estimate M n .As pointed out at the end of the previous subsection, a consistent estimator of W is Ŵ = j Xn,j n .Using this estimator for W and M GT ñ for Mñ , we obtain the following estimator Mn of the missing mass M n , which turns out to be exactly the Good-Turing estimator for feature allocation models.

A confidence interval for Mn
In this section, we consider the problem of uncertainty quantification of the proposed estimator of the missing mass.We exploit tools from concentration inequalities for sum of independent random variables (Boucheron et al. (2013)) to introduce a non-asymptotic level-δ confidence interval for Mn , for any δ ∈ (0, 1).
Theorem 3.1 Let (p j ) j≥1 be any sequence of feature probabilities s.t.j≥1 p j < +∞, and set c δ (x) = ( log(1/δ)/2 + 7 log(1/δ)/6 + x) 2 for any nonnegative integer x and δ ∈ (0, 1).Then, with where See Appendix A.3 for the proof of Theorem 3.1.For any δ ∈ (0, 1), Theorem 3.1 provides with a level 1 − δ confidence interval for Mn that holds for every sample size n and for every possible values of the feature probabilities (p j ) j≥1 .Indeed, this is derived by applying finite sample concentration inequalities, without using any asymptotic approximation, and it does not rely on any distributional assumption on the feature probabilities (p j ) j≥1 .Note that the proposed confidence interval can be easily evaluated without resorting to any simulation based strategy.It is enough to count the total number of features and number of features with frequency one and two in the observable sample, i.e.K n , K n,1 and K n,2 , and plug these quantities into U δ (K n ) and L δ (K n,1 , K n,2 ) to be added and subtracted to Mn .

A Stopping Rule for the Discovery Process
As we recalled in the Introduction, interest in estimating missing mass M n is motivated by the design of feature inventories that are cost-effective in terms of the number of features discovered and the amount of resources allocated in the discovery process.This is a fundamental issue in numerous scientific disciplines where the sampling procedure is expensive, in terms of time and/or financial resources.Feature inventories must then be designed in such a way to redirect the search of new features to more productive sites, methods or time periods whenever the sampling effort becomes unprofitable.In such a context, the estimator Mn is the key ingredient for defining an adaptive sequential approach in terms a stopping rule for the discovery process.Specifically, let h be an utility function, defined on the integers, such that h(k) is the gain of observing k features; assume that h in non-decreasing and concave.If c is the cost associated with each sampling step, and K n is the number of features in the sample Y n , then the stopping rule may be defined as This brief discussion highlights how to exploit the estimator Mn within the context of designing costeffective feature inventories.In particular, it gives rises to the challenging problem of finding the time n * at which it is optimal to conclude the discovery process, i.e. the time that maximizes the expected payoff E(h(K n ) − cn).

Numerical Illustration
We illustrate the experimental performance of the estimator Mn by the analysis of various synthetic data, and by the analysis SNP datasets from a genome project.Let N be the total number of possible features in the whole population of individuals.With regards to synthetic data, we present a numerical illustration by setting p j = 1/j s for j ≤ N and p j = 0 for j > N , with s being a nonnegative parameter.Note that the feature probability masses p j 's correspond to the unnormalized masses of the ubiquitous Zipf distribution.The parameter s controls how the total mass is spread among the features: the lager s, the smaller is the number of features with high probability.In other terms the larger s, the larger is the number of features with small frequencies (rare features).Hereafter we set N = 10 5 , and we consider the following values for the discount parameter s: 0.6, 0.8, 1.0, 1.2, 1.4, 1.6.For each of these values of the parameter s, we take 100 samples of size n = 50, 250, 1000 from the population (p j ) 1≤j≤N .Table 1 displays the true value of the missing mass M n , the estimated value Mn and the corresponding 95% confidence interval (CI).All the experiments are averaged over the 100 samples.Results in Table 1 show that Mn provides good estimates of true value of M n in all the scenarios that we considered.In addition, confidence intervals are quite narrow around the estimator and contain always the true value of the missing mass.
We conclude with an application to SNP datasets from the ENCODE sequencing genome project (http://www.hapmap.org/downloads/encode1.html.en).The same project was analyzed in Ionita-Laza et al.  2009), showing that the African population is the most diverse.We also consider increasing the sample size to n = 40 of the sequenced individuals for each dataset.Table 2 shows how the missing mass decreases with respect to the case n = 20.This highlights the saturation effect in discovering new variants.The discovery process is very efficient in the beginning, but after many individuals are sequenced, each additional individual contributes less and less to the pool of the newly discovered variants.

Concluding Remarks
We introduced a nonparametric estimator Mn of the missing mass M n for feature allocation, we obtained provable guarantees for its performance, and we derived corresponding non-asymptotic confidence intervals.In particular, we proved that the mean squared error of Mn goes to zero at the best possible rate as the sample size n goes to infinity.Our approach is simple, intuitive and easily implementable from practitioners.It distinguishes from the approaches of of Ionita-Laza et al. (2009) and Zou et al. (2016) for being the first nonparametric statistical approach for estimating the missing mass in feature allocation models.In particular, differently from Ionita-Laza et al. ( 2009) and Zou et al. (2016), we associated to the proposed estimator an exact (non-asymptotic) quantification of its uncertainty.The present paper paves the way to new directions in feature allocation problems.
In particular, three promising directions are: i) estimating the conditional expected number, given an observable sample Y n = (Y 1 , . . ., Y n ), of features with frequency r > 0 in Y n that would be observed in one additional sample; ii) solving the optimal stopping problem discussed in Section 4; iii) estimating the conditional expected number, given an observable sample Y n , of new features that would be observed in m > 1 additional (unobservable) samples.Work on this is ongoing.

A Appendix -Proofs
A.1 Nonparametric empirical Bayes In the present section we derive Mn as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973).Recall that the label of feature j is denoted by F j , in the sequel we further assume that F j ∈ (0, 1).Remind also that we have associated the i-th individual with a sequence Y i = (Y i,j ) j≥1 of independent Bernoulli random variables with parameter p j , where Y i,j = 1 (resp.Y i,j = 0) indicates the presence (resp.absence) of feature j in the i-th individual.The Bayesian nonparametric approach to estimate the missing mass M n requires a prior specification for the p j 's: we resort to the three-parameter Beta process introduced by Teh and Görür (2009).Such a prior distribution is defined as the distribution of a Completely Random Measure (CRM) μ (see e.g.Daley and Vere-Jones ( 2008)), with Lévy intensity ν(ds, df ) = ρ(s)dsI(0, 1)(f )df , where ρ(s Let Y n be a random sample of size n and F * 1 , . . ., F * Kn be the K n distinct features observed in it, the posterior estimate of M n , under a square loss function, equals We can characterize the posterior distribution of μ in (6) resorting to (James, 2017, Theorem 3.1 (ii)), which gives where the J ℓ 's are jumps having a density on the positive real line and μn is a CRM having updated Lévy intensity given by ν n (ds df ) = (1 − s) n ρ(s)dsI(0, 1)(f )df .Hence the Bayesian estimator in ( 6) boils down to where the second equality follows from the fact that the base measure of μn is diffuse and the last equality follows from the availability of the Laplace functional and from standard calculations.We are now going to show that θn = (K n,1 we will conclude that the empirical Bayes estimator Mn ( θn , α, β) coincides with Mn .The consistency of θn can be established resorting to the regular variation theory by Karlin (1967); Gnedin et al. (2007).More specifically we are able to show that the tail integral of ρ(s) is a regularly varying function, having regular variation index α, indeed Hence we just proved that ρ(x) ≃ x −α ϑ/(Γ(1 − α)α) as n → +∞, and therefore an application of (Gnedin et al. , 2007, Corollary 21) gives the asymptotic relation Γ(β+α+n)n → ϑ and θn is a consistent estimator for ϑ.Plugging θn into the posterior estimate Mn (ϑ, α, β), we obtain Mn ( θn , α, β) = Kn,1 n = Mn .

A.2 Proof of Theorem 2.1
We prove part i) (upper bound) and ii) (lower bound) of Theorem 2.1 separately.
We first focus on the upper bound for the L 2 -risk of Mn .In the sequel we denote by M n,1 the total mass of features with frequency 1 in a sample of size n, formally M n,1 = j≥1 p j I{X n,j = 1}, we observe that the expectation of M n,1 equals and obviously E[M n,1 ] ≤ W , for any sequence (p j ) j≥1 belonging to the class P W .
In order to bound the L 2 -risk of the estimator Mn from above for any (p j ) j≥1 ∈ P W , we remind that R n ( Mn ; (p j ) j≥1 ) = E( Mn − M n ) 2 may be seen as the sum of the squared bias and the variance.The upper bound for the bias can be easily proved as follows As for the variance bound, we note that where we have exploited the independence of the random variables X n,j .We now observe that the events {X n,j = 1} and {X n,j = 0} are incompatible, hence we get Putting together the bound on the bias and the variance we immediately obtain (4).
We now focus on the proof of the lower bound of the minimax risk R * n , which has been defined in (3).Let us first notice that, since (p j ) j≥1 belongs to P W , then M n ≤ W almost surely, and hence (M n − Tn ) 2 ≥ (M n − min( Tn , W )) 2 .Therefore, we can restrict the minimum in (3) over all the possible estimators less than W , hence the problem boils down to determine an estimate from below for inf Tn≤W sup (pj )j ∈PW E[(M n − Tn ) 2 ].The proof of this estimate relies on a Bayesian parametric approach based on the randomization of the probability masses p j 's.More specifically we consider the randomized vector P = (p 1 , • • • , p m , 0, 0, . ..)where m is a Poisson random variable with mean λ = nW and p j are independent beta random variables with parameters a = 1 and b = 2n − 1.We denote by P F := m j=1 p j the total mass of the random sequence (p j ) j≥1 ; we further define X := (X n,j ) j≥1 and it is worth noticing that, conditionally on P, the X n,j 's are independent, having a Binomial distribution with parameters (n, p j ).For generic random variables U, V we will write E V (resp.E V |U ) to mean that the expectation is made with respect to V (resp.to V given U ).It is easily shown that inf Tn≤W sup We bound separately the two terms on the r.h.s. of ( 7).We start by deriving an upper bound for the term on the right.Using Fubini's Theorem, it comes that which leads to the more convenient form In order to provide an upper bound for the probability of the event {P F > t}, we use a Markov inequality with third centralized moment.We first evaluate the mean of the total mass, Now, denoting µ 3 (Y ) the centralized third moment of the r.v.Y , the law of total cumulant implies that m is Poisson distributed, so its variance and third centralized moment are equal to its mean λ = nW .
Further noticing that 2n−1 2n+1 < 1, we upper bound the third centralized moment by E[µ 3 (P F |m)] + W 2n 2 .Conditioning on m, the total mass P F is a sum of independent and identically distributed random variables so µ 3 (P F |m) = mµ 3 (p 1 ).Since p 1 is Beta distributed with parameters a, and b, its third centralized moment is given by 4n 3 (2n + 1)(n + 1) .
Using the fact that (2n−1)(n−1) (2n+1)(n+1) ≤ 1 ≤ 2, we successively find that E[µ 3 (P F |m)] ≤ W 2n 2 and finally µ 3 (P F ) ≤ W n 2 , obtaining the inequality we need to bound P(P F > t).Indeed, Markov's inequality implies that Now we can use the last inequality in (8) and find We now provide a lower bound for the first term on the r.h.s. of ( 7) inf Let m ∈ N, with the convention that the risk is 0 when m = 0. We will now lower bound inside the first expectation in the previous expression, using the fact that given X and m, we know the posterior of P F . inf where we used the fact that p j |X n,j = x is a beta random variable Beta(x + a, n − where we observed that (1 − p j ) ∼ Beta(b, a) to obtain (10).Therefore we can derive the bound inf Together with (9), the previous inequality implies that inf Plugging in (7) the previous results, we obtain which concludes the proof.Notice that here we are only interested in showing that the minimax rate is or order W n .We could have obtained sharper bounds (with a better constant) by for instance taking

A.3 Proof of Theorem 3.1 and related results
In this section we focus on Theorem 3.1.First of all we introduce some preliminary results which are useful to prove the theorem and are interesting in their own right.In the next proposition we derive concentration inequalities for both K n,r and K n .
Proposition A.1 For any n ≥ 1, r ≥ 1 and δ ∈ (0, 1) we have and Proof.Let us focus on the derivation of ( 11).First of all we will determine a variance bound for the random variable K n,r , which will be employed to derive a corresponding bound on the log-Laplace transform of (K n,r − E(K n,r )).Then the result will follow by a suitable application of the Bernstein inequality.
Thanks to the independence of the random variables X n,j 's, the variance Var(K n,r ) may be bounded as follows We now establish the bound on the log-Laplace.Since K n,r is the sum of independent random variables, for any t ∈ R we can write: ≤ φ(|t|)E(K n,r ) where we have implicitly defined the function φ(t) = e t − 1 − t and we have applied the Bennett inequality.The validity of the previous bound on the log-Laplace implies that for any x ≥ 0 thanks to the Bernstein inequality.If one defines s := x 2 /(2(E(K n,r ) + x/3)) the previous inequality boils down to the following one The next Remark is a simple consequence of Proposition A.1 and will be used in the derivation function of n.We are now ready to prove that Mn − M n is sub-Gamma on its tails.
Finally we prove Theorem 3.1.
Proof.[Proof of Theorem 3.1:] The result is a consequence of the two bounds that we are now going to prove separately: P M n ≥ Mn − mn − 2v + n log(1/δ) − log(1/δ)/n ≥ 1 − 4δ ( 18) where mn , v− n and v+ n are suitable quantities that will be defined in the proof.First we discuss how to determine (18), we remind that D n = Mn − M n is a sub-Gamma random variable, as shown in Proposition A.
) See Appendix A.2 for the proof of Theorem 2.1.For any two sequences (a n ) n≥1 and (b n ) n≥1 , write a n b n to mean that there exists a constant C > 0 such that a n ≤ Cb n for all n and a n ≍ b n to mean that we have both a n b n and b n a n .Part i) of Theorem 2.1 shows that sup (pj ) j≥1 ∈PW R n ( Mn ; (p j ) j≥1 ) W/n, namely as n goes to infinity the mean squared error of Mn goes to zero at rate W/n.Part ii) of Theorem 2.1 provides with a lower bound for the minimax risk, ) j≥1 , r ∈ {0, . . ., W − 1}, each of them sampled from a binomial product model with the corresponding frequencies (p (r) j ) j≥1 .Hence, the missing mass for a sample of observations Y n sampled from P can be related to the missing masses of each of its W subcomponents, by which in turn implies the validity of (11), as a consequence of the change of variable 2e −s = δ and the elementary inequality √ a + b ≤ √ a + √ b, valid for any positive real numbers a, b.The proof of (12) goes along similar lines, having observed that Var(

Table 1 :
Synthetic data: missing mass estimation with p j = 1/j s for j ≤ 10 5 and p j = 0 for j > 10 5 , and different values of s.

Table 2 :
Real data: missing mass estimation for the ENCODE sequencing genome project.
(2009).Ten 500-kb regions of the genome were sequenced in 209 unrelated DNA samples: 60 Yoruba (YRI), 60 CEPH European (CEPH), 45 Han Chinese (CHB), and 44 Japanese (JPT).These regions were chosen to be representative of the genome in general, including various chromosomes, recombination rates, gene density, and values of nontranscribed conservation with mouse.To make results comparable across the 4 populations (YRI, CEPH, CHB, and JPT), we consider only n = 20 of the sequenced individuals for each dataset.Table2displays the estimated values and 95% confidence interval (CI).For samples of 20 individuals, the YRI population displays the highest estimate of the missing mass.This agrees with results in Ionita-Laza et al. (