Bayesian nonparametric disclosure risk assessment

: Any decision about the release of microdata for public use is supported by the estimation of measures of disclosure risk, the most popular being the number τ 1 of sample uniques that are also population uniques. In such a context, parametric and nonparametric partition-based models have been shown to have: i) the strength of leading to estimators of τ 1 with de- sirable features, including ease of implementation, computational eﬃciency and scalability to massive data; ii) the weakness of producing underesti- mates of τ 1 in realistic scenarios, with the underestimation getting worse as the tail behaviour of the empirical distribution of microdata gets heavier. To ﬁx this underestimation phenomenon, we propose a Bayesian nonparametric partition-based model that can be tuned to the tail behaviour of the empirical distribution of microdata. Our model relies on the Pitman–Yor process prior, and it leads to a novel estimator of τ 1 with all the desir- able features of partition-based estimators and that, in addition, allows to reduce underestimation by tuning a “discount” parameter. We show the eﬀectiveness of our estimator through its application to synthetic data and real data.


Introduction
Releasing microdata for public use requires a careful assessment of the risk of disclosure (Willenborg and Waal [26]). Consider a microdata sample (X 1 , . . . , X n ) of units (individuals) from a finite population of size N ≥ n, such that each X i is a record containing identifying and sensitive information for the i-th unit. Identifying information consists of categorical variables which might match known units of the population. A threat of disclosure results from the possibility that an intruder, who could have personal or public information about the population (e.g. knowing who is included in the sample or using other available datasets), might succeed in identifying an individual through such a match, and hence be able to disclose sensitive information. To quantify disclosure risk, microdata units are partitioned according to a categorical variable that is defined by cross-classifying all identifying variables. That is, units X i 's are partitioned into non-empty cells, with each cell containing individuals with the same combination of values of identifying variables. A risk of disclosure arises from cells in which both sample and population frequencies are small, since the rarer the category the more likely the match is correct. Of special interest are cells with frequency 1 (uniques) since, assuming no errors in matching processes or data sources, for these cells the match is guaranteed to be correct (Bethlehem et al. [2], Skinner et al. [24]). This has motivated inferences on measures of disclosure risk that are functionals of the number of uniques, the most popular being the number τ 1 of sample uniques that are also population uniques. Once an estimateτ 1 of τ 1 is obtained, a criterion to understand if the data would incur an excessive risk in being published is to set a relative risk threshold C and check if the proportion ofτ 1 with respect to the sample size does not exceed it, i.e. τ 1 /n ≤ C (Bethlehem et al. [2]). If this is not the case, more care must be used before releasing data, possibly applying other privacy preserving methods.
Over the past three decades, a wide range of parametric and nonparametric approaches, both classical (frequentist) and Bayesian, have been proposed to estimate τ 1 . One may identify two main streams in the disclosure risk literature: i) modeling the sole microdata partition by parametric and nonparametric partition-based models (Bethlehem et al. [2], Skinner et al. [24], Fienberg and Makov [11], Samuels [21], Skinner and Elliot [23], Camerlenghi et al. [6]); ii) modeling both the microdata partition and associations among identifying variables by parametric and semiparametric latent class models (Reiter [19], Skinner and Shlomo [25], Manrique-Vallier and Reiter [13,14], Carota et al. [4,5]). All these approaches have been applied to synthetic data and real data, showing the effectiveness of τ 1 as a sensible global measure for assessing the risk of disclosure. Partition-based models lead to estimators that are simple, linear in the sampling information, computationally efficient and scalable to massive data sets, though they typically show underestimation when the sampling fraction n/N becomes smaller than a certain threshold (Camerlenghi et al. [6]). Latent class models have typically a better empirical performance than partition-based models, especially for small sampling fractions, though this is achieved at the cost of an increased computational effort for the need of Markov chain Monte Carlo methods for posterior approximation (Reiter [19], Manrique-Vallier and Reiter [13]).
In this paper, we contribute to the partition-based literature from a Bayesian nonparametric perspective. Bayesian nonparametric ideas for estimating τ 1 date back to the seminal work of Samuels [21], where the Dirichlet process (Ferguson [10]) was applied as a prior model for the microdata partition. This approach leads to an estimator of τ 1 which is easy to implement, computationally efficient, and scalable to massive data. Despite these desirable features, empirical analyses in Samuels [21] show that such an approach underestimates τ 1 in many realistic scenarios, the issue being related to the tail behaviour of the empirical distribution of microdata. That is, the heavier the tail the worse the underestimation of τ 1 . As heavy-tail scenarios occur when the number of sample uniques is large with respect to the population size, this phenomenon is a critical concern in disclosure risk assessment. A simulation study in Figure 1 shows analogous estimation issues for the most common partition-based estimators of τ 1 in such a heavy-tails setting. Our experiments use synthetic microdata from a power-law distribution of exponent σ > 1, samples being the 10% of the population of size 10 6 , and they are averaged over 1000 iterations. It emerges that the smaller σ, namely the heavier the tail, the worse the underestimation of Bayesian parametric estimators (Bethlehem et al. [2], Skinner et al. [24]), and the worse the overestimation of a nonparametric empirical Bayes estimator (Camerlenghi et al. [6]).
To overcome the underestimation phenomenon of Samuels' approach, we propose a Bayesian nonparametric partition-based model that can be tuned to the Empirical performance, with respect to the true τ 1 , of estimatorsτ 1 : nonparametric Bayes ( nb) of Samuels [21], nonparametric empirical Bayes ( neb) of Camerlenghi et al. [6], parametric Bayes ( pb-1) of Bethlehem et al. [2], parametric Bayes ( pb-2) of Skinner et al. [24]. tail behaviour of the empirical distribution of microdata. In particular, as a prior model for the microdata partition, we assume the Pitman-Yor process (Perman et al. [15], Pitman [16], Pitman and Yor [18]). The Pitman-Yor process prior generalizes the Dirichlet process prior by means of an additional "discount" parameter that allows to control the tail behaviour of the prior, ranging from geometric tails to heavy power-law tails (Pitman and Yor [18]). Under the Pitman-Yor process prior, we present a simple characterization of the posterior distribution of τ 1 , given the observed microdata, and we propose the posterior mean as a Bayesian nonparametric estimator of τ 1 . Such an estimator has all the same desirable features as Samuels's estimator and, in addition, it allows to reduce its underestimation of τ 1 by tuning the "discount" parameter with respect to observable microdata. Our approach stands out for being the first partition-based approach to provide a closed-form posterior distribution of τ 1 , which makes straightforward to quantify uncertainty of our Bayesian procedure through credible intervals. We investigate the empirical performance of our approach through synthetic data and real data from the 2018 American Community Survey, showing its effectiveness in reducing underestimation phenomenon of Samuels' approach.
The paper is structured as follows. In Section 2 we introduce the Pitman-Yor process prior and its sampling structure, and present our Bayesian nonparametric approach to infer τ 1 . Section 3 contains an illustration of the proposed approach through synthetic data and real data. In Section 4 we conclude by discussing our results and directions for future work. Proofs are deferred to the Appendix.

Bayesian nonparametric inference for τ 1
We consider a super-population of units belonging to an (ideally) infinite number of distinct symbols (z j ) j≥1 , taking values in a measurable space Z, with unknown proportions (p j ) j≥1 such that j≥1 p j = 1. The partition of microdata into non-empty cells, both at the sample and population level, is modeled as a random partition induced by sampling from the unknown discrete distribution P = j≥1 p j δ zj , where each symbol z j ∈ Z takes the interpretation of a distinct combination of values of identifying variables. That is, a population of N ≥ 1 of microdata units is assumed to be a random sample (X 1 , . . . , X N ) from P , of which the first n < N elements (X 1 , . . . , X n ) are observable. These samples induce a random partition at population level consisting of K N cells of frequencies (N 1,N , . . . , N K N ,N ), and a random partition at the sample level consisting of K n cells of frequencies (N 1,n , . . . , N Kn,n ). If I(·) denotes the indicator function, then namely the number of sample uniques that are also population uniques (Bethlehem et al. [2], Skinner et al. [24]). Bayesian nonparametric inference for τ 1 relies on the specification of a (nonparametric) prior distribution on the discrete distribution P , which in turn leads to a prior model for the microdata partition.

The Pitman-Yor process prior
We assume the Pitman-Yor process as a prior model for the unknown discrete distribution P . A simple and intuitive definition of the Pitman-Yor process follows from its stick-breaking construction (Pitman [16]). For α ∈ [0, 1) and θ > −α let: i) (V i ) i≥1 be independent random variables such that V i is distributed as a Beta distribution with parameter (1 − α, θ + iα); ii) (Z j ) j≥1 be random variables, independent of the V i 's, and independent and identically distributed as a non-atomic distribution ν on Z. If we set p 1 = V 1 and p j = V j 1≤i≤j−1 (1 − V i ) for j ≥ 2, which ensures that j≥1 p j = 1 almost surely, then P α,θ = j≥1 p j δ Zj is a Pitman-Yor process on Z with "discount" α and scale θ. The Dirichlet process arises as a special case by letting α = 0. The Pitman-Yor process generalizes the Dirichlet process by means of the "discount" α, which controls the tail behaviour of P α,θ , ranging from geometric tails to heavy powerlaw tails. In particular, for α ∈ (0, 1), let (p (j) ) j≥1 be the random probabilities p j 's of P α,θ in decreasing order. Then, as j → +∞ the p (j) 's follow a powerlaw distribution of exponent σ = α −1 (Pitman and Yor [18]). This shows that α ∈ (0, 1) tunes the power-law tail behaviour of P α,θ through small probabilities p (j) 's: the larger α the heavier the tail of P α,θ , whereas a geometric tail arises as α → 0.
According to de Finetti's representation theorem, a random sample from P α,θ is part of an exchangeable sequence of Z-valued random variables (X i ) i≥1 whose directing measure Π is the law of P α,θ . Let (X 1 , . . . , X n ) be a random sample from P α,θ , i.e.
Because of the discreteness of P α,θ , the sample (X 1 , . . . , X n ) induces a random partition of {1, . . . , n} into K n ≤ n blocks, labelled by distinct symbols {Z * 1 , . . . , Z * Kn }, with frequencies (N 1,n , . . . , N Kn,n ) = (n 1 , . . . , n k ) such that N i,n ≥ 1 for i = 1, . . . , K n and 1≤i≤Kn N i,n = n (Pitman [Chapter 3, 17]) for a detailed account. A generative model for the X i 's, and hence for the induced random partition, is provided by the predictive distribution of the Pitman-Yor process, namely for n ≥ 1. That is, X n+1 is of a new symbol (block), namely a symbol not 17] for a detailed account on the predictive distribution (2).
The predictive distribution of the Pitman-Yor process highlights the role of the "discount" parameter α in the sampling process: it drives a combined effect in terms of a reinforcement mechanism and the increase in the rate of generating new symbols. In particular, a new symbol z * entering in the sample produces two effects: i) it is assigned a mass proportional to (1 − α) to the z * 's empirical component of (2); ii) it is assigned a mass proportional to α to the probability of generating new symbols in (2). That is, the probability mass assigned to the symbol z * 's is less than proportional to 1, and the remaining probability mass is assigned to the probability of generating new symbols. The first effect gives rise to a reinforcement mechanism: the sampling procedure allocates more mass on symbols with higher frequencies. The second effect implies that the probability of generating new symbols, which overall still decreases as a function of n, is increased by α/(θ + n + 1). The larger α the stronger the reinforcement mechanism and the higher is the probability of new symbols. For α = 0, that is under the Dirichlet process prior, everything is proportional to symbols' frequencies, which do not alter the probability of discovering new symbols. We refer to Bacallado et al. [1] for a detailed account on the predictive distribution (2), as well a generalizations thereof, and for characterizations of (2) with respect to the use of the sampling information, i.e. "sufficientness postulate", and of Pólya like urn schemes.
Remark 1. The power-law tail behaviour of the Pitman-Yor process emerges from the large n asymptotic behaviour of the number K n of distinct symbols and the number M r,n of distinct symbols with frequency r ≥ 1 in n random samples from P α,θ . From Pitman [17,Theorem 3.8], K n behaves as n α for large n; this is the behaviour of the number of distinct symbols in n random samples from a power-law distribution of exponent σ = α −1 . Moreover, from Pitman [17,Lemma 3.11] it holds that the proportion M r,n /K n of distinct symbols with frequency r behaves as r −α−1 for large n and large r; this is, up to a constant or proportionality, the distribution of the number of distinct symbols with frequency r in n random samples from a power-law distribution of exponent σ = α −1 .

Posterior inference for τ 1
We consider microdata units to be modeled under the Bayesian nonparametric framework (1). That is, a population of N ≥ 1 of microdata units is assumed to be a random sample (X 1 , . . . , X N ) from a Pitman-Yor process, of which the first n < N elements (X 1 , . . . , X n ) are observable. We characterize the posterior distribution of τ 1 , given (X 1 , . . . , X n ). To introduce our main result, it is useful to recall the generalized factorial distribution (Charalambides [7, Chapter 2]). For a real a and r ∈ N let (a) (r) be the rising factorial, that is (a) (0) = 1 and (a) (r) = 0≤i≤r−1 (a + i) for r ∈ N {0}, and for a > 0 and r, s ∈ N with r ≤ s let C (r, s; a) be the generalized factorial coefficient (Charalambides [7]), that is The next theorem provides the posterior distribution of τ 1 , given (X 1 , . . . , X n ), as a mixture of a (general) hypergeometric distribution ( See Appendix A for the proof of Theorem 1. Theorem 1 is the first example in the literature to provide a closed-form posterior distribution of τ 1 . This is critical to quantify, by means of Monte Carlo sampling, uncertainty of our Bayesian procedure through credible intervals; see Section 2.3 below. According to (4), for any fixed (α, θ), the number M 1,n = m 1 of sample uniques is sufficient for estimating τ 1 . The estimator (5) is easy to implement, computationally efficient, and scalable to massive datasets. Moreover, it has a simple interpretation as the proportion of the number m 1 of sample uniques. The estimator (5) is somehow reminiscent of the "naive" nonparametric estimator (Bethlehem et al. [2], Skinner and Elliot [23]) of τ 1 , namelyτ In particular,τ 1 is a smoothed version ofτ 1 , where the smoothing acts by replacing the purely empirical proportion n/N with the parametric proportion w n,N (α, θ). For any fixed θ, n and N , the proportion w n,N (α, θ) increases in α, meaning that the larger α the higherτ 1 . This behaviour, which agrees with the role of α discussed in Section 2.1, shows the effectiveness of the "discount" α in tuning the inference to the tail behaviour of the empirical distribution of microdata.
By assuming both the sample and population to be large, it emerges: i) the critical influence of the "discount" α in estimating τ 1 , with respect to the scale θ; ii) the crucial limitation of the estimator proposed in Samuels [21]. In particular, let f ≈ g meaning f/g → 1. As n, N → +∞ with n < N, for any and henceτ That is, for large n and N with n < N, the posterior distribution (4) admits a first order (local) approximation in terms of a Binomial distribution with parameters {m 1 , (n/N ) 1−α }. See Appendix B for the proof of (7). This result shows that, in realistic scenarios, the "discount" α is the sole tuning parameter of our Bayesian nonparametric model. In other terms, for α = 0, namely under the Dirichlet process prior, the approximated estimator (8) reduces to the "naive" estimatorτ 1 . Equivalently, for large n and N , the "naive" estimatorτ 1 approximates the estimator of Samuels [21]. Therefore, in realistic scenarios, Samuel's estimator is a purely empirical estimator, meaning that no tuning parameters are available.

Computations
For any fixed α ∈ (0, 1) and θ > −α, the estimator (5)  To implement Theorem 1 we must specify the prior's parameters (α, θ), whose choice is critical for a correct estimation of τ 1 . Two common approaches for estimating (α, θ) are: i) the hierarchical Bayes approach, which relies on Bayesian estimates obtained from the posterior distribution of (α, θ) with respect to suitable prior specification; ii) the empirical Bayes approach, which relies on estimates obtained by maximizing, with respect to (α, θ), the marginal likelihood of the observable sample. Here, we adopt the empirical Bayes approach. Let (X 1 , . . . , X n ) feature K n = k distinct symbols with frequencies (N 1,n , . . . , N Kn,n ) = (n 1 , . . . , n k ). Pitman [16,Proposition 9] provides the likelihood function of (X 1 , . . . , X n ), and the empirical Bayes approach reduces to Table 1 Estimates of τ 1 for synthetic data. The parameters are σ (Zipf data) and π (Geometric data). pb-1 is parametric Bayes of Bethlehem et al. [2]; pb-2 is parametric Bayes of Skinner et al. [24], and neb is nonparametric empirical Bayes of Camerlenghi et al. [6].  solve: The optimization problem (9) can be solved numerically and efficiently even for large values of n, by means of routines available in standard softwares. We refer to Favaro and Naulet [9] for provable guarantees of the estimatorα. Alternatively, one could specify a prior distribution on (α, θ). However, we found no relevant differences between the fully Bayes and the empirical Bayes approach, given that the posterior distribution of (α, θ) is highly concentrated, when n is large.

Simulated data
We consider synthetic data from two super-populations P . For the first superpopulation, we let the "true" probability masses (p j ) j≥1 to be those of a Zipf distribution with index σ > 1, so that data are generated from the discrete distribution P = ζ(σ) −1 j≥1 j −σ δ zj , with ζ(σ) = j≥1 j −σ . As we discussed in Section 2, this is the scenario in which a Pitman-Yor specification is recommended. We considered different values of σ = 1.25, 1.50, 1.75, 2, and different combinations of n and N . The prior's parameter (α, θ) is estimated through maximum likelihood; see Section 2.3. Table 1 reports estimates of τ 1 , together with 99% credible intervals (within brackets), and the "true" value of τ 1 . Credible intervals are obtained via Monte Carlo sampling of the posterior distribution  (1), by means of the scheme described in Section 2.3. Table 2 reports the corresponding estimates of (α, θ) for the Pitman-Yor model. In all these scenarios, the Bayesian nonparametric estimator (5) is much closer to the "true" value of τ 1 , compared to its partition-based competitors. In particular, the approaches of Bethlehem et al. [2], Skinner et al. [24] and Samuels [21] underestimate the "true" τ 1 , whereas the approach of Camerlenghi et al. [6] tends to overestimate it.
For the second super-population, we let P = j≥1 π(1 − π) j−1 δ zj , corresponding to a geometric distribution with parameter π ∈ (0, 1). We consider two different values of π = 10 −3 , 10 −4 and the same sample size n and a population size N as before. As we discussed in Section 2, this is the ideal setting for the Dirichlet process and this is indeed confirmed by Table 1. Moreover, the Pitman-Yor estimator reduces to the Dirichlet process since we obtainα = 0, as reported in Table 2.

The 2018 American Community Survey
We consider real data from the 2018 American Community Survey (Manrique-Vallier and Reiter [13], Carota et al. [4]). This dataset is a random sample of the American population (usa.ipums.org/usa). We regard the 2018 American Community data as a "population" of size N = 2, 432, 323, and we consider observable samples which are the 5% and 10% fractions of the population obtained by sampling at random n = 121, 616 and n = 243, 232 individual, respectively. We restricted the population to individuals older than 20, and we cross-classified the records according to the following variables: census region (9 levels), race (139 levels), and primary occupation (531 levels), obtaining K N = 60, 215 non empty classes.
As detailed in Section 2, the Pitman-Yor specification should be employed whenever the data follow a power-law behaviour. However, in real data problems such an assumption must be empirically validated. A simple approach is comparing the observed number m r of distinct types with frequency r = 1, . . . , n against the model-based expected frequencies under a Pitman-Yor specification, namely where the parameters in the above formula are replaced by their maximum likelihood estimates; see also Favaro et al. [8] for further details. Poor in-sample fit strongly suggests that the corresponding disclosure risk assessment will be unreliable.
The observed values m r for r = 1, . . . , n and their model-based estimates for the 5% and 10% fractions of the data from the American Community Survey presented in Section 2.2 are reported in Figure 2, both under a Pitman-Yor and Dirichlet process specification. These results confirm a very good in-sample fit for the Pitman-Yor. Conversely, the Dirichlet process seems unsuitable for this specific datasets. The prior's parameters α and θ are estimated through maximum likelihood; see Section 2.3. Results in Table 3 confirm what we observed for synthetic data, and in particular it is confirmed the superior empirical performance of our estimators, with respect to partition-based competitors. The approaches of Bethlehem et al. [2], Skinner et al. [24] and Samuels [21] underestimate the true τ 1 , whereas the approach of Camerlenghi et al. [6] overestimates it. Table 3 Estimates of τ 1 for real data the 2018 American Community Survey. The estimate pb-1 refers to the parametric Bayes of Bethlehem et al. [2], pb-2 is the parametric Bayes of Skinner et al. [24], and neb is the nonparametric empirical Bayes of Camerlenghi et al. [6].

Discussion
In this paper, we considered the problem of Bayesian nonparametric estimation of τ 1 , which is arguably the most popular measure of disclosure risk. Our study is motivated by an early work of Samuels [21], where empirical analyses showed that the use of Dirichlet process priors lead to underestimate τ 1 in many realistic scenario, with the underestimation getting worse as the tail behaviour of the empirical distribution of microdata gets heavier. Here, to overcome such an underestimation phenomenon, we proposed the use of the Pitman-Yor process prior, which generalizes the Dirichlet process prior through an additional "discount" parameter that allows to control the tail behaviour of the prior, ranging from geometric tails to heavy power-law tails. Under the Pitman-Yor process prior, we obtained a simple characterization of the posterior distribution of τ 1 , in terms of a compound (general) hypergeometric distribution, and made use of the posterior mean as an estimator of τ 1 . Such a novel estimator has all the desirable features as Samuels' estimator, including ease of implementation, computational efficiency and scalability to massive data, and, in addition, it allows to reduce its underestimation of τ 1 by tuning the "discount" parameter with respect to observable microdata. We presented an empirical analysis of our Bayesian nonparametric approach through synthetic data and real data, showing its effectiveness in reducing underestimation phenomenon of Samuels' approach.
While τ 1 is known to be the most popular measure of disclosure risk (Bethlehem et al. [2] and Skinner et al. [24]), one might consider alternative measures by broadening the definition of "uniqueness". For instance, Fienberg and Makov [11] considered a generalization of τ 1 which is defined in terms of the number of cells with frequency less or equal than 2. In general, one may consider the following measure namely the number of cells with sample frequency less or equal than p which have population frequency less or equal than p + q. In particular, τ 1 corresponds to τ 1,0 . We refer to Appendix D for Bayesian nonparametric inference of τ 1,q , which is arguably the most natural generalization of τ 1 . It remains an open problem to adapt our Bayesian nonparametric approach to deal with structurally empty cells, i.e. structural zeros (Manrique-Vallier and Reiter [14]). In such a context, it may be useful to consider spike and slab generalizations of the Pitman-Yor process prior (Scarpa and Dunson [22], Canale et al. [3]). They consist in replacing the non-atomic distribution ν of the Pitman-Yor process prior with a distributionν(ζ) = ζδ 0 + (1 − ζ)ν, with ζ ∈ [0, 1] and ν being a non-atomic distribution. Then ζ may then be used to include the information on structural zeros, being interpretable as the proportion of structural zeros in the population.

A.1. Generalized factorial coefficients
For t ∈ R, a > 0 and n ∈ N 0 , let (at) (n) be the rising factorial of at of order n, i.e. (at) (n) = 0≤i≤n−1 (at + i). The (n, k)-th generalized factorial coefficient, denoted by C (n, k; a), is the k-th coefficient in the expansion of (at) (n) into rising factorials, i.e.
with C (0, 0; a) = 1, C (n, 0; a) = 0 for n > 0, C (n, i; a) = 0 for i > n. For b > 0, let us consider the k-th coefficient in the expansion of (at − b) (n) into rising factorials, so that with C (0, 0; a, b) = 1, C (n, 0; a, b) = (−b) (n) for n > 0, C (n, i; a, b) = 0 for i > n. The coefficient C (n, k; a, b) is referred to as the non-centered generalized factorial coefficient (Charalambides [7]). Here, it is useful to recall the following property for any b 1 , b 2 > 0 and r 1 , r 2 > 0. The convolutional identity (12) can be found in Charalambides [Chapter 2, 7] and plays a critical role in the proof of Theorem 1.
Distributional properties, and moments, of the general hypergeometric distribution can be easily obtained from (14) We refer to Charalambides [7] and Johnson et al. [12] for a comprehensive account of the generalized factorial distribution and the (general) hypergeometric distribution.

Appendix B: Proofs of Equation
Recall that by means of Stirling formula Γ(n + i)/Γ(n) ≈ n i as n → +∞ is a first order approximation of the Gamma function. By applying it to (23), as n → +∞ and N → +∞.
Equation (24) is the moment of order z of a Binomial random variable with parameter (m 1 , (n/N ) 1−α ), with m 1 being the number of trials and (n/N ) 1−α being the probability of success in a trial. This completes the proof of Equation (7) and Equation (8).
Appendix C: On the distribution of U 1−α, θ+n

Appendix D: Bayesian nonparametric inference for τ 1,q
Under the Pitman-Yor process prior, we characterize the posterior distribution of τ 1,q through its moments; this leads to a Bayesian nonparametric estimator of τ 1,q in terms of the posterior mean. The proof is along lines similar to the proof of Theorem 1. Let (X 1 , . . . , X n ) be a random sample from the Pitman-Yor process P α,θ , and let (X 1 , . . . , X n ) feature K n = k distinct symbols, labelled by {Z * 1 , . . . , Z * Kn }, with frequencies N n = n, with N n = (N 1,n , . . . , N Kn,n ), and n = (n 1 , . . . , n k ) be such that N i,n > 0 and 1≤i≤Kn N i,n = n. Moreover, for any N > n let (X n+1 , . . . , X N ) be an additional random sample from P α,θ , and let N j,N −n ≥ 0 be the number of records X n+i , i = 1 . . . , N that coincide with the label Z * j , j = 1, . . . , K n . Moreover, let V N −n = N − n − 1≤i≤Kn N i,N −n be the number of X n+i , i = 1, . . . , N that do not coincide with any Z * j 's. To compute the posterior distribution of τ 1,q , we first determine its moment of order z ≥ 1, i.e., E{(τ 1,q ) z | X 1 , . . . , X n } (26) = E{(τ 1,q ) z | N n = n, K n = k} = E ii) Pr(V N −n = v | N n = n n , K n = k) and Pr(N c1,N −n = h 1 , . . . , N cx,N −n = h x | N n = n n , K n = k)