Differentially Private Confidence Intervals for Proportions under Stratified Random Sampling

Confidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, developing a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from sensitive data. Recent work has been done around differentially private confidence intervals, yet to the best of our knowledge, rigorous methodologies on differentially private confidence intervals in the context of survey sampling have not been studied. In this paper, we propose three differentially private algorithms for constructing confidence intervals for proportions under stratified random sampling. We articulate two variants of differential privacy that make sense for data from stratified sampling designs, analyzing each of our algorithms within one of these two variants. We establish analytical privacy guarantees and asymptotic properties of the estimators. In addition, we conduct simulation studies to evaluate the proposed private confidence intervals, and two applications to the 1940 Census data are provided.


Introduction
With the increase of privacy awareness in the modern information era, establishing privacy-preserving methodologies for statistics and machine learning has become an active research area.Differential privacy, a state-of-the-art privacy protection technique [14], is considered a gold standard for rigorous privacy guarantees.Not only has it drawn significant attention in academia [15,16], but also it has been deployed by governments, firms, and other data agencies, such as the U.S. Census Bureau [1], Google [20], Microsoft [8], and Apple [38].
Recently, the U.S. Census Bureau released a new demonstration of its differentially private Disclosure Avoidance System (DAS) for the 2020 Census [5,24].At the intersection of differential privacy and statistics, both statisticians and computer scientists are working on developing private versions of statistical inference procedures.Early work discussing differential privacy in the context of statistics includes [17,13,42,37].More recent work has explored statistical inference and estimation under the constraint of differential privacy [6,29,31].
As one of the most fundamental tools for statistical inference, confidence intervals are ubiquitous in quantifying the uncertainty of parameters of interest.In this paper, we propose three differentially private algorithms for constructing confidence intervals for the population proportion under stratified random sampling.To the best of our knowledge, our work is the first to establish rigorous methodologies on differentially private confidence intervals in the context of survey sampling.Survey sampling is an important area in statistics that involves selecting a sample of individuals from a target population to conduct a survey.It provides timely and cost-efficient estimates of population characteristics of interest and is widely used in broad-scale data gatherings, such as the American Community Survey (ACS), the Survey of Income and Program Participation (SIPP), and the Current Population Survey (CPS).
This paper provides the first study of differentially private confidence intervals for data from stratified sampling designs.Specifically: • We articulate two specific variants of differential privacy that are appropriate for data from stratified sampling designs.In addition to the standard notion of differential privacy, we also consider settings in which the sample stratum sample sizes are fixed and public.This latter setting allows for simpler algorithms and tighter confidence intervals.• We give methods to propagate the uncertainty due to the application of differentially private mechanisms (adding random noise) into the construction of confidence intervals.A necessary bias correction is made to achieve (asymptotic) unbiased variance estimates.Central limit theorem (CLT)type statements are provided to guarantee the confidence level asymptotically.• We assess the performance of the proposed algorithms both in theory and through simulations.The theoretical analysis comparing the non-private and private methods gives practitioners a sense of how the algorithms would work prior to applying them to real data.• To support the theoretical analysis of one of the algorithms, we study the behavior of a reciprocal normal variable in depth.A general form of the Taylor expansion (for conditional moments) is obtained to solve the problem of the non-existence of moments due to its heavy-tailed nature.
The paper is organized as follows.We briefly discuss the existing work on differentially private confidence intervals in Section 1.1.Section 2 provides preliminaries on confidence intervals of population proportions and differential privacy.In Section 3, we discuss the methodology of three differentially private algorithms.Section 4 provides theorems on both privacy and asymptotic cov-erage guarantees.Numerical experiments, including simulation studies and two applications to the 1940 Census data, are conducted in Section 5. Section 6 discusses the implications of our methods and general research directions on differentially private confidence intervals.

Related Work
Differentially private confidence intervals have recently been studied for other settings.Some studied differentially private confidence intervals for the population mean of normally distributed data [27,11,23] .Other tasks on confidence intervals have also been explored.Drechsler et al. designed and evaluated several strategies to obtain differentially private confidence intervals for the median [10].Wang et al. provided confidence intervals for differentially private models trained with objective or output perturbation algorithms [41].
Besides, bootstrapping is a popular technique for constructing more general differentially private confidence intervals.Ferrando et al. proposed a generalpurpose approach to construct confidence intervals for a population parameter [21].A numerical confidence interval for the difference of mean was provided [9].The nonparametric bootstrap was considered in [3].Covington et al. described a method to induce distributions of mean and covariance estimates via the bag of little bootstraps (BLB), which can further produce private confidence intervals [7].
Our work is the first to study design-based approaches to sampling.In a design-based setting, the values of interest are viewed as fixed but unknown constants.Randomness only comes from the sampling design.The selection probabilities introduced with the design will be used for estimation.On the contrary, in a model-based setting, a parametric model is postulated.In many cases, especially with natural populations where no accurate prior information about the population distribution is available, design-based sampling methods can be more reassuring.More discussion of design-based versus model-based approaches in sampling can be found in [39].

Preliminaries
In this section, we provide some preliminaries on population proportion estimation and differential privacy.We first review the classic Wald confidence interval for the population proportion under stratified random sampling.Then we define a notion of differential privacy specifically for stratified data.Some properties of differential privacy are revisited in preparation for the theoretical analysis in Section 4.

Confidence Intervals for the Population Proportion
In stratified random sampling, a population of N individuals is partitioned into H strata, where stratum h has N h individuals, and simple random sampling of n h individuals is conducted within each stratum.When the objective is to estimate the proportion of individuals having some attribute in the population, one can estimate it by the sample proportion.Let y hi be the corresponding indicator variable: y hi = 1 when the individual i in stratum h has the attribute and y hi = 0 otherwise.One can estimate the population proportion where An unbiased estimator for Var(p h ) is given by the sample variance in the stratum Then an unbiased estimator for Var(p) is given by Var(p) = H h=1 w 2 h Var(p h ).An approximate 100%(1−α) confidence interval for p based on a normal distribution can be constructed: where z 1− α 2 denotes the 1 − α 2 quantile of standard normal distribution.The normal approximation is useful when all the sample sizes are moderate to large.Otherwise, the t distribution with appropriate degrees of freedom is typically used to replace the standard normal distribution.For small sample sizes, various specialized confidence intervals have been developed [22].

Differential Privacy
Differential privacy ensures that the output of data analysis or a query does not differ much when the data set is changed by one record, such that one can not infer the presence or absence of any individual.If two data sets x, x ′ differ by one record, we say that x, x ′ are adjacent or neighboring, written as x ∼ x ′ .The definition of differential privacy depends on how we define adjacency.For the partitioned data under stratified sampling, there are two ways to change a record: (1) one way is to substitute one record within a stratum, with all the stratum sample sizes fixed.We refer to this adjacency relation as "substitute-one relation within a stratum" and denote it by ∼ ss .This relation corresponds to the case where the sample sizes are public and fixed; (2) another way to obtain an adjacent data set is to remove or add one record from one stratum; we refer to the corresponding relation as, which we call "remove/add-one relation", denoted by ∼ r .In this case, one of the stratum sample sizes will change by one, as will the overall sample size.This relation corresponds to the case where the sample sizes are private.
Under either adjacency relation, we can define zero-concentrated differentially private (ρ-zCDP) as in [4]: Definition 1 (ρ-zCDP).Let X * denote the space of the input data with an arbitrary finite dimension.Under the adjacency relation ∼, a randomized algorithm M : X * → Y is ρ-zero-concentrated-differentially private (ρ-zCDP) if, for every pair of adjacent data sets x ∼ x ′ ∈ X * , and all α ∈ (1, ∞), [40] between the distribution of M (x) and the distribution of M (x ′ ).
The parameter ρ indicates the privacy level.A smaller ρ means a more restrictive distance control between M (x) and M (x ′ ).As a result, the outputs on two adjacent data sets are harder to tell apart and the algorithm achieves higher privacy.We call ρ the privacy budget when we deliberately design an algorithm to satisfy ρ-zCDP.
Depending on the adjacency notion, there are two types of differential privacy: bounded and unbounded differential privacy [28].Definition 1 under the "remove/add-one relation" corresponds to the standard unbounded differential privacy.The sample size of the data set changes when one record is added or removed to obtain an adjacent data set.With "substitute-one within a stratum" relation ∼ ss , the resulting notion corresponds to the bounded version of differential privacy where the sizes of two adjacent data sets are the same.But it is somewhat different from the standard notion of bounded differential privacy in that for the latter, substitutions can happen across strata.That is, we can change both the record and the stratum it is part of.
In the literature on differential privacy, (ϵ, δ)-DP ([15] Definition 2.4) is considered the classic notion.We consider ρ-zCDP because (1) ρ-zCDP implies (ϵ, δ)-DP ([4] Proposition 1.3), (2) the application of the Gaussian mechanism to achieve zCDP facilitates the theoretical analyses, and (3) the composition of ρ-zCDP is straightforward.The Gaussian mechanism is a prototypical example of a mechanism satisfying zCDP, which perturbs the true values by adding Gaussian noise.We provide the Gaussian mechanism and the composition and post-processing properties of ρ-zCDP in the following propositions.All propositions can be found in [4] and will be used in the analyses of privacy guarantees in Section 4.
A smaller budget leads to larger noise added to the query on average.Consequently, the output is more private.

Methodology
Our goal is to release a ρ-zCDP confidence interval for the population proportion p under stratified random sampling.To construct a confidence interval as in (2), we need to estimate both p and the variance of the estimator privately.Recall that the non-private estimator of population proportion is given by the sample proportion p = H h=1 w h ph .
We assume the stratum sizes N h are all public and fixed, thus so are w h .To get a private estimator for p, denoted by p, we can add noise at the level of either the (non-private) estimator p or the estimator ph .With p, we further devise a private estimator for Var(p).Based on this idea, two algorithms for the case of public sample sizes are designed by adding noise at the stratum or population level in section 3.1.In section 3.2, we additionally propose adding noise at the stratum level when sample sizes are private.Throughout the paper, the accents • and • are used to represent non-private and private estimators, respectively.

Estimating with Public Sample Sizes
When sample sizes n h are fixed, there are two natural strategies for perturbing p: add Gaussian noise to (1) the stratum-level statistics ph 's, or (2) the overall statistic p. Adding noise to the ph 's has the advantage of producing private estimators for stratum-level proportions simultaneously.

Adding Noise at the Stratum Level
We apply the Gaussian mechanism to each stratum to derive a private estimator ph def = ph + e h where e h is the Gaussian noise.Then the private estimator for the population proportion is As a result, the variance of p consists of both the intrinsic variances of estimating p h 's by ph 's and the additional variability from added noise: where Var(e h ), h = 1, ..., H are public since they do not depend on the data.
To obtain a private confidence interval for p, we will need to privately estimate Var(p h ).Note that the added noise biases the term ph (1− ph ) in the non-private estimate of Var(p h ) in (1).More specifically, where E e denotes the expectation taken on the randomness of the added noise.Then a private unbiased estimator of Var(p h ) in the right-hand side in (3) is given by where z 1−α/2 is the (1 − α/2)-quantile of the standard normal distribution.
Recall that is an unbiased estimator for Var(p).To get a private estimator for Var(p), we again apply the Gaussian mechanism to Var(p) based on the sensitivity of Var(p): where Since we apply the Gaussian mechanism twice, the total privacy budget should be divided into two parts: ρ = ρ 1 + ρ 2 to spend on adding noise to p and Var(p), respectively.The composition property (Proposition 2) ensures the total privacy level is ρ.The resulting algorithm, PopNz-PubSz, is presented in Algorithm 2.

Remark 1.
When there are multiple strata with similar sampling rates, Algorithm 1 yields a wider confidence interval for p than Algorithm 2 does, given the same privacy budget.However, Algorithm 1 additionally produces private confidence intervals for ph which may be of interest for release.In Section 4.2.1, we compare the two algorithms quantitatively.

Remark 2. Proportions are always between 0 and 1. One can post-process proportion estimates (p h in Algorithm 1 and p in Algorithm 2) by clipping them onto interval [0,1] without undermining privacy. When the privacy budget is very small, the noisy proportion estimates are likely to lie outside [0,1]. Thus, clipping moves the confidence interval toward the truth and a higher coverage
Algorithm 2 Adding noise at the population level with public sample sizes, PopNz-PubSz 2: Generate noise e ∼ N (0, ) where ∆p = max h w h n h and let p ← p + e.
rate will be observed.With a moderate or large budget, clipping does not make a noticeable difference.Lastly, one can always clip the output confidence intervals onto [0,1] without privacy loss.

Estimating with Private Sample Sizes
When sample sizes are public information, keeping the proportions private is essentially protecting only the numerator, i.e., the counts of individuals with y = 1.In some cases where subpopulation proportions also need to be estimated, Algorithms 1 and 2 with public sample sizes can lead to privacy leakage since the counts become the denominator.For example, one may ask the following queries: (1) what is the proportion of females in the US; and ( 2) what is the proportion of unemployed among females in the US.The number of females is the numerator in query (1) but becomes the denominator in query (2).Employing Algorithms 1 or 2 protects the number of females in query (1) but reveals it in query (2).Therefore, a method of constructing confidence intervals for proportions to keep both the counts and sample sizes private is necessary for subpopulation analysis.We protect the sample sizes by adding noise to them.As a result, sample sizes are not fixed and therefore we need the unbounded notion of differential privacy with the adjacency relation ∼ r .
In the following, we extend Algorithm 1 to serve the needs of privacy protection of sample sizes by adding noise at the stratum level.(It is not obvious how to extend Algorithm 2, which adds noise at the population level.It requires more sophisticated mechanisms; we briefly discuss in Section 6.) To begin, we first consider the setting of simple random sampling.The idea is to add independent Gaussian noise to both the numerator and denominator for each stratum.For ease of notation, we first consider a single stratum with count c = n i=1 x i .We know where K is the total number of individuals with the attribute of interest.The count c has mean . By applying the Gaussian mechanism to c and n with privacy budgets ρ 1 and ρ 2 , respectively, we have private count c and sample size ñ: ).
The unconditional mean and variance for c are By the composition property of zCDP, we get a private estimator for proportion p, denoted by p, with privacy level ρ = ρ 1 + ρ 2 .Since c and ñ are independent variables, in principle, and However, the moments of 1 ñ do not exist, thus neither do those of p. Generally speaking, the ratio of two independent normal random variables has a heavytailed distribution with no moments [34,18].The shape of the distribution could be unimodal, bimodal, symmetric, or asymmetric.It is primarily determined by the coefficient of variation of the denominator variable, CV .When CV is sufficiently small, a normal distribution approximation can be effective.It has been shown theoretically that a normal distribution can be arbitrarily close to the ratio variable in an interval centered at the ratio of means of two normal random variables [18].Experiments have provided guidelines for when the normal approximation can be used.For example, a simple rule is that the approximation is reasonable whenever CV is less than 0.1 [30].Other practical rules are mentioned in [26,34].
We take advantage of the normal approximation to construct a ρ-zCDP confidence interval for the proportion.We present the following estimation strategy Algorithm 3 Adding noise at the stratum level with private sample sizes, StrNz-PrivSz ) and e (2) ), and let

5:
Let 6: end for where in Algorithm 3, StrNz-PrivSz.In the algorithm, we clip ñh in (9) to ensure the denominator is not too small.Otherwise, the ratio can be arbitrarily large.Such a post-processing step preserves the same privacy guarantee.For the theoretical analysis, we do not clip ñh , but instead, we consider the ratio variable ch /ñ h given the event S h = {1 ≤ ñh ≤ 2n h − 1} (a symmetric area around the mean of ñh ).It is more convenient for the analysis.The asymptotic behaviors of ph in the algorithm and ch /ñ h | S h are essentially the same since Pr(ñ h ≥ 2) → 1 and Pr(S h ) → 1 as n → ∞.We will see the private estimator of the variance of ph we derive from the analysis of ch /ñ h | S h works well and the algorithm does achieve the desired coverage level.We consider the ratio of two independent normal variables.By independence, what remains unclear is the behavior of the reciprocal of a normal distribution.(We should mention that the Inverse Gaussian distribution is a different distribution than the reciprocal distribution we discuss here.)In Theorem 3.1, we provide a general form of the Taylor series of conditional mean and variance of a reciprocal normal distribution.To our best knowledge, this is the first complete result of the Taylor series, with the remainder term specified.We prove the theorem in the Proofs section.We use k = 2 to derive an estimator for the variance of p Algorithm 3, which leads to (11).Theorem 3.1 (Conditional mean and variance of a reciprocal normal distribution).For random variable X ∼ N (µ, σ 2 ) where µ > 1 and σ 2 > 0, given the event S = {1 ≤ X ≤ 2µ − 1}, for any integer k > 0, the first two moments of 1 X | S have the following expansions: and

Theoretical Results
In this section, we present the theoretical results of both privacy and asymptotic coverage guarantees.In addition, comparisons of the three algorithms in terms of variance and width ratios are discussed.

Privacy and Coverage Guarantees
Our theoretical results are two-fold.First, the proposed algorithms satisfy the desired privacy level under the corresponding adjacency relation, which is presented in Theorem 4.1.Proofs are presented in the Proofs section.On the other hand, for the confidence intervals to be useful, we provide theorems that guarantee the asymptotic coverage for each algorithm.The central limit theorem (CLT) asserts (essentially) that the sample mean is asymptotically normally distributed regardless of the original distribution.Therefore, the sample mean can be used to construct a confidence interval for the population mean.In the finite-population sampling designs we are considering, variants of CLTs can be found among [19,25,32] and others.We restate a general form of the finite-population CLT for simple random sampling in Theorem A.2 and provide asymptotic coverage guarantees in the following theorems.

Theorem 4.2 (Algorithm 1). For a population of size N , let p be the proportion in the population with the attribute of interest. Under stratified random sampling with sample sizes n h within the stratum of size N
for ρ > 0 as described in Algorithm 1.If ρ = ω(1/n h ) for all h, then as N h − n h and n h both tend to infinity for every stratum, , where Var(p h ) is the non-private estimator for Var(p h ); Theorem 4.3 (Algorithm 2).For a population of size N , let p be the proportion in the population with the attribute of interest.Under stratified random sampling with sample sizes n h within the stratum of size where e V ∼ N (0, for all h, then as N h − n h and n h both tend to infinity for every stratum, where Var(p) is the non-private estimator for Var(p); Proofs of the above theorems use the finite-population CLT and are provided in the Proofs section.
For Algorithm 3, the asymptotic behavior of p is grounded on the normal approximation to a ratio variable in addition to the CLT.We revisit the result of normal approximation by [18] in Theorem A.4.Based on the approximation, we have shown the consistency of p in the case of simple random sampling.
With the foundation of the above consistency, we establish the asymptotic properties: Theorem 4.5 (Algorithm 3).For a population of size N , let p be the proportion in the population with the attribute of interest.Under stratified random sampling with sample sizes n h within the stratum of size for ρ 1 , ρ 2 > 0 as described in Algorithm 3. If for all h, then as N h − n h and n h both tend to infinity for every stratum, where S is an event with Pr(S) → 1, more specifically, for all h, where S h is an event with Pr(S h ) → 1; (ii) for 0 < α < 1, The event S h is discussed in Section 3.2 and S can be set to ∩ h S h .As the variance of both p and ph does not exist, we resort to the conditional variance under high probability events.To prove Theorem 4.5, we start with a single stratum.We use a normal distribution (denoted by p * h ) to approximate that of the proportion estimator ph , with the distance between the two distribution vanishing to zero in an interval.Then for multiple strata, we show that the linear combination of the normal variables (denoted by p * ) is an accurate approximation to p.Last but not least, due to the consistency stated in Theorem 4.4, the noisy estimator V is a consistent estimator for the variance of p * .Then, a Wald confidence interval can be constructed using p and V .Details are presented in the Proofs section.
Note that, in addition to consistency for our estimates of the variance, the results above provide convergence rates.Compared to the estimation of variance in non-private settings, the additional biases are merely nuances given the conditions on ρ, ρ 1 and ρ 2 .In fact, we impose these conditions to ensure that the introduced noise does not dominate when estimating the variance.In principle, these rates may be used in practice to adjust the length of confidence intervals accordingly, although we do not explore that direction here.

Comparisons of Variances
The theorems presented in Section 4.1 ensure that, under proper conditions, the desired coverage is achieved asymptotically.Therefore, to compare the performance of the different proposed confidence intervals, we compare their widths, which are determined by their variance estimates.In this section, we will analyze our variance estimates and compare the resulting widths to that of the non-private confidence interval.

Extrinsic Variances
To investigate how much additional uncertainty is caused by adding noise, we decompose the variances of the private estimators into two parts: (1) the inherent component coming from the estimation from the sampling data, i.e, Var(p), and (2) the extrinsic component introduced by the added noise, written as Table 1 provides the (approximate) variances of p for three algorithms, where w h = N h N are the stratum weights.The variances are derived in the proofs of Theorems 4.2, 4.3, and 4.5.The additional variance terms, V ex , can be rewritten in terms of u h def = N h n h instead of w h , as shown in the second row of the table.In fact, u h are called sampling weights in the literature on survey sampling.A sample weight is defined as the number of individuals that each respondent in the sample is representing in the population.It is the reciprocal of the sampling rate n h N h and plays an important role in statistical inference for survey data [35,12].Understanding the relation between sampling weights and the variance of the noisy estimators is helpful for practitioners to make survey designs and the choice of algorithms.
With a fixed population size N and a chosen privacy level ρ, the extra variances V ex induced by the added noise are primarily dictated by u h .In PopNz-PubSz where we add noise at the population level, V ex is solely determined by the largest sample weight among all strata.If noise is injected into each stratum, then sampling weights in all strata collectively affect V ex .In particular, for StrNz-PrivSz, V ex is impacted by p h additionally.For all three algorithms, smaller sampling weights lead to lower extrinsic variance.
For comparison, we look at the ratio of V ex with the budgeting ρ 1 = ρ 2 = ρ/2 for PopNz-PubSz and StrNz-PrivSz.The ratio of V ex for StrNz-PubSz to PopNz-PubSz is Roughly speaking, when there are many strata, adding noise at the population level gives a smaller variance.To compare StrNz-PrivSz and StrNz-PubSz, the ratio of V ex is 2 which will always be greater than 2 (due to the cost it takes to protect sample sizes in StrNz-PrivSz) and at most 4.

Comparing with Non-Private CI: One Stratum Case
To assess the width in theory, we also look at the confidence interval width ratios by comparing them to the non-private one.Since the parameters N h , n h , p h , ρ h come into play together in the stratification setting, it is more practical to analyze the special case with one stratum.
Let the theoretical width ratio (TWR) be

TWR = Var(p) Var(p) .
In the implementation, the real width ratio (WR), defined as V / Var(p), will be very close to TWR in that V is a consistent estimator for Var(p).Table 2 displays some relevant quantities.Note that N −1 N −n is always less than 1 but tends to 1 when the population size is far larger than the sample size.
We can obtain a lower bound for TWR by dropping the factor N −1 N −n and minimizing over p.We can see that the width ratio mainly depends on p and the relative magnitude between n and ρ.If p is extreme (tends to 0 or 1), TWR is drastically large; when p is around 0.5, TWR is close to the lower bound.Also, the added noise induces a term involving ρ.For example, under the regime ρ = 1/n, the three algorithms result in an interval of length at least √ 3 ≈ 1.73, √ 5 ≈ 2.24, and 3 + 2 √ 2 ≈ 2.41 as wide, respectively.It is trivial that with one stratum, StrNz-PubSz produces a tighter confidence interval than PopNz-PubSz does in that the ratio of V ex in (20) is 1/2.However, PopNz-PubSz will outperform StrNz-PubSz once there are enough strata such that (20) is greater than 1.

Numerical Results
In this section, we conduct both simulation studies and applications to assess and illustrate the numerical performance of the proposed algorithms.The budgeting ρ 1 = ρ 2 = ρ/2 are used for PopNz-PubSz and StrNz-PrivSz.We clip the proportions ph onto [0, 1] as mentioned in Remark 2.

Simulations
We set up a set of experiments to (1) check the normality of noisy estimators, and (2) evaluate the performance of the proposed confidence intervals by varying the number of strata H, the true population proportion p, and the privacy level ρ.To generate the data, we need to specify the strata sizes N h and the sampling rates r h .The setup of these parameters is presented in Table 3.We generate a proportion for each stratum to create heterogeneity across strata.The true population proportion is then calculated and reported in each experiment.

Normality Check
We first check whether the distributions of p in the three algorithms are reasonably close to the theoretical normal distributions with the corresponding means and variances.Figure 1 displays the Q-Q plots of the theoretical distribution of p versus its sample distribution:  • StrNz-PrivSz: Note that, p in Algorithms StrNz-PubSz and PopNz-PubSz are unbiased for p while p in StrNz-PrivSz is not.Nevertheless, under the condition that ρ 2 = ω(1/n h ) in Theorem 3, the bias term h is negligible and thus we do not make a bias correction in Algorithm 3. We observe great alignments between the theoretical and experimental distributions, indicating that the private estimators in all three algorithms are indeed highly close to being normally distributed.

Varying Key Parameters
Assured by the results of the normality check, we experiment with a wide range of the privacy budget, different numbers of strata, and true population proportions.
We examine the impact of ρ on the performance of the three private estimators.The simulation is run on 10,000 repetitions and therefore the empirical coverage falling into 90% ± 0.006 (departure of two standard deviations) is considered appropriate.In Figure 2a, the empirical coverage is reasonable except that StrNz-PrivSz gives unnecessarily higher coverage when ρ is smaller than around 0.005.This is because the budget is so small for the method that, with clipping, it covers the truth more often than needed.In this case, the confidence intervals are too wide to be as useful, as shown in Figure 2b.For all three methods, the width grows as ρ becomes smaller.However, the rates of width growth differ: in the multiple strata case we simulate, the width of PopNz-PubSz grows the slowest, StrNz-PrivSz grows the fastest, and StrNz-PubSz is in the middle.Thus, the optimal privacy level should be chosen by taking into account the method, width, and coverage.For instance, if we want a 90% of confidence level and width under 0.1, one can choose the value for ρ as small as (1) 0.001 for PopNz-PubSz, (2) 0.003 for StrNz-PubSz, and (3) 0.01 for StrNz-PrivSz.In addition, Table 4 shows the numerical results of three experiments with different combinations of the numbers of strata and the true population proportions.The simulation in the middle panel shares the same setting as the experiment shown in Figure 2 but has a fixed privacy level: 1/ max(n h ).This is an analogous regime to ρ = 1/n (for simple random sampling) for multiple strata.In the literature on differential privacy, the regime ρ = 1/n for a simple random sample is often considered to understand how small ρ can be as the sample size increases.Recall that a smaller ρ means a higher privacy level.
As argued above, clipping ph (or p) onto [0,1] will yield better results in some cases.The conclusions coincide with the analyses in Section 4. The empirical coverage of the three private ones in all simulations achieves the nominal level of 90%, as guaranteed by Theorems 4.2, 4.3, and 4.5.The case where StrNz-PrivSz gives a 91.9% confidence level in the bottom panel is due to clipping.(When the stratum proportions are close to the extreme, clipping is more noticeable.) The average width and width ratio (WR) varies.With one single stratum, WRs are near the lower bounds of theoretical width ratios (TWR) given in Section 4.2.2, which suggests that the lower bounds are almost tight.StrNz-PubSz gives a narrower CI than PopNz-PubSz with one stratum.But with
The strata sizes and sampling rates are drawn as described in Table 3.For the multiple strata case, the resulting sample sizes in n h range from 72 to 152, and ρ is set to be 1/152 ≈ 6.58 × 10 −3 .For the one-stratum case, we set the sample size to 152 so that we have the same level of privacy.more strata, PopNz-PubSz outperforms StrNz-PubSz in terms of WR.Having more strata means splitting the total privacy budget into smaller portions, which leads to adding more noise on the whole.The CI needs to be wider to achieve the same confidence level.As for StrNz-PrivSz, however, it always yields the widest CI due to the additional price it pays to protect sample sizes simultaneously.

Non
On the other hand, with the same number of strata (20 here), we see that more extreme p h leads to a larger WR than p h in the middle range.This is because the factor p h (1 − p h ) comes into play as p(1 − p) does in TWR in Table 2 for the one stratum case.
We also provide the sample standard deviation of the widths (width SD).In general, the non-private method results in a smaller standard deviation than the private ones.In some cases, clipping helps reduce the width SD for the private algorithms.With the same privacy level, there is more fluctuation in width for PopNz-PubSz compared to StrNz-PubSz.This is because we use onehalf of the privacy budget and directly add noise to the variance estimate.As expected, StrNz-PrivSz has the largest width SD since the magnitude of width is the largest and the ratio variable is heavy-tailed by design.Nevertheless, compared to the width, the width SD for all methods is so small that it does not compromise the effectiveness of the confidence interval.

Applications
In this section, we apply the proposed methods to the 1940 Census full enumeration from IPUMS USA [36] and evaluate the performance of three differentially private confidence intervals.To conduct stratified random sampling on the data set, the state-level geographical variable "STATEICP" (49 categories, constituting the then-48 states and Washington, D.C.) is used for stratification.Under stratified random sampling with H = 49 strata, we estimate the national unemployment rate for the first application.In the second application, we are interested in studying the discrepancy in income levels between black and white men.

Confidence Intervals for the Unemployment Rate
As an important indicator of the status of the national economy, the unemployment rate is the percentage of unemployed workers in the total labor force consisting of both the employed and unemployed.Thus, we consider all the individuals who are either employed or unemployed as the whole population.In the 1940 Census data set, the binary characteristic "EMPSTAT" represents employment status.The full enumeration is considered the truth and the true population proportion is p = 9.346%.To carry out stratified random sampling, sample sizes or sampling rates are selected for all 49 strata.For modern relevance, we simulate in a manner intended to mimic the canonical design implemented in the current American Community Survey (ACS), by choosing a typical range of sampling rates used in ACS which is [0.5%, 15%].See Table 5 for detail.

Stratum size
Sampling rate To apply and assess the proposed algorithms, we experiment with a wide range of small privacy budgets: ρ ∈ [10 −6 , 10 −3 ].Each method is repeated 10,000 times and the empirical coverage, the average CI width, and the average CI width ratio (WR) are computed.As shown in Figure 3a, the empirical coverage is always around the nominal level which is chosen at the level of 90% for the whole range of privacy levels.In Figure 3b, the CI width and CI width ratio with the non-private CI as the benchmark, share the same shape.Even when the CI given by StrNz-PrivSz is 8 times the non-private CI width, the CI width is only 0.01 due to the large sample size.Both CI width and width ratio should be taken into account when choosing an optimal privacy level.

Confidence Intervals for the Difference in Income Level
In the second application, we want to investigate whether there was a discrepancy between the income levels of white males (population 1) and that of black males (population 2).Note that only those who had valid income numbers in the 1940 Census are considered.Since the poverty thresholds were not developed until the 1960s and thus are not available for the 1940 data, the national income average is used as a threshold instead.We are interested in examining the difference in subpopulation proportions of those whose income levels passed this threshold.
The geographic feature "STATEICP" is used for stratification, yielding 49 strata, with stratum size ranges of (41838, 4621442) for the population of white males and (50, 309214) for the population of black males.Sampling rates are adaptively chosen based on stratum sizes.For the population of white males, the range of sampling rates is also [0.5%, 15%], whereas the range of sampling rates is [0.5%, 30%] for the population of black males given its small stratum sizes.Additionally, to allow solid approximations based on the asymptotic results, we impose that the sample sizes are adjusted to be 50 if the sampling rates give smaller sizes than 50.See Table 6 for detail.

Table 6
Sampling rates for two populations.Stratum sizes n h ∈ (4.1 × 10 4 , 4.7 × 10 6 ) for the population of white males and stratum sizes n h ∈ (50, 3.1 × 10 5 ) for the population of black males.*The sample size will be adjusted to be 50 if the above sampling rate results in a size smaller than 50.
Stratum size N h of white males Sampling rate Let p 1 and p 2 denote the proportions of eligible individuals who earned more than the national average income level $442.12.The true values of proportions are p 1 = 49.0223% and p 2 = 29.5152%.Let p diff = p 1 − p 2 , then the true difference in these two proportions is p diff = 19.5071%.By the additivity of two independent normal distributions, naturally, we use the following differentially private CI: where V (•) denotes a private estimator of variance, pdiff is defined as p1 − p2 and V(p diff ) = V(p 1 ) + V(p 2 ).

average width and width ratio of DP-CIs of the difference of the above-national-income-level proportions between black and white males with valid income values.
In Figure 4, similar patterns are observed in this application as in the first.All CIs have empirical coverage around/above the nominal confidence level as in the simulation study in Section 5.1.2.The phenomenon of higher coverage is due to small ρ and effective clipping.When the range of stratum sizes is large (it is (50, 309214) in this application), that is, when the stratum sizes are very different, a large privacy budget ρ should be chosen.The choice of a small ρ harms the estimates of small-sized strata.We advise that the smallest ρ be chosen given the tolerance of uncertainty in terms of width and/or width ratio.For example, if the accuracy requirement is that the width should be under 0.05 or WR under 5, then the best choices of ρ among the experiments in Figure 4b are (1) 0.0001 for PopNz-PubSz, (2) 0.0018 for StrNz-PubSz, and (3) 0.0056 for StrNz-PrivSz.

Discussion
We have designed three algorithms to construct confidence intervals for the population proportion under stratified random sampling with zero concentrated differential privacy guarantees.We consider both the case where the sample sizes are public and the case where they are private information.Theoretical results including privacy guarantees and asymptotic properties are established.With proper conditions on the relation between the privacy budget and sample sizes, as stated in the theorems, the resulting confidence intervals will achieve the desired coverage asymptotically, and the width tends to be that of a non-private confidence interval when the sample sizes go to infinity.
In the simulation studies and two applications, we have experimented with a wide range of privacy budgets under a variety of parameter setups.The three algorithms always perform well in terms of empirical coverage.The width and width ratio are in a reasonable range even under the strict regime where ρ = 1/ max h n h .Typically in practice, a constant between 0.001 to 10 is chosen to be the privacy budget.According to our experiments, with the choice of the smallest budget in this range, 0.001, the three algorithms still have fairly good results even when the smallest stratum has only a size 50 (as demonstrated in the second application).
The comparative analysis of the three algorithms in Section 4.2 gives actionable guidance to practitioners.When releasing the population proportion is the only goal and there are enough strata (such that Eq.( 20) regarding sample weights is greater than 1), PopNz-PubSz is the better option.However, if stratum proportions should also be released or there are just a few strata, StrNz-PubSz is preferable.On the other hand, when the population proportion and sample sizes must be protected simultaneously, StrNz-PrivSz is the only algorithm presented in this paper.StrNz-PrivSz, compared to the case with public sample sizes, needs a larger budget to meet the same width requirement on account of the additional cost of protecting sample sizes.
There are a few open questions worth considering for future research.In this paper, we discuss the classic case where the number of strata is fixed, and the sample sizes tend to infinity.In principle, asymptotic normality is also valid in other settings with finite sample sizes.For example, it has been shown in the non-private setting that as the total sample size N → ∞ with many small samples or a few large samples, or some combination thereof, central limit theorems hold under certain (complex) conditions [2].Under the constraints of differential privacy, we have shown that the trade-off between the privacy and accuracy (a.k.a., utility) of DP-CIs depends on the smallest sample, i.e, recall the condition, ρ = ω(1/n h ) for all h, in Section 4.1.The overall privacy loss is determined by the largest privacy loss among all strata.However, when the strata sizes N h remain finite, the weights (w h = N h /N ) of these strata tend to 0 as N → ∞.Therefore, the noise injected into these small strata should not harm the overall accuracy of the intervals if N is sufficiently large.
More interestingly, we do not provide 'PopNz-PrivSz' -an analogous algorithm to PopNz-PubSz for the private sample sizes case.To protect both the population proportion and the sample sizes, the direct addition of noise to the non-private aggregated estimator is not plausible.One should consider more sophisticated mechanisms other than directly adding noise to the statistics.If 'PopNz-PrivSz' were proposed, we shall expect it to yield a narrower confidence interval since we only need to publish the private population proportion without being able to provide private confidence intervals for stratum proportions at the same time.
Another direction for future research would be optimal budget allocation.We do not discuss how to best divide the total budget for PopNz-PubSz or StrNz-PrivSz.Budgeting for the composed application of the algorithms may also be of interest, like in Section 5.2.2 where we apply the algorithms twice for two independent populations.Lastly, one broad direction is to develop the differentially private versions for other alternatives to the basic Wald interval, such as the Wilson Interval, Jeffreys interval, etc.(see [22] for a comparative summary of seven such types of confidence intervals for proportions).Many of these latter are specifically designed for the case of small sample sizes, which we do not consider here and for which we expect fundamentally different approaches to differential privacy likely to be necessary.

A.1. Proof of Theorem 3.1
Lemma A.1.Let X ∼ N (µ, σ 2 ) and S = {µ − a ≤ X ≤ µ + a}.For any a > 0 and an integer k ≥ 1, the conditional even moments where the big-O hides a constant depending on σ and k.
Proof.Without loss of generality, we assume µ = 0. We prove the lemma by induction.Set k = 1, integrate by parts, Integrate by substitution, the integral in the second term becomes then integrate by parts for the k + 1 case, Plug in (24), we obtain So far we have proved (24).Note that and that the image of erf(z) is between (−1, 1).Therefore, Proof of (12) in Theorem 3.1.Consider the Taylor series of 1 x at x = µ: Let y m be the partial sum of the above series, i.e., y m (x) = m k=0 where dν = f (x)dx is induced by N (µ, σ 2 ) conditional on event S. Note also that |y m (x)| ≤ g(x) for any naturals m and x ∈ [1, 2µ − 1].By the dominated convergence theorem, the operations of limit and integral are exchangeable for y m (x). Then, (26) The second equality is because the odd moments are zero due to symmetry.

Note that given event S, |
It follows that Applying Lemma A.1, by the choice of a = µ − 1, ( 26) becomes Proof of ( 13) in Theorem 3.1.We conduct a similar procedure for the second moment of X | S. Based on the Taylor expansion Due to (27), it follows that where the term is a sum of an arithmetic-geometric sequence.

A.2. Proof of Theorem 4.1
Proof for Algorithm 1.Under neighboring relation ∼ ss , only one record changes within one stratum and sample sizes remain the same.Applying the Gaussian mechanism to each stratum at the level of ρ gives ρ-zCDP guarantee.By postprocessing, the confidence interval is also ρ-zCDP.
Proof for Algorithm 2. The sensitivities of p and Var(p) are ∆ p and ∆ V , respectively.Applying the Gaussian mechanism, it follows that p is ρ 1 -zCDP and V is ρ 2 -zCDP.By basic composition, the confidence interval p Proof for Algorithm 3. By the Gaussian mechanism and the basic composition property of zCDP, we know that ph is ρ-zCDP.Under neighboring relation ∼ r , only one record changes within one stratum.Then, by post-processing, the confidence interval is ρ-zCDP.

A.3. Proof of Theorem 4.2
Before proving the theorem, we revisit the finite-population CLT first: Theorem A.2 (Theorem 1, [33]).Consider a finite population Π = {X 1 , ..., X N } of size N .Let µ be the population mean and Xn be the mean of a simple random sample of size n from Π, and Var( Xn ) is the variance of Xn .The finite population variance of Π is denoted by we have The variance of Xn is determined by the population variance v which is unknown.Nevertheless, the sample variance Var( Xn ) can be used to estimate v.To make sure the CLT still holds when substituting Var( Xn ) by Var( Xn ), the consistency of Var( Xn ) is crucial, as stated in the following lemma.Since ph = ph + e h where e h ∼ N (0,  In Algorithm 2, we set where e V ∼ N (0, Then, the confidence interval given by p ± z 1−α/2 V has the asymptotic coverage level 1 − α.

A.5. Proof of Theorem 4.4
Proof.For ñ ∼ N (n, 1 2ρ2 ), by Proposition 3.1, we derive the kth-order Taylor series of the conditional expectation of p given S = {1 ≤ ñ ≤ 2n − 1}: For example, when k = 2, To obtain a Taylor expansion for the conditional variance, we plug

Theorem 4 . 4 .
Under simple random sampling, let c be the count of individuals having the attribute of interest and n be the sample size.The true population proportion is denoted by p.Let p = c/ñ where c ∼ N (c, 1 2ρ1 ) and ñ

Fig 2 :
Fig 2: Setup: 20 strata and p = 0.505 (p h ∼ Uniform(0.4,0.6)) with 10,000 repetitions.Figure (a) is the empirical coverage with the black solid line indicating the nominal confidence level of 90%.Error bars of one standard deviation are shown for coverage.The average width and width ratio are displayed in (b) with the non-private as the benchmark.Error bars of width are not visible in the plots and therefore not shown.

Fig 3 :
Fig 3: The empirical coverage with error bars, average width and width ratio of DP-CIs of the unemployment rate.

Fig 4 :
Fig 4: The empirical coverage with error bars, average width and width ratio of DP-CIs of the difference of the above-national-income-level proportions between black and white males with valid income values.

Lemma A. 3 . 1 . 2 :
Let Var( Xn ) be the sample variance.Var( Xn ) is an unbiased estimator for Var( Xn ).Moreover, under the condition in Theorem A.2, as N → ∞, Var( Xn )/Var( Xn ) p → Now we prove Theorem 4.Proof of Theorem 4.2.It suffices to show ph −p h √ V h d → N (0, 1) for all h.By the finite-population CLT in Theorem A.2, we know ph − p h Var(p h ) d → N (0, 1).

Table 3
Parameter setup.The resulting sample sizes are between 60 and 160.
h ) or specified in the axis of the plot ), we have e h = O P ( 1 √ ρn h ).Then, the second term of (34) is O P ( 1 √ ρn 2 h ), and thus, V h − Var(p h ) = Var(p h ) − Var(p h ) + O P Note that Var(p h ) is of order 1 n h , and that by Lemma A.3, Var(p h ) p → Var(p h ).
So far, we have shown that as n goes to infinity, under the conditions 1 ρ1n = o(1) and1 ρ2n = o(1), for z ∈ I h , |F Z (z) − F Z * (z)| → 0. (48)Lemma A.6.Let Z 1 , ..., Z H and Z * 1 , ..., Z * H be independent continuous random variables which depend on n.Let F denote the distribution function.As n → ∞, if |F Z h (z) − F Z * h (z)| → 0 holds for any h = 1, ..., H and z in an interval (a h , b h ) and Pr(Z h ∈ (a h , b h )) → 1, Pr(Z * h ∈ (a h , b h )) → 1.Then, h Z h (z) − F H h=1 c h Z * h (z) → 0 for any z ∈ H h=1 c h a h , H h=1 c h b h ,where c h 's are constants.