Bayesian Bootstraps for Massive Data

In this article, we present data-subsetting algorithms that allow for the approximate and scalable implementation of the Bayesian bootstrap. They are analogous to two existing algorithms in the frequentist literature: the bag of little bootstraps (Kleiner et al., 2014) and the subsampled double bootstrap (SDB; Sengupta et al., 2016). Our algorithms have appealing theoretical and computational properties that are comparable to those of their frequentist counterparts. Additionally, we provide a strategy for performing lossless inference for a class of functionals of the Bayesian bootstrap, and briefly introduce extensions to the Dirichlet Process.


Introduction
Massive datasets are increasingly common in statistical applications, mainly because current computational technologies are capable of efficiently recording and storing large datasets. As a consequence, there is an increasing demand for statistical methods that can analyze large volumes of data. Quite often, a single computer cannot store big datasets into its internal memory, and statistical analyses can only be performed in smaller subsets of the original data. In such cases, we must use algorithms that combine statistical outputs from subsets to approximate the results we would obtain if we analyzed the full dataset at once.
In the frequentist literature, two scalable adaptations of the bootstrap have been proposed: the bag of little bootstraps (BLB; Kleiner et al., 2014) and the subsampled double bootstrap (SDB; Sengupta et al., 2016). These adaptations are based on data-subsetting. The BLB proceeds by splitting the data into subsets, bootstrapping within each subset, and averaging the summaries of the "little" bootstraps to get a global assessment of an estimator. To resemble the "big" bootstrap (i.e., the bootstrap based on the full dataset), the little bootstraps need to be rescaled. The rescaling is achieved by modifying the parameters of the corresponding multinomial distribution to avoid extra computational cost. The SDB is an alternative to the BLB. The SDB proceeds by drawing random subsets from the entire dataset, running a rescaled bootstrap of size one within each subset, computing the root function for each rescaled bootstrap, and finally computing a summary of the bootstrapped values to obtain a global assessment of the estimator. As shown in Kleiner et al. (2014) and Sengupta et al. (2016), the BLB and SDB are computationally efficient and provide adequate assessments of uncertainty.
In this paper, we develop data-subsetting methods for the Bayesian bootstrap (BB) that are analogous to the BLB and the SDB. We will refer to these adaptations as the bag of little Bayesian bootstraps (BLBB) and the subsampled double Bayesian bootstrap (SDBB), respectively. The BB is a nonparametric model for probability measures proposed by Rubin (1981) as a Bayesian analogue to the bootstrap (Efron, 1979). As discussed in Lyddon et al. (2019, page 1), the BB represents a useful modeling technique to perform general Bayes updating, which "is a way of updating prior belief distributions that does not need the construction of a global probability model." For massive datasets, the BB is an appealing procedure for two main reasons: (1) it can accommodate complex features that are inherent to big data and (2) its implementation does not rely on recursive sampling algorithms, which are usually computationally expensive (e.g., Markov Chain Monte Carlo methods). In addition to the BLBB and SDBB, we present a strategy for performing lossless inference for a class of functionals of the BB. As a natural extension, we generalize our data-subsetting strategies to Dirichlet Processes (DP).
Our methods complement the growing literature on scalable Bayesian methods. Taddy et al. (2015Taddy et al. ( , 2016 advocate for the use of the BB in massive datasets and approximate the distribution of certain functionals through Taylor series expansions and empirical Bayes procedures. Unlike these approaches, our proposal is based on data-subsetting. Most data-subsetting procedures consist of three steps: (1) the dataset is divided into subsets; (2) for each subset, a rescaled posterior distribution (subposterior) resembling the full posterior (i.e., the posterior distribution obtained by conditioning on the whole dataset) is obtained; and (3) the subposteriors are combined to either approximate the full posterior or a posterior summary of interest (e.g., a posterior variance). Methods based on subposteriors can be classified into two categories: methods that rescale the prior (see e.g., Wang and Dunson, 2013;Neiswanger et al., 2014;Wang et al., 2015;Scott et al., 2016) and methods that rescale the likelihood (see e.g. Minsker et al., 2017;Srivastava et al., 2018Srivastava et al., , 2015Li et al., 2017). Particularly, there are some similarities between the method proposed in Li et al. (2017) and the BLBB: both methods focus on summaries of the posterior (not on the full posterior itself) and use the same rule to combine them. The methodology proposed in Li et al. (2017) applies to a large class of models; however, its theoretical guarantees are proved under assumptions (e.g., Assumption 3) that are not verifiable for the BB.
The article is organized as follows. In Section 2, we introduce and define the BB. Then, we describe the BLBB, the SDBB, and a strategy to perform lossless inference. Section 2 ends with a discussion of extensions for the DP. In Section 3, we illustrate the performance of our methods in simulation studies. In Section 4.1, we apply the BLBB and SDBB to model U.S. federal employees' wages from a subset of the Office of Personnel Management's datafile. In Section 4.2, we use our methods to model whether or not households in the American Commu-nity Survey are paying for a firehazardflood insurance. The paper concludes with a discussion and directions for future work.
2 Data-subsetting strategies for the Bayesian Bootstrap Throughout, we assume that the observations X n = {X 1 , X 2 , ... , X n }, X i ∈ R p , are independent and identically distributed given a probability measure P , which represents the datagenerating mechanism. The BB is a probability model for P given the data X n which admits the stochastic representation where (W 1 , . . . , W n ) ∼ Dirichlet(1, . . . , 1). The BB defines a distribution on probability measures which bypasses the traditional prior to posterior update, and it is nonparametric in the sense that no assumptions are made on the data-generating mechanism P beyond conditional independence of the data given P (Taddy et al., 2016). Additionally, the stochastic representation above allows us to obtain draws from P BB n directly without resorting to Markov Chain Monte Carlo methods. However, when n is massive and the full dataset cannot be loaded into memory, the BB cannot be easily implemented and approximations are needed (such as the ones presented in this article).
Several papers investigate the theoretical properties and methodological uses of the BB. From a theoretical point of view, various authors have studied its first and second-order asymptotic properties (Lo, 1987;Weng, 1989), proposed extensions and variations (Lo, 1991;Kim and Lee, 2003;Ishwaran et al., 2009), and provided distributional characterizations of functionals of P BB n (Gasparini, 1995;Choudhuri, 1998;Cifarelli and Melilli, 2000). The connections between the BB and other processes have also been an object of study. In relation to Dirichlet processes, the BB is a limiting case of the posterior distribution of a DP (as the concentration parameter of the DP goes to zero, it converges weakly to the BB), so we can interpret it as a "noninformative" limit of the DP (Lo, 1987;Muliere and Secchi, 1996;Gasparini, 1995).
In addition to the above-mentioned favorable properties, the BB also has some potentially unappealing features. For example, P BB n is almost surely discrete, and its support is limited to the observed data X n . At this point, we would like to stress that the goal of the article is not to provide an extensive discussion of the advantages and shortcomings of the BB, but rather to provide methodological tools to implement the model when n is massive.
Let π P (·|X n ) be the posterior distribution of P given X n , which has a stochastic representation as in Expression (1). We assume that the goal is to make posterior inferences about a functional of P denote by φ. We use the notation π φ (·|X n ) for the posterior distribution of φ(P ). In the following subsections, we introduce methods for approximating ξ{π φ (·|X n )}, where ξ is a summary of interest (e.g., mean, variance, length of credible intervals).

Bag of little Bayesian bootstraps
The BLBB is an adaptation of the bag of little bootstraps proposed by Kleiner et al. (2014). This procedure provides an approximation of ξ{π φ (·|X n )} when n is large, and it comprises three steps: divide, rescale, and combine. In the first step, we randomly split the dataset into subsets of size b such that each subset can be stored in the internal memory of the computer. We define these subsets by generating a random partition I 1,b,n , . . . , I n/b,b,n of the set {1, . . . , n}, where |I j,b,n | = b. For ease of exposition, we assume that n is a multiple of b. In the second step, we compute a rescaled version of the posterior distribution of φ(P ) for each dataset X j,b,n = {X i } i∈I j,b,n , j = 1, . . . , n/b. In the third step, we combine the summaries found with the rescaled posteriors to obtain an approximation of ξ{π φ (·|X n )}.
The purpose of rescaling is to define a posterior distribution that resembles the one we would obtain with the full dataset. Without rescaling, the next step after partitioning the dataset would be to compute ξ{π φ (·|X j,b,n )}, j = 1, . . . , n/b, and then combine these summaries using, for example, the average; that is, to approximate ξ{π φ (·|X n )} by b/n n/b j=1 ξ{π φ (·|X j,b,n )}. This strategy could work if the ξ{π φ (·|X j,b,n )} provided a reasonable approximation of ξ{π φ (·|X n )}. Unfortunately, this is often not the case. For example, if ξ is the variance of π φ (·|X n ), then we expect ξ{π φ (·|X n )} < ξ{π φ (·|X j,b,n )}, which in turn implies that ξ{π φ (·|X n )} < b/n n/b j=1 ξ{π φ (·|X j,b,n )}. For this reason, we rescale the posterior distributions within each subset by following a standard strategy employed in several data-spliting based procedures (e.g., Kleiner et al., 2014;Minsker et al., 2017;Srivastava et al., 2015). Each subset X j,b,n is replicated n/b times such that the replicated dataset contains n instead of b data points. Then, we obtain the posterior distribution associated with the replicated dataset, which we refer to as the "rescaled posterior distribution." For a functional φ(P ), the rescaled posterior distribution of φ(P ) given X j,b,n is defined as π BLBB φ (·|X j,b,n ) = π φ (·|X * j,b,n ), where X * j,b,n denotes the dataset X j,b,n replicated n/b times, so we have a new sample of size n. Note that rescalingπ BLBB φ (·|X j,b,n ) can be achieved by rescaling the posterior distribution of P given X j,b,n . The rescaled posterior distribution of P given X j,b,n is denotedπ BLBB P (·|X j,b,n ) = π P (·|X * j,b,n ). In this case, the distributionπ BLBB P (·|X j,b,n ) also has a stochastic representation of the form where (W i,j ) i∈I * j,b,n ∼ Dirichlet(1, . . . , 1), (W * i,j ) i∈I j,b,n ∼ Dirichlet(n/b, . . . , n/b), d = denotes equality in distribution, and I * j,b,n denotes the subset I j,b,n replicated n/b times. The process in Expression (2) belongs to the class of BB clones proposed in Lo (1991). With this stochastic representation, we have φ(P BLBB j,b,n )|X j,b,n ∼π BLBB φ (·|X j,b,n ). Although we compute the rescaled posterior using a replicated dataset of size n, the computational cost of drawing from π BLBB φ (·|X j,b,n ) is the same as of a BB with sample size equal to b. Thus, we propose approximating ξ{π φ (·|X n )} by b/n n/b j=1 ξ{π BLBB φ (·|X j,b,n )}. We provide theoretical guarantees for approximating summaries of φ T (P BB n , P n ) = √ n(T (P BB n ) − T (P n )) by summaries of φ T (P BLBB j,b,n , P j,b,n ) = √ n(T (P BLBB j,b,n ) − T (P j,b,n )), where T is a functional and, P n and P j,b,n are the empirical measures associated with X n and X j,b,n , respectively. Since E[P BB n |X n ] = P n and E[P BLBB j,b,n |X j,b,n ] = P j,b,n , we can think of T (P n ) and T (P j,b,n ) as measures of central tendency for the distribution of T (P BB n )|X n and T (P BLBB j,b,n )|X j,b,n , respectively. The functional φ T can be used to quantify uncertainty (e.g., estimate the length of intervals and measures of dispersion) associated with the functional T . The study of the asymptotic properties of φ T (P BB n , P n ) and φ T (P BLBB j,b,n , P j,b,n ) requires understanding the asymptotic behavior of the processes √ n(P BB n − P n ) and √ n(P BLBB j,b,n − P j,b,n ). These processes belong to the class of weighted bootstrap empirical processes studied in Section 3.6.2 of van der Vaart and Wellner (1996). The connection with weighted bootstrap empirical processes and an assumption on the differentiability of T allow us to prove that b/n n/b j=1 ξ{π BLBB φ T (·|X j,b,n )} suitably approximates ξ{π φ T (·|X n )}, whereπ BLBB φ T (·|X j,b,n ) and π φ T (·|X n ) denote the distributions of φ T (P BLBB j,b,n , P j,b,n )|X j,b,n and φ T (P BB n , P n )|X n , respectively. The proof of this result is similar to the proof of Theorem 1 in Kleiner et al. (2014), and it also relies on the assumption that T is Hadamard differentiable at P 0 (the data generation mechanism) and the existence of a P 0 -Donsker class. The specific statement of our result and its proof can be found in Theorem 1 in the supplementary material. We show that even if we use s instead of n/b subsets (s < n/b), the average s −1 s j=1 ξ{π BLBB φ T (·|X j,b,n )} can provide a reasonable approximation of ξ{π φ T (·|X n )}. Figure 1 describes the resulting Monte Carlo algorithm for the BLBB.

Subsampled double Bayesian bootstrap
The SDBB is the Bayesian analogue to the subsampled double bootstrap for massive data proposed by Sengupta et al. (2016), which also provides an approximation of ξ{π φ (·|X n )}. In Sengupta et al. (2016), the authors claim that the SDB outperforms the BLB in some scenarios with limited time budget, especially when it is only possible to run s < n/b little bootstraps. Therefore, we would expect the same phenomenon to occur with the BLBB and SDBB.
Theorem 1 in the supplementary material shows that s −1 s j=1 ξ{π BLBB φ T (·|X j,b,n )} can provide a reasonable approximation of ξ{π φ T (·|X n )}. However, the BLBB could be outperformed by the SDBB because the little Bayesian bootstraps only consider a unique partition of the dataset and, if the computational budget is limited, only a fraction of the dataset contributes to the analysis. We refer to this fraction as sample coverage, a term we borrow from Sengupta et al. (2016).
The SDBB is a procedure that ensures a higher sample coverage compared to the BLBB and does not require using a partition of the dataset. Instead, this procedure uses random subsets of X n . Let X b,n = {X R 1 , . . . , X R b } be representing the random subset, where b ∈ N is such that X b,n can be stored in the internal memory of the computer, R = (R 1 , . . . , R b ) ∼ U(P n b ), and U(P n b ) stands for the uniform distribution defined on the permutations of size b of the elements {1, . . . , n}. The SDBB runs a very little Bayesian bootstrap of size 1 for each drawn subset, so it has higher sample coverage than the BLBB. The use of subsets of X n also requires a rescaling strategy, and we use the same one that was used for the BLBB. We approximate the posterior distribution of P given X n by π SDBB P (·|X n ), where π SDBB P (·|X n ) is the distribution induced by the SDBB process. We define the SDBB process as The technical conditions assumed for T and P 0 in Theorem 2 are similar to those assumed for the BLBB. Theorem 2 is the counterpart of Theorem 1 in Sengupta et al. (2016). Figure 1 describes the Monte Carlo algorithm for running the SDBB.

Lossless inference for the Bayesian bootstrap
For a certain class of functionals, we can perform exact (lossless) inference for the BB after splitting the data. This strategy is based on decomposing the Dirichlet weights into Gamma random variables, and it is similar in spirit to the strategy devised for bagging in Lee and Clyde (2004). The functionals for which lossless inference can be performed are of the form φ(P . . , ρ k ), and ρ l is a function defined on the sample space. This class of functionals clearly contains moments and expectations of transformations, but it also contains other functionals such as weighted least squares estimators (as we show in the example below) or the instrumental variables estimator presented in Section 2 in Chamberlain and Imbens (2003) (see the supplementary material for further details).
Example 1. (Weighted least squares) Let Y i ∈ R be the outcome and U i ∈ R p+1 be covariates, and assume that we want to model E[Y i |U i ] using a linear combination of the predictors. If the pairs are (Y i , U i )|P iid ∼ P and P given the data is the BB, then we can define the least squares functional (which, in this case, has a posterior distribution): This functional has been used, for instance, in Clyde and Lee (2001) and Taddy et al. (2016). We can rewrite and v is a (p + 1)-dimensional vector. Thus, β BB lm,n is included in the class of functionals for which lossless inference can be performed.
The algorithm requires processing all the n/b subsets, so it can be significantly slower than the BLBB or SDBB. A figure that summarizes the Monte Carlo algorithm for performing lossless inference can be found in the supplementary material.

Extension to the Dirichlet Process
In the subsection, we extend the results in previous subsections for the Dirichlet process (DP), which we denote X i |P iid ∼ P , i = 1, . . . , n, with P ∼ DP(α, H). The hyperparameters of the DP are the base measure H and the concentration parameter α > 0. The base measure H is the prior expectation of P , whereas α controls how concentrated P is around H. The DP has the following explicit stochastic representation (Sethuraman, 1994) and it is a conjugate model: where P n is the empirical probability measure. Pitman (1996) shows that the posterior above can be represented as The random weight R n , the prior measure P DP of the DP, and the BB P BB n are independent given the data. For large n, the posterior measure of the DP is very close to the BB: if the same random P BB n is to be used for an analysis with the BB and the DP, For instance, if 0 < < 1 and 0 < α ≤ 1, the inequality above implies that P From a more theoretical perspective, the representation in Equation (3) allows us to define an analogue of the BLBB and SDBB where BLDP and SDDP stand for bag of little Dirichlet processes and subsampled double Dirichlet process, respectively, and R n ∼ Beta(α, n). The Bernstein-von Mises results for the Dirichlet process that are available in the literature (see e.g., Lo, 1983;James, 2008;Varron, 2014;Castillo and Nickl, 2014) allow us to show that the asymptotic behavior of , which is a parallel of the results found in the previous subsections (a proof can be found in the supplementary material). In order to have the same theoretical guarantees for a functional φ T (P BB n , P n ) (as described in previous subsections), the next step would be invoking the functional delta theorem. We finalize this subsection by noting that for the class of functionals that was defined in Subsection 2.3, one can perform lossless inference for the Dirichlet-Multinomial process (Kingman, 1975;Pitman, 1995), which is an approximation to the Dirichlet Process (Ishwaran and Zarepour, 2002;Muliere and Secchi, 1996). The details can be found in the supplementary material.

Numerical experiments
We compare the performance of the BLBB and SDBB in approximating posterior means and standard deviations, as well as the length of 95% credible interval of functionals of the BB in linear, logistic, and mixed-effects regression. The sample sizes of the datasets are always equal to 10,000. We simulate data from the following models: where i = 1, . . . , n, Y i is the outcome, U i denotes a (p + 1)-dimensional vector containing the predictors, and β 0 = (β 0,0 , . . . , β p,0 ) = (1, 1, ... , 1). The first component of U i is equal to 1 (acting as an intercept) and the remaining elements are independent and identically distributed as a Student-t with 3 degrees of freedom. In the linear model, the errors i s are simulated from a Skew Normal distribution with location parameter −0.71, scale 1, and slant 2, which has mean 0 and is asymmetric. In the mixed model, we draw the random the effect α j and error i from a Skew Normal distribution with location parameter −0.71, scale 1, and slant 2 and from a Student-t with 3 degrees of freedom, respectively. Finally, in linear regression, p is taken to be 100, whereas in the case of logistic and mixed-effect regression p is equal to 25. Throughout, we model the data as (Y i , U i )|P iid ∼ P and P given the data using P BB n . We focus on the following 3 functionals that induce posterior distributions for the regression coefficients: -Linear regression: the weighted least squares estimator (see, for example, Lee, 2001 andTaddy et al., 2016), -Logistic regression: the robust estimator, (see, for instance, Carroll and Pederson, 1993), -Mixed-effects regression: the weighted estimator, where , andV is the MLE of the covariance matrix derived from the marginal likelihood of a mixed-effects model with random intercept and under Gaussian assumptions. Welsh and Richardson (1997) and Jacqmin-Gadda et al. (2007) discuss and assess the robustness of these type of functionals. Now, we explain how the length of credible intervals, standard deviations, and mean for β BB lm,n = T (P BB n ) are computed, using the notation introduced in Section 2 (the summaries for β BB lg,n and β BB mx,n are computed using the same procedure). In the simulation studies, where T (P BB n ) = β BB lm,n and T (P n ) =β lm,n . We use the notation ξ l,1 , ξ l,2 , and ξ l,3 for the 2.5th and 97.5th percentiles, and standard deviation of the l-th marginal distribution of a (p + 1)dimensional distribution. Thus, we define where ξ BLBB l,k and ξ SDBB l,k are computed using the algorithms displayed in Figure 1. With these summaries, we can compute the average relative errors of lengths of 95% credible intervals and posterior standard deviations. For example, for the BLBB, we can compute the average relative error as and the same computation can be carried out for the SDBB by substitutingξ BLBB l,1 andξ BLBB l,2 bỹ ξ SDBB l,1 andξ SDBB l,2 . For the posterior mean, we define for the BLBB and SBBB, respectively. We use the errors in (7) and (8) to assess the performance of the methods. We compare the results found after 1,000 samples from the BB (run on the full dataset), and the results shown are averages over 100 simulated datasets. In the case of the BLBB, the number of bootstrap samples within each subgroup is equal to 100. On the other hand, the SDBB is run until 1,000 samples are drawn. For both methods, we set b = n γ , γ = 0.6, 0.7, and 0.8.
Given that the BLBB and SDBB have asymptotic guarantees (which are proved in Theorems 1 and 2 in the supplementary material), we compare the performance of the BLBB and SBBB with two procedures based on the asymptotic distributions of T (P n ) and T (P j,b,n ). The first procedure uses an estimate of mean and variance of the asymptotic normal distribution of T (P n ) to compute the summaries of interest. We refer to this method based on asymptotic normality as AN. The second one relies on data-subsetting and, for each j = 1, . . . , n/b, uses an estimate of mean and variance (re-scaled by a factor of b/n) of the asymptotic normal distribution of T (P j,b,n ) to compute an aggregated estimation of the corresponding summaries. We aggregate estimates using the average over j = 1, . . . , n/b. We refer to this method as ANS. The summaries computed under AN and ANS are compared to ξ BB l,k , k = 1, 2, 3, and ξ BB 4 using (7) and (8). The estimated mean and variances of these asymptotic distributions as well as the functionals (4) and (5), and the varianceV in (6) are computed using the statistical software R (R Core Team, 2015). For linear and logistic regression, we used the function lm and glm in the stats package; for mixed-effects regression, we used the function lme in library nlme (Pinheiro et al., 2018). The simulations were performed on a computer with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz processor and 16 GB RAM and running R version 3.3.2. We acknowledge that the results are contingent on our computing infrastructure and coding abilities. Section 3.1 below explains the results of the simulation studies and Section 3.2 contains a discussion on the computational overhead of the methods.

Monte Carlo results
In this subsection, we discuss and compare the Monte Carlo results obtained with the BB to those obtained with the BLBB, SDBB, ANS, and AN. Table 1 displays the average relative and absolute errors associated with the summaries.
In our simulations, ANS is outperformed by the BLBB and SDBB, but AN outperforms both of our methods. The good performance of AN is in agreement with our theoretical results which assert that, under appropriate regularity conditions, the posterior distribution of functionals of P BB n will be approximately normal when the sample size is large enough. However, AN requires storing the full dataset in the internal memory, which is unfeasible in realistic applications with massive data. In other words, we propose to use the BLBB and SDBB when the data cannot be stored into the internal memory of a single computer, so AN is not a direct competitor.
As expected, Table 1 shows that the performance of the proposed methods depends on the functional of interest. In linear and mixed-effects regression the errors are rather small, whereas in logistic regression the errors tend to be higher. For the functionals considered in this simulation study, we observe that the performance of the BLBB and SDBB suffers in scenarios where the functional of interest does not have a closed-form expression, such as logistic regression. Nonetheless, regardless of the functional, the BLBB and SDBB approximate the summaries of the BB better as the size of the subsets b = n γ increases, so we encourage users to use subsets as large as possible. Another component that can affect the quality of the approximation is the number of bootstrap samples used to compute the summaries. We recommend setting this number as large as possible.
While the bias of the SDBB is smaller than the bias of the BLBB in approximating posterior means, the BLBB is better than the SDBB in approximating credible interval lengths and posterior standard deviations. The bias of the BLBB can be attributed to the fact that it only considers a single partition of the dataset, which can be avoided by averaging over several random subsets (so the SDBB avoids this issue). However, the use of several random subsets adds an additional source of randomness that leads to wider credible intervals and larger standard deviations. Fortunately, as pointed out above, these differences are less worrisome as the size of the subsets increases.

Computational considerations
We compare the relative error of our methods with respect to 1,000 posterior draws from the BB (run on the full dataset), and we average our results over 100 simulated datasets. In the case of the BLBB, the number of bootstrap samples within each subgroup equals 100, and the algorithm is run for 20 seconds (if all the subgroups are processed before 20 seconds, the algorithm stops). On the other hand, the SDBB runs for 20 seconds (unless 1,000 samples are sampled before 20 seconds, in which case the algorithm stops). For both methods, the subsets are processed sequentially using a single core. Figure 2 displays the results found for linear and logistic regression and credible interval lengths using γ = 0.8. Due to the similarity of conclusions, we do not present the results for mixed-effects regression and other values of γ. Average relative errors and processing times associated with the length of credible intervals for β BB lm,n and β BB lg,n . The vertical green dashed line indicates when the SDBB has generated the 100th draw. The black line represents how fast the BB approximates the interval lengths based on 1,000 draws. Figure 2 shows that the SDBB provides faster outputs than the BLBB if we do not wait until having 100 simulations to compute summaries of the posterior distribution. If we wait, the BLBB is faster than the SDBB. This phenomenon occurs because each iteration of the SDBB requires computing the functional twice, whereas the BLBB only computes the functional once. If we wait, the SDBB produces outputs just after the BLBB has processed the second subset. We recommend waiting until having some minimum number of samples because, otherwise, the estimates might not be reliable (in the sense that they will have high variance). Indeed, the figure shows that while the SDBB can produce outputs very quickly, reporting inferences with very few samples can have high relative error.
The computational cost of loading subsets of a massive dataset into memory is not always negligible and can play a crucial role in the performance of the proposed methods, particularly for the SDBB. If b is large enough, each subset takes a considerable amount of time to be loaded into memory, which can make the SDBB slower than the BLBB. More precisely, if we consider the same settings used in the simulation studies, the BLBB only has to load n/b subsets, while the SDBB needs to load 1,000 subsets into memory. Hence, the use of the SDBB is limited to scenarios where, according to the available computational resources, the value of b is not too large or the data transfer rate is not too low.
Although the BLBB and SDBB do not start reporting results at the same time (which depends on the size of b and whether or not we impose a condition of having at least 100 draws from the SDBB), both provide results that stabilize rather quickly: in Figure 2, the relative errors tend to stabilize before the n/b subsets for the BLBB and the 1,000 subsets for the SDBB have been processed. This is appealing in scenarios with limited access to computational resources. Even if the user can load the full dataset into memory, our methods will stabilize faster and might output better results than the BB. ANS is a valid alternative in this context, but users have to be willing to accept higher relative errors.
We conclude this section discussing some computational aspects related to the lossless procedure presented in Section 2.3. While it is possible to perform lossless inference for β BB lm,n , it is significantly slower than the BLBB and SDBB. In our simulations, the BLBB and SDBB can produce excellent approximations in 20 seconds. Table 2 shows computing times for the lossless method with different subgroup sizes and number of bootstrap samples that are drawn at a time from the subgroups (referred to as "batch" in the table). We show the average time spent sampling from the subsets (processing) and the time it takes to combine the results once all the samples are drawn (combining). The last column is an estimate of how long the procedure would take to generate 1,000 samples using the lossless method. We observe that smaller subgroups and bigger batches are preferable, but in any case the time it takes to generate 1,000 samples is significantly larger than the time that it takes to produce them using the BLBB and SDBB (and recall that, in this context, the BLBB and SDBB provide very good approximations, so the benefit of using a lossless method is minimal).

Applications with real-life datasets
In this section, we apply the BLBB and SDBB to 2 different datasets. The first one is the Office of Personnel Management's Central Personnel Data File (https://www.opm.gov/), which we refer to as the OPM dataset from now on. The OPM dataset is confidential and housed by the Protected Research Data Network at Duke University https://oit.duke.edu/ what-we-do/services/protected-network. We consider two subsets of the data: i) a subset comprising employees that worked during 2011 and is referred to as OPM-2011, and ii) a subset comprising employees that worked without any interruption during ten years starting in 2002, which we refer to as OPM-10Y. The second dataset includes public microdata files from the American Community Survey 2012 (ACS-2012), which can be downloaded from the United States Bureau of the Census (http://www2.census.gov/acs2012_1yr/pums/).
We study the performance of the BLBB and SDBB in approximating functionals of the posterior distribution of regression coefficients. We compare their 95% credible intervals, posterior standard deviations, and posterior means (as we did in Section 3). We approximate 95% credible intervals using ξ BLBB l,4 ± (ξ BLBB l,2 − ξ BLBB l,1 )/2 and ξ SDBB l,4 ± (ξ SDBB l,2 − ξ SDBB l,1 )/2. We also compare the BLBB and SDBB to the asymptotic methods ANS and AN, which were defined in Section 3. For the OPM-2011 dataset, we focus on estimating coefficients from linear regression, i.e., β BB lm,n (an analogous comparison for the τ -quantile regression estimator proposed in Chamberlain and Imbens (2003) can be found in the supplementary material). For the OPM-10Y dataset, we use mixed-effects regression and aim to approximate summaries of the distribution of β BB mx,n . For the ACS-2012 dataset, we estimate coefficients of a logistic regression model, which we denote β BB lg,n .

Office of Personnel Management
The OPM dataset comprises about 3.5 million employees and 29 variables recorded over 24 years. This dataset contains personnel records from employees that served in the federal U. S. government, including demographic variables (e.g., race, gender, and age) and relevant information related to their wages and careers (such as educational level). Our response variable is the natural logarithm of the wages, and the predictors correspond to the variables whose effect is of interest (e.g., gender or race) along with other variables that are used to control for potential confounding factors (e.g., age and education level). In order to assess wage gaps between the levels of a feature of interest (e.g., between men and women or between races), researchers have run regression models and interpreted their coefficients (see Bolton and de Figueiredo, 2016b,a;Barrientos et al., 2018). When those inferences cannot rely on parametric assumptions, uncertainty about the coefficients can be measured using nonparametric methods such as the BB.
For illustrative purposes, we use 2 random samples of 50,000 and 200,000 full-time employees from the OPM-2011 dataset, and 2 random samples of size 10,000 and 40,000 from the OPM-10Y dataset. We include gender, race, educational level, and age as predictors. The levels for gender are male and female, whereas the levels for race are white, American indian/Alaskan native, Asian or pacific islander, black, and hispanic. Age and educational level are treated as numerical variables, and we include both linear and quadratic effects on the age of the employees. The datasets contain 22 different educational categories. For ease of interpretation, we collapse the categories into a single continuous measure of the years of education attained by an individual past high school. Race is treated as a categorical variable, and the baseline cannot be disclosed because the dataset is confidential. For the OPM-10Y, we add a predictor representing the year at which the observation was collected. The regression coefficient β BB mx,n is computed using Expression (6) whereV is computed using a mixed-effects model with a random intercept and slope for year, and assuming that each employee represents a level of the grouping factor.
We compare the results obtained with the BLBB, SDBB, ANS, and AN with those obtained after drawing 1,000 samples from the BB ran on the full dataset. The analysis was performed on a computer with a Intel(R) Xeon(R) CPU E5-2699 v8 @ 2.20GHz processor and 236 GB RAM (running R version 3.3.3). We consider b = n γ with γ = 0.6, 0.7, and 0.8.

OPM-2011
A table containing the relative and absolute errors associated with β BB lm,n can be found in the supplementary material. The relative errors for interval lengths and standard deviations are rather small (less than 0.05) with all methods. The biases (absolute errors) are also small (less than 0.015). In practice, the BLBB and ANS can give biased results, depending on the partition. In almost all of the cases, the BLBB and SDBB produced errors that are smaller or equal than the errors of ANS and AN.
In this application, the BLBB and SDBB provide satisfactory results for the values of n and γ we considered, and the effect of increasing n and γ is not particularly noticeable. On the other hand, the asymptotic methods improve their performance when n and γ are increased. Figure  3 displays credible intervals for coefficients associated with race. All the methods provide satisfactory approximations. Table 3 contains the relative and absolute errors associated with β BB mx,n . The errors are rather large for n =10,000 and γ = 0.6. However, as n or γ increase, the BLBB and SDBB are better approximations of the BB. For n =40,000 and γ = 0.8, the errors are fairly small. For this specific dataset, none of the methods outperforms the others, but we observe some patterns. For example, for γ = 0.6, the BLBB assesses uncertainty better than the SDBB, and for n=10,000, the bias associated with SDBB is smaller than the bias of the BLBB.

OPM-10Y
The relative errors for interval lengths and standard deviations are quite large with the asymptotic methods if we use the asymptotic estimator of the variance proposed in Jacqmin- Gadda et al. (2007). We also consider the sandwich estimator proposed in Liang and Zeger (1986), which is implemented in the R function vcovCR.lme of the library clubSandwich (Pustejovsky, 2018).The results with the sandwich estimator are labeled ANS-Sand and AN-Sand. Table 3 also contains the relative and absolute errors associated with these asymptotic methods. The errors associated with ANS-Sand are similar to the errors found with the BLBB. Figure 3 displays credible intervals for coefficients associated with different levels of race. Again, we observe that increasing n or γ improves the quality of the approximation. For n =40,000, we do not observe big discrepancies between the credible intervals based on the BB and the BLBB, SDBB, or ANS-Sand; however, the intervals with ANS tend to be too narrow. Additionally, in Figure 3 we observe that the most problematic coefficient is the first one (represented as β 1 in the figure), which is associated with a category with a small observed frequency (2%). If a variable has some levels that are highly infrequent, approximating their coefficients via subsetting can be problematic: for example, some of the subsets might have extremely small frequencies for some levels, and some levels might not be observed at all.
In this specific context (OPM dataset, linear and mixed-effect regressions), our general recommendation to take subset sizes as large as possible leads to reasonable performance of the BLBB and the SDBB.

American Community Survey
The ACS-2012 dataset contains records of 1,477,091 households in the United States collected in 2012. The data were collected with the goal of making inferences about population demographics and housing characteristics. In this application, we only use the information at the household level and not at the individual level, and we model a binary outcome (response variable) that indicates whether or not the household is paying for a fire/hazard/flood insurance. We regress our binary outcome on a set of numerical and categorical predictors. As numerical predictors, we consider number of people living in the household, number of bedrooms, number of rooms, number of vehicles, and household income (in the past 12 months). The categorical predictors are lot size, yearly food stamp/Supplemental Nutrition Assistance Program (SNAP) recipiency, house heating fuel, presence and age of children, and a discretized variable that indicates how long the families lived in the household. This set of predictors leads to a regression with 21 coefficients. We use two subsets of the ACS-2012 dataset of n =50,000 and 200,000 complete cases records that correspond to households that are located in the Northeast of the United States and are not rentals. We estimate the posterior mean and standard deviation, quantiles (2.5% and 97.5%), and lengths of the resulting 95% credible intervals for β BB lg,n . We compare the results obtained with the BLBB, SDBB, and the asymptotic methods ANS and AN with those obtained from running 1,000 samples from the BB ran on the full dataset. Table 4 contains the relative and absolute errors associated with β BB lg,n . The errors are moderately small (<0.09) even for n =10,000 and γ = 0.6. We also observe that, as either n or γ increase, the BLBB and SDBB provide better results. In this application, we cannot conclude that the BLBB or SDBB uniformly outperforms the other; for γ = 0.6, the BLBB is better at assessing uncertainty than the SDBB, whereas for n=50,000, the bias associated with SDBB is smaller than the bias of the BLBB. These patterns were also observed in the OPM-10Y dataset, where we used a mixed-effects model (see Section 4.1.2). The relative errors for estimating interval lengths and standard deviations with the ANS and AN are large. The errors decrease as n increases, but they are never smaller than the errors of the BLBB and SDBB. The bias (absolute error) associated with AN is very small, whereas the bias associated with ANS is similar to the bias of the BLBB. This is not surprising because we are using the same partition with both methods (BLBB and ANS). Figure 3 displays credible intervals for the coefficients of the variable that indicates how long the families lived in the household. We choose to show these intervals because the observed frequencies of some of the levels are low, so they are particularly hard to estimate with subsetting methods. ANS is more sensitive to this specific issue (which is even worse when b is small) than the BLBB and SDBB. In general, we observe that ANS tends to output intervals that are too wide.

Discussion
We have presented the BLBB and SDBB as two data-subsetting procedures to approximate the BB. The BLBB and SDBB are analogous to the BLB (Kleiner et al., 2014) and SDB (Sengupta et al., 2016). The proposed procedures have theoretical and computational properties that are comparable to those of their frequentist counterparts. The performance of the methods has been illustrated and compared in simulation studies and real datasets. Although both the BLBB and SDBB are computationally efficient, the BLBB is preferable in scenarios where the computational cost of loading the subsets into memory is high. A similar conclusion can be drawn if the BLB and SDB are compared in an analogous setting. We observe that the BLBB approximates the uncertainty of the BB better than the SDBB, whereas the SDBB provides better approximations of point estimates than the BLBB. If the subsets can be loaded into memory reasonably fast and the functional of interest can be computed quickly, we recommend running rescaled bootstraps for some subsets to check if the posterior distributions of the functionals are similar. If they are, the SDBB will approximate the uncertainty as well as the BLBB does but with less bias; if they are not, the SDBB will overestimate uncertainty estimates and we recommend using the BLBB. The performance of the methods depends on the size of the subsets and the functional of interest. In general, we observe that increasing subset sizes improves the approximation. This relationship between the quality of the approximation and subset size is not particular to our procedures; in fact, it is a common issue of data-subsetting methods. In addition to the BLBB and SDBB, we provide a strategy for performing lossless inference for functionals that can be expressed as functions of expectations with respect to the probability measure of the BB. This class is larger than one would expect at first glance: it includes, for instance, the weighted least squares estimator used in Clyde and Lee (2001) and Taddy et al. (2016), as well as the instrumental variables estimator introduced in Section 2 of Chamberlain and Imbens (2003).
Future work can extend our contribution. For example, it would be useful to determine which functionals are best estimated by the BLBB or SDBB (beyond empirical investigations), so that we can select and combine the methods as needed depending on the functionals we want to estimate. It would also be interesting to find strategies for determining when the sample size is big enough so that asymptotic methods (such as ANS, as defined in Section 3) can be used to our advantage. Another interesting direction for further research would be designing datasubsetting strategies for datasets that have categorical variables with low observed frequency levels, which is an important practical issue, as we argue in Section 4.1.

Supplementary Material
The supplementary material has 6 sections: the first provides theoretical results for the processes proposed in Sections 2.1, 2.2, and 2.3; the second has a figure which details the Monte Carlo algorithm for performing lossless inference for the class of functionals described in Section 2.3; the third contains a scheme for lossless simulation for the example in Section 2.3 in Chamberlain and Imbens (2003); the fourth part explains how to perform lossless inference for the Dirichlet-Multinomial process; the fifth includes a table with relative and absolute errors related to the linear regression coefficients estimated from the OPM-2011 dataset in Section 4.1; Finally, the sixth assesses the performance of the BLBB, SDBB, ANs, and AN approximating coefficients of a quantile regression fitted to the OPM-2011 dataset.