Fast methods for posterior inference of two-group normal-normal models

We describe a class of algorithms for evaluating posterior moments of certain Bayesian linear regression models with a normal likelihood and a normal prior on the regression coefficients. The proposed methods can be used for hierarchical mixed effects models with partial pooling over one group of predictors, as well as random effects models with partial pooling over two groups of predictors. We demonstrate the performance of the methods on two applications, one involving U.S. opinion polls and one involving the modeling of COVID-19 outbreaks in Israel using survey data. The algorithms involve analytical marginalization of regression coefficients followed by numerical integration of the remaining low-dimensional density. The dominant cost of the algorithms is an eigendecomposition computed once for each value of the outside parameter of integration. Our approach drastically reduces run times compared to state-of-the-art Markov chain Monte Carlo (MCMC) algorithms. The latter, in addition to being computationally expensive, can also be difficult to tune when applied to hierarchical models.


Introduction
Advances over the last decade in statistical methods and their implementation in open-source, user-friendly software have drastically simplified statistical modeling for applied researchers.For example, with probabilistic programming languages such as Stan [Carpenter et al., 2017] a user can specify and sample from a very general choice of posterior density with flexible language and an easy-to-use interface.For its primary tool of inference, Stan (as well as other probabilistic programming languages) samples from the posterior distribution via dynamic Hamiltonian Monte Carlo sampler (HMC) [Betancourt, 2018, Hoffman andGelman, 2014].HMC is a grandient-based sampling method that has become ubiquitous in statistics over the last decade due to its being flexible, reliable, and general.
Despite its widespread use, HMC, as well as other Markov chain Monte Carlo (MCMC) methods, have a substantial drawback in statistical problems with large amounts of data -they can be prohibitively slow (and difficult to tune [e.g.Betancourt et al., 2015]).For example, in the case of a linear regression with n observations and k predictors, evaluation of the posterior density requires O(nk) operations with straightforward implementation.To make matters worse, MCMC methods require large numbers of evaluations of the posterior density, and in the case of HMC, the posterior's gradient.
Alternative methods for inference have been proposed for problems where MCMC is impractical.These approaches typically involve a suitable approximation of the posterior density with a function with desirable properties.Laplace approximation methods [e.g.Margossian et al., 2020] and variational inference [Blei et al., 2017] are two examples.More generally, there is extensive literature on efficient computational tools and analysis of posterior densities, and there are various software packages devoted to their implementation [see, e.g.Rue et al., 2017, Kristensen et al., 2016].
While these packages, and indeed most of the literature, are devoted to general tools for a wide range of posterior densities, in this paper we introduce an efficient algorithm for computing posterior expectations for two particular classes of Bayesian regression modelstwo-group normal-normal models and mixed-effects models.These classes of models find a broad range of applications in, for example, social sciences, epidemiology, biochemistry, and environmental sciences [Gelman et al., 2013, Gelman and Hill, 2006, Greenland, 2000, Merlo et al., 2005, Bardini et al., 2017].Furthermore, in the broader context of model development, these regression models can serve as template models [Gelman et al., 2020].
Using general MCMC methods for sampling from these posteriors can be exceedingly slow for problems with large amounts of data.By specializing to this particular family of models, we leverage their structure to create customized algorithms for fast and accurate inference.
The two Bayesian linear regression models we consider are: 1. Two group normal-normal: We define the two-group normal-normal model by where X 1 is a n×k 1 matrix, β 1 ∈ R k 1 is a vector of regression coefficients, X 2 is a n×k 2 matrix, and β 2 ∈ R k 2 is a vector of regression coefficients.For Bayesian inference, we assume priors on the scale parameters σ 1 , σ 2 , σ 3 .The performance of the algorithm is largely independent to the choice of these priors.In the models that we use in this paper, we assign independent weakly informative normal + (0, 1) priors on the variance parameters σ 1 , σ 2 , σ 3 (assuming y and the columns of X have been normalized to have standard deviation 1).

Mixed effects:
The mixed-effects model differs slightly from the two-group normalnormal model.Instead of modeling the scale parameter σ 2 , fixed scale parameters are assigned to the normal priors on β 2 .The mixed-effects model is defined by where σ 2,i is the fixed scale parameter prior on each regression coefficient β 2,i for i = 1, ..., k 2 where β 2 ∈ R k 2 .We will assume priors on the scale parameters σ 1 , σ 3 .
The models we discuss in this paper are standard models of Bayesian statistics and appear when seeking to model an outcome, y, as a linear combination of two (or more) distinct groups of predictors.The Gaussian prior on the predictors enable various strategies commonly used in statistical modeling and machine learning; notably regularization and partial pooling between various sources of data.We demonstrate these models on three applications.
1. COVID-19: Due to a lack of reliable, fast, and widespread testing, an online survey initiative was created in Israel [Rossman et al., 2020] for tracking and predicting COVID-19 outbreaks.We constructed a mixed-effects model for estimating geographic and age effects on the spread of the virus.With tens of thousands of responses, straightforward implementation of MCMC methods takes hours.Using the methods of this paper, we obtain accurate posterior inference in seconds.

Rat growth:
We demonstrate the efficiency of our two-group algorithm on the classical two-group model for rat growth [Gelfand et al., 1990], which estimates the growth rates of a population of rats over the first few weeks of life.

Public opinion on abortion:
We use 2018 results of the annual Cooperative Congressional Election Study (CCES) to estimate geographic and demographic effects on attitudes towards abortion.The CCES contains nearly 100, 000 responses, and performing inference via MCMC sampling can be prohibitively slow.We use the mixedeffects algorithm introduced in this paper to perform posterior inference in seconds.
The computational methods we introduce for the two-group normal-normal model and the mixed-effects models are closely related.In fact, the mixed-effects model is a special case of the two-group model.We organize this paper by first describing our algorithm for the two-group normal-normal model in detail, and then outline the minor modifications that allow for efficient evaluation of mixed-effects models.
The unnormalized density corresponding to the two-group model is given by where For convenience, we will be denoting by σ the vector of scale parameters (σ 1 , σ 2 , σ 3 ) ∈ R 3 .
In the methods of this paper, we compute posterior moments of q by analytically reducing the calculation of moments from integrals over k + 3 dimensions to 3-dimensional integrals.We then integrate the remaining 3-dimensional integrals with a tensor product of Gaussian nodes.For example, we evaluate the normalizing constant C and posterior means for the regression coefficients, β, via where σ i ∈ R 3 and w i ∈ R are three-dimensional Gaussian nodes and weights [Trefethen, 2020] and Integrals ( 4) and ( 5) can be evaluated analytically via well-known equations [Lindley and Smith, 1972], but a straightforward implementation of those equations results in a computational cost of O(m 3 k 3 ) operations where m is the number of discretization nodes needed in each dimension.In the methods of this paper, we improve the computational cost of those integrals to O(mk 3 + m 2 k 2 + m 3 ) operations after a change of variables, allowing for the rapid evaluation of marginals.The tools used in the algorithm of this paper are a generalization of the approach proposed by Greengard et al. [2021] and generalize to higher-dimensional multilevel and higherdimensional multigroup posterior distributions.Since we integrate the marginal density using a tensor product of Gaussian nodes, the cost of the integration scales like O(m d ) where m is the number of discretization nodes in each direction and d is the dimension of the marginalized integral (where d = 3 in the models of this paper).As a result, higher dimensional problems require evaluation of marginal integrals via sampling-based algorithms and cannot rely solely on Gaussian quadrature.We leave the analysis and description of numerical tools for such models to a subsequent publication.
The structure of this paper is as follows.In the following section we describe the change of variables of (3) and provide mathematical analysis that will be used in the algorithm of this paper.Section 3 includes formulas that will allow for the evaluation of the normalizing constant of (3).In Section 4 we describe analysis that will be used in computing expectations and in Section 5 we describe formulas for second moments.In Section 6 we discuss details of the implementation of the algorithm.Section 7 contains an algorithm for a special case of the two-group normal-normal model in which one group is much smaller than the other.The mixed effects algorithm, or rather the modification of the two-group normal-normal algorithm, is contained in Section 8.In Section 10, Section 9 and Section 11 we apply the algorithms of this paper to applications.Conclusions and generalizations of the algorithm of this paper are presented in Section 12.

Mathematical apparatus
Over the next several sections, we describe a numerical algorithm for computing expectations and second moments of the density q : R k+3 → R + defined by , and y ∈ R n .We begin by introducing notation that will be used throughout the numerical sections of this paper.Let y = X β + d, where X t d = 0, so that β is the least-squares solution to the linear system Xβ = y and d is the residual.Let I 1 be the diagonal k × k matrix with ones in the first k 1 places on the diagonal, and zeroes in the remaining k 2 places.Similarly, let I 2 be the diagonal k × k matrix with zeroes in the first k 1 places on the diagonal, and one in the remaining k 2 places.
We perform a change of variables in σ 1 , σ 2 , and σ 3 , defining ρ, θ, and φ implicitly by This corresponds to changing to spherical coordinates in the σ variables.With these substitutions, and with some minor abuse of notation, we obtain The differentials transform as follows: dσ 1 dσ 2 dσ 3 = ρ 2 sin φ dρ dθ dφ.

Normalizing constant
In this section we describe the computation of the integral of f over its domain.If we denote this quantity by I 0 , then f /I 0 is a probability density on R + × R + × R + × R k .We begin by making a change of variables in β, setting β = (I 1 cos θ + I 2 sin θ)z.Similarly, we define z implicitly by β = (I 1 cos θ + I 2 sin θ)z.Then, where X θ = X (I 1 cos θ + I 2 sin θ).The differentials transform as follows To diagonalize the quadratic form appearing in the exponent, we perform an eigendecomposition of on X t θ X θ obtaining where D θ ∈ R k×k is diagonal and positive, and V θ ∈ R k×k is a unitary matrix.We have assumed here that n ≥ k.If the converse is true then a small modification is required.In the following, for notational convenience we denote the diagonal entries of D θ by λ i (θ).
Next we set z = V θ w and z = V θ w.In terms of the original variables, β = (I 1 cos θ + I 2 sin θ)V θ w and β = (I 1 cos θ + I 2 sin θ)V θ w.In particular, After making these substitutions, we obtain Thus, the integrals over the β variables have been reduced to the product of k one-dimensional Gaussian integrals.Using the identity we find that Next, we define the functions α ,θ) .
For a fixed θ, the vector w and the eigenvalues values λ i are independent of ρ and φ, and hence need to be recomputed only when θ is changed.Moreover, for fixed θ and φ, the above integral can be computed in O(1) floating operations for each new value of ρ.In particular, the normalization constant can be computed efficiently via the formula 4 Expectations of β In this section we describe how to compute moments in β of the distribution f .We begin by observing that where Using the results of the previous section we can perform the integrals to obtain We remark that the second factor in the above expression is the same as the one arising in the computation of I 0 .The first factor depends on j, φ, and θ but not on ρ.For ease of exposition, let us define Mj by Then In particular, if 1 is defined via the formula then and hence all k moments can be computed simultaneously with an integrand requiring O(k 2 ) operations to compute (assuming the number of quadrature nodes required to achieve a fixed precision is more or less independent of k).

Covariance of β
In this section, we describe formulas for computing the posterior covariance matrix of β.We use the identity to compute second moments of w j .That is, letting P j (ρ, φ, θ) be defined by we use ( 12) to obtain and defining 2 via the formula we observe that where C is the k × k matrix defined by We then compute the posterior covariance of β via where E[β] is obtained via (11).

Numerical implementation
We now describe a numerical approach for computing the normalizing constant I 0 , see (7), and moments of q (see ( 3)).The quadrature rules used provide arbitrary user-specified precision for both the normalizing constant as well as moments.Our integration scheme is a tensor product of Gaussian nodes in the θ, φ, and ρ directions.
In the remainder of this section we describe how to adaptively determine integration bounds in the φ and ρ directions as well as the number of nodes to use in θ.

Integrals with respect to ρ
We first describe an approach for integrating the inner integral, We do this by finding upper and lower integration bounds and then performing Gaussian quadrature with 80 nodes.The maximum of the integrand of ( 15), which we denote ρ max , is achieved at See Appendix A for details.Furthermore, for all ρ > ρ max , the integrand is monotonically decreasing and for all ρ < ρ max , the integrand is monotonically increasing.We choose our upper integration bound, ρ 1 , to be the value ρ 1 > ρ such that the integrand of ( 15) is smaller than its maximum by a factor of 10 20 .We evaluate ρ 1 with bisection with initial bounds of where σ is the second derivative of the integrand of ( 15) with respect to ρ evaluated at ρ max (see Appendix A).We also find the lower bound of integration using bisection where our starting bounds for the bisection are 0 and ρ max .After finding the bounds of integration, we evaluate (15) using Gaussian quadrature.Using 80 nodes was sufficient for full double precision accuracy in our examples.

Integral with respect to φ
We evaluate the integral I (j) 1 (θ) (see ( 10)) using Gaussian quadrature where we determine the upper bound of integration adaptively.That is, we determine the upper bound of integration, φ 1 , by sequentially evaluating the integrand of (10) at order 100 Gaussian nodes until the integrand is smaller than the maximum observed value of the integrand by a factor of 10 20 .We then declare the first such point, φ 1 , to be the upper integration bound and evaluate (10) with 80 Gaussian nodes on the interval (0, φ 1 ).

Integral with respect to θ
In this section, we describe a technique for evaluating the outer integral of (11).For each evaluation of the integrand of (11), we perform eigendecomposition (6).As a result, each evaluation of the integrand requires O(k 3 ) operations and the total computational time for evaluating moments of posterior (3) is roughly linear in the number of evaluations of the integrand of (11).
We compute integral (11) using Gaussian quadrature.We found that for a general set of problems, the integrand of (11) was sufficiently smooth, that using around 10 nodes was sufficient for several digits of accuracy.To check accuracy of an m-point Gaussian quadrature we compute the same integral with 2m nodes and check the difference, which is a proxy for the accuracy of the m-point Gaussian quadrature.
For problems in which the number of evaluations of the integrand of (11) needs to be reduced, we can use the following approach.Approximate the integral using two Chebyshev nodes (see, e.g., Trefethen [2020]) and four Chebyshev nodes and compute the difference between the two values.That difference provides an estimate for the error of the approximation using two nodes.If the difference between the two approximations is larger than desired, then double the number of nodes and again compute the difference.Continue to double the number of nodes until the desired accuracy is achieved.
By using practical Chebyshev nodes in θ the order n nodes are a subset of the order 2n nodes.This saves n evaluations of the integrand when approximating the integral with 2n nodes.
Algorithm 1: Two-group normal-normal models: Evaluation of posterior expectations.In this section we consider the special case where the posterior density f corresponds to an intercept model and one of the two groups is much smaller than the other.That is, k 1 is small relative to k 2 , and each row of X contains two non-zero entries-one in the first k 1 columns and another in the next k 2 columns.This model appears in applications in which we seek to model an outcome, y, as a combination of two unrelated factors.
For the corresponding posterior, we introduce a fast algorithm that analytically marginalizes the normal-normal parameters β by evaluating a determinant and solving a linear system.
Then R of (18) satisfies For ease of exposition, let β be the function defined by The following lemma will be used in Theorem 1 and will provide formulas for computing posterior moments of f .
Lemma 1 (moments of a Gaussian).Let C be a symmetric positive definite k × k matrix.Then, for any vector Proof.The proof of the first two is immediate.For the second, let z = √ C(β − β).Then, the integral becomes The following theorem follows immediately from the previous lemma and provides formulas that will be used to compute posterior moments of f .Theorem 1.Let f be the unnormalized probability density defined above.Then 1. 2. 3.

Numerical apparatus
In this section, we describe a numerical method for exploiting the structure of X t X and R to obtain a computationally efficient strategy for computing moments of f.
We use the following lemma to compute the determinant found in each integral of Theorem 1.The lemma follows immediately from the Schur complement formula [Trefethen and Bau, 1997].
Lemma 2. Let X t X be the matrix given by and R be the matrix given by In the following lemma, which follows immediately from elementary linear algebra operations, we show that X t X + R can be written as D + U U t where U is a k × 2k 1 matrix and D is diagonal.This will be used to solve the linear system (18) via the Sherman-Morrison-Woodbury formula.
Lemma 3. Suppose X t X and R are the same matrices as defined previously.Then The following theorem allows us to efficiently solve the linear system (18) and follows immediately from the combination of the previous lemma with the Sherman-Morrison-Woodbury formula [Trefethen and Bau, 1997].
Theorem 2. Let X t X and R be as defined above.Let W (ν 3 ) be the k 1 × k 1 matrix defined by where Using the preceding theorems and the quadrature rule described in Section 6, we compute posterior moments of the special case of f described in this section.

Mixed effects
From a computational standpoint, the mixed effects model is nearly identical to the twogroup normal-normal model, however they differ in one key respect.In the two-group normalnormal model (see (1)), the scale parameters σ 1 and σ 2 are treated as unknowns that are fit to the data and assigned priors-they're treated as modeled coefficients, also called "random effects."In the mixed effects model, σ 1 is still a random effect, however instead of treating σ 2 as a random effect, each regression coefficient in the second group of predictors, β 2,i is given a normal prior with fixed scale parameter.The corresponding Bayesian model is where σ 2,i is the fixed scale parameter prior on each regression coefficient β 2,i for i = 1, ..., k 2 where β 2 ∈ R k 2 .For the purposes of demonstrating the algorithm in this paper, we assign the priors The choice of priors on σ 1 and σ 3 is somewhat arbitrary.The algorithm described allows for general choices for these priors.The unnormalized posterior density for the mixed effects model, f : R k+2 → R, is given by where σ 3 ∈ R k 2 is a vector of fixed scale parameter priors for the regression coefficients β 2 ∈ R k 2 .Now our density is over k+2 dimensions-the k regression coefficients, β = (β 1 , β 2 ), and the two scale parameters, σ 1 and σ 2 .With a change of variables we convert the posterior density, f , to the posterior density of a two-group normal-normal model where one scale parameter is fixed.That is, we now convert f (β, σ 1 , σ 2 ) to q(β, σ 1 , σ 2 , 1) where q is the posterior density of a two-group normal-normal model (see ( 3)).
We first scale the last k 2 columns of X by σ 2 3 and define X to be resulting matrix: Xi,j = X i,k 1 +j σ 2 3,j for i = 1, ..., n and for j = 1, ..., k 2 .We define β2 to be the vector and define f by the unnormalized density It follows that posterior means and standard deviations of and all other posterior first and second moments are unchanged under density f .We've now reduced the problem of finding moments of f to finding moments of f , which is equal to q( β, σ 1 , σ 2 , 1) where q is defined in (3).
At this point, we rely on the analysis and numerical tools of the two-group normalnormal model for evaluating posterior moments.The only difference between evaluation of moments of the two-group model is that in the two group model the marginal density is a 3-dimensional density over (σ 1 , σ 2 , σ 3 ) whereas in the mixed effects model, the marginal density is over two dimensions, (σ 1 , σ 2 ) and σ 3 = 1 is fixed.As a result, we perform the change of variables from (σ 1 , σ 2 , 1) to polar coordinates .
The differentials become We can now evaluate the posterior of f with Algorithm 1, where the integral with respect to ρ is replaced with ρ = σ 2 1 + σ 2 2 + 1.

A simple example: Hierarchical linear model
We demonstrate the algorithm on a hierarchical linear model describing the growth of a group of young rats over a period of several weeks; this is a small example that has been used in the statistical literature [Gelfand et al., 1990].In the experiment, the weight of each rat is measured at regular time intervals.Regression coefficients are computed for each rat; that is, for the j th rat, we estimate an intercept α j and a linear coefficient β j .We assign a normal prior on both parameters and estimate the prior scale.The full model is as follows: where X 1 is an indicator matrix indicating to which rat each observation (weighing) corresponds; X 2 is that same indicator matrix multiplied by w − w, where w is the observation week and w the mean observation week.In other words, we have an intercept and a slope parameter for each rat.The data is centered at 0 and the priors on the scale parameters are weakly informative.
We demonstrate the efficiency of the two-group normal-normal Algorithm 1 on evaluating posterior means and standard deviations of the rats model on randomly generated data.We assume an experiment with 100 rats and 20 weighing times and randomly generated data for each weighing.As a result, matrix X 1 and X 2 of model ( 20) are 2000 × 100 matrices.
Because the data size is relatively small and the data matrices have a friendly structure, running MCMC with Stan (4 chains in parallel, each with 1,000 warmup iterations and 1,000 sampling iterations) only takes 13.9s.Our algorithm takes 1.6s and achieves significantly smaller errors than MCMC estimates.Figure 1 shows the error of the MCMC estimates as a function of time and the accuracy achieved by our algorithm.For problems with larger data, the difference in time scale becomes important.

Application: COVID-19 symptom survey
As of the writing of this article, the coronavirus pandemic is still raging in many countries and stressing healthcare systems around the world.A challenge at the start of the pandemic was tracking its spread, especially in locations where reliable testing was not widely available.Having accurate estimates of infection rates across geographical regions can be extremely helpful.For example, reliable estimates allow hospital systems to allocate resources efficiently, they can alert residents of the need to take extra precaution in their daily rou-tines, and they can facilitate better policy from local governments.In order to get improved estimates of infection rates in the absence of widespread testing, initiatives were deployed in early 2020 in several countries that allowed individuals to report symptoms via publicly available surveys [e.g., Segal et al., 2020].
One country where these surveys provided valuable information was Israel [Rossman et al., 2020], where demographic and health data was provided by tens of thousands of respondents across the country.The large amount of data collected from survey respondents provided data scientists and policy-makers with a great resource, however at the same time, large amounts of data turns the computational aspect of statistical modeling into a substantial challenge.
In this section, we present an exploratory model used to analyze data from the COVID-19 survey conducted in Israel [Rossman et al., 2020].Using straightforward MCMC with Stan [Carpenter et al., 2017] was inconvenient; using the full data set resulted in runtimes of several hours.Using the algorithms of this paper, we were able to evaluate posterior moments in seconds without loss of accuracy.

Multilevel regression and poststratification procedure
The respondents are anonymous, but several of their features are recorded, including their age and the city in which they live.We can use the data to identify regions in which the average symptom score seems unusually high.
A first exploratory model uses an intercept, age group, and population density in the respondent's city, as covariates, X, and an indicator matrix Z for city: with a hierarchical prior on the city parameters, and weakly informative priors on the other coefficients, This unit prior is weakly informative if the outcome y i has been standardized and the continuous predictors (in this case, population density) has also been standardized to be on unit scale.
In addition, we put weakly informative half-normal priors (standard normal distributions restricted to the non-negative reals) on the hyperparameters σ 1 and σ 2 : This corresponds to a two-group normal-normal model with an additional covariate.In cities where u cannot be well estimated due to a low response rate, we can rely on the rest of the model, that is a regression model based on age and population density.
Only a fraction of the population responds to the survey, which raises questions about biases.This is notably a concern because different age groups behave differently: not only do their chances of contracting and spreading the disease vary, their susceptibility to the disease also changes.In multilevel regression and poststratification (MRP), we adjust for these biases by using estimates of the proportion of people in each city that belong to each age group.For this model, the proportions are estimated using census data.This leads to a corrected estimate for the expected symptom score of an individual in city i: where β 0 is the intercept, β density,i is the regression coefficient of the population density covariate, d i denotes the density of city i, a i j is the proportion of individuals in the j th age group in the i th city, and β age,j is the regression coefficient of age group j.
Using the means and covariances of u, β 0 , β density , and β age we compute the posterior mean and variance for ũ, per the following formulas.Given a linear combination of random variables, Y = i δ i Z i , we have Moreover, variance and thence standard deviations of ũ can be computed, provided we also evaluate the relevant posterior covariances.

Comparison of our algorithm to MCMC
We analyze the data collected over the two weeks between April 15 th and 30 th 2020, across 351 cities.These are cities for which we know, through census data, the population density and the age distribution.The total number of responses is 135,501.
Our proposed algorithm returns the posterior mean and standard deviation for all variables of interest and takes ∼7s to run.
We next fit the model in Stan using the default dynamic HMC sampler.After warming up the sampler for 500 iterations, we compute another 500 draws, using 4 chains computed in parallel, for a total of 2,000 sampling iterations.The wall time for this procedure is ∼12,000s (>3 hours).For each city, we computed the Monte Carlo mean. Figure 2 plots the posterior mean and standard deviation of ũ for all cities, computed by both methods.Figure 3 shows the difference between our algorithm and the Monte Carlo estimate, as a function of computation time.While it takes on the order of hours to get accurate results with MCMC, our algorithm achieves better results within seconds.

Limitations of the model and our numerical method
We believe the presented model offers an improvement on the analysis conducted on the survey data [Rossman et al., 2020], because (i) it uses full Bayesian inference to quantify   uncertainty and (ii) it corrects sampling biases using a poststratification step.A more careful quantification of uncertainty would use posterior intervals, rather than posterior variance.Such an interval can be estimated using MCMC draws.Extending our numerical scheme into a sampling scheme to estimate such intervals is a direction we are actively pursuing.

Regression MCMC Accuracy
For the model of this paper we only used a fraction of the available covariates, that is, the data collected in survey responses.As a result, the model can be extended to include more than two groups.Estimates tend to be noisy because the studied covariates can be strongly correlated with the outcome.For example, age is correlated with intensity of symptoms.The marginal correlation however is weak.This, and other considerations, suggest that it might be beneficial from a modeling standpoint to build a more sophisticated model, which might be outside of the scope of application of the methods of this paper.Nevertheless, the model considered here is an important step in the development of a better model.

Application: Public opinion on abortion policies
We next apply our method to a hierarchical linear regression used to model attitudes on abortion policies as they vary across states, ethnicity, age groups, and education levels.Modeling this heterogeneity requires partitioning an initially large data set into small groups.Furthermore, we must address biases that can arise in our survey and correct them using more comprehensive surveys, such as census data.As in Section 10, we use MRP to do inference for small slices of big data and correct biases in our survey.
We analyze data from the 2018 Cooperative Congressional Election Study (CCES) using, as in the case study of [Lopez-Martin et al., 2020], a random subset of 5,000 respondents.Respondents express support or opposition on six abortion policies, for example "Ban abortion after the 20 th week of pregnancy" or "Allow employers to decline coverage of abortion in insurance plan."These policies are intended to restrict access to abortion.Each respondent is given a support score, y, ranging from 0 to 6, indicating the number of supported policies.
We use a normal likelihood with the following covariates, recorded for each respondent: state, ethnicity, age group, education level, and sex.We use the proportion of votes for the Republican party in the state in 2016 as an additional predictor, denoted as repvote.The model also admits an intercept term.The statistical formulation of the model is the following: The difficult parameters to estimate here are the state coefficients, to which we give normal(0, σ 2 ) priors.Because the model includes repvote, the partial pooling is done toward the prediction of the state based on its previous vote, not toward the national mean.
Table 2 summarizes the performance of the algorithm on this model.The posterior mean and standard deviation of the MRP estimates for each state can be computed as in Section 10 and are plotted in Figure 4.
Figure 4 shows that the expected support score increases with the level of support for the Republican party, bearing some fluctuations.The large posterior standard deviations n k 1 k 2 max error total time (s) 5000 50 19 1.2 × 10 −8 0.05 Table 2: Computation time and accuracy of Algorithm 1 applied to a model of support/opposition for abortion policies.The column "max error" shows the maximum error of posterior means and standard deviations of regression coefficients and scale parameters.indicate there is quite a bit of heterogeneity within each state.For further insight, we may examine how groups other than states, e.g.ethnic groups, age groups, etc. behave.The present model has certain limitations.First, one could consider interaction terms.This seems sensible since, for instance, white males with no college education likely behave differently than white males with a college degree.The numerical method presented in this paper can handle interaction terms.Computing the posterior standard deviation of the MRP estimate however requires some data wrangling.We plan to create an R package with routines that seamlessly implement these MRP calculations, making it straightforward for modelers to experiment with different covariates and interaction terms.
There is also interest in nonlinear models with non-normal likelihoods.Lopez-Martin et al. [2020] consider an item-response or ideal-point logistic regression.This sort of model can better capture certain characteristics of the data, such as dependence among different survey responses.For such models, we cannot use the proposed integration scheme.This presents us with a tradeoff: the proposed algorithm takes a fraction of a second to run, while fitting the ideal point model with Stan's MCMC takes hundreds of seconds.The difference is more severe if, rather than fitting a subset of 5,000 respondents, we use all 60,000 respondents in the survey.The modeler then needs to assess how useful it is to use a non-normal likelihood.Even then, the normal likelihood model can be a fast way to do model exploration, by for example examining various covariates and interaction terms.

Conclusions and generalizations
In this paper we describe a class of fast algorithms for evaluating the posterior moments of two Bayesian linear regression models: 1. Two-group normal-normal: The two-group normal-normal model is used to model a continuous outcome with two groups of parameters: where X 1 is a n × k 1 matrix of predictors, β 1 ∈ R k 1 is a vector of regression coefficients, X 2 is a n × k 2 matrix of predictors, and β 2 ∈ R k 2 is a vector of regression coefficients.
2. Mixed effects model: The mixed-effects model is a slight variant of the two-group normal-normal model.In the mixed-effects model we model the scale parameter on one group of coefficients and assign fixed scale parameters to the priors on all other coefficients: y ∼ normal(X 1 β 1 + X 2 β 2 , σ 3 ) β 1 ∼ normal(0, σ 1 ) β 2,i ∼ normal(0, σ 2,i ) where σ 2,i is the fixed scale parameter prior on each regression coefficient β 2,i for i = 1, ..., k 2 where β 2 ∈ R k 2 .
The algorithms of this paper allow for assigning a general choice of priors on the scale parameters.We demonstrated the performance of our algorithm for posterior inference on two applications.In Section 10 we used COVID-19 symptom survey data to model geographic and age effects.We also used the mixed-effects model with public opinion survey data to estimate geographic and demographic impacts on attitudes towards abortion.These are both existing applications that have been fit with MCMC; by allowing these models to be fit much faster, our algorithm can facilitate a workflow in which users can fit and explore many more models in real time.
The algorithms of this paper provide substantial improvements over standard MCMC methods in both computation time and accuracy in approximating posterior moments.These improvements rely on analytically integrating the regression coefficients, which make up the bulk of the posterior dimensions, and then numerically integrating the remaining lowdimensional density with Gaussian quadrature.
Many of the techniques and analysis used in this paper generalize to multilevel and multigroup models with more than two-groups.For an m group model, the numerical integration of our algorithm is computed over a m + 1 dimensional density.For models with large m (large number of groups) the analytic marginalization of this paper can still be applied, however, integration via a tensor product of Gaussian nodes will not be feasible.On the other hand, using MCMC or other integration schemes can be used on the m + 1 dimensional marginal density.
Bayesian models with more than two groups and non-Gaussian likelihoods are directions of future research.(31)

Figure 1 :
Figure 1: Error of MCMC estimates via Stan as a function of run time.The horizontal orange line is the error of Algorithm 1.

Figure 2 :
Figure 2: Posterior mean and standard deviation for ũ computed using Algorithm 1 and MCMC.The points represent the estimated mean and the "error bars" span two standard deviations.

Figure 3 :
Figure3: Error of MCMC estimates via Stan as a function of run time.As a benchmark, we use the estimate returned by Algorithm 1.The horizontal orange line is the error of our method, which took 7 seconds to run.MCMC in Stan required over 6, 000 seconds of warmup before error can be measured.Total run time for Stan was ∼ 12, 000 seconds.

Figure 4 :
Figure 4: MRP estimate of the expected level of support for anti-abortion policies in each state.The point represents the posterior mean, and the bars span two posterior standard deviations.The states are ordered based on Republican vote share in the 2016 presidential election.

Table 1 :
Accuracy of the approximation of posterior means for several regression coefficients with both MCMC and Algorithm 1.For MCMC, the total time for the approximation was ∼12,000s.Total time for Algorithm 1 was 7s.