Fast and Accurate Estimation of Non-Nested Binomial Hierarchical Models Using Variational Inference

Non-linear hierarchical models are commonly used in many disciplines. However, inference in the presence of non-nested effects and on large datasets is challenging and computationally burdensome. This paper provides two contributions to scalable and accurate inference. First, I derive a new mean-field variational algorithm for estimating binomial logistic hierarchical models with an arbitrary number of non-nested random effects. Second, I propose"marginally augmented variational Bayes"(MAVB) that further improves the initial approximation through a step of Bayesian post-processing. I prove that MAVB provides a guaranteed improvement in the approximation quality at low computational cost and induces dependencies that were assumed away by the initial factorization assumptions. I apply these techniques to a study of voter behavior using a high-dimensional application of the popular approach of multilevel regression and post-stratification (MRP). Existing estimation took hours whereas the algorithms proposed run in minutes. The posterior means are well-recovered even under strong factorization assumptions. Applying MAVB further improves the approximation by partially correcting the under-estimated variance. The proposed methodology is implemented in an open source software package.


Introduction and Motivating Example
Hierarchical models, often known as multilevel, mixed, or random effects models, are ubiquitous in the social sciences (Gelman and Hill 2006;Rabe-Hesketh and Skrondal 2008).In political science alone, these models are used for addressing unobserved heterogeneity, explicitly modeling dependence between observations, allowing effects to vary across space or time, and many other applications (e.g.Clark and Linzer 2015;Bell and Jones 2015;Steenbergen and Jones 2002;Stegmueller 2013).They are also popular in other fields such as educational research and psychology.
The benefits and challenges of these models can be illustrated by an increasingly popular application for survey research in social science: Multilevel Regression and Post-Stratification (MRP; Gelman and Little 1997;Park et al. 2004;Gao et al. 2020).Described in more detail in Section 5, the core purpose of this method is to extrapolate outcomes from nationally representative surveys to small geographic areas with limited Second, I outline a generic procedure for improving an initial variational approximation when a parameter expansion of the underlying Bayesian model exists.I do this by drawing a connection to "marginal augmentation" from the Markov Chain Monte Carlo literature (e.g.Liu and Wu 1999;Van Dyk and Meng 2001) and showing that this parameter expansion often permits a nearly costless improvement of the initial approximation.The method ("marginally augmented variational Bayes"-MAVB) transforms the initial approximation by sampling the expansion parameter and re-transforming the original samples while maintaining the stationarity of the target posterior.This induces dependencies between the parameters that were assumed away in estimating the initial procedure and provides a provable guaranteed improvement upon the original approximation.
Methodologically, this pushes forward the literature on variational inference for hierarchical models by extending work in the case of a single random effect (Hall et al. 2011;Ormerod and Wand 2012;Tan and Nott 2013;Hall et al. 2019) or two non-nested random effects (Jeon et al. 2017;Menictas et al. 2019) to the general case.The proposed method requires no integration, unlike many existing methods for binary outcomes (Ormerod and Wand 2012;Tan and Nott 2013;Jeon et al. 2017).It further provides a link to existing work that seeks to combine Markov Chain Monte Carlo and variational inference by stochastic optimization (e.g.Salimans et al. 2015;Ruiz and Titsias 2019;Yin and Zhou 2018).Instead of optimizing the transformed density, MAVB transforms the samples from the initial approximation with a partial step of MCMC using marginal augmentation that, in practice, appears as performing a stochastic location/scale transformation of the sampled parameters.This leverages a sampler that is known to mix well in the case of fully Bayesian MCMC and lacks internal tuning parameters as its primary goal is to find a computationally inexpensive way to improve an initial approximation.While it bears some similarities to work on re-parameterization in hierarchical models for variational algorithms (e.g.Tan and Nott 2013;Tan 2021), it does not fix the reparameterization in advance of estimation.It differs from other approaches that seek to improve an initial approximation (e.g.linear response variational Bayes; Giordano et al. 2015) in that it has a guarantee on improving the approximation quality.Future work could examine how such methods work alongside Polya-Gamma data augmentation.
The remainder of the paper proceeds as follows.Section 2 states multiple factorization assumptions under which Polya-Gamma augmentation can be used to estimate a variational approximation for a binomial logistic hierarchical model.Section 3 links parameter expansion to variational Bayes and explains MAVB formally.
Sections 4 and 5 conduct simulations and examine performance on the empirical example (Ghitza and Gelman 2013).The latter shows dramatic gains in speed: Even after applying MAVB and drawing 4,000 samples, the fastest variational algorithm is nearly 60 times faster than Laplace approximation and nearly 350 times faster than Hamiltonian Monte Carlo for the most complex models.This reduces the run time from hours to minutes.All variational methods well-recover the posterior means.While the strongest factorization assumptions have poor performance in terms of estimating the posterior variance, applying MAVB corrects a large amount of the problem.
Section 5 then uses this algorithm to engage in model comparison that was computationally infeasible in Ghitza and Gelman (2013).I perform 10-fold cross-validation across nine models ranging from having four to 18 random effects and thousands of parameters.The process takes around 30 minutes compared to the hours needed to fit even a single model once using existing approaches.The results provide some evidence of over-fitting in the original specification suggesting that the most complex model does not outperform models of intermediate complexity.I use this to draw out some guidance for practitioners of MRP in other substantive domains.

Mean-Field Variational Inference for Binomial Hierarchical Models
I focus on the following generative model that is broader than MRP but also captures the majority of applications.For each observation i ∈ {1, • • • , N}, the researcher observes y i "successes" out of n i trials (e.g.how many individuals in a population of size n i turn out to vote).I model this using a binomial distribution with probability of success p i defined via a linear predictor (ψ i ) put through a logistic link.Equation (1) expresses this model using a "general design" notation (Zhao et al. 2006).Appendix A (Goplerud, 2021) shows the model using Gelman and Hill (2006)'s notation and a plate diagram.
As is standard in hierarchical models, the linear predictor consists of p "fixed" effects: x i ∈ R p .The hierarchical component contains J random effects indexed from j ∈ {1, • • • , J}.For each random effect j, there is a d j dimensional covariate vector indexed by z b i,j where z b i,j = 1 represents the ubiquitous "random intercept."Each random effect has g j groups and each observation i is assigned to exactly one group for each random effect; define its membership for random effect j as a one-hot vector m i,j ∈ {0, 1} gj .
The notation in equation ( 1) stacks together the hierarchical components as follows; first, for each random effect j, z i,j represents a d j × g j length vector (mostly sparse) by the Kronecker product (⊗) of the group membership vector m i,j and the base covariate.This repeats z b i,j once in the position corresponding to the group of which i is a member for random effect j.This allows us to model the distribution of the entire parameter vector for random effect j (α j ∈ R gj •dj ) as a multivariate normal with a block diagonal matrix where each block is given an identical Inverse Wishart prior as noted in equation ( 1)b (Σ j ∈ R dj ×dj ; Φ j ∈ R dj ×dj ).Using such priors is standard in the literature on variational inference for hierarchical models (e.g.Tan and Nott 2013), although extensions to more weakly informative priors are possible (e.g.Huang and Wand 2013).The compact notation in equation (1)a stacks together all random effects j into a single vector z i ∈ R J j=1 gj •dj that is highly sparse.It thus accommodates designs with arbitrary patterns of crossing (non-nesting) amongst the J random effects.
A key distinguishing feature of this model as applied to MRP is that J can be large (e.g.greater than ten) and g j ranges widely from a handful up to over a thousand (e.g.g j = 4 for ethnicity and g j = 1, 020 for state-ethnicity-age combinations in Ghitza and Gelman 2013).In most applications for MRP, d j = 1 and z b i,j = 1 (random intercept) but sometimes d j = 2 in the case of a random slope and intercept (Gelman and Hill 2006).Regarding the other parameters, for most applications of MRP, N is often relatively modest given post-stratification requirements (see Section 5) and that surveys can be collapsed into units with identical state-demographic covariates by allowing varying n i .Thus, in many studies, N can be made smaller than 10,000 (e.g.below 5,000 in Park et al. 2004;Ghitza and Gelman 2013).The size of β (p) is also usually modest and below ten.
By using Polya-Gamma augmentation, the model in equation ( 1) can be rendered conditionally conjugate, enabling the straightforward application of numerous standard algorithms for Bayesian inference (Polson et al. 2013).Specifically, equation (2) from Polson et al. (2013) states that for any a, b > 0 the following identity holds, where f P G (ω|b, c) denotes the Polya-Gamma density with parameters b and c.The definition of a Polya-Gamma variable as a weighted infinite convolution of Gamma random variables is also shown.
Thus, the complete data likelihood can be expressed as follows where Ω denotes the N × N diagonal matrix of the corresponding ω i , X, Z stack the data for each observation into N × p and N × J j=1 d j g j design matrices, and s is a N × 1 vector with [s] Noting the result from Polson et al. (2013) that the full conditional of , it immediately follows that a Gibbs Sampler exists to sample all of the parameters in the model where the full conditionals on β and α are normal and Σ j is Inverse Wishart.

Variational Inference
The first contribution of this paper is to use the Polya-Gamma representation above to find a tractable variational algorithm to approximate the joint posterior of p(β, α, {Σ j }, Ω|y) and thus the joint posterior on the parameters excluding Ω. Blei et al. (2017) provides a recent review of these methods.Equation (4) formulates the problem where X denotes some (restricted) set of distributions to optimize over.It can be equivalently expressed as finding the closest distribution in X to the true posterior in terms of KL-divergence.For notational simplicity, denote θ = {β, α, {Σ j } J j=1 , Ω}. q * (θ) = arg max q(θ)∈X ELBO q(θ) where ELBO q(θ) = E q(θ) [ln p(y, θ)] − E q(θ) [ln q(θ)] (4) A common method for solving this problem is known as "coordinate ascent variational inference" (CAVI; Blei et al. 2017).It maximizes or increases the target ELBO with respect to some sub-block of θ.By cycling through θ repeatedly, a local optimum can be obtained.The choice of restriction X is crucial to the accuracy of the approximation method; an extremely popular choice is a "mean-field" factorization assumption where blocks of parameters are assumed to be independent.
Leveraging the existence of a Gibbs Sampler, Result 1 states that the augmented posterior on q(θ) can be approximated using a number of mean-field assumptions with no further restrictions on distributional form, all updates having closed analytical forms, and for arbitrary J, d j , g j .Appendix A provides the full derivations as well as noting how to back out the corresponding Gibbs Sampler.
Algorithm 1 explicitly outlines the updates for Scheme I. Experiments showed that convergence could be improved at little computational cost by jointly updating the mean parameters of q(β) and q(α); see Appendix C for discussion.All models estimated in the paper use this acceleration technique.This improves upon existing mean-field schemes for logistic hierarchical models in a number of ways.First, for any of the factorization assumptions, no further distributional assumptions are required (cf.Ormerod and Wand 2012; Tan and Nott 2013 assuming normality).Second, most existing algorithms for binomial outcomes require the repeated evaluation of (low) dimensional integrals at each iteration whose number scales with g j (cf.Ormerod and Wand 2012;Tan and Nott 2013;Jeon et al. 2017).Extending these algorithms to J > 2 would likely incur significant computational costs as the number of those integrals increases.None of the schemes in Result 1 require integration at any step as the Polya-Gamma augmentation turns inference into iteratively performing weighted ridge regression.In the models considered in this paper, the major bottleneck as one moves from Scheme I to Scheme III is in calculating the variance term of q(β, α); even relying on a (sparse) Cholesky decomposition, this involves inverting an increasingly dense lower triangular matrix as weaker independence assumptions are imposed.Appendix E disaggregates the run-time of Algorithm 1 by stage and scheme.
Most importantly, the ability to choose between Schemes I, II, and III allows the researcher to smoothly trade-off computational cost and accuracy as in Menictas et al. (2019)'s work on J = 2 for linear mixed effects models.Scheme I with its strong implied factorization assumptions is immediately scalable to huge datasets with large J or g j .However, the downside is that the strong factorization assumptions will likely degrade performance.Scheme III provides the ability to avoid these strong assumptions at a somewhat increased computational cost.The ability to avoid such factorization assumptions for arbitrary J > 1 and binomial outcomes appears to be a new result.The expectation is that it will have the best performance.Scheme II is a compromise between the two extremes, and other hybrid approaches are possible such as applying re-parameterizations to the augmented posterior (e.g.Tan and Nott 2013;Tan 2021).

Marginally Augmented Variational Bayes
The second major contribution of the paper is demonstrating that there is a computationally cheap way of improving the initial approximation resulting from Schemes I, II, or III.The key intuition, formalized below, is that once an initial approximation q(θ) is found, one can draw samples from this approximation, perform a single step of Markov Chain Monte Carlo through (some of) the parameters, and thereby "improve" the sample.Some existing work in computer science (e.g.Salimans et al. 2015;Ruiz and Titsias 2019) has leveraged this point to attempt to optimize over the intractable improved density which can be computationally expensive.
By contrast, this paper explores the idea that if one can find a transition kernel with good mixing, then simply doing a single partial step can provide considerable gains at limited computational cost.While many samplers can be employed for this purpose, initial experiments suggested that the key problem was the independence assumptions in q(β, α) and thus I chose to focus on marginal augmentation and parameter expansion as it is inexpensive to use in fully Bayesian MCMC to improve a Gibbs Sampler, has demonstrated strong performance in hierarchical models, lacks internal tuning parameters, and was explicitly designed to link the fixed and random effects together (Liu and Wu 1999;Van Dyk and Meng 2001;Gelman et al. 2008).I focus on logistic hierarchical models although the procedure is itself much more general; Section 6 discusses some broader implications and Appendix B formulates the results in a more general fashion.
The key idea behind parameter expansion is to create an "over-parameterized" model where certain additional parameters (ξ) are introduced such that they (i) maintain the observed data model but (ii) are not identifiable from the observed data itself.A careful choice of parameter expansion allows the construction of algorithms that have either faster mixing for MCMC (Liu and Wu 1999;Van Dyk and Meng 2001) or faster convergence for deterministic algorithms such as EM (Liu et al. 1998).The intuition behind its effectiveness is that it allows "moves" (either via sampling steps in MCMC or parameter updates in EM) in the un-identified space that can break or escape the strong associations between parameter blocks (e.g.β and α) that slow down mixing (Liu and Wu 1999) or lead to the algorithm getting "stuck" for many iterations near boundary conditions (e.g. a small sampled Σ j shrinking α j,g leading to a small Σ j , etc.; Gelman et al. 2008).Liu et al. (1998) provide a useful explanation of parameter expansion in the context of EM as a "covariance adjustment" to the estimated parameters.
In the case of hierarchical models, the most popular parameter expansion appears as a location and/or scale transformation of the random effects (e.g.Van Dyk and Meng 2001).The location transformation, for example, allows the random effects to have a non-zero mean: α j,g ∼ N (μ j , Σ j ).Note that it is not possible to estimate μ j from the observed data but that it could be estimated if α j,g were known.Implementation is simple and appears as a location or scale transformation of the sampled parameters that leads to very large gains in performance (e.g.Van Dyk and Meng 2001;Gelman et al. 2008).
Definition 1 generalizes this parameter expansion to the arbitrary J case, where the M j notation is bookkeeping to note which element of x i (and β) corresponds to each element of z b i,j (and α j,g ).
Definition 1 (Expansions for Hierarchical Models).Define a set of expansion parameters ξ that consists, for each j, of a mean shift μ j ∈ R dj and a scale shift R j ∈ R dj ×dj such that R j is invertible.I use superscript X to denote the "expanded" parameters.
The mapping between θ X and θ for a fixed ξ is denoted as t ξ (θ X ) and listed below.
All other elements of M j are zero.For simplicity, assume that each element of z i corresponds to some variable in x i , i.e. that each column of M j has exactly one non-zero element.
The augmented model is listed below for an important special case treated in detail ("Mean Expansion") in the empirical analysis.The full expansion ("Translation Expansion") is also listed.
• Translation Expansion: Given such an expanded version of the hierarchical model, there are two ways to improve the algorithms in this paper.First, drawing on Jaakkola and Qi (2007), it is possible to accelerate convergence of Algorithm 1 using "parameter expanded variational Bayes" (PX-VB).Appendix C derives a new application of PX-VB to the models in Result 1 and shows it can often improve the algorithm's convergence by decreasing the number of iterations required at effectively no computational cost as, functionally, it involves centering the random effects to be mean zero and adjusting the mean of q(β) correspondingly.
The main use of parameter expansions in this paper, however, is to improve the quality of the approximation by "improving" q(θ) by performing one step of marginal augmentation where the expansion parameters ξ are sampled and then the components of θ are re-sampled.Definition 2 outlines the procedure in a general case.The notation and procedure mirrors that in Liu and Wu (1999).

Create θ
3. Sample a new ξ 1 as follows where J ξ (θ X ) is the Jacobian of t ξ with respect to θ X and p(θ|y) denotes the true posterior distribution.
Theorem 1 states a key result for MAVB.
Theorem 1 (Guaranteed Improvement with MAVB).For any (proper) choice of prior p 0 (ξ), the MAVB approximation q(θ) has a better ELBO than the initial approximation: The proof is in Appendix B and uses two lemmas from existing results.First, Theorem 1 in Liu and Wu (1999) demonstrates the transformation to generate MAVB maintains the stationarity of the posterior.Second, a data processing inequality noted by various authors (e.g.Ruiz and Titsias 2019) showing that this transformation which keeps the true posterior invariant results in a better approximating distribution.
It is known from the data augmentation literature that an increasingly diffuse prior on the expansion parameters ("working prior") allows for the parameters themselves to "decide" the best expansion parameter ξ rather than being weighed down by the prior (e.g.Liu andWu 1999, p. 1268), and I conjecture that a similar intuition applies for MAVB.Thus, in all applications, I use an improper prior (i.e.p 0 (ξ) ∝ 1); Appendix B discusses the validity of this prior using existing theory (Liu and Wu 1999;Van Dyk and Meng 2001), provides the result for a proper prior on ξ, and notes Algorithm 2 can be found as the limit of a proper working prior p 0 (μ j ) ∼ N (0, τ2 I) as τ → ∞.Algorithm 2 shows how MAVB is implemented using the mean expansion noted in Definition 1. 2   Thus, for this model and relying only on a location transformation, MAVB has a simple form that, as shown later, can result in considerable improvements in the performance of Scheme I.As noted in the earlier discussion, the presentation of MAVB in Algorithm 2 illustrates the close relationship to the location transformation noted Algorithm 2 Applying MAVB to Non-Linear Hierarchical Models.
Set the Number of Samples Desired: M Estimate q(θ) using CAVI (e.g.Algorithm 1) M j μj earlier: It can be thought of a "stochastic" location transformation given that mean of the expansion parameter is the mean of the sampled α j,g .Some additional remarks are in order: First, if MAVB is applied to an approximation resulting from Scheme I (i.e. with independence assumed between β and α), the resulting approximation will not imply such an assumption.Consider the correlation between α j,g and β in Algorithm 2. Before applying MAVB, the two parameters are independent by assumption.After applying MAVB, they have a non-zero posterior correlation because of the shared dependence on μj .While not sufficient to restore all missing dependencies (e.g.components of β that are not included in any random effect), this can at least address some of the shortcomings of Scheme I. MAVB can be applied to the outputs of Scheme II and III, although the expectation is that the improvement for these schemes should be less pronounced given that more of those dependencies are estimated directly.
Second, the cost of implementing MAVB is quite modest, unlike existing approaches that attempt to optimize over the improved density (e.g.Ruiz and Titsias 2019).After drawing a sample from q(θ), all that is needed to perform MAVB is drawing j d j univariate Gaussians (22 in the largest model considered in this paper [Model 9]), some summation of the sampled random effects and then subtracting off the sampled expansion parameter μj .Note that this MAVB procedures do not require sampling the Polya-Gammas as they are left un-transformed by the algorithm nor does the cost of MAVB depend on the size of the data (N ) directly; even if g j is large, MAVB will still be fast.
Third, while MAVB is guaranteed to increase performance, the quality of MAVB is difficult to ascertain analytically in most complex models.However, insights come from simpler cases: In a stylized hierarchical model, Liu and Wu (1999) show that marginal augmentation results in perfect sampling.In the more realistic case where Σ j is not fixed and J = 1, studies show that certain forms of marginal augmentation result vastly improved mixing of MCMC samplers (Van Dyk and Meng 2001;Gelman et al. 2008).Thus, there is reason to be optimistic about the ability of MAVB to improve initial approximations as the scale/location transformations in fully Bayesian marginal augmentation seem to provide quite considerable benefits over simple Gibbs samplers.
Overall, while MAVB is likely to be helpful in improving the variational schemes in this paper, it is not a panacea.Its major benefit appears to be in "connecting" blocks of parameters that were assumed to be independent in a way that is guaranteed to improve the approximation quality at a very limited computational cost.The key limitation is that its speed and scalability depends on it not returning to the observed data (y).Interestingly, this suggests a "stronger" version of MAVB that could be performed by implementing one full sweep of the Gibbs Sampler, i.e. sampling Polya-Gammas and cycling through all full conditionals, and then performing marginal augmentation.If this were to be performed many times, the samples would converge to the true posterior by standard properties of MCMC.While this might raise its own computational concerns, exploring this is an interesting area of future research.

Simulation Study
I perform a simulation study to assess the accuracy of the proposed methods.I compare my variational algorithms against two gold standards (Laplace approximation using blme - Bates et al. 2015;Chung et al. 2015; HMC in STAN using brms; Bürkner 2017) and Automatic Differentiation Variational Inference (ADVI; Kucukelbir et al. 2017). 3The latter is a useful comparison as it is easily implemented in STAN and is a generic approach to approximate complex models.I show results using its mean-field approximation (MF) and full rank (FR).To begin, I conducted a simulation where the linear predictor ψ i was generated using the following scheme (J = 2).

Draw the fixed effects
For each group g ∈ {1, • • • , 10}, draw the random intercept α 2,g ∼ N (0, 1) For each observation i ∈ {1, • • • , 1000}, assign at random to groups g, g .Draw its fixed effect x i ∼ N (0, Σ) where Σ j,j = 0.5 |j−j | ; j, j ∈ {1, • • • , 10}.Draw y i such that: ) 3 Using blme allows for an identical Inverse Wishart prior to be added to the Laplace approximation; models are fit using optimx's nlminb algorithm (Nash and Varadhan 2011) that returned noticeably better performance.brms generates a model that can be manually adapted to place an Inverse Wishart prior as this is not permitted in the default options in pre-written STAN models at the time of writing (e.g.rstanarm or brms).
All models are fit with a standard Inverse Wishart prior of IW(d j + 1; I dj ) on Σ j .I run each variational algorithm until the change in the ELBO is less than 10 −8 or the largest parameter changes by less than 10 −5 .For the HMC and MAVB methods, I draw 4,000 samples from the (approximate) posterior.
Table 1 reports four measures of performance; the first two measures compare the point estimates (posterior mean) against HMC.The third measure compares the full posterior using a measure of "accuracy" that modifies the integrated absolute error (e.g.Faes et al. 2011).Formally, this is defined as 1 − 1 2 ∞ −∞ |q k (θ) − q HMC (θ)|dθ.I use kernel density estimation with a range over the shared support of the samples (bkde, KernSmooth; Wand and Ripley 2020) and then approximate the integral.Finally, to understand how the estimates of uncertainty fare against the unknown truth, I examine the "frequentist coverage": Does an interval of ± 1.96 times the standard deviation of the parameter contains the truth?A value of around 0.95 would indicate correct coverage at the expected frequentist level.Note: This reports the bias (Bias), root mean squared error (RMSE) of the estimated posterior means against those estimated from HMC.The distance between the distributions (Accuracy) and frequentist coverage (Coverage) are reported; see the main text for an explanation of these measures.The statistics are disaggregated by fixed (FE) and random effects (RE).All results are created using all relevant parameters in each simulation and then averaged across one hundred simulations.ADVI (MF) uses the mean-field approximation; ADVI (FR) uses the full rank approximation in Kucukelbir et al. (2017).

Bias
Table 1: Results from Simulations.
The results are promising; looking at the bias and RMSE, the variational methods perform well; they have very small bias against the means estimated from HMC and an RMSE that is quite small, comparable to the Laplace approximation, and out performs both ADVI implementations.
Examining accuracy and frequentist coverage shows more separation across the methods.The accuracy and coverage of Scheme I are noticeably lower than the Laplace approximation.However, applying MAVB results in noticeable improvements in accuracy (around 6% for fixed effects; and nearly 25% for random effects) and increases coverage by similar amounts to near nominal levels.After this improvement, Scheme I is comparable to the best approximate method (Laplace approximation) having slightly lower accuracy for the fixed effects but noticeably better accuracy and coverage for the random effects.Scheme II has somewhat better initial performance but is also boosted considerably by applying MAVB.
Scheme III-the factorization that does not assume independence between q(α) and q(β)-performs nearly as well as the Laplace approximation (and better in terms of the random effects) before applying MAVB.Applying MAVB results in only slight improvements (e.g. a 1-2% boost in accuracy and coverage for the random effects).
Appendix D conducts additional simulations.First, I vary the magnitude of the true coefficients by changing the variance of the fixed and random effects.After applying MAVB, the coverage of the variational methods is near nominal (i.e.above 0.90) in all cases except when the variance of the true distribution of the fixed effects is larger where MAVB is insufficient to obtain nominal coverage on the fixed effects (0.80-0.85) although the coverage on the random effects remains good.While this is worthy of future exploration, I conjecture this occurs because of the large magnitudes of the linear predictors (with 5-95% interval of around −5.9 to 5.5 vs. −2.7 and 2.3 in the simulations in Table 1) and the highly bimodal distribution of p i .It may be that a pass over the observed data and one full sweep of MCMC (discussed in Section 3 as a "stronger" MAVB) could result in more significant improvements in coverage.
Second, to examine simulations in a more realistic case, I fit a simple MRP model on the data from Ghitza and Gelman (2013) with random effects for age, income, ethnicity and state (Model 1 from Table 2, below) and take the parameter estimates from the Laplace approximation as "ground truth" to create simulated outcomes.It shows a similar pattern although with weaker performance across the board-Scheme I after applying MAVB outperforms ADVI (Mean Field) across all measures with noticeable improvements in accuracy (10%) and is comparable to ADVI (Full Rank).Scheme III performs the best of all approximate methods, including beating the Laplace and ADVI (Full Rank).The values of the linear predictor are relatively modest in this case (90% of the HMC posterior means are between −0.67 and 1.99) and more comparable to those in the simulations in Table 1.This provides further evidence that for reasonably sized linear predictors, the variational approximations perform well and are improved by MAVB.
Finally, I examine the sensitivity of the algorithm to initial values to see if there is evidence of arriving at different local optima.I find little evidence of this for the models considered in this paper given reasonable random initializations, although researchers should check for this in their own applications.

Application: Estimation for Complex MRP
This section re-analyses the results in Ghitza and Gelman (2013) where I compare my results against Hamiltonian Monte Carlo (HMC).I then conduct 10-fold cross-validation using Scheme I to examine which model seems to be most appropriate to use for the final predictive task.I find that, contrary to the decision in Ghitza and Gelman (2013), a model with intermediate complexity is preferred.

Brief Explanation of MRP
Before proceeding, I provide a brief explanation of MRP (see, e.g. Park et al. 2004;Lax and Phillips 2009b;Ghitza and Gelman 2013 for more detailed explanations).The key problem is that while it is easy to gather a representative survey at the national level, it is very expensive to gather a sufficiently large and representative survey at subnational units (e.g.states) or sub-types of respondents (e.g. by race, education, income, their interactions, etc.).Further, the number of observations in any sub-group may be very small, rendering a direct analysis of their values unreliable (Lax and Phillips 2009b;Warshaw and Rodden 2012;Buttice and Highton 2013).However, the most substantively important questions exactly rely on drawing inferences about those sub-groups.MRP provides a model-based procedure to attempt to reliably estimate these sub-group effects by providing a principled way to extrapolate the nationally representative survey.
MRP is a two-step procedure.First, the researcher estimates a hierarchical model ("multilevel regression") on the initial survey including covariates such as demographic characteristics and indicators for the relevant geographic unit (e.g.state) to get estimates for various "types" of respondents (e.g.age-income-ethnicity by state).The hierarchical model usually has a binomial or binary outcome.The second step calculates the expected response for each demographic-state profile.These can be examined directly or aggregated to get a measure of opinion at the desired geographic level (e.g.state).The aggregation or "post-stratification" occurs by taking a weighted average of those sub-group predictions from the known joint distribution in the population from some ground truth such as the Census.This paper has focused on the first step ("multilevel regression").Ghitza and Gelman (2013) apply this method to explore the decision to turn out to vote and party choice by age-race-income-state sub-groups in the 2004 and 2008 American presidential elections.They note that traditional MRP includes the random effects linearly and thus may be failing to capture important complexities or interactions between demography and geography.They thus fit a highly complex model with eighteen random effects and nearly 4,000 parameters on a dataset with around 4,000 observations.After doing so, they draw a variety of subtle and nuanced conclusions about the behavior of particular demographic sub-groups.For example, they qualify the conventional wisdom to show that turnout increases were concentrated amongst non-white younger voters instead of younger white voters (Ghitza and Gelman 2013, p. 771-772).

Estimating Complex Hierarchical Models
I begin by performing a direct comparison of Schemes I, II, and III against the gold standard approaches applied to Ghitza and Gelman (2013).To illustrate the many specifications available to the researcher, Table 2 shows nine possible specifications ranging from a simple MRP model with no interactions to the preferred model in Ghitza and Gelman (2013) (Model 9).I round y i and n i to the nearest integer to facilitate interpretation as a standard binomial regression.The intermediate models represent varying complexities that allow for some, but not all, interactions.Note: This table summarizes nine possible models to predict voter turnout.All models include six fixed effects: an intercept, (standardized) individual income, state-level income, state-level Republican vote share and the interaction between individual income and the latter two variables.Ghitza and Gelman (2013) use Model 9.The first panel indicates which random effects are included; a hollow diamond ( ) indicates that only a random intercept is used.A solid circle (•) indicates that a random intercept and a random slope allowing for the effect of (standardized) individual income to vary by group are included.The number of parameters is the number of fixed effects, random effects, and variance components for the random effects.The run times are for a Laplace approximation using blme (Bates et al. 2015;Chung et al. 2015) and HMC in STAN (via brms; Bürkner 2017).All models were run on an instance with 16 GB of memory and 4 cores.HMC was estimated using four chains distributed in parallel.

Model
Table 2: Nine Possible Models for Predicting Turnout via MRP.
It clearly shows the scale of the difficulty for applied researchers: Fitting the published model (Model 9) takes hours using either specification on a machine similar to that available for many applied researchers (a Microsoft Azure instance; Ubuntu, 4 cores, 16 GB of RAM).Methods that require fitting the model repeatedly to facilitate common tasks as bootstrapping, model comparison via cross-validation, or ensemble analysis (Van der Laan et al. 2007, see Ornstein 2020 for an application to MRP) are clearly prohibitively expensive for all except the simplest models using the Laplace approximation.
Figure 1 illustrates the improvement after applying the variational algorithms to each model in Table 2 and performing MAVB.All reported times include estimation of the variational algorithm and drawing 4,000 samples using MAVB.Appendix E shows the time for estimation and MAVB separately; it takes around thirty seconds for Scheme I on the most complex model.2. All models are fit on a computer with 16 GB of RAM and 4 cores.
As shown on a linear scale, the time to estimate either the Laplace approximation or Hamiltonian Monte Carlo dwarfs that of any of the variational schemes.The right panel shows the results on a log-scale to allow for clearer comparisons; it shows that Scheme I remains remarkably fast estimating even Model 9 in around one minute versus hour(s) for either gold standard method.The performance of Schemes II and III degrade somewhat-taking around fifteen minutes to fit.This is still very reasonable, but may still be onerous if repeated fitting is required as in cross-validation.
The quality of the approximation is also crucial to assess.As the truth is unknown, I do this by comparing all methods against HMC as this seeks most directly to sample the posterior.Consider first the Laplace approximation; it nearly exactly recovers the point estimates-its solid points and shading lie very near to the 45-degree line.For the variational methods, Scheme I is highly correlated with the posterior (ρ = 0.996 for ᾱj ; ρ = 0.964 for the raw |α j,g |) although less so than the Laplace approximation.Schemes II and III show tight coupling with the estimates from HMC and are effectively equally accurate to the Laplace approximation.This matches the conventional wisdom that variational methods typically well-recover the posterior means.Again consider first the Laplace approximation; the standard deviation of its point estimates are often tightly clustered near the 45-degree line but there are a number of random effects that are noticeably smaller (above the 45-degree line).
The performance for the variational algorithms is rather mixed, by comparison.Looking at Scheme I, almost all points show a too small standard deviation-with many random effects being considerably too small.Scheme III, however, improves the situation markedly.While slightly smaller-especially for points with large standard deviations-it tracks the 45-degree line closely and has better performance than the Laplace approximation.As expected, Scheme II is somewhat of an intermediate case; improving some parameters but still having significant problems.
Overall, therefore, Schemes I and II fall into the usual problem of understating posterior variance.By contrast, Scheme III appears to do rather well and lacks the obvious problems of lack of posterior variability versus a fully Bayesian baseline.This corroborates results from Menictas et al. (2019) that estimating q(β, α) jointly performs well for (linear) hierarchical models with J > 1.
Finally, I show how these estimates change when using MAVB.I focus on the effect on posterior variability as the means are not materially affected by MAVB; Appendix E shows the analogous figure.Figure 4 presents the distribution of the gap between the variability between the HMC estimates and the other methods where negative values indicates a smaller standard deviation for the competitor methods.Any point below the dotted line indicates that that percentile of effects has a smaller standard deviation than the HMC estimates.To make results interpretable, I report the percentage gap, e.g.sd Laplace • 100 for all parameters k in (β, α).To ensure that random effects with small g j are counted, it presents the averaged statistic across g as in Figures 2 and 3.
The results provide clear evidence for the important role of MAVB.Considering first random effects in the left panel, it is first worth noting that the Laplace approximationcommonly used by researchers-has poor performance for a number of parameter blocks (e.g. the lower percentiles).Scheme I shows a clear lack of variability in the posterior estimates with all estimates being estimated at least 20% too precisely and around half of all estimates having less than 75% of the variability estimated in the fully Bayesian setting.After applying MAVB, the (red) solid line shows a considerable improvement although still markedly below the HMC estimates and performing worse than the Laplace approximation.Large improvements are seen for the fixed effects (β) where the estimates of the variability go from extremely poor to being much closer to the Laplace approximation which, itself, is markedly below the HMC coverage.
Scheme III is worth also considering in detail; even before applying MAVB, it has stronger performance than the Laplace approximation in that its curve has a much less poor "tail" (i.e. its worst blocks are around 25% too small vs around 60% for the Laplace approximation).MAVB provides some additional gains ensuring that most parameter blocks are only around 10% too small in terms of their variability.Scheme II is again somewhat intermediate; after applying MAVB, it is broadly comparable to the Laplace approximation.
To provide another interpretation of the role of MAVB, consider the accuracy measure in Section 4 that measures similarity between two distributions, averaged within and then across parameter blocks: The Laplace approximation performs relatively well (90%).Scheme I performs poorly (43%) because it clearly fails to capture the posterior variance.MAVB increases this considerably (68%) although it still falls below the Laplace approximation.Scheme III, however, out-performs the Laplace approximation (95%) with a slight improvement from MAVB (97%).
Appendix E provides some additional results.First, it breaks apart Figure 3 by the type of random effect; the main implication is that the initial lack of variability from Scheme I is most pronounced for the fixed effects and random effects with small g j (age, ethnicity, income).The improvements for MAVB for those random effects are large and resolve the much of the negative gap.
Second, it examines the linear predictor (i.e.x T i β + z T i α).It shows that MAVB has little effect, although all schemes perform well.In addition to closely estimating the posterior mean (Scheme I has a bias of −0.002 vs HMC), the standard deviation is also fairly close (bias of −0.013 or about −2%), especially compared to the gaps seen in Figure 4.A conjecture would be that MAVB as implemented here has little impact on the linear predictor as it is more about building correlations between parameter blocks, but the "stronger" MAVB noted above might address such limitations.

Choosing an Optimal Model
Finally, I return to the substantive analysis in Ghitza and Gelman (2013).A key question when performing MRP is the complexity of the accompanying model.Even with the regularization implied by the hierarchical effects, it is still possible to over-fit to the survey sample (Goplerud et al. 2018).The reported analysis relies on Model 9 without exploring this possibility.The computational burden needed to estimate multiple models and thereby engage in model testing and checking is often onerous for the applied researcher.I thus use the ability to rapidly fit variational approximations to deploy a standard model comparison technique (cross-validation) and examine whether a model of intermediate complexity should be preferred.Table 3 reports a number of statistics on model fit.
The first two rows (LOO and WAIC) are popular tools for deciding between nonnested Bayesian models (Vehtari et al. 2017).Details on their exact calculation can be found in the relevant articles (Gelman et al. 2014;Vehtari et al. 2017), but both are designed to be approximations to cross-validation that do not require fitting the Bayesian model repeatedly.
Fortunately, both have diagnostics to assess whether the underlying approximations are reliable; unfortunately, the diagnostics tests fail in this setting.Almost all models report unacceptable violations of the underlying assumptions for both the LOO and WAIC, and the associated software explicitly encourages the user to resort to K-fold cross-validation.On the other hand, variational inference provides a fast approximate method.The final row of the table (VI-CV) reports the average deviance (twice the negative log-likelihood) of the held-out predictions after conducting 10-fold cross-validation where observations are allocated to each fold with equal probability using Scheme I. Formally, if observation i has a prediction pi , the individual deviance  The results are interesting and push against the decision to use Model 9; it finds that while Model 1 performs noticeably worse than all other models, it is not necessarily best to use the most complex model.Indeed, an intermediate model-Model 4-performs the best although the differences in the error are quantitatively small between Models 4 and 9.
Appendix E provides a more detailed exploration against a "Bayesian gold standard."Using a new set of folds, it fits Models 1, 4, and 9 using ten-fold cross-validation and performing HMC and the Laplace approximation on each fold.This is extremely time intensive-taking around ten days to complete the whole process.It confirms that cross-validated HMC, Laplace approximation, and Scheme I all select Model 4. The out-of-sample predictions between Scheme I and HMC are highly correlated (0.998).This gives some confidence that the results of the variational method can be used in lieu of prohibitively expensive classical cross-validation.When it is too expensive to conduct such an analysis, relying on methods such as simulation-based calibration (e.g.Yao et al. 2018) may be a feasible way to assess whether the variational approximation "successfully" approximated the posterior.
Returning to Table 2, the major feature that distinguishes Model 4 from less complex models is interactions between the core random effects (age, ethnicity, income) and state.This matches a reasonable expectation from political science that demographics are likely to vary across state but the complex higher-order interactions between region and three-way-interactions do not seem to add much predictive power.
These results are useful to practitioners of MRP in three ways; first, complex hierarchical models can now be compared against other state-of-the-art machine learning methods versus relying on a very simple model (analogous to Model 1) due to computational costs (Bisbee 2019;Ornstein 2020).Thus, it is an interesting and open question whether methods such as BART are actually superior for MRP tasks (Bisbee 2019) or whether properly specified complex hierarchical models can be competitive.Second, it suggests that interactions between demographics and state characteristics are important to include although the evidence for going extremely "deep" and adding many higherorder interactions appears more limited.Finally, even if one prefers to fit a Bayesian model for the final regression, the ability to quickly search between models allows the researcher to narrow down a set of plausible candidate models for final exploration and model testing.

Conclusion
This paper provided a new set of variational algorithms that, leveraging Polya-Gamma data augmentation (Polson et al. 2013), require only a mean-field assumption to estimate a logistic hierarchical regression with an arbitrary number and size of random effects.It provided multiple factorization assumptions; Scheme I required the independence of the fixed effects and each block of random effects whereas Scheme III relaxed that assumption at the expense of increased computational cost.All methods seemed to quite accurately capture the posterior means in even complex models.As expected in both simulations and real data, Scheme I performed worse-especially in terms of understating posterior variance for many random effects.
The paper also provided a generic way to improve the performance of Scheme I, and Schemes II and III to a lesser extent.By leveraging the existence of a parameter expansion of the underlying model either by allowing the means of the random effects to be non-zero or by imposing some translation, one can use a marginal augmentation sampler to improve the posterior approximation.This procedure ("marginally augmented variational Bayes"; MAVB) showed promising performance when applied to Scheme I: It increased the variance of the estimated approximations to be closer to the samples drawn using a fully Bayesian procedure, although still remaining too small on real data.However, given its speed even on complex models, MAVB provides a cheap way to make Scheme I a more viable approximation to the true posterior.It is also worth noting that Scheme III performed very well-often beating the very popular Laplace approximation on both real and simulated data.
Future work could proceed in at least two directions.First, the algorithms here can be naturally extended to count and multinomial outcomes providing a more unified approach to variational estimation of non-linear hierarchical models.Extending the model to include a weakly informative prior such as Huang and Wand (2013) is also an important extension.
Second, the usefulness of MAVB should be explored both theoretically and in the context of other models.As noted earlier, there is nothing about using MAVB that is specific to logistic hierarchical models per se.Indeed, this idea of "improving" an approximation by pushing it through a Markov transition kernel can be generalized to a wide variety of MCMC samplers and models.It thus opens a question of which Markov transition density to use for other models that do not admit marginal augmentation.A reasonable conjecture is that as the mixing of the sampler improves, the transformed sample will be closer to the true posterior.

Figure 1 :
Figure 1: Speed of Estimation.Each figure plots the run-time of each of the five methods (Laplace approximation, Hamiltonian Monte Carlo [HMC], Schemes I-III with drawing 4,000 samples using MAVB).The reported times are averaged across the 2004 and 2008 elections.The left figure shows the time in minutes on a linear scale; the right figure reports the same information on a log-scale.Models 1-9 are described in Table2.All models are fit on a computer with 16 GB of RAM and 4 cores. 4 Figure2begins by comparing the point estimates pooling across the 18 models.As there are thousands of parameters to plot, I simplify the picture in the following way; I plot the absolute magnitude of the estimates averaged across j: ᾱj = 1 gj gj g=1 |α j,g | in solid circles and shade the background of the plot based on the density of the individual |α j,g |.This prevents the domination of the j with smaller numbers of groups (e.g.age, income, etc.) in the visualization.I also separately mark the fixed and random effects.

Figure 2 :
Figure 2: Comparing Posterior Means.This figure plots the absolute value of estimated mean value from Schemes I-III and the Laplace approximation on the horizontal axis against the absolute value of the posterior mean from Hamiltonian Monte Carlo [HMC] on the vertical axis.Each parameter is plotted as a thin grey point; the average of the values inside each random effect are shown as larger points.The axes are on a square-root scale.

Figure 3
Figure 3 presents an analogous figure for the posterior variability, plotting the standard deviation of each parameter.It smooths across random effects in the same way as Figure 2. In interpreting this figure, note that points in the upper left quadrant (above the 45-degree line) indicate worrying performance as the posterior variability is below that coming from the HMC estimates.

Figure 3 :
Figure 3: Comparing Posterior Variability.This figure plots the estimated standard deviation from Schemes I-III and the Laplace approximation on the horizontal axis against the standard deviation of the posterior distribution from Hamiltonian Monte Carlo [HMC] on the vertical axis.Each parameter is plotted as a thin grey point; the average of the values inside each random effect are shown as larger points.

Figure 4 :
Figure 4: Improvements from MAVB.This figure plots the percentile of the gap between the standard deviations estimated via Hamiltonian Monte Carlo [HMC] and the approximate methods.The percentage gap, i.e. sd Laplace k

, α); {ν j
(Gelman et al. 2014;Vehtari et al. 2017)fit.The first two rows for each election report fit statistics on the model estimated via Hamiltonian Monte Carlo that approximate cross-validation; the "LOO" information criterion and the WAIC information criterion(Gelman et al. 2014;Vehtari et al. 2017).The third row reports the average out-of-sample deviance from a model fit using Scheme I.For all statistics, smaller is better and the best value is bolded.The time in minutes for each row to be estimated is shown in the final column; this includes estimation time and the time needed to estimate the relevant fit statistic. Note:

Table 3 :
Cross-Validation to Choose Optimal Model.
is −2 [y i ln(p i ) + (n i − y i ) ln(1 − pi )].Observations with n i = 0 are excluded from the reported average.Model 4 is also selected if Schemes II or III are used.