Normal Approximation for Bayesian Mixed Eﬀects Binomial Regression Models

. Bayesian inference for generalized linear mixed models implemented with Markov chain Monte Carlo (MCMC) sampling methods have been widely used. In this paper, we propose to substitute a large sample normal approximation for the intractable full conditional distribution of the latent eﬀects (of size k ) in order to simplify the computation. In addition, we develop a second approximation involving what we term a suﬃcient reduction (SR). We show that the full conditional distributions for the model parameters only depend on a small, say r (cid:2) k , dimensional function of the latent eﬀects, and also that this reduction is asymptotically normal under mild conditions. Thus we substitute the sampling of an r dimensional multivariate normal for sampling the k dimensional full conditional for the latent eﬀects. Applications to oncology physician data, to cow abortion data and simulation studies conﬁrm the reasonable performance of the proposed approximation method in terms of estimation accuracy and computational speed.


Introduction
Generalized linear mixed models (GLMMs) have been widely used for modeling longitudinal and clustered data in many scientific applications (McCulloch, 1996;Diggle et al., 2002). Statistical inference for GLMMs has received a lot of interest over the past three decades. In frequentist literature, it is common to consider a likelihood-based approach (McCulloch, 1997;Breslow and Clayton, 1993), which often requires a painful high-dimensional numerical integration and the derivation of large-sample asymptotic results for inference. Consequently, Bayesian methods have become more attractive as they provide a useful alternative to numerical integration by adopting posterior sampling techniques such as Markov chain Monte Carolo (MCMC) methods. The MCMC methods for GLMMs are usually carried out using Gibbs sampling (Zeger and Karim, 1991) and Metropolis-Hastings methods (Gamerman, 1997); and those methods can now be conveniently implemented using standard software platforms such as BUGS (Spiegelhalter et al., 2003) and JAGS (Plummer, 2012).
Despite the popularity of MCMC methods for GLMMs, in some cases, posterior sampling can still be challenging due to computational issues such as the convergence of Markov chains and the lack of efficient sampling methods; particularly with a large number of latent effects (Hadfield, 2010). The main focus of this paper is to provide a simple-yet-useful alternative by considering a large sample approximation method that could have the potential to improve upon the computational efficiency of MCMC sampling for GLMMs. Large sample approximation is a classical approach that has been well explored in many areas of statistics, including Bayesian statistics and GLMMs. For example, asymptotic normality results were established for posterior distributions under different conditions (Walker, 1969;Chen, 1985). Laplace approximation (Tierney and Kadane, 1986) and the integrated nested Laplace approximation (INLA) (Rue et al., 2009) provided a computationally convenient first-order approximation to a posterior density function by a normal density centered at the posterior mode. For GLMMs, Breslow and Clayton (1993) proposed the penalized quasi-likelihood method by approximating the marginal quasi-likelihood function.
More recently, Yee et al. (2002) obtained an asymptotic joint normal approximation for the posterior distributions of two blocks of variables, i.e., one block of GLMM model parameters and another block of random effects. They first obtained the conditional asymptotic distributions for normalized block variables, and then, due to compatibility of the limiting distributions, they obtained the appropriate joint asymptotic distribution. This result was later generalized to a model with multiple blocks of variables by Su and Johnson (2006). Posterior normality was also obtained for stochastic processes by Weng and Tsai (2008). Fong et al. (2010) gave a comprehensive review of the implementation of INLA for GLMMs. Baghishani and Mohammadzadeh (2012) also obtained asymptotic normality results for the joint posterior distribution of the model parameters and random effects in GLMMs by using Stein's Identity.
Unlike the aforementioned methods that focus on establishing joint asymptotic normality results for general model forms, our interest here focuses on a large sample approximation for the block that is more difficult to sample, namely the one for latent effects. Using our normal approximation for the latents, the normal prior for regression coefficients is conditionally conjugate. The gamma prior for the precision is also conditionally conjugate.
We initially develop the concept for a particular GLMM, the mixed effects binomial regression model where x i is a 1 × p vector with a one in the first slot and with p − 1 predictor values for unit i in the remaining slots; β is a vector of regression coefficients. The main reason for considering this model is because the type of data we envision involves obtaining Bernoulli success/failure observations for units that are clustered. For example suppose we were to count the number of patients in each of k hospitals who acquired nosocomial infections while under treatment, during a specified amount of time. Since the number of treated patients at hospital i, n i , would be known, the corresponding number of infections, y i , might initially be regarded as a binomial count. But when considering multiple counts, we might realize that the individual Bernoullis within hospitals would be correlated since the same environment, including hospital staff, physical space, management, etc is shared by all who are being treated. So extra binomial variation is to be expected and this model will account for it. A main goal for the ensuing analysis of such data would be to estimate regression effects, β, corresponding odds ratios, e β , and probabilities of infection, expit ( We consider a more complex model later in the paper after exploring this one. Denote the observed data as y = {y 1 , . . . , y k }. The MCMC approach for making numerical approximations to the joint conditional distribution, p(β, τ, u | y) involves sampling from the full conditional distributions p (β | τ, u, y), p(τ | β, u, y), and p(u | β, τ, y). With conditionally conjugate priors for β and τ , the first two distributions are easy to sample. However sampling from the third generally involves adaptive rejection sampling (Gilks and Wild, 1992). We consider a large-sample normal approximation to replace the nonstandard conditional density kernels for the u i 's. This allows us to bypass the need of implementing the usual adaptive rejection step within the Gibbs sampler and hence renders more convenient computation. In particular, we explore two types of the normal approximation methods. One is to approximate the conditionals for each u i separately with a normal distribution; and the other is to construct a low-dimensional statistic, T , in replacement of the potentially high-dimensional collection of all k u i 's and then approximate the conditional distribution of T with a normal distribution. In our current illustration we are able to sample a two-dimensional T (u), instead of k dimensional u, in order to make full inferences about all regression parameters, probabilities of success and odds ratios, The normal approximation and dimension reduction ideas that we develop in this paper are in general applicable to other mixed effect models although we focus on binomial regression for demonstration. Our goal is to introduce these ideas and methods, and to show their fittness, rather than to claim any superiority over existing computational methods. We will see clearly that JAGS is much faster than our methods due to its C++ based platform being superior to our R code. Moreover, their optimization of the (Gilks and Wild, 1992) adaptive rejection scheme is surely superior to ours. We also acknowledge that the Pólya-gamma data augmentation (Polson et al., 2013) scheme for binomial regression with logit link, which manages to perform Gibbs sampling exactly with an efficient augmentation scheme, is surely an attractive alternative to other possibilities including ours. Instead, the main goal of this paper is to present a neat idea and show its promise as a valuable alternative for analyzing Bayesian mixed effect models, especially when the number of latent effects is large. Normal approximation is a classical idea that has been widely studied in all areas of statistics. It is our hope that the work presented in this paper may pave the way for future investigation of more complicated GLMMs (e.g., with more levels of random effects and general link functions) and development of efficient computational toolkits.
We give a more detailed description of the proposed methodology in Section 2. We evaluate the empirical performance of the proposed method through an oncology physician data analysis and additional simulation studies in Sections 3 and 4. We further extend our method to a two-level logistic mixed effect model in Section 5, and we provide an elaborate sketch of theoretical details that supports its use in this more complex model. Several future working directions are discussed in Section 6. Following Gelfand et al. (1995), we consider a mixed effect binomial regression model that involves centering the random effects, u i , on x i βs. This parametrization also facilitates our normal approximation here. The augmented data likelihood is
We assign independent prior distributions on β and τ ; β ∼ N p (B 0 , C) and τ ∼ Ga(a/2, b/2) where B 0 , C, a, b are hyper-parameters determined using a priori knowledge. Consequently, the joint conditional for (β, u, τ ) given the data is proportional to To approximate inferences using MCMC methods, we obtain the full conditional distributions: whereπ i = y i /n i for all i andβ = (X X) −1 X u. Here we assumeπ i ∈ (0, 1) since a Poisson approximation will work better than a normal approximation ifπ = 0 or 1.
Since the u i 's are mutually independent, we only need to sample the full conditional distributions of the individual u i 's, which is usually accomplished with adaptive-rejection sampling or slice sampling.
We propose to approximate the conditional distribution for u | β, τ, y, X in (4) using a multivariate normal distribution with diagonal covariance structure due to the aforementioned conditional independence. This is accomplished by expanding the terms in the first exponent of (4) in a particular second order Taylor expansion and recognizing that the first term in the expansion does not involve u, the second term is zero, and the third term is a quadratic form in u. Moreover, since (u − Xβ) (u − Xβ) is itself a quadratic form in u, it is possible to complete the square to obtain a normal kernel for the conditional pdf for u. The precise result is given in Proposition 1 below and is proven in the Supplementary File (Berman et al., 2022), Section S1.

Normal approximation with sufficient reduction
In most statistical applications, a large sample size is an advantageous feature as more information is usually helpful for understanding scientific phenomena. However, in our case, methods with a large k can be difficult since our normal approximation method relies on Gibbs sampling; and since we sample the full conditional for u | β, τ, y, X as a k-dimensional multivariate normal distribution, the computational costs increase as k increases. In this section we propose a solution that we call, "sufficient reduction", where the name comes from the concept of sufficient statistics, even though our method does not involve a sufficient statistic in the classical sense. It is a form of conditional sufficiency. Our proposal is motivated by examining the conditional distributions for β | τ, u, y, X and τ | β, u, y, X.
Recallβ = (X X) −1 X u. If we let T 1 (u) =β, then the conditional distribution of β depends on u only through T 1 (u). The rate parameter of conditional distribution of τ depends only on u through To summarize, we consider the following "conditionally sufficient reduction" for u, Thus if we sample from the full conditional for T , we no longer need to sample the full conditional for u, or its approximation, in order to make inferences about (β, τ ), or functions of it.
The conditional distributions for β and τ are now expressed as Since the conditional distribution of u is approximately multivariate normal, the conditional distribution for T 1 (u) is also approximately multivariate normal (exactly normal of course if u is exactly normal). The conditional distribution for T 2 (u) involves a quadratic function of u, and it can also be approximated by a normal distribution, for large k. More specifically, we consider a multivariate normal approximation for the joint conditional distribution of (T 1 (u), T 2 (u)). Before asserting formal asymptotic normality, it is straightforward to obtain the conditional moments for T (u): where (9), the posterior samples can be obtained by the usual Gibbs sampling of β, τ and (T 1 , T 2 ).
For the formal result, we treat the result in Proposition 1 as if it were exact. Thus we and proceed to show that the joint distribution for (T 1 ,T 2 ) converges to a (p + 1) dimensional standard multivariate normal. From a practical standpoint, when we do computations, we can just use the normal distribution with the moments given in Equations (9) when we sample the full conditional for T .
Technical assumptions (A1)-(A4) are given in the Supplementary File Section S1. Assumption (A1) is easily satisfied since (I k − sBD 0 ) should be very close to I k as s is sufficiently small. Assumptions (A2) and (A3) require the existence of the asymptotic variance forT 2 and its covariance withT 1 . Assumption (A4) is needed for handling the remainder terms in the Taylor expansion of the mgf in the proof. Now we state Proposition 2, which establishes the joint asymptotic distribution for (T 1 ,T 2 ). The proof is in Supplementary File Section S1. There, we also give a simple illustration of the assumptions for a concrete example.

Data description
We demonstrate the use of our proposed approximation by analyzing a hospital, doctor, and patient (HDP) dataset simulated by UCLA's Institute of Digital Research and Education (IDRE), available at https://stats.idre.ucla.edu/r/codefragments/mesimulation/.
Despite the fact that the data were simulated, they provide a very nice conceptual situation with a moderately large k in order for us to illustrate our methods.
There are 308 oncology physicians in the dataset who have treated 6745 individuals for lung cancer, i.e, on average 21.9 patients treated by each physician. We are mainly interested in modeling the probability of a physician's ability to facilitate lung cancer remission using physician level covariates, including physician's years of experience (mean = 17.96, sd = 4.08), the number of malpractice lawsuits involving the physician (mean = 1.97, sd = 1.53), and whether or not the physician attended a top ranked medical school (22.4% attended top school). These covariates are believed to be helpful in quantifying a physician's ability to successfully facilitate cancerous tumor remission. For example, the more experience a physician has, ideally, the better the treatment. The number of malpractice lawsuits may have a negative effect on the probability of a physician's ability to successfully facilitate cancerous tumor remission, as it may be viewed as a sign of carelessness.
Let y i be the number of patients that physician i has treated successfully (the cancer went into remission) and n i be the total number of patients they treated. Also let x i = (1, sExp i , sLS i , TMS i ), where sExp and sLS are standardized experience and law suit scores (mean zero and variance 1) respectively, and where TMS is one if the physician went to a top medical school and zero otherwise. We consider a physician with about 18 years of experience, with about two law suits and who went to a non-top medical school to be a baseline physician. This concept will be useful when we specify priors.
Moreover, patients of the same physician, say i, would naturally share a common treatment environment and experience. Thus it is reasonable to assume that they would share the same random u i whose distribution is centered on the x i β for that physician. Thus we regard their n i patients as a cluster, and we propose the model in Equation (1) for these data. Accordingly, we don't expect all patients of the same physician to have exactly the same probability of successful treatment. In fact, the median success rate for physicians with covariates x i isp i = expit(x i β), and the corresponding 90th percentile of success probabilities isp i,90 = expit( We specify β ∼ N 4 (0, I 4 ), and τ ∼ Ga(1/2, 1/2) as our prior for the model parameters. On the logistic scale, a N (0, 1) prior induces a prior on a probability that is not overly concentrated near zero or one. While priors with large variances have been used in attempts to be "noninformative", the induced prior on the probability scale attaches considerable mass to 0 and to 1, which can have a big effect on the posterior (Christensen et al., 2010, Chapter 8).
Part of the particular specification involves thinking about the success probability for baseline physicians, whose median probability of success is expit(β 1 ). Our best guess for this probability 0.5 and we specify 95% certainty that it is between 0.1 and 0.9. Using this information, we find the N (0, b) distribution for β 1 that induces this specification; b = 1.1 does the job. Additional considerations are made for the full specification, and are given in Supplementary File Section S2.

Results
We implemented the proposed normal approximation within the Gibbs sampling and compared the results to the method used in JAGS. Both methods used the same prior as described in the previous section, the same number of iterations (5000) and the same burn-in (500). In Figure 1 we plot posterior pdfs for five quantities: expit(β 1 ), expit(β 1 + 1.28σ), exp(β 2 ), exp(β 3 ), and exp(β 4 ). The quantity expit(β 1 ) represents the probability of a "baseline" physician (with average covariate values), i.e., approximately 18 years of experience, subject to approximately 2 lawsuits, and did not attend a top ranked medical school, to remit a patient's cancerous tumor. Then expit(β 1 + 1.28σ) represents the 90 th percentile of cancer remission probabilities for baseline physicians. The other objects of interest, exp(β 2 ), exp(β 3 ), and exp(β 4 ), are odds ratios for the three predictors, experience, lawsuits, and top-ranked medical school, respectively.
We present numerical summaries for these quantities in Table 1. We find that our normal approximation method (we are not implementing the SR method here) works quite well given the minor discrepancy between the results from normal approximation and JAGS. For example, the posterior median for the odds ratio, exp(β 2 ), is 1.31 with a 95% PI of (1.12, 1.53) under normal approximation; and it is 1.34 with 95% PI of (1.13, 1.58) using JAGS. Thus if there are two physicians of equal education level and number of lawsuits, the type of physician with four more years of experience has odds of tumor remission that are 31% higher than for the less experienced type of physician. Similar interpretations can be obtained for other two predictors.
In Figure 1, we see greater discrepancies for expit(β 1 ) and for expit(β 1 + 1.28σ) compared with other parameters. Aside from the different graph scales, we believe this is also due to using the normal kernel for approximating the binomial likelihood when the expected number of successes np (or failures n(1 − p)) is too small (Cochran, 1952). To further explore this phenomenon, we selected two physicians with the most extreme sample proportions of cancer remission, i.e., smallest (Physician # 4) and largest (Physician # 218). We also randomly selected two additional physicians (# 116 and 171) as benchmarks. We present inferences for their probabilities of cancer remission in Figure 2 and Table 2. It is clear that the proposed normal approximation works almost perfectly for the benchmark Physicians # 116 and 171, but is not as well for Physicians 4 and 218, especially in the tails. It is worth noting that Physician 4 only treated one patient out of 34 patients successfully while Physician 218 successfully treated 36 out of 40 patients.     In Figure 1, we also plot the empirical CDF of the posterior medians for n i min(p i , 1− p i ) : i = 1, . . . , 308 for physicians in the lower right picture. It is clear that about half (150 out of 308) of the physicians have an estimated np value less than 5, which explains why the normal approximation does not work perfectly when p is close to the boundary of (0, 1). We also present results after filtering out the physicians with n min(p, 1−p) > 5 in Figure 3. Results from normal approximation method and JAGS align quite well.
We also did an analysis of these data restricted to the physicians with n min(p, 1 − p) ≥ 10, and the analogous figure to Figure 3 shows perfect alignment of estimated pdfs based on JAGS and the normal approximation.
Finally we conclude the analysis by briefly actually analyzing the data. We use the normal approximation to make inferences, but inferences would be virtually/practically identical if we used the JAGS output. The estimated probability of patient remission for a baseline physician is estimated to be 0.34 (0.30, 0.38). Observe from Figure 1 that the corresponding 90th percentile of remission probabilities for baseline doctors would be estimated to be considerably larger than the median (50th percentile). An additional four years experience is estimated to improve the odds of remission by about 30%, with a 95% probability interval for that improvement of 12 to 53 percent. This effect is statistically important and plausibly practically important as well. The estimated effect of having two additional law suits against a physician is a reduction in estimated odds by about 14%. However this is not statistically important since the possibility of both reductions and increases in these odds are well inside of the corresponding 95% interval. Finally, the effect of attending a top medical school does not appear to have any practical or statistical importance. It is clear from Figure 2 that there is a considerable range of estimated probabilities of success across the types of physicians in the data. We remind the reader that the data were simulated.

Simulations
In this section we use simulations to further evaluate the empirical performance of the proposed normal approximation method. We generate the data from a realistic setting that is similar to what we observed from the HDP data example. More specifically, we consider three covariates, including the intercept, a continuous covariate that follows a standard normal distribution and a binary covariate following a Bernoulli distribution with success probability of 22% (the same percentage with the medical school covariate from the HDP example). The true coefficient values are set to be the same with the posterior medians obtained from the HDP data analysis using JAGS with the prior β ∼ N 3 (0, I) and τ ∼ Gamma(1/2, 1/2). We vary the number of observational units, k, i.e., the number of physicians in the HDP data example, in the set {100, 300, 500}; and generate n i (e.g,. number of patients being treated by physician i) from a Poisson distribution with parameter λ, where λ takes values in {25, 50, 100}.
We compare the performance of four methods including (i) the method implemented by JAGS, (ii) the normal approximation method proposed in Section 2.1, (iii) the sufficient reduction method in 2.2, and (iv) an adaptive-rejection sampling (ARS) method that we programmed. All four methods generate MCMC samples using Gibbs sampling and they all update the blocks for β and τ by sampling the exact full conditional distributions. The exact full conditional distribution for u is sampled using adaptive-rejection Gilks and Wild (1992) in methods (i) and (iv), using different code of course. And finally, the full conditional for u is sampled approximately using our normal approximation in method (ii), and the full conditional for T is sampled approximately with a normal approximation in our method (iii). The main reason that we include the ARS method (iv) is to allow a fair comparison of computational time because methods (ii-iv) (except  JAGS) are all implemented in R (MCMC iterations is 5000, with the first 500 as burnin) while the core function in JAGS has been optimized for mass use and is written in C++, which is much faster than R.
We summarize the mean absolute deviation (MAD) and the bias of the posterior medians for each parameter of interest based upon 100 Monte-Carlo replications in Tables 3 and 4. We find that both the MAD and bias decrease as either k or λ increases in most cases. The estimation accuracy of all four methods are comparable in most cases. Both the normal approximation and sufficient reduction seem to work quite well, especially for larger values of k and λ. When λ is small, e.g., first row of Table 3, normal approximation works better than ARS and SR, and this advantage becomes less noticeable as the λ increases. This is expected because λ is the mean parameter for n i , which controls the accuracy of the normal approximation to a binomial likelihood with n i trials. In other words, a smaller λ value leads to a worse normal approximation and SR will perform even worse because of the additional loss of information. Similar findings have been observed in Tan (2021) and Goplerud (2021) when normal approximation is used in variational inference.
We also use wall time, the elapsed time the computer used to execute the program, to quantify the computational efficiency for three methods. The mean wall time for each of the three methods is summarized in Table 5. We find that the sufficient reduction method is faster than the normal approximation and both of those are faster than the adaptive-rejection method we implemented. Wall time for the SR method improves over the normal approximation as k grows, as expected.
Next we relate the average running time in Table 5 to the computational complexity of normal approximation and SR methods. We assume p is fixed. For the normal approx-   imation method in Section 2.1, its computational complexity in theory is O(k 3 + λk), where the k 3 term comes form the matrix inversion of k by k matrices in Proposition 1, and the λk term comes from the matrix multiplication for n i times where n i is generated following a Poisson distribution with mean parameter of λ in this simulation. For the SR method, its computational complexity is O(k 2 + λk), where the k 2 term comes from the matrix multiplication in Equation (9). It is hence clear that the computational gain of the proposed SR method is mainly driven by avoiding inverting k × k matrices, which becomes substantial as k becomes large. These results are reflected in Table 5, e.g., the time saving of SR over normal approximation improves as k becomes larger because both k 2 and k 3 dominate the λk term. When k is small (e.g., 100), the computational time nearly doubles as λ changes from 25 to 50 as expected. On the other hand, when k is large (e.g., 500), increasing λ from 25 to 100 does not quite change the computational time. For other entries in the table (e.g., k = 300), the results are less interpretable probably because the terms associated with k 3 and λk are at comparable orders.

Model setup
In this section, we further demonstrate the proposed normal approximation and sufficient reduction ideas with a two-level logistic regression model as follows, where Y ij is a binary response variable for j th subject within cluster i with j = 1, 2, . . . , n i , i = 1, 2, . . . , k, X ij is a 1 × p vector of subject-level covariates, and u i is a normal random effect for the i th cluster with a precision parameter τ . We center u i on S i γ where S i is an 1 × q vector of unique covariate information for cluster i, γ is the associated coefficient parameter, and β is the 1×p vector of regression coefficients at the observational unit level.
We use independent normal priors for the regression parameters β and γ and consider a gamma prior for τ as follows, Then the joint posterior is proportional to To further simplify the algebra, we introduce some matrix notation. Let u = (u 1 , . . . , u k ) such that u ∼ N k (Sγ, I k τ −1 ), where S = (S 1 , . . . , S k ). Let N = k i=1 n i be the total number of observations, X = (X 11 , X 12 , . . . , X kn k ) be an N ×p matrix, and y = (y 11 , y 12 , . . . , y kn k ) be an N ×1 vector of responses of y ij s. Finally, we define Z ij as a 1 × k vector of zeros except the i th entry is one for i = 1, . . . , k, and let Z = (Z 11 , Z 12 , . . . , Z kn k ) be an N ×k matrix. Then the joint posterior kernel in (12) can be rewritten as Based on this joint posterior kernel, we can easily obtain the conditional distribution of τ and γ given other parameters as follows, The conditional distribution for (β, u) is proportional to which is not recognizable. In the literature, it is common to use a Metropolis sampling or adaptive-rejection sampling approach to sample from this type of distribution; it is our goal in the next section to explore the possibility of a normal approximation.

Normal approximation and sufficient reduction
We first consider the normal approximation for (15). By some calculation (details provided in Supplementary File Section S3), we have the following normal approximation results for the conditional distributions of β and u, where and Dπ (1−π) = diag{expit(X ijβij + Z ijũij ): j = 1, 2, . . . , n i , i = 1, 2, . . . , k}, where the definitions ofβ andũ are given in Supplementary File Section S3. Based on this result, we can easily design a block Gibbs sampling algorithm that allows us to directly sample from β, u together with γ and τ using (14).
Next we consider the sufficient reduction by letting n i → ∞ for each i = 1, . . . , k and having k → ∞. Under this setting, we would expect the dimension for u is large, such that a direct sampling from its distribution is expensive. Similarly with the definition of (T 1 , T 2 ) in Section 2.2, we define a new set of statistics as follows, Here T = (T 1 , T 2 , T 3 ) can be viewed as "sufficient reduction" for u. Then the conditional distributions for τ , γ, and β are We will also need a conditional distribution of T | else to form a Gibbs sampler for our SR method. Similarly with Proposition 2, we consider a normal approximation for their joint conditional distribution. This approximation intuitively makes sense because T 1 and T 3 are affine transformations of a multivariate normal distribution, and T 2 is a weighted sum of squares of normally distributed random variables. We present the means and covariances for (T 1 , T 2 , T 3 ) in Supplementary File Section S3.2.
In terms of computational complexity, normal approximation is at O(N 2 + k 3 ) and SR is at O(N 2 + k 2 ) = O(N 2 ) since N = k i=1 n i . Therefore we expect a substantial computational saving when k is large and n i 's are fixed.

Cow abortion data analysis
We use the developed methodology to analyze cow abortion data. This data set consist of 13145 cows across 9 herds, where the herds vary in size from 116 to 2711 cows, with a mean (median) herd size of 1460 (1490) cows, and a standard deviation of approximately 719 cows. The data were previously analyzed by Hanson et al. (2003), and the main interest here is to model the probability of a spontaneous abortion in dairy cows given two covariates, the gravidity (GR) and days open (DO). Here gravidity refers to the number of times the cow was pregnant before the current pregnancy (mean is 3.2 pregnancies, median is 3, and the standard deviation is 1.5) and days open is the number of days between the most recent birth and conception (mean is 95.6 days, median is 79, and the standard deviation is 48.4). We consider the following two-level logistic model, where Y ij is a binary response with 1 indicating a recorded abortion for j th cow in the i th herd for all j cows (j = 1, 2, . . . , n i ) in herd i (i = 1, 2, . . . , 9). Since both covariates that we consider are quantitative, we standardize them prior to analysis. We then consider We implemented four methods to analyze these data: (i) the adaptive rejection sampling used in JAGS, which will serve as the benchmark for comparison; (ii) the proposed normal approximation within the Gibbs sampling; (iii) sufficient reduction method; and (iv) a Metropolis Hasting algorithm by using the normal approximation as a proposal distribution. Method (iv) is new to the presentation. For each of those methods, we ran three Markov Chains, each with 10,000 iterations and the first 3,000 were discarded as burn-in.
We focus on four quantities of interest, (a) expit(γ), which is the probability of an average cow (taking average values for cow's days open and gravidity) abortion; (b) expit(γ + 1.65σ), which is the 95 th percentile for the probability of abortion for the average cow, (c) exp(β 1 ), the odds ratio for days open; and (d) exp(β 2 ), the odds ratio for gravidity. We summarize the posterior densities for those four quantities in Figure 4, it is clear that the posterior densities obtained from all four methods are almost identical, which confirms the excellent approximating accuracy for our proposed three methods. The good normal approximation performance is probably due to the large values of n i (number of cows for each herd), e.g., the smallest herd has 116 cows.  We also present the summary statistics for the quantities of interest in Table 6. Since we have standardized the covariates, the interpretation of the odds ratios is based on a one standard deviation increase in the covariate. For example, exp(β 1 ), represents the change in the odds of spontaneous abortion between two cows with identical gravidity values, but one cow has a days open value that is 48.4 days longer than the other. Similarly, exp(β 2 ), is the change in odds of spontaneous abortion between two cows with identical days open, but one standard deviation more in gravidity value, approximately 1.5 more previous pregnancies.
The computational time (in seconds) is 725 for normal approximation, 873 for sufficient reduction, and 948 for the Metropolis Hasting algorithm using the normal approximation as a proposal. Sufficient reduction is slower than normal approximation in this example since k = 9 is quite small. The M-H algorithm takes the longest time because it requires calculating the normal approximation as a middle step.

Discussion
In this paper, we explored the idea of large sample approximation for enhanced MCMC sampling in two GLMMs. This was implemented by replacing the usual Metropolis-Hastings and adaptive-rejection sampler with a direct sampling from a normal distribution within the Gibbs sampler. In the future, it will be of interest to extend the idea of normal approximation and sufficient reduction for more complex models, e.g., general forms of GLMM and additional hierarchies in the model. Computationally, as complexity grows, it will be of interest to implement the proposed method in a more efficient software platform such as C++ and to develop faster algorithms for precision matrix calculation. Missing from our presentation so far is an investigation of the effects of "big data" values for k on various methods, including those not considered here. We plan to investigate potential improvements for data sets with values of k that are orders of magnitude larger than those considered here.

Supplementary Material
Web-based Supplementary File for "Normal approximation for Bayesian mixed effects binomial regression model" (DOI: 10.1214/00-BA1312SUPP; .pdf). The supplementary material contains technical details and proofs that the full conditionals for u and for T (u) are asymptotically normal, as well as details for the more complex model.