Improving the INLA approach for approximate Bayesian inference for latent Gaussian models

We introduce a new copula-based correction for generalized linear mixed models (GLMMs) within the integrated nested Laplace approximation (INLA) approach for approximate Bayesian inference for latent Gaussian models. While INLA is usually very accurate, some (rather extreme) cases of GLMMs with e.g. binomial or Poisson data have been seen to be problematic. Inaccuracies can occur when there is a very low degree of smoothing or"borrowing strength"within the model, and we have therefore developed a correction aiming to push the boundaries of the applicability of INLA. Our new correction has been implemented as part of the R-INLA package, and adds only negligible computational cost. Empirical evaluations on both real and simulated data indicate that the method works well.


Introduction
Integrated Nested Laplace Approximations (INLA) were introduced by Rue et al. (2009) as a tool to do approximate Bayesian inference in latent Gaussian models (LGMs).The class of LGMs covers a large part of models used today, and the INLA approach has been shown to be very accurate and extremely fast in most cases.Software is provided through the R-INLA package, see http://www.r-inla.org.
An important subclass of LGMs is the rich family of generalized linear mixed models (GLMMs) with Gaussian priors on fixed and random effects (Breslow & Clayton, 1993;McCulloch et al., 2008).The use of INLA for Bayesian inference for GLMMs was investigated by Fong et al. (2010), who reanalyzed all of the examples from Breslow & Clayton (1993).Fong et al. (2010) found that INLA is somewhat inaccurate for binary data with few or no replications.In this paper, we introduce a new correction term for INLA, significantly improving accuracy while adding negligibly to the overall computational cost.
To set the scene, we consider a minimal simulated example illustrating the problem (postponing more thorough empirical evaluations until Section 3).Consider the following model: For i = 1, 2, . . ., n, let Prob(y i = 0) = 1 − p i , Prob(y i = 1) = p i , and where u i ∼ N (0, σ 2 ), iid.Let the precision σ −2 have a Gamma(1, 1) prior, while the prior for β is N (0, 1).We simulated data from this model, setting n = 100, σ 2 = 1 and β = 2. Figure 1 shows the resulting posterior distributions for the intercept β and for the log precision, log(σ −2 ), where the histograms show results from long MCMC runs using JAGS (Plummer, 2013), the black curves show posteriors from INLA without any correction, and the red curves show results using the new correction defined in Section 2. While some of our later examples show more dramatic differences between INLA and long MCMC runs, these results exemplify quite well our general experience with using INLA for "difficult" binary response GLMMs: Variances of both random and fixed effects tends to be underestimated, while the means of the fixed effects are reasonably well estimated.One part of the problem is that the usual assumptions ensuring asymptotic validity of the Laplace approximation do not hold here (for details on asymptotic results, see the discussion in Section 4 of Rue et al. (2009)).The independence of the random effects make the effective number of parameters (Spiegelhalter et al., 2002) on the order of the number of data points.In more realistic models, there is often some amount of smoothing or replication that alleviates the problem, but it may still occur.Except in the case of spline smoothing models (Kauermann et al., 2009), there is a lack of strong asymptotic results for random effects models with a large effective number of parameters.In the simulation from model (1), the data provide little information about the parameters, with the shape of the likelihood function adding to the problem.Figure 2 illustrates the general problem.Here, the top panel shows the log-likelihood of a single Bernoulli observation Y as a function of the linear predictor η, i.e. log(Prob(Y = 1)) = log(p) where logit(p) = η.The bottom panel shows the corresponding derivative.We see that the loglikelihood gets very flat (and the derivative near zero) for high values of η, so inference will get difficult.
Bayesian and frequentist estimation for GLMMs with binary outcomes has been given some attention in the recent literature (Capanu et al., 2013;Grilli et al., 2014;Sauter & Held, 2015), but a computationally efficient Bayesian solution appropriate for the INLA approach has been lacking.An alternative to our new approach would be to consider higher-order Laplace approximations (Shun & McCullagh, 1995;Raudenbush et al., 2000;Evangelou et al., 2011), other modifications to the Laplace approximation (Ruli et al., 2015;Ogden, 2015), or expectation propagation-type solutions (Cseke & Heskes, 2010), but we view them as too costly to be applicable for general use in INLA.The motivation for using INLA is speed, so we see it as a design requirement for any correction that it should add minimally to the running time of the algorithm.We proceed as follows.In Section 2, we present a derivation of our new correction method.Section 3 presents empirical studies, both on real and simulated data, showing that the method works well in practice.Finally Section 4 gives a brief discussion and some concluding remarks.
Before we describe the copula construction, we need to define some notation.First, for i = 1, . . ., n, let μi (θ) and σ2 i (θ) denote the mean and variance of of each marginal πSN (x i |θ, y), and let F i be the cumulative distribution function corresponding to πSN (x i |θ, y).Second, let xi ∼ F i and assume that Fi is the distribution of zi = (x i − μi (θ))/σ i (θ).As usual, Φ denotes the cumulative standard Gaussian distribution function.Furthermore, let µ i (θ) and σ 2 i (θ) denote the marginal means and variances of the Gaussian approximation πG (x|θ, y), let Q(θ) be the precision matrix of πG (x|θ, y), and let x = (x 1 , . . ., x n ) where x ∼ πG (x|θ, y), and define We will now show how to construct a joint distribution having marginals F i and the dependence structure from πG (x|θ, y), using a Gaussian copula (see e.g.Nelsen (2007) for a general introduction).First, note that Φ(z i ) ∼ U[0, 1] by the probability integral transform (PIT).Let zi = Fi (Φ(z i )).Applying the inverse of the PIT then yields that zi ∼ Fi , from which it follows that xi = σi (θ)z i + μi (θ) is distributed as xi ∼ F i , which is the marginal distribution we want.Since we have only done marginal transformations, the dependence structure of the original x = (x 1 , . . .x n ) is still intact.Thus, to construct the new approximation to the joint distribution π(x|θ, y), we define the transformed value xi as follows: We may simplify the construction above by replacing the Fi in (4) by Φ and assuming σi (θ) = σ i (θ).This means that we do not correct for skewness, but we take advantage of the improved mean μi (θ) from πSN (x i |θ, y).We denote this as the "mean only" correction.(We will later discuss the possibility of retaining Fi as a skew normal, while still assuming σi (θ) = σ i (θ); this we denote as the "mean and skewness" correction.)In the simple "mean only" case, the transformation reduces to a shift in mean: the Jacobian is equal to one, and the transformed joint density function is a multivariate normal with mean μ and precision matrix Q, i.e.
For the INLA implementation of the copula correction, we have found that it is sufficient to only include fixed effects (including any random effects of length one) in the calculation of C(θ).This gives better numerical stability and also seems to provide a more accurate approximation.Conceptually, this step involves finding Σ(θ) = Q(θ) −1 , and then again finding , where J is the index set of the fixed effects, so this might seem computationally costly.However, it can be done cheaply by using the linear combination feature described in Section 4.4 of Martins et al. (2013): If n f is the number of fixed effects, only the (parallel) solution of a n f -dimensional linear system is needed.
Additionally, to guard against over-correction, we perform a soft thresholding on C(θ), as follows: First we define a sigmoidal function f (t): which is increasing, has derivative equal to one at the point t = 0, and where f (t) → 1 as t → ∞.
Then we replace C(θ) by C t (θ), where for u = n f ξ and with the "correction factor" parameter ξ > 0 determining the degree of shrinkage (more shrinkage for smaller values of ξ).Since the function f (t) is approximately linear with unit slope around zero, C t (θ) will be close to C(θ) for small and moderate values of C(θ), while larger values will be increasingly shrunk toward zero.Note that since f (t) < 1 for all t, C t (θ) < u.We have found that ξ = 10 is a good choice, letting the correction do its job while guarding against too large changes, and we have used this value for all of the examples.
As mentioned, we have also investigated a more general case of the copula construction, where we retain F i as a skew normal distribution, i.e. the CDF of πSN (x i |θ, y) (and σi (θ) = σ i (θ) as before).This results in a more complicated correction term C skew (θ), derived in Appendix A. We have not found any appreciable differences in the accuracy compared to the simpler case without skewness, so we have concluded that the non-skew version is preferable due to its simplicity.We will show both the skew and the non-skew correction for the toenail data discussed in Section 3.2, but otherwise we show only results from the simpler non-skew version.We have tried both corrections on many (both real and simulated) data sets, and never seen a significant difference in the results.

Have we solved the problems detected by Fong et al. (2010)?
As mentioned in the Introduction, Fong et al. (2010) studied the use of INLA for binary valued GLMMs, and they showed that the approximations were inaccurate in some cases.We have redone the full simulation experiment described on pages 10-14 of the Supplementary Material of Fong et al. (2010), both for INLA without any correction, and INLA with the "mean only" correction described in Section 2.
We only show the results from the m = 1 (i.e.binary data) version of the model ( 8), as this is the most difficult case with the largest errors in the approximation.The correction also works well for m > 1 and the more complex model ( 9), but this model is easier to deal with for INLA.This is seen empirically, and is also as expected based on considering the asymptotic properties of the Laplace approximation: for m > 1 and model ( 9) there is more "borrowing of strength"/replication, so the original approximation should work better.
Results from the simulation study are shown in Table 1.Note that the aim here is to be as close as possible to the MCMC results, not the true values.The upper part of the table shows averages of the posterior means over the 100 simulations.We see that INLA gets much closer to the MCMC results for all parameters except β 3 , which is in any case reasonably close to the MCMC value.The improvement is particularly large for the variance parameter.This is also seen in the middle panel, which shows (E(θ INLA |y) − E(θ MCMC |y))/sd(θ MCMC |y) for each parameter θ (averaged over the 100 simulations), i.e. the difference in INLA and MCMC estimates scaled by the MCMC standard deviation.Here, the random effects variance and the fixed effects except β 3 are also more accurately estimated.Finally, the lower panel shows the ratios Var(θ INLA |y)/Var(θ MCMC |y); here the correction improves the estimation of the variance for all parameters.For σ 2 0 , σ 0 and log σ −2 0 we get very close to a ratio of one, and there are also major improvements for the fixed effects variances.

What happens when the random effects variance increases?
We start by discussing the toenail data, which is a classical data set with a binary response and repeated measures (De Backer et al., 1998;Lesaffre & Spiessens, 2001).The data are from a clinical trial comparing two competing treatments for toenail infection (dermatophyte onychomycosis).The 294 patients were randomized to receive either itraconazole or terbinafine, and the outcome (either "not severe" infection, coded as 0, or "severe" infection, coded as 1) was recorded at seven follow-up visits.Not all patients attended all the follow-up visits, and the patients did not always appear at the scheduled time.The exact time of the visits (in months since baseline) was recorded.For individual i, visit j, with outcome y ij , treatment Trt i and time Time ij our model is then Notice that this is the same model as model ( 8), except that the time variable here varies over individuals.INLA underestimates σ 2 quite severely.The top panel of Figure 3 shows the different estimates of the posterior distribution of the log precision, log σ −2 .The histogram shows the results from a long MCMC run using JAGS, the black curve shows the posterior from INLA without the correction, the red curve shows the simple (non-skew) version of the INLA correction, while the green curve shows the INLA correction accounting for skewness (as discussed in Appendix A).The bottom panel of Figure 3 shows the additive correction to log posterior density, as a function of the hyperparameter (log precision).We see that there is very little difference between the two corrections.
For the toenail data, the estimated random effects standard deviation is approximately σ = 4, which is very high.To investigate how the copula correction works as σ increases, we studied simulated data sets from the model above, where we set σ to different values, and where the α parameters were fixed to the values from a long MCMC run using the real toenail data (i.e., we simulate only the outcome, keeping the covariates unchanged).Results are shown in Figure 4 for different values of σ ranging from σ = 1 to σ = 16.We clearly get very accurate corrected posteriors for σ < 4. For σ ≥ 4, we gradually get a tendency of under-correction.

Simulated example with Poisson likelihood
As our final example, we study the case where the data are Poisson distributed.We consider a simple simulated (extreme) example in order to investigate how well the correction works in the Poisson case.For i = 1, . . ., n we generated iid y i ∼ Poisson(µ i ) where log(µ i ) = β + u i , with u i ∼ N (0, σ 2 ).We chose n = 300 and σ 2 = 1, and a Gamma(1, 1) prior for the precision σ −1 .Figure 5 shows the results for different values of the intercept β.Each histogram is based on ten parallell MCMC runs using JAGS, each with 200,000 iterations after a burn-in of 100,000.Here, reduction of β implies that estimation is more difficult, since negative β with large absolute value will imply that the counts y i are very low, with many zeroes, and the data are uninformative.We see that uncorrected INLA tends to get less accurate as β moves towards more extreme values, while the correction seems to work well.

Discussion
The binary (and, more generally, binomial) GLMMs discussed in Sections 3.1 and 3.2 are are important in many applications, particularly for biomedical data.Poisson GLMMs are also of  great interest, and among the difficult cases here are point processes such as log-Gaussian Cox processes (Illian et al., 2012;Simpson et al., 2013), where data are typically extremely sparse: essentially there are ones at the observed points, and zeroes everywhere else.Our example in Section 3.3 shows a stylized, extreme case of this type.Studying the correction for the full log-Gaussian Cox process case could be a topic for future work.Even though the point process case may be difficult, there will often be some degree of smoothing and/or replication making inference easier, so real data sets should be less extreme than the simulated example in Section 3.3.
From the results in this paper, it appears that the copula correction is robust and works well.There is no general theory guaranteeing that the method will always work under all circumstances, but we feel that the intuition underlying the method is quite strong.Since INLA for LGMs is quite accurate in most cases, the correction is not needed in general, only for problematic cases such as those discussed in this paper.Using the copula correction method, we can stretch the limits of applicability of INLA, while maintaining its computational speed.

Fi
xi − μi (θ) Letting fi and φ denote the density functions corresponding to Fi and Φ, differentiating with respect to xi then gives fi xi − μi (θ) σ i (θ)

Figure 1 :
Figure 1: Minimal example defined in Equation (1).The histogram show posterior distributions from a long MCMC run (ten chains of one million iterations each), the black curve shows the posterior from INLA, while the red curve shows the posteriors using our new correction to INLA.

Figure 2 :
Figure 2: Log-likelihood (top panel) and derivative of log-likelihood (bottom panel) for a single Bernoulli observation as a function of the linear predictor in a logistic model.

Figure 3 :Figure 4 :Figure 5 :
Figure3: The top panel shows the posterior density of the hyperparameter (log precision) for the toenail data.The histogram is from one million MCMC samples from JAGS, the blue curve is from uncorrected INLA, the red curve from INLA with the "mean only" correction, and the green curve from INLA with the "mean and skewness" correction.The bottom panel shows the additive corrections to the log posterior for the toenail data for the "mean only" and the "mean and skewness" correction

Table 1 :
Results from simulation study