Max-and-Smooth: A Two-Step Approach for Approximate Bayesian Inference in Latent Gaussian Models

. With modern high-dimensional data, complex statistical models are necessary, requiring computationally feasible inference schemes. We introduce Max-and-Smooth, an approximate Bayesian inference scheme for a ﬂexible class of latent Gaussian models (LGMs) where one or more of the likelihood parameters are modeled by latent additive Gaussian processes. Our proposed inference scheme is a two-step approach. In the ﬁrst step (Max), the likelihood function is approximated by a Gaussian density with mean and covariance equal to either (a) the maximum likelihood estimate and the inverse observed information, respectively, or (b) the mean and covariance of the normalized likelihood function. In the second step (Smooth), the latent parameters and hyperparameters are inferred and smoothed with the approximated likelihood function. The proposed method ensures that the uncertainty from the ﬁrst step is correctly propagated to the second step. Because the prior density for the latent parameters is assumed to be Gaussian and the approximated likelihood function is Gaussian, the approximate posterior density of the latent parameters (conditional on the hyper-parameters) is also Gaussian, thus facilitating eﬃcient posterior inference in high dimensions. Furthermore, the approximate marginal posterior distribution of the hyperparameters is tractable, and as a result, the hyperparameters can be sampled independently of the latent parameters. We show that the computational cost of Max-and-Smooth is close to being insensitive to the number of independent data replicates, and that it scales well with increased dimension of the latent parameter vector provided that its Gaussian prior density is speciﬁed with a sparse precision matrix. In the case of a large number of independent data replicates, sparse precision matrices, and high-dimensional latent vectors, the speedup is substantial in comparison to an MCMC scheme that infers the posterior density from the exact likelihood function. The accuracy of the Gaussian approximation to the likelihood function increases with the number of data replicates per latent model parameter. The proposed inference scheme is demonstrated on one spatially referenced real dataset and on simulated data mimicking spatial, temporal, and spatio-temporal inference problems. Our results show that Max-and-Smooth is accurate and fast.


Introduction
Data are being generated today at an unprecedented rate. Many datasets are large and exhibit complex marginal behaviors and dependence structures. In particular, data that are indexed in space and time may indicate non-linear time trends and spatial patterns, and may be driven by complex space-time interactions. Statistical modeling and inference for high-dimensional spatio-temporal datasets becomes increasingly computationally demanding as the spatial and/or temporal dimension increases (see, e.g., Heaton et al., 2019). The same is true for, e.g., multinomial data (see, e.g., Gustafson et al., 2008) and survival data (see, e.g., Lee et al., 2011), i.e., as the number of units or individuals increases the computational cost increases.
In this paper we focus on latent Gaussian models (LGMs), which form a general and very flexible class of models that has proven to be useful in a wide range of concrete applications (see, e.g., Gelfand et al., 2007;Cooley et al., 2007;Rue et al., 2009;Margeirsson et al., 2010;Sigurdarson and Hrafnkelsson, 2016;Zinszer et al., 2017;Opitz et al., 2018;Lombardo et al., 2018Lombardo et al., , 2019. We here introduce Max-and-Smooth, a novel approximate Bayesian inference procedure for LGMs with independent data replicates that is both accurate and fast, providing significant speedups in high dimensions. Our approach has superficial similarities with the recent contribution by Risser et al. (2019), who propose a frequentist two-step inference approach and focus on the GEV distribution. In contrast, our proposed inference scheme is more general and designed for fully Bayesian inference.
LGMs are Bayesian hierarchical models (Cressie and Wikle, 2011;Banerjee et al., 2014) that consist of three levels: the data level, the latent level and the hyperparameter level. Each level is specified with a probability distribution, the one at the latent level being Gaussian. For computational reasons, it is common to assume that the data are conditionally independent, given the latent process, and we also assume this here. The role of the latent process is to capture the underlying dynamics of the data (such as space-time dependence interactions). Our focus in this paper is mainly on three types of spatio-temporal LGMs that are useful in different settings, although our proposed method applies more generally, e.g., with (replicated) multinomial or survival data. The first type of LGMs that we consider assumes that the spatio-temporal dynamics of the data is described by latent parameters that vary spatially but are constant in time, and that (a potentially different number of) data time replicates are available at each spatial location. This type of LGMs focuses on capturing the data's spatial behavior, although datasets with temporal covariates or slowly-varying temporal trends may also be modeled in the same framework. The second type of LGMs assumes that the latent parameters vary in time, and that several spatial replicates of the data are available at each time point. In this setting the latent parameters are usually constant in space, although they may also refer to the main effects of distinct regions that vary over time. The focus is therefore on capturing the data's temporal behavior. Finally, the third type of LGMs assumes that the latent process varies in both space and time, and that several replicates for each spatio-temporal observation are available.
Several strategies have been proposed to fit LGMs. Simulation-based Markov chain Monte Carlo (MCMC) methods can be used (see Cressie and Wikle, 2011;Banerjee et al., 2014), although their application in high-dimensional settings (i.e., situations with either space-rich and/or time-rich data, or with many parameters involved at the latent level) may be limited by the computational complexity. In order to make MCMC sampling more efficient, Knorr-Held and Rue (2002) proposed a single block updating strategy for LGMs characterized by a univariate link function. Their strategy reduces the cross-correlation between the hyperparameters and the latent parameters within the posterior samples. A detailed comparison of several sampling strategies for LGMs in Filippone et al. (2013) showed that the single block updating strategy of Knorr-Held and Rue (2002) has larger effective sample size compared to sufficient augmentation, ancillary augmentation and ancillarity-sufficiency interweaving strategy (Yu and Meng, 2011), and the surrogate method (Murray and Adams, 2010). Another approach to infer LGMs was proposed by Filippone and Girolami (2014), who suggested using a pseudomarginal sampling procedure for the marginal posterior density of the hyperparameters, which relies on the Metropolis-Hastings algorithm and importance sampling. Essentially, samples from the marginal posterior density of the hyperparameters are obtained first, and the latent parameters are sampled from the conditional posterior density of the latent parameters. Filippone and Girolami (2014) compared their pseudo-marginal approach to ancillary augmentation (Yu and Meng, 2011) and the surrogate method (Murray and Adams, 2010), and found that the effective sample size of the hyperparameters was much lower for ancillary augmentation and the surrogate method than for the pseudo-marginal approach. Filippone and Girolami (2014) concluded that this was due to ancillary augmentation and the surrogate method not being fully capable of breaking down the correlation between hyperparameters and latent parameters. The findings of Filippone et al. (2013) and Filippone and Girolami (2014) underline that sampling from the marginal posterior density of the hyperparameters in an LGM leads to more effective sampling schemes.
Alternatively, the integrated nested Laplace approximation (INLA) has proven to be very fast and accurate for approximate Bayesian inference in LGMs (Rue et al., 2009). The INLA methodology essentially bypasses MCMC sampling by performing a numerical approximation of the posterior density. Due to its computational efficiency and its convenient implementation in the package INLA for the R statistical computing environment, the INLA method has found widespread interest, and has been applied in numerous settings; see the review papers by Rue et al. (2017) and Bakka et al. (2018), and references therein. However, the current implementation in the INLA package only supports LGMs characterized by a univariate link function (i.e., with one single Gaussian linear predictor at the latent level), and with a small number (typically less than 20) of hyperparameters. In Section 5, we discuss a linear regression model for spatio-temporal meteorological data, where the intercept, the covariate effect, and the residual variance vary spatially, thus requiring a trivariate link function in the likelihood. Generally speaking, it is common to assume that spatio-temporal data are described by LGMs of type (i), (ii) or (iii) above with multiple parameters (e.g., intercept, multiple covariate effects, scale and shape parameters, etc.) that vary spatially and/or temporally. These types of LGMs usually require multivariate link functions, and we will hereafter refer to models of this type as extended LGMs in line with Geirsson et al. (2020). Although it might be possible in principle to extend the INLA software to LGMs with multivariate link functions, this has not been implemented yet.
Posterior inference for extended LGMs in moderate or high dimensions is known to be challenging. Geirsson et al. (2020) developed an efficient block Gibbs sampling scheme, referred to as the LGM sampler, which was shown to significantly reduce the autocorrelation in posterior samples. In this paper, we propose using Max-and-Smooth, a novel two-step approximate Bayesian inference approach for extended LGMs, which borrows ideas from the INLA method and the LGM sampler, in order to be both fast and accurate in high dimensions. Essentially, our approach approximates the likelihood function by a Gaussian likelihood, similar to the Laplace approximation used in the INLA method. This allows us to perform fast inference with a correct propagation of the uncertainty. The two steps of the inference scheme are as follows: (i) In the first step (Max), we compute the maximum likelihood (ML) estimates of the latent parameters at each spatial, temporal, or spatio-temporal point (depending on the type of LGM considered), and we approximate the covariance of the Gaussian approximation using the inverse observed information evaluated at the ML estimate. We also consider an alternative Gaussian approximation that uses the mean and covariance of the normalized likelihood function. (ii) In the second step (Smooth), we treat the ML estimate (or the mean of the normalized likelihood) as the observed data of the latent parameters, with a Gaussian likelihood (thus taking their estimated variance into account). We then fit the latent Gaussian model by taking advantage of the conjugacy properties of the approximate Gaussian-Gaussian model, which is hereafter referred to as the pseudo model. In other words, we essentially consider that the parameter estimates from the first step are noisy measurements of the latent field (with known noise variance) and we smooth them jointly in the second step. Notice that although our proposed approach has two consecutive steps, it properly propagates the uncertainty, and thus provides a valid approximate procedure to sample from the full posterior density. Our proposed procedure is very fast for a variety of reasons. First, the (approximate) sampling scheme is such that the unnormalized marginal posterior density of the hyperparameters can be expressed analytically, making it straightforward to sample from it. Second, the conditional density of the latent parameters given the hyperparameters is Gaussian, which is straightforward to sample from. Third, because the hyperparameters can be sampled independently from the latent parameters, similarly to Geirsson et al. (2020), their cross-correlation is reduced, which yields better MCMC mixing properties with a higher effective sample size. Fourth, as the computational cost of the second step (i.e., fitting the pseudo model) does not depend on the number of data replicates, our proposed procedure is especially well-suited for datasets with a large number of independent replicates. Finally, further speed-ups can be obtained by specifying the Gaussian prior density for the parameter vector at the latent level to be a Gaussian Markov random field (GMRF, Rue and Held, 2005) with a sparse precision matrix. When such GMRFs are used, Max-and-Smooth scales well with increasing dimension of the latent parameter vector.
Our proposed methodology involves approximations at two levels. First, the likelihood function is approximated by a Gaussian likelihood and may therefore be misspecified. Second, the variance of the Gaussian approximation to the likelihood has to be estimated from data in the first step, but is then treated as exact in the second step. Intuitively, if the shape of the "true" likelihood function is close to a Gaussian likelihood, our inference approach will be accurate. With a perfectly (or nearly) Gaus-sian likelihood, our inference approach will be very close to being exact provided the variance of the Gaussian approximation is properly estimated. In contrast, when the number of data replicates per latent model parameter is low, the Gaussian approximation may become a poor approximation to the likelihood, which might negatively impact the posterior inference. However, owing to the asymptotic behavior of the likelihood function in posterior inference, see, e.g., Schervish (1995), the errors of these two levels of approximation will typically become negligible as the number of data replicates per latent model parameter grows. In other words, Max-and-Smooth is expected to perform increasingly well as the number of data replicates gets larger, with a negligible effect on the overall computational time. The Gaussian approximation based on the normalized likelihood function is more suitable for highly non-Gaussian likelihood functions, because the mean and covariance are propagated more precisely compared to the Gaussian approximation based on the ML estimate.
The paper is organized as follows. In Section 2, we introduce the extended latent Gaussian modeling framework, and in Section 3 we detail our proposed approximate Bayesian inference methodology and introduce Max-and-Smooth. In Section 4 we illustrate the strengths and weaknesses of our approach by simulation studies using different types of extended LGMs. We apply the proposed methodology to a real dataset in Section 5. Finally, Section 6 concludes with a discussion and directions for future research.

LGMs with a univariate link function
Latent Gaussian models are a subset of Bayesian hierarchical models in which parameters at the latent level have a joint Gaussian prior distribution, conditional on hyperparameters.
LGMs are a subclass of structured additive regression models where the observations, y i (i = 1, . . . , n), are assumed to have a density from the exponential family and the mean or a particular quantile, μ i , is then linked to a structured additive predictor, η i , through a univariate link function g(·) such that g(μ i ) = η i ; see Rue et al. (2009). The structured additive predictor can then accommodate covariates and random effects in an additive way, namely, with a univariate link function may be represented hierarchically as follows in terms of the data level, the latent level and the hyperparameter level: Data level The observations y = (y 1 , . . . , y n ) T are assumed to be dependent on the latent parameters x and have a density π(y|x, θ). Often, conditional independence is assumed for simplicity, that is, π(y|x, θ) factorizes as i π(y i |x, θ) where π(y i |x, θ) = π(y i |η i , θ) and defined in terms of an appropriate quantile function Q p .

Latent level
The latent parameters x have a Gaussian prior density and are potentially dependent on some hyperparameters θ. The density of x may be written as where the right-hand side denotes a Gaussian density with mean vector μ(θ) and covariance matrix Σ(θ). This is in line with the notation in Gelman et al. (2013).
An LGM is fully specified by the definition of these three levels. In the next section, we extend this framework to LGMs characterized by a multivariate link function.

LGMs with a multivariate link function
LGMs with a multivariate link function are referred to as extended LGMs (Geirsson et al., 2020). We assume here conditional independence at the data level for simplicity, and that the data can be lined up according to groups, e.g., sites, time points, spatiotemporal elements or categories. The models presented here have the same structure as the LGMs in Section 2.1, except that the assumption of the data density being in the exponential family is dropped and the vector x refers to several subsets of parameters found at the data level, each subset with its separate set of linear predictors at the latent level. In contrast, in the case of classical LGMs defined in Rue et al. (2009), only one single parameter of the data distribution is modeled at the latent level.
We assume that each group i has n i observations. The total number of groups is G. Observations from the same group or from different groups are assumed to be conditionally independent given the latent process. The groups can represent various types of sampling setups. For example, the groups may be geological sites observed over time, i.e., each group corresponds to a site; or the groups may be time points where several observations are made at the same time point. The groups may also be spatio-temporal elements such that multiple observations are collected for each spatially and temporally-referred element. Furthermore, the groups may also represent generic categories that do not have any spatial nor temporal reference, yet, several observations are made within each category. Each group is described by M parameters. The general setup is such that the M subsets of parameters at the data level are mapped to M subsets of parameters at the latent level through an M -variate link function. Each of these subsets at the latent level is modeled with a linear model of the form in (2.1).
The probability density function of y i,j , the j-th observation from group i, is denoted by π(y i,j |ψ 1,i , ψ 2,i , . . . , ψ M,i ) where ψ 1,i , . . . , ψ M,i are the parameters within group i such that (ψ 1,i , . . . , ψ M,i ) T ∈ D, and D is a subspace of R M . Let y be the vector containing all the observations, let y i be the observations from group i, and let denote the M subsets of parameters at the data level, each vector containing only one type of parameters, e.g., all the location parameters or the regression slope coefficients for all groups. Conditional on ψ 1 , ψ 2 , . . . , ψ M , the probability density function of y is so the domain of each η m,i is the whole real line. Linear models for the M subsets of parameters at the latent level, i.e., the vectors are then specified and they may be expressed in vector notation as where β 1 , β 2 , . . . , β M are fixed effects, X 1 , X 2 , . . . , X M are the corresponding design matrices containing covariate information, u 1 , u 2 , . . . , u M are random effects, A 1 , A 2 , . . . , A M are their corresponding weight matrices, and 1 , 2 , . . . , M are independent and unstructured error terms, referred to as model errors. The terms β m , u m , and m , m = 1, 2, . . . , M, are assigned Gaussian prior densities and assumed to be a priori mutually independent.
In the following section, we develop an approximate inference scheme for the LGMs with a multivariate link function.

General idea
In this section, we detail an approximation to the posterior density used to compute inference for extended LGMs (recall Section 2.2). We then introduce Max-and-Smooth, a two-step approach that is fully Bayesian and utilizes a Gaussian approximation to the likelihood. This approximation together with the conditional Gaussian prior at the latent level results in a Gaussian-Gaussian model (conditional on the hyperparameters) that is referred to here as the pseudo model. We perform inference both for the hyperparameters and for the latent parameters in the original extended LGM by exploiting this Gaussian-Gaussian model. In Section 3.2, we describe the posterior density of extended LGMs, and in Section 3.3 we detail the two consecutive steps of our inference approach. The first step relies on a Gaussian approximation to the likelihood function. We actually propose two different approximations; one is based on the ML estimates and the inverse of the observed information, while the other one is based on the mean and covariance of the normalized likelihood function. The second step consists of inferring the resulting Gaussian-Gaussian pseudo model. For more details about the approximate inference scheme, see the Supplementary Material .

The posterior density of extended LGMs
The vectors for the model in (2.2) are gathered as follows A priori, the vectors β 1 , u 1 , β 2 , u 2 ,. . . , β M and u M are assumed to be independent. Denote the means and precision (i.e., inverse covariance) matrices of these vectors by μ β,1 , μ u,1 , μ β,2 , μ u,2 , . . . , μ β,M , μ u,M , Q β,1 , Q u,1 , Q β,2 , Q u,2 , . . . , Q β,M and Q u,M , respectively. The prior mean of ν is therefore while the precision matrix of ν is a block diagonal matrix, The precision matrices of 1 , 2 , . . . , M are diagonal matrices that are denoted by where I is the identity matrix and σ ,1 , σ ,2 , . . . , σ ,M are the corresponding standard deviations. The precision matrix of is thus given by where the zeros denote zero matrices. Simple matrix multiplication implies that (2.2) can be rewritten more compactly as Since the model is an extended LGM, the prior densities of ν, and η conditional on ν, are assumed to be Gaussian. Furthermore, the precision matrices Q β,1 , Q β,2 , . . . , Q β,M are assumed to be fixed, while hyperparameters govern the precision matrices for the random effects u 1 , u 2 , . . . , u M , i.e., Q u,1 , Q u,2 , . . . , Q u,M . When the dimensions of u 1 , u 2 , . . . , u M are large, and we require fast computation, then Q u,1 , Q u,2 , . . . , Q u,M need to be sparse and thus specifying them with Gaussian Markov random fields (GMRF, Rue and Held, 2005) becomes crucial.
The joint posterior density of (η, ν, θ) may be expressed as where y denotes the data vector, π(y|η) is the data density defined at the data level, π(η, ν|θ) is the Gaussian prior density defined at the latent level, and π(θ) is the prior density for the hyperparameters defined at the hyperparameter level. When the data density π(y|η) is used for inference, it is referred to as the likelihood function and viewed as a function of η. We next describe our proposed approximate inference scheme.

Max-and-Smooth: A two-step approximate inference approach
Our proposed approximate Bayesian inference scheme (Max-and-Smooth) is based on approximating the likelihood function with a Gaussian density function (Step 1, Max), and on fitting the resulting Gaussian-Gaussian pseudo model (Step 2, Smooth). We now describe each step separately.

Step 1 (Max): Gaussian approximation of the likelihood function
We here propose two different Gaussian likelihood approximations that we then subsequently exploit for fast fully Bayesian inference.
The first Gaussian approximation is based on the mode of the likelihood function, i.e., the ML estimate (hence the term "Max"), and the observed information evaluated at the ML estimate. Let L(η|y) denote the likelihood function, where L(η|y) = π(y|η), and letL denote the first Gaussian approximation; then, cL(η|y) ≈ L(η|y), wherê c is a constant independent of η,η is the ML estimate for η, i.e., it is the mode of L(η|y), and Σ ηy = (−H ηy ) −1 , where H ηy denotes the Hessian matrix of log(L(η|y)) evaluated at η =η. Furthermore, the observed information evaluated atη is written as Due to the assumed conditional independence, the first Gaussian approximation is straightforward to evaluate because it can be computed for each group separately. More precisely, let L(η i |y i ) = π(y i |η i ) denote the likelihood contribution of the i-th group and the inverse of I ηyi is equal to the covariance matrix Σ ηyi . Now we can approximate the likelihood contribution of group i, and c i is a constant independent of η i . Therefore, the approximated posterior densityπ(η, ν, θ|y) based on the approximated full likelihoodL(η|y) = G i=1L (η i |y i ), is such thatπ(η, ν, θ|y) ≈ π(η, ν, θ|y), and it is given byπ (3.1) The second Gaussian approximation relies on normalizing the likelihood function such that the function d −1 L(η|y) integrates to one (over the domain of η), where d ∈ (0, ∞) is an appropriate normalization constant that is independent of η. Here we assume that d is finite. If this is not the case, we may either find a more adequate model parametrization or replace the likelihood function by an alternative generalized likelihood that consists of the likelihood times an extra prior density for η. Bayes' Theorem ensures the finiteness of d under the generalized likelihood since it is proportional to a posterior density. This second Gaussian approximation is also designed to "maximize" the match with the true likelihood function, especially in skewed scenarios.
Then, similarly to the first Gaussian approximation, the likelihood function, or the generalized likelihood function, is approximated with a Gaussian density that has mean and covariance matrix equal to those of the normalized likelihood function. If there is a need for ensuring that the mean and variance are finite then the extra prior density can be given a finite support. We exploit again the assumed conditional independence, i.e., L(η|y) = G i=1 L(η i |y i ), now approximating the i-th likelihood contribution as where d i is a constant independent of η i ,L(η i |y i ) is the Gaussian approximation, and the mean and the covariance matrix arẽ Similarly to (3.1), the alternative approximation to the posterior density,π(η, ν, θ|y), which is based onL(η|y) = G i=1L (η i |y i ), may be expressed as Because the Gaussian density is the asymptotic form of the likelihood function under mild regularity conditions (Schervish, 1995), these two types of approximations are expected to work increasingly well when the number of data replicates grows. With a low number of replicates (i.e., less than 10-20 per distinct model parameter involved at the data level and further described at the latent level), a small bias might be expected, although we have found it to be relatively negligible in the settings we have considered. More details on the quality of these Gaussian approximations are given in Section 4.
The computational benefit of the approximations in (3.1) and (3.2) lies in the fact that π(η, ν|θ) is Gaussian with respect to (η, ν) and the functional form of bothL(η|y) andL(η|y) with respect to η is proportional to a Gaussian density. As a result, the conditional posterior density of (η, ν) is Gaussian, and posterior samples can be obtained directly from this density and it is well known how to generate the samples from it. The information about η stemming from the data is quantified with reasonable accuracy in L(η|y) orL(η|y) provided that at least one of these two approximations is fairly good. This information is correctly weighted against the prior information about η which is quantified in π(η, ν|θ). Since the inference scheme is Bayesian and the parameters are inferred simultaneously, then the information about η is correctly propagated to ν and θ through π(η, ν|θ). Notice that the likelihood approximationL(η|y) may be faster to compute thanL(η|y) as it is free from integrals, whileL(η|y) is likely to provide a more accurate approximation in the case of a small number of replicates within each group. Hereafter, the Gaussian approximation based on the ML estimates and the inverse of the observed information will be referred to as the first Gaussian approximation, while the Gaussian approximation based on the mean and covariance of the normalized likelihood function will be referred to as the second Gaussian approximation.
The above posterior density stems from looking at it as a function of η and takinĝ η as a fixed quantity, which gives N(η|η, Q −1 ηy ) = N(η|η, Q −1 ηy ). Thus, the above posterior density is exactly the same as the approximated posterior density in (3.1) for the extend LGM in Section 2.2. The pseudo model is a Gaussian-Gaussian model and it is convenient to approach the inference for the unknown parameters through this model.
Samples of (x, θ), where x = (η T , ν T ) T , are obtained by sampling first from the marginal posterior density of θ, and then from the posterior density of x conditional on θ. The marginal posterior density of θ givenη is π(θ|η) ∝ π(θ)π(η|θ) and it can be represented as The densities π(η|x, θ) and π(x|θ) have precision matrices Q ηy and Q x , respectively, and if Q ηy and Q x are sparse matrices, then the precision matrix of π(x|η, θ) is a sparse matrix; see details in the Supplementary Material . Samples from the marginal posterior density of θ can be obtained by using grid sampling if the dimension of θ is small (i.e., four or less), a Metropolis step or a Metropolis-Hastings step, or other samplers that are well suited for densities with a non-tractable form. Since the conditional posterior density of x is Gaussian, samples of x are straightforward to obtain, and if the precision matrix of π(x|η, θ) is sparse, the computational cost is relatively low.
Further methodological and computational details on the inference scheme for the Gaussian-Gaussian pseudo model are presented in the Supplementary Material (Hrafnkelsson et al., 2020).

Settings
In this section, we assess the accuracy of our proposed approximate Bayesian inference scheme, Max-and-Smooth, by evaluating by simulation how close the approximate posterior density of an LGM is to its exact posterior density. In particular, in Section 4.2 we consider an LGM where the data are independent mean zero Gaussian random variables on a lattice, with spatially-varying variance at each lattice point. The approximate marginal posterior densities of the latent parameters and the hyperparameters are compared to the exact posterior densities inferred by an "exact" MCMC sampler. Moreover, as the link function in this specific example is univariate, INLA can also be applied and we include it in our experiments for comparison.
In the Supplementary Material  we explore three other models, namely a linear regression model on a lattice with Gaussian error terms, which is also applied to real data in Section 5, a linear regression model with t-distributed error terms and temporally varying coefficients, and a spatio-temporal model for Poisson counts. Our results suggest that the Gaussian likelihood approximation is accurate in finite sample sizes (even with just one single replicate in the case of the Poisson distribution with large counts), and therefore, that our Max-and-Smooth approach performs well in a rich variety of realistic settings.

Gaussian data with spatially-varying log-variance
In this section, we apply Max-and-Smooth to mean-zero Gaussian data with spatiallyvarying variance. We first simulate a single realization of a Gaussian Markov random field (GMRF; see Rue and Held, 2005), {x i }, on a regular lattice of size 10 × 10 where the index i = (i 1 , i 2 ) corresponds to the lattice point with horizontal coordinates i 1 and vertical coordinates i 2 . With the mean fixed to zero, the conditional mean and variance of this GMRF at an interior point of the lattice are while the GMRF has different means and variances on the boundaries (i.e., along the edges and at the corners) of the lattice; see further details in the Supplementary Material . For the simulation, the precision parameter of the GMRF is fixed at τ = 1. At each of the N = 100 lattice points, we simulate T = 10, 20, 50 independent Gaussian variates with zero mean and log-variance x i , i.e., y i,t ∼ N(0, exp(x i )) for t = 1, . . . , T and i = 1, . . . , N. Therefore, in this simulation example, the "groups" mentioned in Section 2.2 represent the data replicates available at each lattice point, and thus, the total number of groups is equal to the number of lattice points, i.e., G = N = 100. The goal is to exploit Max-and-Smooth to infer the latent variables x = {x i }, and the precision hyperparameter τ from the observed data y = {y i,t }. The model is simple enough to also infer the latent variables and hyperparameter using an MCMC sampler that uses the true likelihood function, to compare the approximation with an "exact" fully Bayesian procedure. The model parameters are inferred assuming that the mean at the data level is equal to zero. As the link function is here univariate (i.e., M = 1), INLA is also used to infer the model parameters for the sake of comparison (but note that this would not be possible for more complex models with multivariate link functions). The full details of this simulation study are reported in the Supplementary Material . Figure 1 shows the posterior densities of latent parameters x i at three different lattice points, inferred from datasets with T = 10, 20, 50 replicates per location. We chose the three locations with the smallest, closest to zero, and largest element of x. Each graph in Figure 1 shows four posterior densities, based on a fully Bayesian MCMC simulation, based on the two approximate inference schemes from Section 3.3, and based on INLA. By comparing the density curves, we see that the second Gaussian approximation (with mean and variance derived from the normalized likelihood function) is generally closer to the true posterior density than the first Gaussian approximation (with mean and variance based on the maximum likelihood estimate (MLE) and observed information). Both approximate posterior inference schemes capture the exact posterior densities very well when T is sufficiently large (T ≥ 20). In the case of T = 10, the discrepancy between the approximated and exact posterior density is relatively small for the small and median values of x i , while it is more pronounced in the case of large x i . The reason might be that the Gaussian approximation of the likelihood function does not properly capture the right-skewness of the exact likelihood function. When the spatial prior for the vector x has a strong effect (pulling the estimate toward zero), then the right skewness of the marginal likelihood function will show up most prominently in the case of large x i . The posterior densities estimated by INLA are slightly closer to the exact densities for these selected latent variables than our approximations.
The exact and approximate posterior densities of the precision hyperparameter τ of this model are shown in Figure 2 for sample sizes T = 10, 20, 50. Figure 2 shows that when T ≥ 20 there is a negligible difference between the densities approximated In addition to the accuracy of the approximation, we also compare the computational speed of each approach. To calculate the joint density π(y, x, τ) = π(y|x, τ)π(x|τ )π(τ ), the "exact" Bayesian inference requires T evaluations of the Gaussian density per lattice point to calculate the likelihood π(y|x, τ). In contrast, under the approximate inference schemes, calculating the approximation of the likelihood requires only a single evaluation of the Gaussian density per lattice point. We should thus expect the computational cost of the exact inference to scale linearly with T , while the computational cost of Maxand-Smooth should be constant in T . Figure 3 shows the time to draw 10,000 MCMC samples of τ and x for different numbers of temporal replicates T and different numbers of grid points N . Max-and-Smooth has constant computation cost as a function of T , and does not suffer from the same slowdown with increasing T as the exact method. As might be expected, for small sample sizes and grid sizes, the speedup is only moderate. But for large grid sizes such as N = 50 × 50, and T = 100 replicates, Max-and-Smooth

Predictions of meteorological variables on a lattice
As alluded to in the introduction, a wide variety of Bayesian and frequentist approaches, involving different types of approximations or simplifications, have been proposed to deal with high-dimensional spatio-temporal data, often under the assumption of data being exactly Gaussian. These include low-rank approaches (Cressie and Johannesson, 2008), the predictive process (Banerjee et al., 2008), covariance tapering (Furrer et al., 2006;Anderes et al., 2013), multi-resolution models (Nychka et al., 2015;Katzfuss, 2017), hierarchical nearest-neighbor Gaussian processes (Datta et al., 2016), the Vecchia approximation (Vecchia, 1988;Stein et al., 2004;Katzfuss and Guinness, 2019), the integrated nested Laplace approximation (Rue et al., 2009;Bakka et al., 2018), or more recently a frequentist approach for data modeled by a generalized-extreme value (GEV) distribution (Risser et al., 2019;Russell et al., 2020). See Heaton et al. (2019) for a recent review and comparison of some of these methods. In this section, we apply our proposed Max-and-Smooth approach to analyze a moderately high-dimensional real climate dataset, assuming that the data are well described with an extended LGM.
More precisely, the dataset used in our analysis is from a seasonal climate forecasting experiment, consisting of retrospective surface temperature forecasts, and their corresponding "verifying" observations. The forecast data were produced by a global climate model ensemble (28 forecasts started from perturbed initial conditions) downloaded from the ECMWF C3S Seasonal catalog (https://apps.ecmwf.int/ data-catalogues/c3s-seasonal/) via the MARS API on 21 February 2018. Forecasts were started from perturbed May 1 initial conditions each year from 1993 to 2015, i.e., the sample size is T = 23 in time. Each model run predicts atmospheric conditions several months into the future. The particular forecast target analyzed for this paper is surface air temperature on a 1-by-1 degree latitude-longitude grid over a rectangular region with corners 20W/40N and 40E/60N (N = 1281 grid points in total, covering most of Europe). Here each grid point forms a group, so the total number of groups is G = N . The forecasts were averaged over the 28 ensemble members, and over the Boreal summer period June/July/August, yielding a single scalar prediction per grid point per year. Since these forecasts were initialized in May and predict June-August climate, the forecast lead time is 1-3 months. Verifying observations are from the ERA-Interim reanalysis dataset (Dee et al., 2011), which is available on the same 1-by-1 degree grid as the forecasts, and can also be downloaded from the ECMWF data base. Observation data were averaged over the same June-August period as the forecasts.
Due to structural errors and missing physical processes in the climate model, and due to the chaotic nature of atmospheric dynamics, numerical model forecasts have systematic biases in the forecast mean. Furthermore, for forecasts of the climate system on seasonal time scales, the correlation between forecasts and verifying observations tends to be low. The biases in the numerical model forecasts can be partly corrected through a linear regression of the observations on forecasts. The adjustment of model forecasts by linear regression, known as model output statistics (MOS), has a long tradition in the weather forecasting community (Glahn and Lowry, 1972;Glahn et al., 2009), and is part of an active area of research known as forecast recalibration or forecast post-processing (e.g., Siegert and Stephenson, 2019). In this section, we will infer the post-processing parameters on a spatial grid in an extended LGM framework, using spatial priors for the regression coefficients to reduce their estimation uncertainty, and ultimately improve the predictive skill of the recalibrated model forecasts.
The statistical model linking observed climate y i,t at time t and location i to the climate model forecast f i,t for the same time and location is assumed to be wheref i denotes the local mean forecast over the data period, and the residuals i,t are independent Gaussian variates with mean zero and variance exp(τ i ). It is common to estimate the regression parameters η = (α T , β T , τ T ) T = (η T α , η T β , η T τ ) T individually for each grid point, using maximum likelihood estimation. However, this may lead to high estimation variability, due to the limited number of samples per grid point.
To exploit the spatial structure in the data, and borrow strength from data at neighboring grid points when estimating the regression parameters η, we use a spatial prior distribution for the spatial fields of regression parameters as outlined in Section 2.2 of the Supplementary Material , and we exploit Maxand-Smooth for Bayesian inference. The spatial field for each regression parameter is decomposed additively into a spatially correlated component u and an unstructured component , i.e., and respectively for β = η β and τ = η τ . We model the structured term u as a firstorder intrinsic Gaussian Markov random field on a regular lattice, (Rue and Held, 2005, Section 3.3.2, pp. 104-108) and the unstructured component as a mean zero Gaussian process with diagonal covariance matrix. All latent processes u and are mutually independent.
We begin by exploring the marginal posterior distributions of the precision hyperparameters θ obtained using Max-and-Smooth (recall Section 3). Preliminary studies suggested that the hyperparameters are nearly uncorrelated under their joint posterior, which allows us to explore their posterior distributions individually. We evaluated each unnormalized marginal posterior at 41 points, that are equidistant on the log-scale and centered around the posterior mode, spanning ±4 posterior standard deviations (estimated via a Laplace approximation from the numerical second derivative).
The (normalized) marginal posteriors of the hyperparameters are shown in Figure 4. For better interpretability, the hyperparameters were transformed from precision τ v,l to standard deviation σ v,l = 1 √ τ v,l for v ∈ {u, } and l ∈ {α, β, τ }. The standard deviations of the unstructured components σ ,l are small compared to the spatial variability seen in the ML estimates of the regression coefficients, and compared to the standard deviation parameters of the spatial components σ u,l . This suggests that a simpler model might be fitted to the data that does not include an unstructured component. The standard deviation corresponding to the spatial effect of the intercept, σ u,α , is large compared to σ u,β and σ u,τ . Thus, there is a greater spatial variability in the intercept than in the slope and log-variance. By comparing the posterior densities with their PC prior distributions (not shown), we found no evidence that the posterior is unduly influenced by the prior, indicating that the amount of data available is sufficient to properly constrain the model's hyperparameters. and observed information) for Figure 5, noting that the corresponding plots for the second Gaussian approximation (mean and variance of normalized likelihood) are almost indistinguishable. The intercept parameters of the regression model in the vector α are not smoothed very much compared to the corresponding ML estimates. This is because the maximum likelihood estimates of α are generally well-constrained by the data, as indicated by the low sampling uncertainty derived from the observed information matrix. In contrast, β and τ exhibit considerable smoothing. The pointwise posterior standard deviations of the latent fields (not shown) are smaller than the sampling standard deviations of the maximum likelihood estimates: they are on average about 5% smaller for α and around 60% smaller for β and τ .
We then test the performance of the spatial regression model using cross-validated predictions of surface temperatures calculated using equation (5.1), integrated over the posterior distributions of α, β and τ . In practice, forecast and observation data are typically available at all locations in the historical dataset, but no data are available for the present year, for which bias-corrected forecasts are required. We mimic this setting by applying "leave-one-year-out" cross-validation: We leave out all forecasts and observations from one year, fitting the regression parameters using data from the remaining years, and then using the fitted parameters to forecast the temperature at all grid points for the left-out year. For each year and grid point, we generated N s = 1000 samples from the posterior predictive distribution by repeating the following algorithm N s times: • Sample hyperparameters θ (s) independently from their marginal posteriors, i.e., π(τ u,l |y −t ), and π(τ ,l |y −t ), where l ∈ {α, β, τ } and y −t is forecast and observation data with year t being left out of the training dataset. Sampling of hyperparameters is done by grid sampling, i.e., after evaluating the marginal density of each hyperparameter at 21 equally spaced points, one of the 21 values is sampled with probability proportional to the marginal density.
• Using the sampled regression parameter in η (s) , simulate a spatial field of responses y t from the regression model conditional on η β , and η (s) τ , and using the covariates f t from the left-out year t.
The procedure is repeated leaving each year out in turn, ultimately resulting in N s out-of-sample posterior predictive samples per grid point for each year. The predictive samples are compared to the verifying observations by the mean squared error (MSE) of the posterior predictive mean, by the continuous ranked probability score (CRPS, see Winkler (1969); Gneiting and Raftery (2007)) of the posterior predictive distribution, by the average widths of the central 95% posterior prediction interval (W95), and by the average coverage frequencies of the posterior predictive 5, 50 and 95-percentiles (COV05, COV50, COV95, respectively). The CRPS is preferred here because it can be evaluated solely from samples drawn from the posterior predictive distribution, without having to know the predictive distribution in closed form, using the following relation: where x 1 , . . . , x Ns are samples from the posterior predictive distribution, and y is the corresponding observed temperature value (see Gneiting and Raftery, 2007, Equation (20)). Lower CRPS values indicate better forecasts. The different measures of sharpness, reliability and accuracy of the posterior predictive distribution are compared to two benchmark predictions. We draw the same number of samples from the classical predictive distributions using the pointwise ML estimates for the linear regression models. This forecast scheme is denoted MLE and is used to characterize the performance of the forecast post-processing method achieved when spatial correlation in the regression parameters is ignored. An even simpler benchmark model is constructed by sampling 1000 times from a Gaussian distribution which has the mean and standard deviation (taken over time) of the observations y i , calculated separately at each grid point i. This benchmark prediction is almost constant between the years, with slight differences only due to the leave-one-out procedure. This forecast scheme is denoted CLIM (for climatology) and quantifies the average predictability of atmospheric surface temperature if no further forecast information is available. The prediction scheme CLIM is thus used to quantify the merit of the forecast information available from the numerical model. Finally, we use the acronyms SPAT1 and SPAT2 to denote the posterior predictive distribution derived from the extended LGM with a spatial prior, inferred using to computational limitations we have not implemented an exact Bayesian inference using an MCMC sampler. Table 1 shows that spatially smoothing the regression estimates using a hierarchical prior leads to improvements in mean squared error of the predictive mean, and CRPS of the predictive distribution. Simultaneously, forecast uncertainty, as indicated by the narrower average prediction intervals, is reduced by our method. Coverage frequencies are quite good overall.
The seemingly small improvement in MSE and CRPS of SPAT1/2 compared to MLE may suggest a rather negligible improvement of our approach. However, the magnitude of this improvement has to be compared to the improvement of MLE over CLIM. Informally speaking, the improvement of MLE over CLIM quantifies the advantage of a forecaster who has access to the numerical model forecast compared to a forecaster who only knows the climatological distribution. Due to the inherent unpredictability of the atmosphere on long time scales, this advantage tends to be small in seasonal climate prediction. The improvement of SPAT1/2 over MLE thus has to be judged relative to the improvement of MLE over CLIM. The improvement of SPAT1 and SPAT2 over MLE relative to the improvement of MLE over CLIM is about 20% for the MSE, and about 25% for the CRPS. Both improvements are substantial. A basic uncertainty assessment of CRPS differences is reported in the Supplementary Material , suggesting that they are indeed significant.
We note that a leave-one-year-out cross-validation is not the only method that can be applied to illustrate the benefits of a spatial model for the forecast post-processing problem. An alternative cross-validation method would be to remove all observed data from a number of spatially contiguous blocks of locations or randomly sampled locations, and infer the post-processing parameters at these locations from the spatial model. We reserve such an analysis for future studies. Finally, it should be noted that evaluating the marginal posterior distributions of the 6 hyperparameters at 21 values each, drawing 1000 samples of the spatial fields of the regression parameters from their posterior distributions, and calculating 1000 posterior predictive samples of atmospheric temperature over the entire grid takes less than 1 minute on a standard laptop.

Discussion
In this paper, we have introduced Max-and-Smooth, a new two-step approximate posterior sampling scheme for extended LGMs with independent data replicates. Extended LGMs include a diverse range of important applications, such as regression models with spatial or temporal effects in regression coefficients and error variances. Our proposed Max-and-Smooth approach is fast and well-suited for this class of LGMs with a complex and high-dimensional latent structure and multivariate link functions. The first step of the inference scheme involves approximating the likelihood function around its mode by its asymptotic Gaussian density, and we also explored another Gaussian approximation which uses the mean and variance of the normalized likelihood. We suggest that the adequacy of the Gaussian approximation be explored case by case. The second step involves Bayesian inference for the latent parameters and the hyperparameters of the model such that the uncertainty from the first step is correctly propagated into the uncertainty of the posterior distributions. Max-and-Smooth contrasts with the INLA software, which is designed for LGMs with a univariate link function but not for extended LGMs.
Our approach scales well with the dimensions of the multivariate link function and the latent parameter vector given that the precision matrices of the Gaussian priors at the latent level are sparse. This is important for high-dimensional applications, e.g., models with high-dimensional temporal and spatial effects. Additionally, exploiting conjugacy of the Gaussian-Gaussian pseudo model (similar to the INLA method), posterior samples from the marginal density of the hyperparameters can be generated efficiently with low sample autocorrelation. For our approach to be successful, we need the Gaussian likelihood approximation in the first step of our two-step inference approach to be accurate. It turns out that models with non-Gaussian likelihood functions can often be aggregated over groups of independently replicated observations, such that the joint likelihood function of the group is close to a multivariate Gaussian likelihood. In our paper, we have provided evidence that Max-and-Smooth indeed provides accurate and reliable inference in various settings. The computational cost of our proposed inference scheme is close to being insensitive to the number of independent data replicates. In the case of a high-dimensional latent vector modeled with sparse precision matrices and a large number of independent data replicates, the speedup is substantial in comparison to a Markov chain Monte Carlo inference scheme which samples from the exact posterior density. A case study comparing Max-and-Smooth with INLA (in a simplified scenario with a univariate link function) showed that INLA seems to do better at approximating marginal posteriors of latent variables, whereas Max-and-Smooth produces better approximations of marginal posteriors of hyperparameters. The observed differences between posterior approximations are minor, and the extent to which the results generalize to different applications will be subject of future research.
Max-and-Smooth is designed for a rich class of flexible models. It is straightforward to implement, and when sparse precision matrices are used at the latent level, high-dimensional latent vectors can be handled. This paves the way for having an additional set of complex models that are feasible for the analysis of complex and highdimensional datasets. Future research involves exploring ways to drop the conditional independence assumption at the observation level, and computing the Gaussian approximation based on the normalized likelihood when analytical results for the mean and variance are not available, and when the dimension of the parameter vector within groups is high.

Supplementary Material
Supplementary material for "Max-and-Smooth: a two-step approach for approximate Bayesian inference in latent Gaussian models" (DOI: 10.1214/20-BA1219SUPP; .pdf). The Supplementary Material presents further methodological and computational details on Max-and-Smooth, and demonstrates applications of Max-and-Smooth to data simulated from statistical models.