Spatial 3D Mat\'ern priors for fast whole-brain fMRI analysis

Bayesian whole-brain functional magnetic resonance imaging (fMRI) analysis with three-dimensional spatial smoothing priors have been shown to produce state-of-the-art activity maps without pre-smoothing the data. The proposed inference algorithms are computationally demanding however, and the proposed spatial priors have several less appealing properties, such as being improper and having infinite spatial range. Our paper proposes a statistical inference framework for functional magnetic resonance imaging (fMRI) analysis based on the class of Mat\'ern covariance functions. The framework uses the Gaussian Markov random field (GMRF) representation of Mat\'ern fields via the stochastic partial differential equation (SPDE) approach of Lindgren et al. (2011). This allows for more flexible and interpretable spatial priors, while maintaining the sparsity required for fast inference in the high-dimensional whole-brain setting. We develop an accelerated stochastic gradient descent (SGD) optimization algorithm for empirical Bayes (EB) inference of the spatial hyperparameters. Conditional on the inferred hyperparameters, we make a fully Bayesian treatment of the main parameters of interest, that is, the brain activity coefficients. We apply the Mat\'ern prior to both experimental and simulated task-fMRI data and clearly demonstrate that this is a more reasonable choice than the previously used priors, by using prior simulation, cross validation and visual inspection of the resulting activation maps. Additionally, to illustrate the potential of the SPDE formulation, we derive an anisotropic version of the Mat\'ern 3D prior.


Introduction
Functional magnetic resonance imaging (fMRI) is a noninvasive technique for making inferences about the location and magnitude of neuronal activity in the living human brain. fMRI has provided neuroscientists with countless new insights on how the brain operates (Lindquist, 2008). By observing changes in blood oxygenation in a subject during an experiment, a researcher can apply statistical methods such as the general linear model (GLM) (Friston et al., 1995) to draw conclusions regarding task-related brain activations.
fMRI data can be seen as a sequence of three-dimensional images collected over time, where each image can be divided into a large number of voxels. A problem with the GLM approach and many of its successors is that the model is mass-univariate, that is, it analyses each voxel separately and ignores the inherent spatial dependencies between neighboring brain regions. Normally, this is accounted for by pre-smoothing data and using post-correction of multiple hypothesis testing, but this strategy is unsatisfactory from a modeling perspective and has been shown to lead to spurious results in many cases (Eklund et al., 2016).
One of the earliest Bayesian spatial smoothing prior for neuroimaging is the twodimensional prior in slice-wise analysis proposed by Penny et al. (2005). The spatial prior on the activity coefficients reflect the prior knowledge that activated regions are spatially contiguous and locally homogeneous. Penny et al. (2005) use the variational Bayes (VB) approach to approximate the posterior distribution of the activations. Sidén et al. (2017) extends that prior to the 3D case and proposes fast Markov Chain Monte Carlo (MCMC) and more accurate VB approaches to inference.
In this paper, we show how the spatial priors used in the previous articles can be seen as special cases of the Gaussian Markov random field (GMRF) representation of Gaussian fields of the Matérn class, using the stochastic partial differential equation (SPDE) approach presented in Lindgren et al. (2011). The Matérn family of covariance functions, attributed to Matérn (1960) and popularized by Handcock and Stein (1993), is seeing increasing use in spatial statistical modeling. It is also a standard choice for Gaussian process (GP) priors in machine learning (Rasmussen and Williams, 2006). In his practical suggestions for prediction of spatial data, Stein (1999) notes that the properties of a spatial field depends strongly on the local behavior of the field and that this behavior is unknown in practice and must be estimated from the data. Moreover, some commonly used covariance functions, for example the Gaussian (also known as the squared exponential), do not provide enough flexibility with regard to this local behavior and Stein summarizes his suggestions with "Use the Matérn model ". Using the Matern prior on large-scale 3D data such as fMRI data is computationally challenging, however, in particular with MCMC. We present a fast Bayesian inference framework to make Stein's appeal feasible in practical work.
Even though the empirical spatial auto-correlation functions of raw fMRI data seem more fat-tailed than a Gaussian (Eklund et al., 2016;Cox et al., 2017), standard practice has traditionally been to pre-smooth data using a Gaussian kernel with reference to the matched filter theorem. The Gaussian covariance function has also been used directly in the model as a spatial GP prior (Groves et al., 2009), but using the standard GP formulation results in a dense covariance matrix which becomes too computationally expensive to invert even with only a few thousand voxels. For this reason, much work on spatial modeling of fMRI data has been using GMRFs instead, see for example Gössl et al. (2001); Woolrich et al. (2004); Penny et al. (2005); Harrison and Green (2010); Sidén et al. (2017). GMRFs have the property of having sparse precision matrices, which make them computationally very fast to use, but do not always correspond to simple covariance functions, especially the intrinsic GMRFs often used as priors, whose precision matrices are not invertible (Rue and Held, 2005). A different branch of Bayesian spatial models for fMRI has considered selecting active voxels as a variable selection problem, modeling the spatial dependence between the activity indicators rather than between the activity coefficients (see, among others Smith and Fahrmeir, 2007;Vincent et al., 2010;Lee et al., 2014;Zhang et al., 2014;Bezener et al., 2018). These articles mostly use Ising priors or GMRF priors squashed through a cumulative distribution function (CDF) for the indicator dependence, which also gives sparsity. However, these priors are rarely defined over the whole brain, but are applied independently to parcels or slices, probably due to computational costs. The SPDE approach of Lindgren et al. (2011) has been applied to fMRI data before, slice-wise by Yue et al. (2014), and on the sphere by Mejia et al. (2019) after transforming the volumetric data to the cortical surface. In both cases integrated nested Laplace approximations (INLA) (Rue et al., 2009) were used for approximating the posterior, which is efficient but presently cumbersome to apply directly to volumetric fMRI data, as the R-INLA R-package currently lacks support for three-dimensional data.
Our paper makes a number of contributions. First, we develop a fast Bayesian inference algorithm that allows us to use spatial three-dimensional whole-brain priors of the Matérn class on the activity coefficients, using the SPDE approach. The algorithm applies empirical Bayes (EB) to optimize the hyperparameters of the spatial prior and the parameters of the auto-regressive noise model, using an accelerated version of stochastic gradient descent (SGD). The link to the Matérn covariance function gives the spatial hyperparameters nice interpretations, in terms of range and marginal variance of the corresponding Gaussian field. Given the maximum a posteriori (MAP) values of the optimized parameters, we make a fully Bayesian treatment of the main parameters of interest, that is, the activity coefficients, and compute brain activity posterior probability maps (PPMs). The convergence of the optimization algorithm is established and the resulting EB posterior is compared to the exact MCMC posterior for the prior used in Sidén et al. (2017), showing the results to be very similar. Second, we apply the Matérn prior with two levels of smoothness to both real and simulated fMRI datasets, and compare with the prior used in Sidén et al. (2017) by observing differences in the PPMs, and by examining the plausibility of new random samples of the different spatial priors. Third, we also develop a framework using cross-validation (CV) on left-out voxels, for comparing the predictive performance of the different spatial priors, both in terms of point predictions and in terms of uncertainty using proper scoring rules (Gneiting and Raftery, 2007). The CV framework aims at evaluating the spatial priors in isolation, neglecting the uncertainty from nuisance regressors, as for example head motion, which might otherwise bias the analysis. Collectively, our demonstration strongly suggests that the higher level of smoothness is more reasonable for fMRI data, and also indicates that the M(2) prior (see the definition in Section 2.2) is more sensible than its intrinsic counterpart. Fourth, we suggest an anisotropic version of the Matérn prior, derived through an alternative SPDE. The anisotropic prior allows the spatial dependence to vary in the x-, y-and z-direction, and we choose a parameterization such that the new parameters can be shown not the affect the marginal variance of the field, and we can choose priors for the new parameters that are symmetric with respect to the axes.
We begin by reviewing the model of Penny et al. (2005); Sidén et al. (2017) and introducing the different spatial priors and associated hyperpriors in Section 2. In Section 3, we derive the optimization algorithm for the EB method, and we describe the PPM computation. Experimental and simulation results are shown in Section 4. Section 5 contains conclusions and recommendations for future work. The more mathematical details of the model and priors, the derivation of the gradient and approximate Hessian used in the SGD optimization algorithm, and the CV framework are given in the supplementary material. Numbered references to the supplementary material are preceded with an S, as in for example Eq. (S6.1).
The new methods in this article have been implemented and added to the BFAST3D extension to the SPM software, available at http://www.fil.ion.ucl.ac.uk/spm/ ext/#BFAST3D.

Model and priors
The model can be divided into three parts: (i) the measurement model, which consists of a regression model that relates the observed blood oxygen level dependent (BOLD) signal in each voxel to the experimental paradigm and nuisance covariates, and a temporal noise model (Section 2.1), (ii) the spatial prior that models the dependence between the regression parameters between voxels (Section 2.2 and 2.3), and (iii) the priors on the spatial hyperparameters and noise model parameters (Section 2.4 and 2.5).

Measurement model
The single-subject fMRI-data is collected in a T × N matrix Y, with T denoting the number of volumes collected over time and N the number of voxels. The experimental paradigm is represented by the T × K design matrix X, with K regressors representing for example the hemodynamic response function (HRF) convolved with the binary time series of task events. The model can be written as Y = XW + E, where W is a K × N matrix of regression coefficients and E is a T × N matrix of error terms. We will also work with the equivalent vectorized formulation y =Xβ + e, where y = vec Y T , X = X ⊗ I N , β = vec W T and e = vec E T . The error terms are modeled as Gaussian and independent across voxels, possibly following voxel-specific P th order AR models, described by the N × 1 vector λ of noise precisions and the P × N matrix A of AR parameters. For the ease of presentation we will in what follows only consider the special case P = 0, that is, error terms that are independent across both time and voxels, and treat the more general case in Section S6.
We can divide our parameters into three groups: β, θ n and θ s . Here, β describes the brain activity coefficients which we are mainly interested in, θ n = {λ, A} are parameters of the noise model, and θ s are spatial hyperparameters that will be introduced in the next subsection.

Spatial prior on activations
We assume spatial, three-dimensional GMRF priors (Rue and Held, 2005;Sidén et al., 2017) for the regression coefficients, which are independent across regressors, that is, we assume β|θ s ∼ N 0, Q −1 . Here Q = blkdiag k∈{1,...,K} [Q k ], that is, a KN × KN block diagonal matrix with the N × N matrix Q k as the kth block. θ s = {θ s,1 , . . . , θ s,K } are the spatial hyperparameters that the different Q k , and thereby Q, depends on. Q k may be chosen differently for different k.
In this paper, we will construct the different Q k using the SPDE approach (Lindgren et al., 2011), which allows for sparse GMRF representations of Matérn fields. An overview of the four priors we will focus on can be seen in Table 1, and are described in more detail below.  Sidén et al. (2017) focus on the unweighted graph Laplacian prior Q k = τ 2 G which we refer to here as the ICAR(1) (intrinsic conditional autoregression) prior. The matrix G is defined by where i ∼ j means that i and j are adjacent voxels and n i is the number of voxels adjacent to voxel i. The ICAR(1) prior can be derived from the local assumption that x i − x j ∼ N 0, τ 2 −1 , for all unordered pairs of adjacent voxels (i, j), where x denotes the GMRF (Rue and Held, 2005). Thus, one can see that τ 2 controls how much the field can vary between neighboring voxels, where large values of τ 2 enforces a field that is spatially smooth. The ICAR(1) prior is default in the SPM software for fMRI analysis. The ICAR(2) prior is a more smooth alternative, corresponding to a similar local assumption for the second-order differences, and has been used earlier for fMRI analysis in 2D (Penny et al., 2005). The ICAR priors can be extended by adding κ 2 to the diagonal of G as in the right hand column of Table 1, and when κ > 0 we refer to these as M(α) (Matérn) priors. The reason for this is the SPDE link established by Lindgren et al. (2011). For example, the M(2) prior can be seen as the solution u to which can in turn be seen as a numerical finite difference approximation to the SPDE when α = 2. Here s denotes a point in space, α is a smoothness parameter, ∆ is the Laplace operator, and W (s) is spatial white noise. Also define the smoothness parameter ν = α − d/2, where d is the dimension of the domain. For ν > 0 and κ > 0, it can be shown that a Gaussian field u(s) is a solution to the SPDE in Eq. (2.3), when it has the Matérn covariance function (Whittle, 1954(Whittle, , 1963) where δ is the Euclidean distance between two points in R d , K ν is the modified Bessel function of the second kind and is the marginal variance of the field u(s). As d = 3 in our case, for α = 2 we have ν = 1/2 which is a special case where the Matérn covariance function is the same as the exponential covariance function. In this paper we also consider the SPDE when κ = 0 or ν = −1/2, in which case the solutions no longer have Matérn covariance, but are still well-defined random measures, and we will refer to them as generalized Matérn fields.
We also define K = κ 2 I + G, in which case the solution to Eq. (2.2) is u ∼ N 0, τ 2 KK −1 , which is largely the same as the solution obtained in Lindgren et al.
(2011) using the finite-element method when the triangle basis points are placed at the voxel locations, apart for some minor differences at the boundary. We use the same definition as Lindgren et al. (2011) for the range ρ = √ 8ν/κ, for κ > 0 and ν > 0, which is a distance for which two points in the field have correlation near to 0.13. Similar simple discrete solutions of the SPDE are available also for other values of α. In particular, for α = 1 we have u ∼ N 0, τ 2 K −1 . Extensions to higher integer values of α such as α = 3, 4, . . . are straightforward in theory (Lindgren et al., 2011), but will result in less sparse precision matrices Q k and thereby longer computing times, and more involved gradient expressions for the parameter optimization in Section 3.1.
For each choice of Q k , we have spatial hyperparameters θ s,k = τ 2 k , κ 2 k , which will normally be estimated from data. For regressors not related to the brain activity, that is, head motion regressors and voxel intercepts, we do not use a spatial prior, but instead a global shrinkage (GS) prior Q k = τ 2 k I. We could here infer τ 2 k from the data, but will normally fix it to some small value, for example τ 2 k = 10 −12 , which gives a non-informative prior that provides some numerical stability.

Anisotropic spatial prior
The SPDE approach makes it possible to fairly easily construct anisotropic priors, for example using a SPDE of the form with h z defined as h z = 1 hxhy for identifiability. For α = 2, this SPDE has a finitedifference solution with precision matrix τ 2 KK, with K now defined as K = h x G x + h y G y + h z G z + κ 2 I. Here, G x is defined as in Eq. (2.1), after redefining the neighbors as being only the adjacent voxels in the x direction. G y and G z are defined correspondingly, so that G = G x + G y + G z . When using this prior for regressor k we have four parameters, θ s,k = τ 2 k , κ 2 k , h x,k , h y,k . The new parameters h x and h y allows for different relative length scales of the spatial dependence in the x-, y-and z-direction, which is reasonable considering the data might not have voxels of equal size in all dimensions and the data collection is normally not symmetric with respect to the three axes. Conveniently, h x = h y = 1 gives the standard isotropic Matérn field defined earlier.
Proposition 2.1. For α > d/2, the anisotropic field u defined in Eq. (2.6) has the marginal variance defined in Eq. (2.5), which is the same as the isotropic field with the same τ 2 and κ 2 , and the variance thus does not depend on h x and h y . Furthermore, Proof. We show the covariance formula first, and then the statement about the marginal variance follows as Cov (u(s), u(s)) = C √ 0 T H −1 0 = C (0). By using a certain definition of the Fourier transform, the spectral density of u in the anisotropic SPDE in Eq. (2.6) is so the covariance function can be written as An isotropic field v can be written as an anisotropic field with H = I, so its covariance function for δ = s − t 2 is On the other hand, where the last step used the variable substitution ω = H 1/2 z. Since det H 1/2 = h x · h y · 1/ (h x h y ) = 1, the last expression equals that in Eq. (2.8). So Proposition 2.1 implies that changing h x or h y does not affect the marginal variance of the field which is convenient because it means that the anisotropic parameterization does not change the interpretation of τ 2 and κ 2 , apart from that ρ = √ 8ν/κ will now be the (in some sense) average range in the x-, y-and z-direction, and we can use the same priors for them as in the isotropic case. By putting log-normal priors on h x and h y , as explained in the next subsection, we get priors that are symmetric with respect to the x-, y-and z-direction.

Hyperparameter priors
We will now specify priors for the spatial hyperparameters θ s = {θ s,1 , . . . , θ s,K }, which we let be independent across the different k. For brevity, we drop subindexing with respect to k in what follows.
Penalised complexity (PC) priors (Simpson et al., 2017) provide a framework for specifying weakly informative priors that penalize deviation from a simpler base model. Fuglstad et al. (2019) showed the usefulness of PC priors for the hyperparameters of Matérn Gaussian random fields, where the base model is chosen for κ 2 as the intrinsic field κ 2 = 0 and the base model for τ 2 |κ 2 is chosen as the model with zero variance, that is τ 2 = ∞ (note that our definition of τ 2 corresponds to τ −1 in Fuglstad et al. (2019)). This means exponential priors for κ d/2 and for τ −1 |κ 2 . The PC prior for M(2) allows the user to be weakly informative about range and standard deviation of the spatial activation coefficient maps, by a priori controlling the lower tail probability for the range Pr (ρ < ρ 0 ) = α 1 and the upper tail probability for the marginal variance Pr σ 2 > σ 2 0 = α 2 of the field. By default, we will set α 1 = α 2 = 0.05, ρ 0 to 2 voxel lengths and σ 2 0 corresponding to 5% probability that the marginal standard deviation of the activity coefficients is larger than 2% of the global mean signal. See Section S6.6 for full details about the PC prior for the M(2) hyperparameters.
For M(1), PC priors are not straightforward to specify, since the range and marginal variance are not available for ν = −1/2 in the continuous space, so we will instead use log-normal priors for τ 2 and κ 2 , as specified in Section S6.6.
For ICAR(1) and ICAR(2) we use the PC prior for Gaussian random effects for τ 2 in Simpson et al. (2017), which means an exponential prior for τ −1 . Since these spatial priors are intrinsic, and do not have a finite marginal variance, we control the PC prior using that the marginal conditional variance of (1) and G i,i = 42τ 2 for ICAR(2) (for non-boundary nodes).
For the anisotropic priors we use log-normal priors for h x and h y as which means that also log (1/ (h x h y )) ∼ N 0, σ 2 h with correlation −1/2 with log h x and log h y . The motivation for this prior is that it is centered at the isotropic model h x = h y = 1, and it is symmetric with respect to the x-, y-and z-direction. We will use σ 2 h = 0.01 as default, which roughly corresponds to a (0.8, 1.2) 95%-interval for h x .

Noise model priors
We use priors for the noise model parameters θ n = {λ, A} that are independent across voxels and across AR parameters within the same voxel, with λ n ∼ Γ (u 1 , u 2 ) and A p,n ∼ N 0, τ 2 A −1 , which is the same prior as in Penny et al. (2005). Normally we use u 1 = 10 and u 2 = 0.1, which are the default values in the SPM software and τ 2 A = 10 −3 which is the value used in Penny et al. (2005). We have seen that the spatial prior for the AR parameters previously used (Penny et al., 2007;Sidén et al., 2017) gives similar results in practice, which is why we use the computationally more simple independent prior for A. This is verified empirically in Section 4.2 below.

Bayesian inference algorithm
The fast MCMC algorithm in Sidén et al. (2017) is not trivially extended to a 3D model with a Matérn prior as the updating step for κ 2 k conditional on the other parameters requires the computation of log determinants such as log κ 2 k I + G for various κ 2 k . This in general requires the Cholesky decomposition of κ 2 k I + G which has overwhelming memory and time requirements for large N and would normally not be feasible for wholebrain analysis. In addition, κ 2 k would require some proposal density for a Metropoliswithin-Gibbs-step, as a conjugate prior is not available. The same problems apply to the MCMC steps for h x,k and h y,k when using the anisotropic model. The lack of conjugate priors also makes the spatial VB (SVB) method in Sidén et al. (2017) more complicated, as the mean-field VB approximate marginal posterior of κ 2 k will have a simple closed form.
We instead take an EB approach and optimize the spatial and noise model parameters θ = {θ s , θ n }, for which we are not directly interested in the uncertainty, with respect to the log marginal posterior p (θ|y). Conditional on the posterior mode estimates of θ, we then sample from the joint posterior of the parameters of interest, the activation coefficients in β, from which we construct posterior probability maps (PPM) of activations. Optimizing θ is computationally attractive as we can use fast stochastic gradient methods (see 3.1) tailored specifically for our problem. We also note that VB tends to underestimate the posterior variance of the hyperparameters (Bishop, 2006;Rue et al., 2009;Sidén et al., 2017), and the approximate posterior for β in Sidén et al. (2017) only depends on the posterior mean of the hyperparameters. Thus, if EB is seen as approximating the distribution of each hyperparameter in θ as a point mass, it might not be much of a restriction compared to VB.
The marginal posterior of θ can be computed by for arbitrary value of β * , where all involved distributions are known in closed form, apart from p (y), but this disappears when taking the derivative of log p (θ|y) with respect to θ i . In Section 3.1, we comprehensibly present the optimization algorithm, but leave the finer details to Section S6. Given the optimal valueθ, we will study the full joint posterior β|y,θ of activity coefficients, which is normally the main interest for task-fMRI analysis. This distribution is a GMRF with meanμ and precision matrix Q as described in Section S6.2, and can be used for example to compute PPMs, as described in Section 3.2.

Parameter optimization
By using the EB approach with SGD optimization, we avoid the costly log determinant computations needed for MCMC, since the computation of the posterior of θ is no longer needed. Our algorithm instead uses the gradient of log p (θ|y) to optimize θ, for which there is a cheap unbiased estimate. We also use an approximation of the Hessian and other techniques to obtain an accelerated SGD algorithm as described below.
The optimization of θ will be carried out iteratively. At iteration j each θ i is updated with some step ∆θ i as θ denote the gradient and H θ log p (θ|y) θ=θ (j−1) denote the Hessian for θ i (Note that we here use the term Hessian to describe a single number for each i, rather than the full Hessian matrix for θ which would be too large to consider). Ideally, one would use the Newton method with ∆θ , with some learning rate α. It turns out that for our model, this is not computationally feasible in general, since the gradient depends on various traces on the form tr Q −1 T for some matrix T with similar sparsity structure asQ. For small problems, such traces can be computed exactly by first computing the selected inverseQ inv ofQ using the Takahashi equations (Takahashi et al., 1973;Rue and Martino, 2007;Sidén et al., 2017), but this is prohibitive for problems of size larger than, say, KN > 10 5 . However, the Hutchinson estimator (Hutchinson, 1990) gives a stochastic unbiased estimate of the trace as tr Q −1 T ≈ 1 Ns Ns j=1 v T jQ −1 Tv j , where each v j is a N ×1 vector with independent random elements 1 or −1 with equal probability. Hence, we can obtain an unbiased estimate of the gradient, which allows for SGD optimization. Using a learning rate α (j) with the decay properties j α (j) 2 < ∞ and j α (j) = ∞ guarantees convergence to a local optimum (Robbins and Monro, 1951;Asmussen and Glynn, 2007).
To speed up the convergence of the spatial hyperparameters θ s , in addition to SGD, we use an approximation of the HessianH θ , which improves the step length (Lange, 1995;Bolin et al., 2019). This is also stochastically estimated using Hutchinson estimators of various traces, for example tr k v j needs only to be computed once for each j. The final optimization algorithm presented in Algorithm 1 also uses: i) averaging over iterations (line 4-5), which gives robustness to the stochasticity in the estimates, ii) momentum (line 6), which gives acceleration in the relevant direction and dampens oscillations, and iii) Polyak averaging (line 10), which reduces the error in the final estimate of θ by assuming that the last few iterations are just stochastic deviations from the mode. In practice, all parameters are reparametrized to be defined over the whole real line, see Section S6.

Algorithm 1 Parameter optimization algorithm
Require: Initial values θ 0 and parameters N iter , γ 1 , γ 2 , α mom , α (j) Niter j=1 , N P olyak , N s 1: for j = 1 to N iter do 2: Estimate the approximate HessianH θ Compute θ s step sizes ∆θ Compute θ n step sizes ∆θ Some practical details about the optimization algorithm follow. Normally, the maximum number of iterations used is N iter = 200 the averaging parameters are γ 1 = 0.2 and γ 2 = 0.9, the momentum parameter is α mom = 0.5, the learning rate decreases as α (j) = 0.9 0.1 max(0,j−100)+1 , the learning rate for θ n is α n = 0.001, we use N P olyak = 10 values for the Polyak averaging, and N s = 50 samples for the Hutchinson estimator. These parameter values led to desirable behavior when monitoring the optimization al-gorithm on different datasets. We initialize the noise parameters by pre-estimating the model without the spatial prior, and the spatial parameters are normally initialized near to the prior mean. We also start the algorithm by running a few (normally 5) iterations of SGD with small learning rate.
The computational bottleneck of the algorithm is the computation of large-dimensional linear algebra expressions such asQ −1 v j , involving the multiplication of the inverse of large sparse precision matrices with a vector. This is carried out using the fast preconditioned conjugate gradient (PCG) iterative solvers of the corresponding equation system Qu = v j in Sidén et al. (2017), where it is also illustrated that PCG is numerous times faster than directly solving the equation system using the Cholesky decomposition in these models.

PPM computation
PPMs are used to summarize the posterior information about active voxels. The marginal PPM is computed for each voxel n and contrast vector c as P c T W ·,n > γ|y,θ , for some activity threshold γ, recalling that vec W T = β are the activity coefficients.
Since β|y,θ ∼ N μ,Q −1 is a GMRF (see Section S6.2), it is clear that c T W ·,n |y,θ is univariate Gaussian and the PPM would be simple to compute for any c if we only had access to the mean and covariance matrix of W ·,n |y,θ for every voxel n. The mean is known, but the covariance matrix is non-trivial to compute, since the posterior is parameterized using the precision matrix. We therefore use the simple Rao-Blackwellized Monte Carlo (simple RBMC) estimate in Sidén et al. (2018) to approximate this covariance matrix using Var W ·,n |y,θ = E W·,−n Var W ·,n |W ·,−n , y,θ +Var W·,−n E W ·,n |W ·,−n , y,θ , where −n denotes all voxels but n. The first term of the right hand side is cheaply computed as the inverse of a K × K subblock ofQ. The second term is approximated by producing N RBM C samples W (j) from W|y,θ, computing E W ·,n |W (j) ·,−n , y,θ analytically for each j, and computing the Monte Carlo approximation of the variance. We leave out the details for brevity, but this computation is straightforward due to the Gaussianity and computationally cheap due to the sparsity structure ofQ. The PPM computation time will normally be dominated by the GMRF sampling, which is done using the technique invented in Papandreou and Yuille (2010)

Results
This section is divided into four subsections. We start by describing the real experimental data used. We then test the performance of the EB method against full MCMC and evaluate the convergence of the optimization algorithm. In the third subsection we present results on the experimental fMRI datasets, estimated using our optimization algorithm implemented in Matlab for the presented model. We compare different spatial isotropic priors by inspecting the resulting posterior activity maps, by examining the plausibility of new random samples from the spatial priors with estimated hyperparameters, and by computing measures of the predictive performance of the spatial priors using cross-validation. We also do interpretations of the estimated parameters for the M(2) prior and present results for the anisotropic prior. In the last subsection we evaluate the methods capability to infer ground truth activity maps and compare the different priors on simulated data.

Description of experimental data
We evaluate the method on two different real fMRI datasets, the face repetition dataset (Henson et al., 2002) previously examined in Penny et al. (2005); Sidén et al. (2017), and the word object (word and object processing) dataset (Duncan et al., 2009). The face repetition dataset is available at SPM's homepage (http://www.fil.ion.ucl.ac. uk/spm/data/face_rep/) and the word object dataset is available at OpenNEURO (https://openneuro.org/datasets/ds000107/versions/00001) (Poldrack and Gorgolewski, 2017). The face repetition dataset was preprocessed using the same steps as in Penny et al. (2005) using SPM12 (including motion correction, slice timing correction and normalization to a brain template, but no smoothing), and small isolated clusters with less than 400 voxels were removed from the brain mask. The resulting mask has N = 57184 voxels and there are T = 351 volumes. For the word object data, preprocessing consisted only of motion correction and removal of isolated clusters of voxels, as the slice time information was not available. We selected subject 10, which had relatively little head motion, and the resulting brain mask has N = 41486 voxels and the number of volumes is T = 166. For both datasets, the voxels are of size 3 × 3 × 3 mm, and the design matrices X both have K = 15 columns, with 4 columns corresponding to the standard canonical HRF, 4 columns corresponding to the HRF derivative, 6 columns corresponding to head motion parameters and 1 column corresponding to the intercept.

Evaluation of the EB method
For the sake of evaluating how well the EB posterior with optimized hyperparameters approximates the full posterior, we fit the model with the ICAR(1) prior also using MCMC as described in Sidén et al. (2017), using the face repetition data. The posterior mean, posterior standard deviation and PPM of the contrast for the EB method can be seen in the third row of Fig. 2, and the corresponding maps computed with MCMC look very similar, so these are not shown. One comparison is that the MCMC PPM contains 10 more active voxels in the whole brain than the EB PMM, 213 in total. Another comparison is that the estimated values for the spatial hyperparameters For this exercise the same conjugate gamma prior for τ 2 k as in Sidén et al. (2017) was also for EB. The only model difference was that the spatial prior for the AR parameters (Penny et al., 2007;Sidén et al., 2017) was used with MCMC, while the independent prior was used with EB. However, the resulting AR parameters were similar, verifying that this spatial prior makes little difference compared to the independent prior for the AR coefficients. For example, 99% of the AR parameters estimated using the EB differed less than 0.1 from the corresponding posterior mean from the MCMC method. The MCMC method used 10,000 iterations after 1,000 burnin samples and thinning factor 5.
The convergence behavior of Algorithm 1 for the M(2) prior is depicted in Fig. 1. The hyperparameter optimization trajectories in Fig. 1a suggest that the parameters reach the right level in about 50 iterations for the face repetition data. The wigglyness from the stochasticity in the algorithm starts to die off with the decrease in learning rate after iteration 100. For the word object data, the trajectories are less wiggly, but takes longer to reach the right level and a higher learning rate would probably have led to faster convergence for this dataset. More careful tuning of the various optimization settings could of course lead to a faster and more stable method, but the default settings presented in Section 3.1 seem reasonable given all the datasets we have tried. The results presented in this paper are all after 200 iterations of optimization, but future work could include coming up with some automatic convergence criterion, based on the change of some parameters over the iterations. Fig. 1b shows how the PPM of the word object data converges. The computation times on a computing cluster, with two 8-core (16 threads) Intel Xeon E5-2660 processors at 2.2 GHz, were 0.7h, 1.7h, 3.0h and 5.4h for 10, 50, 100 and 200 iterations respectively.

Comparison of spatial priors
Activity maps Fig. 2 and Fig. 3 display posterior brain activity maps (mean, standard deviation and PPMs) for the two real datasets, when using the spatially independent GS prior and for the different spatial priors. The figures show that one gets rather different results when using α = 2 instead of α = 1 in both datasets. The choice between κ = 0 (ICAR(α)) or κ > 0 (M(α)) has a very small impact on the results for the face repetition data since the estimated κ for is close to zero for M(α) for this dataset. For the word object data, the estimated κ is larger and there are clear differences in the activation inferences between M(2) and ICAR(2). The maps for the GS prior show that the inferred activations are clearly much more noisy without spatial regularization.

Interpretations for the M(2) prior
We show the estimated spatial range and marginal standard deviation for the Matérn field prior with smoothness α = 2 (M(2)), for the different datasets and conditions, in Table 2. We see that the spatial range is much larger for the face repetition data, which should be interpreted as there is more long range dependence between the activity coefficients at different voxels for this dataset. Possible explanations for this difference are different scanner properties and that the slice timing and normalization preprocessing steps impose some smoothness for the face repetition data. Estimating our method on other OpenNEURO datasets resulted in similar ranges to those of the word object dataset. Having the flexibility to adapt to datasets with different spatial information is a good property of the M(2) prior. The corresponding exponential autocovariance and autocorrelation functions are depicted in Fig. 4. The fat tails makes these functions resemble the empirical autocorrelation functions for fMRI data found in Eklund et al. (2016, Supplementary Fig. 17) and Cox et al. (2017, Fig . 3).  Table 1.  Figure 3: Posterior maps of the first condition (words) for the word object data, when using the spatially independent GS prior and the different spatial priors. Posterior mean (left), posterior standard deviation (middle) and PPMs (right). Note the different color scale used for the GS prior. The PPMs show probabilities of exceeding 0.5% of the global mean signal, thresholded at 0.9. See the definition of the spatial priors in Table 1.

Prior simulation
In an attempt to give better understanding of the meaning in practice of the different priors, Fig. 5 displays samples from the spatial priors for the first regressor using the estimated values of τ 2 1 and κ 2 1 for the different datasets and priors. The same seed has been used for the same α and dataset. The visual impression is that M(1) and ICAR(1) produce fields that vary quite rapidly, and M(2) and ICAR(2) give fields that are more smooth. For the word object dataset we note a difference in that the short estimated range for M(2) gives a sample with much faster variability than the sample from ICAR(2), which looks unrealistically smooth.

Cross validation
Many studies, including this one, evaluate models for fMRI data by displaying the estimated brain activity maps and deciding whether they look plausible or not. A more scientifically sound approach would be to compare models based on their ability to predict the values of unseen data points, which is the standard procedure in many other statistical applications. The problem for fMRI data is that the main object of interest, the set of activity coefficients W corresponding to activity related covariates, is not directly observable, but only indirectly through the observed noisy BOLD signal Y. This makes direct comparison to ground truth activation impossible. We will here attempt to evaluate the performance of the spatial priors for brain activity by measuring the out-of-sample predictive performance by computing various prediction error scores on Y instead. We cannot, however, expect to find large differences between the different priors, as only a small fraction of the signal is explained by brain activation; most is explained by the intercept and various noise sources.
We compute both the in-sample fit across all voxels and the out-of-sample fit using CV when we repeatedly leave out 50% and 90% of the voxels randomly over the whole brain, and compare the estimated and actual signal Y in those voxels. As we are mainly interested in evaluating the spatial priors, we compute the errors as explained in Section S7, in order to reduce the impact of the noise model, head motion and intercept regressors.
We use the (mean) absolute error (MAE) and (root mean) square error (RMSE) to evaluate the predicted mean of Y for each prior, and the continuous ranked probability score (CRPS), the ignorance score (IGN, also known as the logarithmic score) and the interval score (INT) to evaluate the whole predictive distribution for Y. All these scores are all examples of proper scoring rules (Gneiting and Raftery, 2007), which encourage the forecaster to be honest and the expected score is maximized when the predictive distribution equals the generative distribution of the data points. Since the predictive distribution is Gaussian given the hyperparameters, all the scores can be computed using simple formulas, as explained in Section S7. Table 3: Cross validation measures for the two datasets, comparing the different spatial priors. All the scores are computed as means across voxels, and presented in negatively oriented forms, so that smaller values are always better. In-sample refers to the average across all voxels, while the other columns shows means and standard errors across 50 random sets of left out voxels. The same random sets were left out for all priors. The small differences between the models can be explained by that the signal Y depends more on the intercept, head motion and other noise sources, than on the activity coefficients. The results can be seen in Table 3. The differences between the different priors may seem small, but remember that most of the error comes from noise that is unrelated to the brain activity, making it hard for a spatial activity prior to substantially reduce the error. For the face repetition data, we note that the M(2) prior performs better than the other priors in all cases, both in-sample and out-of-sample, and the differences are significant except for the ignorance score. For the word object data the differences between different priors are smaller, which can partly be explained by the higher noise level in this dataset. The large RMSE for the ICAR(2) prior for the face repetition data indicates that this prior can give relatively large out-of-sample errors, possibly due to over-smoothing. The fact that the M(1) and ICAR(1) often performs better that the M(2) prior in-sample, but not out-of-sample, for the word object data, could be a sign of over-fitting for these less smooth priors.

Anisotropic priors
We also estimated the model with anisotropic priors, defined in Section 2.3, for the two datasets, but found it hard to set the tuning parameters of Algorithm 1 to get a stable optimization that converges. For example, the estimated anisotropic parameters for the word object data are shown in Table 4. It can be seen that the hyperparameters corresponding to the main HRF regressors get reasonable values fairly close to 1. The interpretation of h x and h y being larger than 1 is that this prior is able to capture a spatial dependence that is stronger in the x-and y-direction, as compared to the z-direction.
For the hyperparameters corresponding to the HRF derivative regressors, however, we note a diverging behavior, with very small or large estimated values. The estimates for the HRF derivatives were also not consistent when rerunning the optimization with different random seeds and initial values. Similar problems were encountered also for the main HRF hyperparameters when analyzing the face repetition dataset. Our conclusion is that, although the anisotropic prior seems to give useful results for some datasets, it is yet too unstable to use in practice without careful supervision of the convergence. Future work should address making this process more stable, for example using stronger priors on the hyperparameter to enhance identifiability.

Simulated data
We simulate data using the results from real datasets with the M(2) prior as template. We use the same brain mask, and subsets of the rows and columns from the design matrix X corresponding to the HRF and intercept to create a new design matrix.
We take the noise model estimates of λ and A and the posterior means of intercept regression coefficients as true values. We do two simulations: the first one based on the face repetition data with T = 351 and K = 5 and the second one based on the word object data with T = 100 and K = 3.
In the first simulation, for the four regressors corresponding to the task conditions, we choose values for the prior hyperparameters τ 2 k and κ 2 k according to Table 5, chosen to reflect the parameter estimates in the real datasets (Table 2). We then simulate values from the spatial M(2) priors for the remaining regression coefficients in W, and after that values for Y through our generative model. The true sampled values of W for k = 1 to 4 can be seen in Fig. 6.   Table 1.
The estimated spatial range and standard deviation with the M(2) prior are shown in Table 5. We note that the estimated values tend to be closer to the truth when the range is short. This is probably due to boundary effects that have higher impact on the results when the spatial range becomes closer to the size of the domain.
The posterior maps for the first condition for the different spatial priors are quite similar across priors and the results are therefore only shown in supplementary Fig. S8. The small differences can be explained by the fact that the data are rather informative, with T = 351 and small noise variance (large λ), which makes the impact of the priors small.
In the second simulation we make the data less informative, by using the estimated parameters and brain mask from the word object dataset, which has higher noise variance, and only T = 100 volumes. Posterior maps for the first condition are shown in Fig. 7. This condition had a true range of 15 mm and a standard deviation of 1.3, which were chosen to be close to the estimated values for the real dataset. We see larger differences between the priors for this simulation. In particular, we note that the ICAR(2)-prior tends to oversmooth slightly.

Conclusions and directions for future research
We propose an efficient Bayesian inference algorithm for whole-brain analysis of fMRI data using the flexible and interpretable Matérn class of spatial priors. We study the empirical properties of the prior on simulated and two real fRMI dataset and conclude that the M(2) prior is in general be the preferred choice for future studies. The priors with α = 1 are clearly inferior in the sense that they do not find the likely correct activity patterns that are found by the priors with α = 2, they produce new samples that appear too speckled and they perform worse in the cross validation. The differences between the M(2) and ICAR(2) are less evident, but fact that they produce somewhat different activity maps for some datasets, that new samples from the ICAR(2) look too smooth, that M(2) performed consistently better in the cross validation, and that the M(2) prior parameters are easier interpreted all argues in the favor of the M(2) prior.
The optimization algorithm appears satisfactory with relatively fast convergence. Using SGD is an improvement relative to the coordinate descent algorithm employed for SVB in Sidén et al. (2017), because following the gradient is in general the shorter way to reach the optimum and there exists better theoretical guarantees for the convergence. Also, well-known acceleration strategies, such as using momentum or the approximate Hessian information, are easier to adopt to SGD and one can thereby avoid the more ad hoc acceleration strategies used in Sidén et al. (2017, Appendix C).
The EB method seems to approximate the exact MCMC posterior well based on our comparison, suggesting that properly accounting for the uncertainty in the spatial hyperparameters is of minor importance if the main object is the activity maps.
As the smoothness parameter α appears to be the most important for the resulting activity maps, it would in future research be interesting to estimate it as a non-integer value, which could be addressed using the method in Bolin and Kirchner (2017).
In this work we have not focused on solving the multiple comparison problem which arises when classifying a large number of voxels as active or inactive. In our previous article (Sidén et al., 2017), we suggested using joint PPMs (Yue et al., 2014;Mejia et al., 2019), rather than marginal PPMs, which are based on the excursions set method introduced in Bolin and Lindgren (2015). The joint PPM finds the active region, defined as largest region such that all voxels in the region are jointly active with some large probability. It is likely that the joint PPMs would show larger differences between the M(2) and ICAR(2) prior, since they depend more on the spatial correlation. The joint PPMs are easily computed from MCMC output, but harder from the resulting posterior high-dimensional GMRF posterior from the EB method due to computational issues with the posterior covariance matrix being unknown. The block RBMC method for approximating the posterior covariance matrix in Sidén et al. (2018) could be used to compute the probability of joint activation in such regions, and this is straightforward when the regions are spatially concentrated. However, when the active region is spread out across the brain, further approximations would be required and this should be addressed in future work.
The estimated spatial hyperparameters for the real datasets in Table 2 have strikingly similar values across different HRF regressors. A natural idea is therefore to let these regressors share the same spatial hyperparameters, at least when the tasks in the experiment are similar. This would give more power to estimate the hyperparameters, which could particularly interesting for the anisotropic priors. Our Bayesian inference algorithm is straightforwardly extended to this setting. Stein, M. L. (1999). Interpolation of spatial data. Some theory for kriging. Springer-Verlag, New York. 2 Takahashi, K., Fagan, J., and Chen, M. S. (1973). Formation of a sparse bus impedance matrix and its application to short circuit study. Yue, Y. R., Lindquist, M., Bolin, D., Lindgren, F., Simpson, D., and Rue, H. (2014). A Bayesian general linear modeling approach to slice-wise fMRI data analysis. Preprint. For the optimization of the parameters θ = {θ s , λ, A}, we need the gradient and approximate Hessian with respect to the different hyperparameters of the log marginal likelihood log p (y|θ) = log p (y|β, θ)+log p (β|θ)−log (β|y, θ), which is constant with respect to β. The gradient of the log posterior is then simply ∂ ∂θi log p (θ|y) = ∂ ∂θi log p (y|θ) + ∂ ∂θi log p (θ), and similarly for the approximate Hessian. We will start this derivation by considering the log likelihood log p (y|β, θ), then the conditional log posterior log (β|y, θ), before producing the expressions for the log marginal likelihood gradient as well as the approximate log marginal likelihood Hessian. Finally, the corresponding posterior gradient and Hessian can be computed by adding the prior contributions derived in the last subsection.
To get a more robust optimization algorithm in practice, we use the reparameterizations τ 0,k = log τ 2 k , κ 0,k = log κ 2 k and λ 0,n = log (λ n ), so that the parameters are defined over the whole R and then perform the optimization over these new variables. The gradient for the new variables is easily obtained from the gradient for the old ones using the chain rule, for example ∂ log p(y|θ) k . For A p,n we use the logit reparameterization A 0,p,n = log λ n l n (W ·,n ) + const, (6.1) where l n (W ·,n ) = Y T ·,n Y ·,n − 2Y T ·,n XW ·,n + W T ·,n X T XW ·,n in the case when the noise is independent over time and l n (W ·,n ) = Y T ·,n Y ·,n − 2Y T ·,n XW ·,n + W T ·,n X T XW ·,n − 2Y T ·,n d T n A ·,n + A T ·,n d n d T n A ·,n + W T ·,n B T n A ·,n + A T ·,n B n W ·,n − W T ·,n RA ·,n + (RA ·,n ) T W ·,n − A T ·,n D n W ·,n + (D n W ·,n ) T A ·,n + W T ·,n A T ·,n SA ·,n W ·,n when the noise follows an AR(P )-process in each voxel with AR-parameters A ·,n . We follow the notation in Sidén et al. (2017), except here β = vec W T . For convenience we list the different matrices and tensors and their sizes also here: The motivation for using this seemingly cumbersome notation is that it allows for precomputation of all sums and matrix products over the time dimension, so that these can be avoided in each iteration of the algorithm, giving greatly reduced computation times.

Conditional log posterior
Following the derivation in Sidén et al. (2017, Appendix A), we obtain β|y, θ ∼ N μ,Q −1 = for P = 0 and for the case with autoregressive noise (P > 0) we havẽ

3)
Q n = X T X − RA ·,n − (RA ·,n ) T + A T ·,n SA ·,n ,q n = Y T ·,n X − A T ·,n B n + A T ·,n D n A ·,n where P KN is the permutation matrix defined such that vec (W) = P KN vec W T . We note that we can also write the parameters of the i.i.d. case on the second, slightly more complicated form in Eq. (6.3) by instead choosingQ n = X T X andq n = Y T ·,n X.

Gradient
In this section we derive the gradient for the M(2) case. The gradient for the other spatial priors can be obtained using the same strategy and these will be left out for brevity. In summary, the log marginal likelihood We begin by writing down the log marginal likelihood gradient and then the derivation follows.
∂ log p (y|θ) where K k = G + κ 2 k I so that Q k = τ 2 k K k K k and J ij is the square single-entry matrix which is zero everywhere except in (i, j) where it is 1. The size of J ij is clear from the context and is here used in couple with the Kronecker product to construct singleblock matrices, where everything but one block is zero. M is the K × N matrix such thatμ = vec M T (compare with β = vec W T ). The terms in the expression for ∂ log p(y|θ) ∂A·,n are given in Eq. (6.6).
We begin with computing the gradient with respect to τ 2 k and κ 2 k , that do not appear in the likelihood, which is why the gradient is the same for the case P = 0 and P > 0 (givenQ andμ). Thereafter, we treat λ n which is different in these cases depending on the log likelihood term expression l n and lastly A pn which is only relevant for the case P > 0. A reference for some of the matrix algebraic operations used here is Petersen and Pedersen (2012).
Gradient with respect to τ 2 k and κ 2 k Some useful derivatives are Also note that ∂Q Now, by setting β = 0 and removing everything that is constant with respect to τ 2 and κ 2 , Eq. (6.4) becomes log p (y|θ) = 1 2 K k=1 log |Q k | − 1 2 log Q + 1 2μ TQμ + const and we get the derivatives with respect to τ 2 k and κ 2 k in Eq. (6.5).
Gradient with respect to λ n Note that and that the last expression is zero for β =μ. Thus, it is clear that the gradient with respect to λ n in Eq. (6.5) can be obtained from taking the derivative of Eq. (6.4) and evaluating at β =μ.

Gradient with respect to A pn
Use that where R p and S p are of sizes K × K and K × K × P and refers to the pth sub-tensor from the appropriate dimension of R and S respectively. The expression in Eq. (6.4) is derived after noting that the remaining terms of the log likelihood become zero after taking the derivative and evaluating at β =μ.
The derivation starts by noting that the non-constant part of the augmented log likelihood with respect to τ 0,k and κ 0,k is log p (y, β|θ) = 1 2 log |Q| − 1 2 β T Qβ + const. (6.8) Taking the derivative twice with respect to τ 0,k and κ 0,k and computing the expectation gives the result, after noting that E β|Y,θ β T Tβ = tr Q −1 T +μ T Tμ, for general KN × KN matrix T (Petersen and Pedersen, 2012, Eq. (318)). We write out the derivation for τ 0,k as an example. First note that and taking the conditional expectation with respect to β|Y, θ gives the result in Eq. 6.7.

The anisotropic case
We will in this subsection present the gradient and Hessian for the anisotropic M(2) model. We only consider h 0,k,x = log (h k,x ), as the results for h k,y are completely symmetric. The derivations can be performed analogously to Subsection 6.3 and Subsection 6.4. The gradient is ∂ log p (y|θ) ∂h 0,k,x = tr K −1 k ∂K k ∂h 0,k,x −τ 2 k tr Q −1 J kk ⊗ K k ∂K k ∂h 0,k,x −τ 2 k M k,· K k ∂K k ∂h 0,k,x M T k,· , (6.9) where ∂K k ∂h 0,k,x = exp (h 0,k,x ) G x − exp (−h 0,k,x ) exp (−h 0,k,y ) G z . The approximate Hessian is where ∂ 2 K k ∂h 2 0,k,x = exp (h 0,k,x ) G x +exp (−h 0,k,x ) exp (−h 0,k,y ) G z and H k,x = ∂K k ∂h 0,k,x ∂K k ∂h 0,k,x + K k ∂ 2 K k ∂h 2 0,k,x . The first trace of this expression will be Hutchinson approximated using withẆ corresponding to the K act ×N activity regression coefficients, which have spatial priors, andẄ corresponding to the nuisance regression coefficients. We define the cross validation error time series as E CV ·,D = R ·,D −ẌE Ẅ ·,D |R ·,D , θ , (7.2) R ·,D = Y ·,D −ẊE Ẇ ·,D |Y ·,−D , θ , which is compute in two steps. First the out-of-sample residuals R ·,D of the spatial part of the model are computed and then a new model R ·,D =ẌẄ ·,D + E ·,D is fitted for each voxel independently, using the original values for the parameters θ, before the error time series E CV ·,D can be computed. The reason for this seemingly complicated procedure is to reduce the impact of the nuisance regressors on the evaluation of the spatial model. We compute the in-sample errors as E IS = Y − XE (W|Y, θ). We compute the MAE and RMSE for voxel set D as 3) The spatial posterior predictions for the dropped voxels E Ẇ ·,D |Y ·,−D , θ are computed using Eq. (6.3) after replacingQ n andq n with 0 for all n ∈ D.
For the proper scoring rules CRPS, IGN and INT, we also need to compute the predictive standard deviation in each voxel. This is done in a way that neglects the uncertainty in the intercept and head motion regressors. For an unseen datapointỸ t,n , the law of total variance gives Var Ỹ t,n |Y ·,−D , θ,Ẅ = EẆ |Y ·,−D Var Ỹ t,n |θ, W + VarẆ |Y ·,−D E Ỹ t,n |θ, W (7.4) = Var Ỹ t,n |θ, W + X t,(1:Kact) Var Ẇ |Y ·,−D X T t,(1:Kact) , where the first term is simply the variance of the AR noise process in voxel n that does not depend on W and which can be obtained given the AR parameters A ·,n through the Yule-Walker equations (see for example Cryer and Chan, 2008). The second term can be computed using the simple RBMC estimator as in Eq. (3.2) in the main article after replacingQ n with 0 for all n ∈ D as was done for the mean. The predictive distribution for x = E t,n is Gaussian with mean µ = 0 and variance σ 2 as in Eq. (7.4) which gives simple expressions for the scores as where ϕ and Φ denotes the standard normal PDF and CDF and A = Φ −1 (1 − α/2) ≈ 1.96 for α = 0.05, which is used by default. The presented values for the scores are averages across all time points and left out voxels.

Additional results
Posterior maps for the first condition of the first simulated dataset for the different spatial priors can be seen in Fig. 8. The results for the M(1) prior are almost visually indistinguishable from the ICAR(1)-results, so these are not shown. The maps for the α = 1 priors also look quite similar to the maps for α = 2. The M(2)-and ICAR (2)results look very much alike and are also very similar to the "true" posterior maps obtained by estimating the true M(2) model with the true spatial hyperparameters and noise parameters. When comparing the corresponding maps for the other conditions which have larger prior range (results not shown), the difference between the α = 1 and α = 2 priors is larger, but the M(2)-and ICAR (2) Table 1.