Intuitive joint priors for variance parameters

. Variance parameters in additive models are typically assigned independent priors that do not account for model structure. We present a new framework for prior selection based on a hierarchical decomposition of the total variance along a tree structure to the individual model components. For each split in the tree, an analyst may be ignorant or have a sound intuition on how to attribute variance to the branches. In the former case a Dirichlet prior is appropriate to use, while in the latter case a penalised complexity (PC) prior provides robust shrinkage. A bottom-up combination of the conditional priors results in a proper joint prior. We suggest default values for the hyperparameters and oﬀer intuitive statements for eliciting the hyperparameters based on expert knowledge. The prior framework is applicable for R packages for Bayesian inference such as INLA and RStan . Three simulation studies show that, in terms of the application-speciﬁc measures of interest, PC priors improve inference over Dirichlet priors when used to penalise diﬀerent levels of complexity in splits. However, when expressing igno-rance in a split, Dirichlet priors perform equally well and are preferred for their simplicity. We ﬁnd that assigning current state-of-the-art default priors for each variance parameter individually is less transparent and does not perform better than using the proposed joint priors. We demonstrate practical use of the new framework by analysing spatial heterogeneity in neonatal mortality in Kenya in 2010–2014 based on complex survey data.


Introduction
Bayesian hierachical models (BHMs) are ubiquitous in science due to their flexibility and interpretablity (Gelman and Hill, 2007;Gelman et al., 2013;Banerjee et al., 2014).In this paper, we consider BHMs where the latent level consists of an additive combination of model components that are classified as fixed effects and random effects.This subclass covers a range of common model classes such as generalised linear mixed models (GLMMs) and generalised additive mixed models (GAMMs) (Fahrmeir and Lang, 2001).In additive models, the total latent variance of the sum of the random effects decomposes into the sum of the variance contributed by each random effect, and each random effect has a variance parameter that controls its a priori contribution.We such as the size of t or preference for A over B or A+B over C in a transparent and intuitive way.
An obvious alternative is to parametrize the variance parameters as t and the proportion of t assigned to each random effect (ω A , ω B , ω C ), where 0 ≤ ω A , ω B , ω C ≤ 1 and ω A + ω B + ω C = 1.This is illustrated in Figure 1a by splitting A+B+C into the models A, B and C.This parametrization is suitable for expressing ignorance about how the variance should be attributed to the random effects.A simple way to assign the joint prior is to set (ω A , ω B , ω C ) ∼ Dir(a, a, a), a > 0, where Dir denotes the Dirichlet distribution (Balakrishnan and Nevzorov, 2003).This prior has no preference for one of the random effects over the other and is invariant to the ordering of the random effects, and we can select a > 0 to make the prior suitably vague.Together with the conditional prior π(t|ω A , ω B , ω C ), this implicitly defines a proper joint prior for (σ 2  A , σ 2 B , σ 2 C ) that is invariant to permutations in the order of the random effects, but can incorporate prior knowledge on t.This has a similar flavor as the Dirichlet-Laplace prior by Bhattacharya et al. (2015), which is a global-local shrinkage prior (Polson and Scott, 2010) that induces sparsity in regression.However, in this paper we will focus on random effects and not fixed effects.
The simple split strategy is not always suitable and Riebler et al. (2016) demonstrated that for the BYM (Besag, York and Mollié) model, which is a sum of a Besag random effect and an unstructured random effect, a PC prior that penalises the added complexity of the structured effect relative to the unstructured effect improves inference.For A+B+C, fewer levels of hierarchy may be preferred so that B is preferred to A and C is preferred over A+B.This knowledge about relative complexity of the random effects can be incorporated by splitting A+B+C hierarchically as shown in Figure 1b.
Here we first split A+B+C into A+B and C through ω 1 = (σ 2 A + σ 2 B )/t, and then split A+B into A and B through ω 2 = σ 2 A /(σ 2 A + σ 2 B ), where 0 ≤ ω 1 , ω 2 ≤ 1.The joint prior for (σ 2  A , σ 2 B , σ 2 C ) is then constructed by first selecting π(ω 2 ), then π(ω 1 |ω 2 ), and finally π(t|ω 1 , ω 2 ).Priors inducing shrinkage towards ω 2 = 0 and ω 1 = 0 can be chosen in the lower and upper split, respectively.The shrinkage can be illustrated graphically as shown in Figure 1c.For LGMs, PC priors offer a robust choice, but the framework is general and other priors can be selected by the analyst.For example, if shrinkage is only required at the top level, a Dirichlet prior for (ω 2 , 1 − ω 2 ) could be combined with a shrinkage prior for ω 1 |ω 2 .
The ideas generalize to more random effects through the selection of a hierarchical decomposition of the model in the form of a tree, and the selection of a conditional distribution for the attribution of the total variance to the branches for each split.The joint prior is calculated in a bottom-up approach using these conditional distributions.We suggest default values for the hyperparameters of the Dirichlet distribution based on the marginal prior distributions for the proportions of variance assigned to each branch of the split.This ensures that the default setting for the prior is well-behaved as the number of branches in a split increases.Default values for the PC priors can be selected based on moderate shrinkage of the proportion of variance.Additionally, we discuss how to include expert knowledge through interpretable statements on the total variance and the distribution of variance in the tree.The joint prior can contain a mix of expert knowledge and default values that provide a weakly informative prior (Gelman et al., 2008;Simpson et al., 2017).This means the prior framework with joint priors is appropriate for default priors for software packages such as INLA and RStan.
The properties of the proposed priors are compared to the properties of default priors from software and vague priors from literature.This is a fair comparison since even though the new priors account for model structure, they do not incorporate strong expert knowledge and are suggested to be used in a default way in Bayesian software.The comparison is performed through three simulation studies: a simple random intercept model with Gaussian responses, a latin square experiment with Gaussian responses, and a spatial model with Binomial responses.To ease the presentation of the comparisons and not overload the reader with results, we choose a set of targets for each simulation study and compare the posteriors resulting from the different prior choices with respect to the targets.Additional results are provided in the Supplementary Materials.Furthermore, we provide example code in the Supplementary Materials for producing results for different priors for the latin square model in Section 5.2.The code is described in Section S4.3 in the Supplementary Materials.
We start by introducing the general framework in Section 2, then we introduce LGMs and suitable priors for developing a new class of priors for LGMs in Section 3. The new class of priors for LGMs is introduced in Section 4 and is applied to simulation studies with Gaussian responses in Section 5.In Section 6 we present one simulation study with Binomial response and explain how the approach can be used in practice.The paper ends with a discussion in Section 7.

Tree-based hierarchical variance decomposition
In this section we cover basic notation, and formally introduce additive models, hierarchical variance decomposition, and the new framework for joint priors for variances.

Additive models
Let y = (y 1 , . . ., y n ) be a vector of n > 0 observations.We model the expected values E(y i ) = g −1 (η i ), i = 1, . . ., n, through a vector of linear predictors η = (η 1 , . . ., η n ) and a link function g : R → R. We consider models where the likelihood has parameters θ L and factors as π(y|η, θ L ) = n i=1 π(y i |η i , θ L ).This covers models such as GLMMs and GAMMs.We term η and its description as the latent part of the model.
We assume that the linear predictor is described as where β 0 is the intercept, x i is the vector of covariates associated with observation i, β is a vector of coefficients, and u j = (u 1 , . . ., u mj ) is a random vector and k j [i] is the associated element of u j for observation i for j = 1, . . ., N .The two first terms will be called fixed effects and the last N terms will be called random effects.To focus on the joint prior for variance parameters, we will assume that each random effect u j has a single model parameter, which is a variance σ 2 j .In general, the random effects may have other parameters such as correlation parameters and we discuss how to handle this in Section 7.
We denote the vector of model parameters by θ M = (σ 2 1 , . . ., σ 2 N ).The BHM is completed by specifying the latent model through π(u j |σ 2 j ) for j = 1, . . ., N , and the prior π(β 0 , β, θ L , θ M ).We follow common practice so that the prior satisfies π The major improvement over common practice is that we will develop a framework for selecting intuitive joint priors for the variance parameters that does not require that π(θ M ) = N j=1 π(σ 2 j ).

Hierarchical variance decomposition
The additivity in Equation (2.1) causes the total latent variance Var[η i |β 0 , β, θ M ] of linear predictor i to decompose as the variance contributed by each random effect Var[u kj [i] |β 0 , β, σ 2 j ], j = 1, . . ., N , for i = 1, . . ., n.If random effect j is homogeneous, the variance parameter of random effect j will be a marginal variance in the sense that Var[u kj [i] |β 0 , β, σ 2 j ] = σ 2 j for i = 1, . . ., n.If all random effects are homogeneous, the total latent variance of the linear predictors is homogeneous, varies for different values of i, the variance parameter σ 2 j is selected to be comparable to a marginal variance; see the discussion in Section 3.1.We term the parameter t = σ 2 1 + . . .+ σ 2 N the total latent variance.We describe the attribution of t to the individual random effects through a tree T .The construction of T starts with a root node T 0 = {1, . . ., N } that contains all the random effects, and in the first step we introduce We continue this recursively for each child node until all leaf nodes are singletons.This results in a tree T with S splits where there are K s child nodes for split s = 1, . . ., S. We have S ≤ N − 1, where S = 1 is achieved by directly splitting the root node to singletons as in Figure 1a and the maximum value is achieved by only using dual splits such as in Figure 1b.
For each split s, the parent node P s is split into K s child nodes C 1 , . . ., C Ks and we will define a vector of parameters ω s = (ω s,1 , . . ., ω s,Ks ), s = 1, . . ., S. The child nodes describe a partitioning of the random effects in the parent node, and we let ω s describe the proportion of the total variance in the parent node, j∈Ps σ 2 j , that is assigned to each child node through We denote the K − 1 simplex by x k ≥ 0 ∀k} so that the restrictions are ω s ∈ ∆ Ks for s = 1, . . ., S. This means that the parameters ω s,Ks are superfluous for s = 1, . . ., S, but we keep them for ease of notation and interpretability.
For any split s = 1, . . ., S, we term a child node and its decendants as a branch of the split.The description of the model structure through a tree structure defines a reparametrization of (σ 2 1 , . . ., σ 2 N ) to (t, ω 1 , . . ., ω S ), where S is the number of splits in the tree.The examples discussed in the introduction can be rephrased in this terminology, and demostrate that there is no unique selection of the tree.
Example 1 (Tree structure).Consider three random effects A, B and C with marginal variances (σ 2 A , σ 2 B , σ 2 C ).Let the root node be T 0 = {A, B, C}. Figure 1a, describes the case that the root node is partitioned into three children T 1 = {A}, T 2 = {B} and T 3 = {C}.This leads to a reparametrization (t, ω), where 1b shows the case that T 0 is first partitioned into T 1 = {A, B} and T 2 = {C}, and then T 1 is partitioned into T 3 = {A} and T 4 = {B}.This results in a reparamerization (t, ω 1 , ω 2 ), where

Hierachical decomposition priors
The tree-based hierarchical variance decomposition facilitates the construction of joint priors that include prior belief about the relative sizes of groups of random effects.The tree structure must be selected so that the desired comparisons can be made.Trees such as shown in Figure 1a are useful for expressing ignorance about the attribution of variance to the random effects, whereas trees such as shown in Figure 1b are useful for imposing shrinkage to one of the branches in each dual split.Generally, a tree may consist of a mixture of splits where the analyst wants to be informative and splits where the analyst wants to express ignorance.
We propose to construct a joint prior for the marginal variance parameters in a bottom-up approach where the prior for a given split only depends on descendant nodes of the parent node.
This means that the joint prior for the decomposition uses a directed acyclic graph so that parameters that belong to subsplits in different branches of a split are marginally imsart-ba ver.2014/10/16 file: Manuscript.tex date: September 25, 2019 independent.We combine the prior for the decomposition of the variance with a conditional prior on the total variance of the random effects to form what we will call hierarchical decomposition (HD) priors.
Definition 1 (Hierarchical decomposition (HD) priors).Consider a BHM with an additive latent structure with N random effects with marginal variance parameters σ 2 1 , . . ., σ 2 N .Assume that the model structure is described by a tree that recursively partitions the set of random effects into singletons.Then a hierarchical decomposition (HD) prior is given by where t = σ 2 1 +. ..+σ 2 N , S is the number of splits, and D(s) denotes the set of descendant splits for the parent node in split s and ω s describes the proportions of the total variance of a parent node assigned to its branches for s = 1, . . ., S.
3 Latent Gaussian models and priors for the splits

This section introduces
LGMs and the priors we will use for the splits to build the intuitive class of joint priors for the variance parameters for LGMs.

Latent Gaussian models
LGMs constitute a subclass of BHMs with additive latent structure where the model components are Gaussian conditional on the model parameters.We write the additive model in Equation (2.1) in vector form, η = 1β 0 + Xβ + N j=1 A j u j , where 1 = (1, . . ., 1) is a column vector of length n, X is the n × p design matrix that contains the covariates for each observation as rows, and A j are sparse n × m j matrices that select the appropriate elements of the random effects for j = 1, . . ., N .The latent Gaussian structure is achieved by β 0 ∼ N (0, σ 2 I ), β ∼ N p (0, σ 2 F I p ), and u j |σ 2 j ∼ N mj (0, σ 2 j Σ j ) for j = 1, . . ., N .It is common to give σ 2 I and σ 2 F suitably vague values, and we will assume that σ 2 I and σ 2 F are fixed and focus on the variance parameters σ 2 1 , . . ., σ 2 N .For non-intrinsic Gaussian random effects, such as independent and identically distributed (i.i.d.) random effects, stationary autoregressive processes and Matérn Gaussian random fields, the covariance matrix Σ of the random effect u is chosen to be a correlation matrix and the variance parameter σ 2 is the marginal variance.However, this does not work for intrinsic Gaussian Markov random fields (GMRFs) (Rue and Held, 2005) such as the Besag model (Besag et al., 1991), the first-order random walk and the second-order random walk (Rue and Held, 2005, Chapter 3).In this case there is no well-defined concept of a marginal variance since they are defined through singular precision matrices that cannot be inverted to find a covariance matrix.We follow Sørbye and Rue (2014) and choose the variance parameter σ 2 to be a representative value for the marginal variance.

Penalising complexity
The fundamental basis for introducing robust shrinkage in our proposed class of priors are the PC priors introduced in Simpson et al. (2017), which uses a set of principles to derive model-component-specific prior distributions.The main idea is to regard a single model component as a flexible extension of a so-called base model.In the simplest case of an unstructured random effect, the base model would be to remove the effect entirely from the linear predictor by letting the variance parameter go to zero.The idea is to follow Occam's razor and favour a simpler, more sparse or more intuitive model as long as the data does not indicate otherwise.The PC priors have been used successfully in a variety of contexts such as BYM models (Riebler et al., 2016), correlation parameters (Guo et al., 2017), autoregressive processes (Sørbye and Rue, 2018) and Matérn Gaussian random fields (Fuglstad et al., 2019).Simpson et al. (2017) proposed to compute the complexity of the alternative model relative to the base model using the Kullback-Leibler divergence (KLD) defined as where ξ is the flexibility parameter, and ξ = 0 at the base model.The KLD is consequently transformed to an interpretable distance measure between two densities f 1 and In contrast to defining a prior for ξ directly, a prior is defined for d.See Simpson et al. (2017) for detailed motivation.
We follow Simpson et al. (2017) and select an exponential distribution, where information provided by the user is used to determine the rate λ.Usually this information is provided by a probability statement about the tail probability of the prior, Here, X(ξ) is an interpretable transformation of the parameter of the flexible extension, U can be thought of as a sensible upper bound, and α is a small probability.A user can express their knowledge by constraining tail probabilities of X(ξ) as above.Selecting U near a large plausible value for X(ξ) and α small encodes weak information about ξ (Simpson et al., 2017).This means that it is a priori unlikely that the value of X(ξ) exceeds U .Finally, the prior can be transformed to the corresponding prior for the flexibility parameter ξ.An attractive feature of this principle-based construction is that the resulting priors are proper and have a natural link to Jeffreys' priors.

Shrinking a marginal variance parameter
In the case of a single Gaussian random effect with marginal variance σ 2 , the PC prior with base model σ 2 = 0 is an exponential prior on σ.The rate parameter λ can be set, for example, by an a priori statement P(σ > U ) = 0.05 so that the 95th percentile of the prior for σ is U > 0. Then the prior is an exponential prior with rate parameter λ = − log(α)/U which we denote as σ ∼ PC SD (U, α); see Simpson et al. (2017) for details and derivation.

Shrinking a weight parameter
Consider the situation that the linear predictor only contains two random effects A and B with variances σ2 A and σ 2 B , respectively.The proportion of t = σ 2 A + σ 2 B assigned to each random effect is described by ω ).If one a priori prefers the attribution ω = ω 0 = (1 − ω 0 , ω 0 ), shrinkage can be induced in the joint prior for the variance parameters using a PC prior where ω = ω 0 is the base model.Here we apply the KLD from Equation (3.1) to express distance from the base model ω 0 to the alternative model ω, and penalise deviations from the base model according to the difference in model complexity.
Theorem 1 (PC prior for dual split).Let u 1 and u 2 be random effects of an LGM that enter the linear predictor through The PC prior for ω with base model where λ > 0 is the hyperparameter.We suggest to set λ so that the median is ω m = 0.25.
For base model 0 < ω 0 < 1, the PC prior whose median is equal to ω 0 is , ω 0 < ω < 1, where λ > 0 is a hyperparameter.We suggest to set λ so that Base model equal to ω 0 = 1 follows directly by reversing the roles of u 1 and u 2 .
Proof.See Section S1.1 in the Supplementary Materials.
The default values in each case are specified as to place most of the prior mass in a small interval on the ω scale around ω 0 , but to also ensure large deviations from ω 0 are a priori plausible; in this sense they are weakly informative (Gelman, 2006;Gelman et al., 2008).Sections 5.1 and 5.2 show that the results from the inference are stable to changes in these hyperparameters; which in turn shows that these λ's provide weak information.If the analyst has expert knowledge this should be used instead of the default values.Large ω might be 0.75 for test-retest reliability in a psychology study (Cicchetti, 1994) but 0.4 for the genetic heritability of a trait (Shen et al., 2016).

Expressing a priori ignorance about a split Exchangeability
In some cases the analyst does not want to express an a priori preference for any of the branches in a split in the tree.This can be achieved indirectly through a series of dual splits.For example, by replacing the split in Figure 1a by the series of dual splits as shown in Figure 1b where the left-hand side has a base model of 2/3 in the first split and the left-hand side has a base model of 1/2 for the second split.In total this is specifying a base model of 1/3 of the total variance to each random effect, but the resulting prior is not invariant to permutations of A, B and C in Figure 1b.See Section S2 of the Supplementary Materials for details.When the goal is to express ignorance about the decomposition of the variance, one can use a base model of equal attribution of the total variance to each random effect and choose an exchangeable prior for (σ 2 A , σ 2 B , σ 2 B ).This can be done, for example, through a Dirichlet prior.

Dirichlet prior
The Dirichlet prior of order K ≥ 2 with parameters a 1 , . . ., a K > 0 is given by where B is the multivariate beta function, and ∆ K is the K − 1 simplex.Since there is no preference for any random effect, we consider the symmetric Dirichlet distribution where a 1 = . . .= a K = a > 0, where a is the hyperparameter that must be selected by the analyst.For a = 1 the prior is uniform, for a < 1 the prior has peaks at the vertrices of ∆ K , and for a > 1 the mode is ω = (1, . . ., 1)/K.The prior is invariant to permutations of the elements of ω for any value of a > 0 and it is computationally cheap for arbitrary dimensions K.

Hierarchical decomposition priors for LGMs
In this section we introduce the new class of intuitive joint priors for the variance parameters in LGMs.

Accounting for model structure
In the general formulation of HD priors in Definition 1, the prior is composed of conditional priors that for each split depends on all descendant splits.This is impractical because computing PC priors would require new KLDs to be computed every time the prior is evaluated.We take a pragmatic approach where we decide on a set of base models, which expresses our best prior guess, and condition on these.

Under this assumption a new class of HD priors for
LGMs are constructed by combining intuition about shrinkage and ignorance through independent priors for the splits.
Prior class 1 (HD priors for LGMs).Assume the LGM contains N random effects with variances σ 2 1 , . . ., σ 2 N and that the hierarchical decomposition of the variance is described through a tree with S splits.Under base models {ω 0 1 , . . ., ω 0 S }, the prior is where the total latent variance is t = σ 2 1 +. ..+σ 2 N , and ω i ∈ ∆ ls , where l s is the number of branches in split s, s = 1, . . ., S.
For each of the S splits, the analyst can express ignorance through a Dirichlet prior or sequence of PC priors as described in Section 3.3, or express preference to the selected base models as described in Section 3.2.The selection of π(t|{ω s } S s=1 ) must be done in the context of the likelihood as described in Section 4.2.
This prior is computationally inexpensive since the overall prior probability density factorises into independent conditional distributions that consist of PC priors, which can be precomputed, and Dirichlet priors, which are cheap to compute.
We demonstrate the use of HD priors through one example where the analyst wants to express ignorance and one example where the analyst wants to penalise complexity.
Example 2 (Non-nested random effects).Consider responses y 1 , . . ., y n , described by the Gaussian linear model y i |η i ∼ N (η i , σ 2 R ) with where µ is the intercept, h 1 , h 2 and h 3 are smooth effects of the covariates expressed as second-order random walks (Rue and Held, 2005), and σ 2 R is the residual variance.Assume that one has no a priori preference for the three smooth effects, and decide to encode the decomposition of the total latent variance as shown Figure 1a, where A, B and C represents the three smooth of covariates effects.Let ω 1 denote the proportions of variance assigned to model components and let t denote the total latent variance.We construct an HD prior by assigning a Dirichlet prior to ω 1 , and handle t|ω 1 as discussed in Section 4.2.
Example 3 (Shrinkage in multilevel models).The latent part of the multilevel model in Section 1 can be written in vector form as η = A A u A + A B u B + A C u C , where A A , A B and A C are sparse matrices selecting the appropriate group, individual and measurement effects, respectively.Assume we use an LGM, then , where G is the number of groups, P is the number of individuals per group, and K is the number of measurements per individual.
If we prefer shrinkage towards fewer levels in the multilevel model as shown in Figure 1c, we decompose the total latent variance t = σ 2 A + σ 2 B + σ 2 C through two splits.For the split at the root node, we decompose t according to the proportions ).We use an HD prior where we apply base models ω 0 1 = (0, 1), which prefers C over A+B, and ω 0 2 = (0, 1), which prefers B over A. Due to the desire for shrinkage we apply PC priors and use Theorem 1 with base model ω 0 2 to compute π(ω 2 ).We define ũ1 = A A u A + A B u B and ũ2 = A C u C .Then if we condition on ω 2 , the top split in Figure 1c ), and the conditional prior π(ω 1 |ω 2 = ω 0 2 ) can be computed using Theorem 1 with base model ω 0 1 conditional on ω 2 = ω 0 2 .The joint prior is then , and an appropriate prior is chosen for π(t|ω 1 , ω 2 ) as described in Section 4.2.

Accounting for the likelihood
Meaningful priors for the total latent variance t depend on the likelihood and prior beliefs about the responses in the specific application (Gelman et al., 2017b).We provide tools for expressing scale-invariance for the variances of the random effects and the measurement error when the responses are Gaussian, or shrinkage for the total latent variance of the random effects.
Under a Gaussian likelihood, the selection of the unit of measurement by the analyst affects the sizes of the variances.However, when the residual variance σ 2 R is expected to be well-identified, we can define the prior on t relative to σ 2 R and shrink t by preferring to describe the total variance V = t+σ 2 R in the model by σ 2 R .This can be complemented by a scale-independent Jeffreys' prior on V to achieve a scale-invariant joint prior for the variance parameters.
Prior class 2 (HD priors with Gaussian likelihoods).Assume an HD prior from Prior class 1 is desired for an LGM with Gaussian responses with residual variance σ 2 R .First select the prior on the decomposition of the total latent variance t.Then augment the tree by an extra node on the top with variance V = t + σ 2 R .The new top node has one branch with residual variance and the other branch is the subtree describing the latent model.
) with base model ω 0 R = (0, 1).If V is assigned a scale-invariant prior, the full joint prior is and ω s ∈ ∆ ls , where l s is the number of branches in split s, for s = 1, . . ., S.
imsart-ba ver.2014/10/16 file: Manuscript.tex date: September 25, 2019 Proof.The scale-invariant prior is π( If the likelihood is binomial with a logit link function, a scale for the random effects is induced through their effects on the odds-ratio.Similarily, for a Poisson likelihood with a log link function, there is a scale for the random effects through their effects on the relative risk.In these cases, scale-invariance is not meaningful and we can induce shrinkage on the total variance of the random effects by using the PC prior for variance from Simpson et al. (2017).
Prior class 3 (HD priors with shrinkage on latent variance).Assume an HD prior from Prior class 1 is desired for an LGM where shrinkage on the total latent variance is appropriate.First select the prior on the decomposition of the total latent variance t.Then t can be shrunk towards 0 by a PC prior π(t|{ω s } S s=1 ) with base model t 0 = 0.This results in t > 0, and ω i ∈ ∆ ls , where l s is the number of branches in split s, for s = 1, . . ., S, and λ > 0 is a hyperparameter.
Proof.The conditional PC prior for t with base model t 0 = 0 is given by π(t|{ω Simpson et al., 2017).
We illustrate how the hyperparameter can be selected by considering the prior on the total latent variance in the case of a Binomial likelihood.
Example 4 (Shrinking latent variance).Let logit(p) = µ + x, where x ∼ N (0, t), for a t > 0, and µ is considered fixed.The latent variance t is difficult to interpret directly due to the non-linear link function, but we can interpret it through the effect on the odds-ratio, p/(1 − p) = exp(µ) exp(x).The hyperparameter λ in Prior class 3 can, for example, be set so that the relative change in the odds-ratio, exp(x), is between 1/2 and 2 with probability 90%, P(1/2 < exp(x) < 2) = 0.90.

Case studies: Gaussian responses
In this section we investigate the performance of HD priors compared to a set of commonly used standard priors for two simulation studies with Gaussian responses.

Random intercept model
The random intercept model is given by y i,j = α i + ε i,j for j = 1, . . ., n i , i = 1, . . ., n g , where n i is the size of group i, and n g is the number of groups.The random intercepts are i.i.d.Gaussian with variance σ 2 α and the residual effects are i.i.d.Gaussian with variance σ 2 R .The total latent variance is t = σ 2 α and the total variance is V = σ 2 R + σ 2 α .We introduce the proportion of the total variance explained by the latent model and decompose V as σ 2 α = ωV and σ 2 R = (1 − ω)V .We desire shrinkage towards the base model ω 0 = 0 and use an HD prior based on the tree structure in Figure 2a, where the prior on ω is calculated using Theorem 1 and we use the scale-invariant prior from Prior class 2. The specification of the hyperparameter of the HD prior is done through the median ω m of π(ω).The resulting prior for ω is shown in Figure 2b for ω m = 0.25 and the corresponding prior for the distance d(ω) discussed in Section 3.2 is shown in 2c.Further details can be found in Section S3.1 of the Supplementary Materials.
The intraclass correlation (ICC) for the random intercept model is given by σ 2 α /(σ 2 R + σ 2 α ), which equals the weight parameter ω.Thus the shrinkage of the ICC is completely controlled in the construction of the prior and expert knowledge about the ICC can be incorporated directly.Further, ω can be linked to a generalised version of the coefficient of determination, R 2 , suggested by Gelman and Hill (2007); see Section S3.2 in the Supplementary Materials for details.
We use the R-package RStan (Stan Development Team, 2018a) to perform the inference for the simulation study.We use HD priors from Prior class 2 with shrinkage from PC priors on ω with hyperparameters ω m = 0.25 (P-HD-25), ω m = 0.5 (P-HD-50) and ω m = 0.75 (P-HD-75), and an HD prior from Prior class 2 where the PC prior is replaced by a Dirichlet prior on (ω, 1 − ω) (P-HD-D) with default hyperparameter.Additional priors are Jeffreys' prior on the residual variance combined with different priors on the random intercepts variance or standard deviation: the default INLA prior InvGamma(1, 5 × 10 −5 ) (P-INLA), Half-Cauchy(25) (P-HC), and PC SD (3, 0.05) (P-PC).This gives seven joint priors.Each scenario in the simulation study consists of 500 datasets which are simulated from the random intercept model for n g ∈ {5, 10, 50}, and 10, 50, or varying number of individuals in each group.We select true values ω ∈ {0.1, 0.25, 0.5, 0.75, 0.9} and select true total variance V = 1 in every scenario.
We evaluate the performance of the different priors with respect to posterior inference for total variance V and ICC ω.We use the bias of log(V ) and logit(ω), calculated using the estimated median minus the true value, and the 80% empirical coverage, found by counting the number of times the true value is contained in the 80% equal-tailed credible interval.We use the same settings for the call to the stan function for all priors and scenarios in the simulation study.RStan reports a divergent transition for each iteration of the MCMC sampler that runs into numerical instabilities (Carpenter et al., 2017).In Figure S3.1 in the Supplementary Materials we report the proportion of datasets that resulted in at most 0.1% divergent transitions for each prior and scenario.This is used as a measure of stability of the inference scheme for each prior, and the dataset and prior combinations causing unstable inference are removed from the study.
The results in Figure 3 are for n g ∈ {10, 50} and group size 10, and show that P-HD-25 performs at least as good in terms of bias and coverage of logit(ω) as P-INLA, P-HC and P-PC.The magnitude of the bias decreases and the coverage approaches 80% for all four priors when the number of groups increases, which is expected as the amount of information about the parameters in the datasets increases.Figures S3.3-S3.7 in the Supplementary Materials show that the HD priors perform at least as good in terms of bias and coverage for logit(ω) as P-INLA, P-HC and P-PC also for the other combinations of the number of groups and group sizes, and that the same conclusions as for logit(ω) also holds for log(V ).
Furthermore, Figures S3.3-S3.7 show that the behaviour of the four HD priors is stable with respect to the choice of ω m when group size is 10, and that P-HD-D performs worse than P-HD-25, P-HD-50 and P-HD-75 for all values of the true weight except 0.5.For 10 groups with two observations per group, the risk of overfitting is high because low information about the parameters may lead to overestimating the weight parameter and estimating spurious signals in the group effect.In this setting, P-HD-25 leads to overfitting for true weight equal to 0.1, but underfitting for true weight equal to 0.25, 0.5, 0.75 and 0.9.P-HD-50, P-HD-75 and P-HD-D result in overfitting for true weight equal to 0.1 and 0.25, but underfitting for true weight equal to 0.5, 0.75 and 0.9.See Section S3.4 in the Supplementary Materials for additional details.
Figure S3.1 shows that P-INLA is the only prior that is heavily affected by divergent transitions during the inference for scenarios with 10 or 50 groups.Part of the problem with P-INLA is that it results in a bi-modal posterior for σ 2 α ; see Figure S3.2.The new HD priors are preferred for the random intercept model due to their intuitive definition, where the structure of the shrinkage is directly available in Figure 2a, and interpretability of the parametrization which aids prior elicitation.

Latin square experiment
Consider an experiment where a latin square design (Hinkelmann and Kempthorne, 1994) is used to control for two nuisance sources of noise.For example, a field split into rows and columns where different levels of strength of a new fertilizer is applied to each plot.We assume there are nine possible levels of the treatment so that a 9 × 9 grid of plots is necessary for a full latin square design.We focus on random effects and exclude fixed effects from the model, and assume that the responses can be modelled by (5.1) ) is the effect of the treatment, k[i, j] denotes the treatment assigned to row i and column j, and ε = (ε 1,1 , . . ., ε 9,9 ) ∼ N 81 (0, σ 2 R I 81 ) is the residual noise.We believe that the effect of the treatment is ordered, and that the treatment effect consists of a smooth signal of interest γ (1) = (γ (1) 1 , . . ., γ (1) 9 ) and random noise γ (2) = (γ 9 ) we have to control for.The signal is given a second-order random walk model described by N 9 (0, σ 2 RW2 Q −1 RW2 ), where σ 2 RW2 is the variance and Q −1 RW2 is a slight abuse of notation to describe the intrinsic second-order random walk defined by the precision matrix Q RW2 , and the noise is γ (2) ∼ N 9 (0, σ 2 t I 9 ).We use the constraints (1) i = 0 and 9 i=1 iγ (1) i = 0 to remove the implicit intercept and linear effect, respectively.
We set the true standard deviations equal, σ r = σ c = σ t = σ R = 0.1, and let the true effect of treatment be given by x i = C (i − 5) 2 − 20/3 , i = 1, . . ., 9. We entertain three scenarios: C = 0 for no effect of treatment (S1), C = 0.05 for medium effect of treatment (S2) and C = 0.2 for strong effect of treatment (S3).More details on the true treatment effect is included in Section S4.1 in the Supplementary materials, see especially Figure S4.2.We simulate 500 datasets for each scenario and analyse them with four choices of priors.
The three default priors used are Jeffreys' prior for σ 2 R combined with InvGamma(1, 5× 10 −5 ) for σ 2 r , σ 2 c , σ 2 t and σ 2 RW2 (P-INLA), or Half-Cauchy(25) (P-HC) or PC SD (3, 0.05) (P-PC) for σ r , σ c , σ t and σ RW2 .We select an HD prior from Prior class 2 using the model structure in Figure 4a, where the triple split has a Dirichlet prior and the two other splits have PC priors (P-HD-D3).We also decompose the triple split into the two dual splits as shown in Figure 4b, and use a PC prior on all four splits according to the shrinkage structure in the figure .In all cases we use default values for α, β, γ (1) , γ (2) , ε α, β, γ (1) , γ (2)  ε γ (1) , γ (2)  α β γ (1) γ (2)   1/3 1/3 1/3 (a) Original structure α, β, γ (1) , γ (2) , ε α, β, γ (1) , γ (2)  ε γ (1) , γ (2)  α, β  The targets of the analysis are the posterior distribution of the structured treatment effect γ (1) and the model fit.The former will be assessed by the continuous rank probability score (CRPS) (Gneiting and Raftery, 2007) and the latter by the leave-one-out log predictive score (LOO-LPS) − 1 81 81 i=1 log π(y i |y −i ).The CRPS is a proper scoring rule and given by 1 2 dx, where F i is the cumulative distribution function for the posterior of γ (1) i , x i is the true effect of treatment i, and I is the Heaviside function, and is estimated using the procedure of Jordan et al. (2017).We imsart-ba ver.2014/10/16 file: Manuscript.tex date: September 25, 2019 Intuitive joint priors for variance parameters report the proportion of datasets leading to no more than 0.1% divergent transitions for each prior and scenario, and use this as a measure on stability of the inference.These numbers can be seen in Figure S4.5 in the Supplementary Materials, and show that all priors lead to similar stability.The datasets leading to more than 0.1% divergent transitions for one or more priors are removed from the study.
The main results from the simulation study are displayed in Figure 5. Low LOO-LPS indicates good model fit and low CRPS indicates good predictive power for the treatment effect.P-INLA gives a poorer model fit than the other priors, and with respect to predictive power, the HD priors P-HD-D3 and P-HD-25 perform best for S2 and S3.The high predictive power of P-INLA for S1 is due to the fact that P-INLA has a peak at low variance and produces a posterior for the treatment effect with mean closer to zero and lower variance.Overall, the HD prior performs well across all scenarios.The results are stable to changes in the construction of the HD prior and the choice of hyperparameters; see Section S4.2 in the Supplementary Materials for details.The HD priors are preferable to the other priors because of their intuitive parametrization and the interpretability of the a priori assumptions placed on the joint prior of the variance parameters.Further, P-HD-D3 is preferred to P-HD-25 since they perform similar and P-HD-D3 is more intuitive.

Case studies: Binomial responses
In this section we study neonatal mortality counts arising from complex surveys through a simulation study, and show how to practically apply the HD priors.

Background
Neonatal mortality is an important indicator of health and well-being in a country and is included in Goal 3.2 of the Sustainable Development Goals (SDGs) (General Assembly of the United Nations, 2015), and mapping child mortality is an important area of current research (Golding et al., 2017;Wakefield et al., 2018;Li et al., 2019).We define neonatal mortality as the rate of deaths within the first month of life per live birth.An important source of data for neonatal mortality is the nationally-representative household surveys performed by Demographic and Health Surveys (DHS).The survey performed by DHS in 2014 in Kenya targets its 47 counties, which is the relevant administrative level for health policies (Kenya National Bureau of Statistics et al., 2015).The target of the simulation study in Section 6.2 and the analysis in Section 6.3 is the spatial heterogeneity in neonatal mortality in Kenya in the time period 2010 to the time of the survey.
From the survey we can extract the number of live births, b i,j,k , and the number of neonatal deaths, y i,j,k , in household k in cluster j in county i.We also have an indicator x i,j specifying whether the cluster is rural (0) or urban (1) and each household has an inclusion probability π i,j,k of being included in the survey sample.See the Section S5.1 in the Supplementary Materials for more background.

Simulation study
In this section we use the n = 290 constituencies shown in Figure 6a 2 .We assume that m i = 6 clusters are visited in constituency i, i = 1, . . ., n, and consider births b i,j and neonatal deaths y i,j in cluster j in constituency i.We assume that there are b i,j = 25 live births in each cluster and the outcomes are simulated according to the model y i,j |p i,j ∼ Binomial(b i,j , p i,j ) for where µ is a joint intercept, u = (u 1 , . . ., u n ) has a Besag distribution with variance σ 2 B and a sum-to-zero constraint, v = (v 1 , . . ., v n ) ∼ N n (0, σ 2 IID I n ), and ν = (ν 1,1 , . . ., ν n,mn ) ∼ N M (0, σ 2 C I M ) with M = m 1 + . . .+ m n = 6 • 290 = 1740.We use the structure for the prior shown in Figure 6b to make an HD prior from Prior class 3 with PC priors on all splits according to the base models indicated in the figure (P-HD-25) and an HD prior from Prior class 3 where a Dirichlet prior distributes variance to the three model components (P-HD-D).In all cases, the splits have default hyperparameter values and we select the hyperparameter in the PC prior on total variance, t = σ 2 B +σ 2 IID +σ 2 C , so that P(t > 3) = 0.05.Further, we use InvGamma(1, 5×10 −5 ) for σ Based on the final report from the survey (Kenya National Bureau of Statistics et al., 2015) the estimated national level of neonatal mortality is 0.022 for 2010-2014, and we set µ = logit(0.022).Further, we choose σ 2 C = 0.1 and create five scenarios  ).We simulate 500 datasets for each scenario.The main targets of the simulation study are the structured part of the spatial heterogeneity through the posterior of u, the degree of structure in the spatial heterogeneity through ω (2) = σ 2 B (σ 2 B + σ 2 IID ) −1 , and how well the underlying neonatal mortality is estimated through the posterior of the intercept µ.The performance is assessed through the CRPS (see Section 5.2) of u, the bias of the posterior median of ω (2) , and the bias of the posterior median and the coverage of the 80% equal-tailed credible interval for µ.We use the proportion of datasets leading to at most 0.1% divergent transitions as a measure of stability in the inference, these numbers can be seen in Figure S5.1 in the Supplementary Materials, and show that P-INLA leads to more unstable inference than the others.
Figure 7 shows the main results from the simulation study.We drop datasets that cause more than 0.1% divergent transitions for at least one of the priors from each scenario.All priors have a tendency to overestimate the intercept, with P-INLA doing worse than the others, P-INLA gives close to exact estimates when the true value of ω (2) is 0 (in S2) and 1 (in S5), but performs worse than the other priors for S3 and S4. Figure S5.2 in the Supplementary Materials shows that P-HD-25 performs better than P-HD-D except in S3 where the Dirichlet prior is closest to the truth, and that ω (1)  tends to be underestimated under all the priors.P-HD-25 is preferred because overall it performs at least as good as the other priors P-HC and P-PC, and P-HD-25 is an intuitive and well-behaved prior that takes the hierarchical structure of the model into account.

Neonatal mortality in Kenya
This section follows the notation introduced in Section 6.1.The survey consists of 13183 households with one or more live births, distributed over 1593 clusters that are dis-tributed over n = 47 counties.In total there are 376 deaths among 17664 children.Figure 8c shows the counties and the weighted neonatal mortality by the inverse inclusion probabilities, and it is unclear if there is a structured spatial pattern.The neonatal mortality is assumed to follow a survival model with constant hazard through the first month of life, and we use a latent Gaussian model with a binomial likelihood, y i,j,k |b i,j,k , p i,j,k ∼ Binomial(b i,j,k , p i,j,k ), a logit link function, and a linear latent Gaussian model where µ is an overall intercept, β is the effect of urban, u is a Besag model with variance σ 2 11 , v is a Gaussian i.i.d.effect of county with variance σ 2 12 , ν is a Gaussian i.i.d.effect of cluster with variance σ 2 2 , and ε is a Gaussian i.i.d.effect of household with variance σ 2 3 .In this model, u and v provide structured and unstructured, respectively, betweencounty variation, ν provides between-cluster variation, and ε provides within-cluster variation.The Besag effect has a sum-to-zero constraint to make the overall intercept identifiable.The random effects of cluster and household are necessary to account for the dependence induced between sampled households due to the clustering in the sampling design.We assume that there is no difference between the effect of urbanicity between different counties.
The model has four variance parameters that must be assigned a joint prior.The first step is to choose the tree structure.For simplicity's sake, the alternatives to the full model (6.1) we would entertain are first η i,j,k = µ + x i,j β + v i , then we would add u i , so ν i,j , and at last ε i,j,k .We prefer coarser unstructured effects over finer unstructured effects since we would like to explain the data at a coarser level if possible, and we prefer the unstructured spatial effect over the structured spatial effect since we want to reduce the risk of estimating spurious spatial signals.This gives the nested tree structure in Figure 8a where the household effect, cluster effect and Besag effect are sequentially split off from the total latent variance.We construct an HD prior based on the tree structure with PC priors with default hyperparameter values for the splits, and induce shrinkage on the total latent variance as in Prior class 3 with a PC prior where P(Total variance > 11.296) = 0.05.This corresponds to a priori equal-tailed 90% credible interval of (0.1, 10) for the effect of the random effects on the odds-ratio, exp(u i + v i + ν i,j + ε i,j,k ).This allows for high variation in the data and is used because the data is observed at the household level.The splits in Figure 8a are given PC priors with default hyperparameters and bases models as indicated in the figure .The model is parameterized by total standard deviation σ T , and proportion of household variance to total variance of the random effects ω (1) , proportion of cluster variance to the sum of cluster and county variance ω (2) , and the proportion of structured spatial variance to county variance ω (3) .The priors and posteriors of the proportions ω (1) , ω (2)  and ω (3) are shown in Figure 8e.The total standard deviation has a posterior median of 1.47, and the prior and posterior can be seen in Figure S5.3 in the Supplementary Materials.The results show that the data only weakly informs about the proportion of structured to unstructured spatial effects, which indicates that the data provide no strong evidence in favor of or against a structured spatial effect.Also the posterior of ω (2) is similar to the prior, but there is a strong signal in the posterior of ω (1) that there (e) The priors and posteriors for the proportion of household variance to total variance of the random effects ω (1) , the proportion of cluster variance to cluster-and household-level variance ω (2) , and the proportion of structured spatial variance to total between-county variance ω (3) . is non-negligible household-level dependence.A plausible explanation for the weak signals in ω (2) and ω (3) is that there is substantial noise coming from high variance in the household-level random effect and weak information from the Binomial likelihood due to few successes and few numbers of trials.
As shown in Figure 8b the proportion of the total latent variance attributed to the structured spatial effect is low and the posterior median is 0.56%.The estimated spatial effect in Figure 8d only explains a small part of the variation seen in the observed data in Figure 8c.One should be careful to draw conclusions about spatial variation based on Figure 8d because the data is only weakly informative about the split between the structured and the unstructured spatial random effects ω (3) , and there is only weak evidence for the spatial effect being different from 0 as shown in Figure S5.5 in the Supplementary Materials.The fact that the comparisons of priors and posteriors for ω (2) and ω (3) directly informs about the weak signal in the data is an advantage of the parametrization through proportions of variance, and a strong argument for setting priors on ω (2) and ω (3) rather than independent priors on the variance of each effect since the resulting posteriors for ω (2) and ω (3) are strongly dependent on the resulting implicit priors for ω (2) and ω (3) .
One could argue for other splits in the tree in Figure 8a such as preferring finer level effects to coarser level effects because one does not want to estimate spurious clusterlevel or county-level effects, but the key point of this application is that it is easy to set up the prior based on a priori assumptions and the assumptions are available to other scientists at a glance.With the traditional approach of independent priors, the resulting prior on the total variance of the random effects and the distribution of this total variance to the different random effects is obfuscated.Furthermore, if expert knowledge indicates that stronger relative shrinkage of the variances than the default setting is needed, the medians of the conditional priors for ω (1) , ω (2) and ω (3) can be reduced.

Discussion
Independent priors for the variance parameters in a BHM result in an implicit prior on the total variance of the random effects, t, and the attribution of t to the random effects.Additive models are typically built in a modular fashion, but these implict priors are not consistent with respect to adding or removing random effects.In the case of Gaussian responses, both the prior for t and the prior for t relative to the size of the residual variance change.The proposed HD priors overcomes these shortcomings, and respect the defined model structure and are consistent for t and the attribution of t to the different random effects for different selections of random effects.
The HD priors admit a visual representation through trees that allow transparent communication of the assumptions made in constructing the priors and facilitate discussion around the assumptions.The tree clearly specifices where shrinkage has been applied, and in some cases lead to more intuitive parametrization that is more suitable for elicitation of priors.For the random intercept model, the tree-based hierarchical variance decomposition leads to a parameterisation in terms of t and the ICC.A prior on these parameters is more interpretable than separate priors on the group variance and individual variance, which obfuscates the joint effect of the priors.The increased interpretability of joint priors compared to independent priors addresses concerns raised about transparency for point processes where prior sensitivity is a major concern (Sørbye et al., 2018).
The mix of robust PC priors for shrinkage and simple Dirichlet priors for expressing ignorance, allows principled priors that respect the relative complexity of the random effects when shrinkage is necessary, and intuitive exchangeability when no random effects are preferred or no model structure is apparent.The simulation studies show that this approach performs better than a completely unstructured approach with a Dirichlet prior attributing t to the different random effects, but that Dirichlet priors perform well for subgroups of the random effects where there is no nested structure or difference in complexity.
HD priors with default settings for the hyperparameters performs well, but there are corner cases like no treatment effect in the latin square experiment and no structured spatial effect for the binomial data, which are best handled by the default INLA prior.However, this prior has a peak in the prior distribution for low variances and generally performs surprisingly bad.The HD priors perform comparable to component-wise PC priors and separate half-cauchy priors for the marginal variances.The main benefit of the HD priors over other default priors is their combination of intuitive graphical representation with robust inference that behaves well across a range of different scenarios.
The calculation of PC priors is more complex in the context of correlation parameters, but multivariate PC priors have been developed for more complex random effects such as autoregressive processes (Sørbye and Rue, 2017) and spatial Matérn models (Fuglstad et al., 2019).These can be integrated into the HD prior framework by first defining priors on the correlation parameters, and then constructing the joint prior for the variance parameters with the correlation parameters fixed to reasonable values.This follows the pragmatic mindset of Assumption 2 of producing priors that are computationally feasible, intuitive and practically useful.
A key focus for future work is to exploit sparsity in the precision matrices of the random effects.This is important when shrinkage is desired through PC priors because many models such as random walks, Besag models, and Gaussian random fields (Lindgren et al., 2011) have dense covariance matrices, but can be expressed through sparse precision matrices.Assume that the total variance is split between random effects with sparse precision matrices Q 1 and Q 2 , where Q 1 corresponds to the base model.Let 0 < ω < 1, then the KLD used in Theorem 1 consists of the trace of Q , which can be computed quickly through the techniques in Rue and Held (2010, Section 12.1.7.10), and the determinant det ) −1 , which can be computed quickly through Cholesky factorizations.
We aim to further broaden the advantages of the HD priors in the future by constructing a joint prior for the variance parameters and the fixed effects.However, this will require re-thinking of the concept of total latent variance as it is the values of the imsart-ba ver.2014/10/16 file: Manuscript.tex date: September 25, 2019 coefficients of the fixed effects and not their variance that determines the amount of variance they explain.Instead of starting with the concept of marginal variances, it is natural to begin with the classical concept of explained variance and use ideas from block-wise g-priors (Som et al., 2014) to distribute variance inside a group of covariates.In a multilevel model this would connect the attribution of explained variance to different levels to generalised coefficients of determinations.Additionally, towards non-parametric regression by including a combination of a linear effect of a covariate and a smooth effect of a covariate, and explicitly putting a prior on the degree of nonlinearity (Simpson et al., 2017, Section 7).However, there are still open questions and this addition is outside the scope of this paper.
The choice of tree structure for HD priors should be guided by the application at hand, for example, by considering the relative complexity of the random effects.When expert knowledge is available, the default values for the hyperparameters should be replaced by values elicited based on expert knowledge.We believe that the advantages of the HD priors over independent priors mean that they should be used as the default option in software for Bayesian analysis.However, it is necessary to make the selection and computation of HD prior for a specific problem easier for analysts.We plan to address this by providing a separate R package, which is compatible with INLA, that provides a graphical user interface for selecting the tree structure and selecting priors for the splits, and has the option to pre-compute priors for use in RStan.This will allow analysts to experiment with different a priori assumptions and produce graphical figures that summarize their assumptions and can be communicated to fellow scientists.This will encourage transparancy and clarity in a priori assumptions in the scientific community.

Figure 2 :
Figure 2: Model structure and prior for ω in the random intercept model with 10 individuals in each group and prior median ω m = 0.25.The prior is independent of the number of groups.a) Tree structure, b) prior for ω, and c) prior for distance d(ω).

Figure 3 :
Figure 3: Results for logit(ω) for the random intercept simulation study.True value of ω shown on the x-axis, the number of groups is shown on left-hand side, and the group size is 10.Results for P-INLA are only shown when it leads to stable inference.

Figure 5 :
Figure 5: Results from the latin square experiment simulation study.
Model structure.Gray nodes indicate base models.

Figure 6 :
Figure 6: Map and model structure for the Kenya neonatal simulation study.

Figure 7 :
Figure 7: Main results from the Kenya neonatal mortality simulation study.Left to right: bias of the intercept µ, CRPS of u and bias of ω(2) .Scenario shown on the x-axes.
Model structure.(b) Variance of u relative to total variance.(c) Weighted average of neonatal mortality.(d) Posterior median of e u .

Figure 8 :
Figure 8: Description of model structure, map of observed mortality, and results for neonatal mortality in Kenya.