Multilevel Linear Models, Gibbs Samplers and Multigrid Decompositions

. We study the convergence properties of the Gibbs Sampler in the context of posterior distributions arising from Bayesian analysis of conditionally Gaussian hierarchical models. We develop a multigrid approach to derive analytic expressions for the convergence rates of the algorithm for various widely used model structures, including nested and crossed random eﬀects. Our results apply to multilevel models with an arbitrary number of layers in the hierarchy, while most previous work was limited to the two-level nested case. The theoretical results provide explicit and easy-to-implement guidelines to optimize practical implementations of the Gibbs Sampler, such as indications on which parametrization to choose (e.g. centred and non-centred), which constraint to impose to guarantee statistical identiﬁability, and which parameters to monitor in the diagnostic process. Simulations suggest that the results are informative also in the context of non-Gaussian distributions and more general MCMC schemes, such as gradient-based ones.


Introduction
Markov chain Monte Carlo (MCMC) is established as the computational workhorse of most Bayesian statistical analyses for complex models. For hierarchical models with conditionally conjugate priors, the Gibbs sampler (Gelfand and Smith, 1990;Smith and Roberts, 1993) remains one of the most natural algorithm of choice, thanks to its simplicity of implementation and low computational cost per iteration (thanks to conjugacy and conditional independence). Nonetheless, speed of convergence of the resulting Markov chain can be a major issue and can be highly sensitive to the model structure and the implementation details, such choice of parametrization (Hills and Smith, 1992;Gelfand et al., 1995) or identifiability constraints (Vines et al., 1996;Gelfand and Sahu, 1999;Xie and Carlin, 2006). This work provides a contribution towards gaining a quantitative understanding of the interaction between Bayesian hierarchical structures and the behaviour of MCMC algorithms, which lies at the heart of the practical success of Bayesian statistics.
While there is some previous work in the area (Roberts and Sahu, 1997;Meng and Van Dyk, 1997;Papaspiliopoulos et al., 2003;Jones and Hobert, 2004;Papaspiliopoulos et al. 2007;Yu and Meng, 2011), current theoretical understanding of the interaction between Bayesian hierarchical models and MCMC convergence is still very limited, and almost nothing is known for models of hierarchical depth greater than two. The present paper offers a contribution towards such an understanding, focusing on theory for Gaussian hierarchical models and seeking quantitative results. In particular, we derive analytic expressions for the convergence rates of the Gibbs Sampler for various multilevel linear models and explore the dependence of these rates on the model structure, the choice of parametrization and the introduction of identifiability constraints. The theoretical results given in this paper extend and improve substantially on existing literature (Roberts and Sahu, 1997;Yu and Meng, 2011;Bass and Sahu, 2016a;Gao and Owen, 2017) both in terms of generality of hierarchical structure and the availability of explicit rates. We also show by simulations that the understanding gained from the Gaussian case can be extrapolated to more general settings.
In general, the Gibbs sampler can be elegantly described in terms of orthogonal projections (Amit, 1991(Amit, , 1996Diaconis et al., 2010). While in principle this theory provides the tools to extract practical convergence information for Gibbs samplers in the context of multivariate Gaussian distributions, in order to apply it to practically used Bayesian multilevel models one needs detailed knowledge of the spectrum of non-trivial high-dimensional matrices, which has drastically limited its applicability to derive analytic results. In this paper we combine this general framework with a novel multigrid decomposition approach that allows us to focus on low-dimensional Markov chains and derive explicit analytic results concerning Gibbs sampler rates of convergence for multilevel linear models, such as nested and crossed random effect models with an arbitrary number of layers and/or factors.
Our results have various practical implications. First they can be readily used in the popular context of conditionally Gaussian models, where there exist unknown variances at various levels of the hierarchy (Gelman and Hill, 2006). In that case our results describe, for example, the optimal updating strategies for the hierarchical mean structure conditional on the variances, allowing to optimize the mean parametrization on the fly (Section 3.2), or the computationally optimal way of imposing statistically identifiability (Sections 4.2), and provide theoretically grounded indication of which parameters to monitor in the convergence diagnostic process (Section 2.1). Also, our results can be used as a building block to derive computational complexity statements about the Gibbs Sampler in the context of multilevel linear model (see e.g. Papaspiliopoulos et al., 2019 for work in that direction). Note that in the context of conditionally Gaussian models the entire Gaussian mean component could be updated in a single block, thus avoiding convergence issues related to single-site updates. However these block updates can in principle be computationally expensive (up to O(n 3 ) cost in the dimension (n) of the Gaussian to be updated), while single-site updating schemes with provably bounded convergence rate can offer a more scalable alternative. For some class of models, sparse linear algebra methods can reduce the cost of the block update by exploiting sparsity in the posterior precision matrix, but the resulting computational cost depends on the model structure and can still be super-linear (see e.g. Section 4 for models leading to dense precision matrices and Papaspiliopoulos et al., 2019 for more discussion).
While impressive results are being obtained with black-box software implementation of Hamiltonian Monte Carlo (HMC) such as STAN (Carpenter et al., 2017), our results suggest that Gibbs Sampling schemes built on our methodological guidance can be substantially cheaper than gradient-based ones in the context of hierarchical models, leading to improved performances (Section 5). Moreover, our simulations show that the methodological results we develop in this paper are also helpful when fitting multilevel models with gradient-based schemes (Section 5.1) and lead to drastic improvements in efficiency also when using generic software, such as STAN.
Throughout the paper, we shall couch all our results in terms of L 2 rates of convergence. Specifically, let (β(s)) s=1,2,... be a Markov chain with stationary distribution π and transition operator defined by P s f (β(0)) = E[f (β(s))|β (0)]. The rate of convergence ρ(β(s)) associated to (β(s)) s=1,2,... is defined as the smallest number ρ such that for all r > ρ where L 2 (π) denotes the space of square π-integrable functions, · L 2 (π) is its associated L 2 -norm and E π [f ] = f dπ is the expectation of f with respect to π. The rate of convergence ρ(β(s)) characterizes the speed at which (β(s)) s=1,2,... converges to its stationary distribution π, with a simple argument giving that if

Paper overview and structure
Section 2 carefully introduces the 3-level hierarchical models we shall consider, and provides motivating simulations. Then in Section 3 we shall give a complete analysis for 3-level symmetric models (i.e. homogeneous variances and symmetric data structure). At the heart of the analysis is a multigrid decomposition of the Gibbs sampler into completely independent Markov chains describing different levels of hierarchical granularity, Theorem 1. Such multigrid decomposition simultaneously applies to every Gibbs sampler induced by all centred/non-centred parametrizations and is fundamentally a statistical property of the hierarchical models under consideration. Although multigrid ideas have already been used in methodological contexts to design improved MCMC schemes (Goodman and Sokal, 1989;Liu and Sabatti, 2000), to our knowledge they had never been used in theoretical contexts to study convergence rates. We demonstrate that the slowest of these independent chains is always that corresponding to the coarsest level, regardless of the value of the variance components and on the number of branches in the hierarchy, and thus derive explicit expressions for the rates of convergence.
In Section 4 we focus on crossed effect models, using again a multigrid decomposition approach to derive explicit convergence rates. The results show that in the context of crossed models, centred/non-centred reparametrizations are not sufficient to guarantee fast convergence of the resulting Gibbs Sampler. On the other hand, we show that the latter can be achieved by imposing stronger statistical identifiability through additional linear constraints and our theory provides indications on which constraints lead to faster convergence. Finally, a simulation study reported in Section 5 suggests that the analysis of the Gaussian case leads to useful guidance also in the case of non-Gaussian models for both the Gibbs Sampler and Hamiltonian Monte Carlo algorithms (Neal et al., 2011). Section 6 considers 3-level non-symmetric hierarchical models, providing bounds on convergence rates based on comparisons with related symmetric models and discussing the use of bespoke parametrizations, where the choice of centred or non-centred parametrization in each branch of the hierarchy depends on branch-specific parameters.
Section 7 considers hierarchical models with arbitrary depth (≥ 4). Using an appropriate auxiliary random walk, whose evolution through the hierarchical tree is governed by the parameters' squared partial correlations, we are able to extend the multigrid analysis to general tree structures and some non-symmetric cases. We again demonstrate a fundamental multigrid decomposition in Theorem 9, where the coarsest level chain converges the slowest, and give explicit formulae for convergence rates.

Three level hierarchical linear models
The theoretical innovation in this paper is centred around an important case in which we can obtain explicit Gibbs sampler rates of convergence, and as a result study explicitly the effects of particular models, parametrization schemes and blocking strategies. We begin with a detailed study of the following three-level Gaussian linear model, providing a fairly complete understanding of the interaction between model structure and parametrization and the Gibbs Sampler convergence behaviour.

Model S3 (Symmetric 3-levels hierarchical model). Suppose
where i, j and k run from 1 to I, J and K respectively and ijk are iid normal random variables with mean 0 and variance σ 2 e . We employ the standard Bayesian model specification assuming a i ∼ N (0, σ 2 a ), b ij ∼ N (0, σ 2 b ) and a flat prior on μ.
For the theoretical analysis, we will consider the variance terms σ 2 a , σ 2 b and σ 2 e to be known (in contrast with the simulations where we generalize to the case of unknown variances). Defining a = (a i ) i , b = (b ij ) i,j and y = (y ijk ) i,j,k , the Gibbs Sampler explores the posterior distribution (μ, a, b)|y by iteratively sampling from the full conditional distributions of μ, a and b as follows (see below for motivation of denoting such sampler as GS(1, 1)).
Given the conditional independence structure of the model, Sampler GS(1, 1) is equivalent to a blocked Gibbs sampler with components μ, a and b, i.e. a scheme performing consecutive updates of μ|a, b, a|μ, b and b|μ, a at each iteration.
The parametrization (μ, a, b) induced by (2.1) is often referred to as non-centred parametrization (NCP) and it is contrasted with the centred parametrization (CP) obtained by replacing a i and b ij with γ i = μ + a i and η ij = γ i + b ij respectively. Under the centred parametrization (μ, γ, η) the model formulation becomes Figures 1b and 1a provide a graphical representation of the two parametrizations. In the (μ, a, b) case (1, 1) refers to the fact that both levels 1 and 2 use a non-centred parametrization, while in the (μ, γ, η) case (0, 0) indicates that both levels use a centred parametrization. The resulting Gibbs sampler for the centred parametrization is as follows.

Illustrative example
As an illustrative example, we simulated data from Model S3 with I = J = 100, K = 5, μ = 0, σ a = σ e = 10 and σ b = 10 −0.5 . This corresponds to a scenario of high level of noise in the measurements. We fit model S3 assuming the standard deviations (σ a , σ b , σ e ) to be unknown and placing weakly informative priors, namely 1 σ 2 a , 1 σ 2 b and 1 σ 2 e a priori distributed according to an Inverse Gamma distribution with shape and rate parameters equal to 0.01. We compare the efficiency of the Gibbs sampling schemes corresponding to the four different parametrizations, denoting them by GS(1, 1), GS(0, 0), GS(0, 1) and GS(1, 0), initializing the chains at true values of the parameters (μ, a, b) and (σ a , σ b , σ e ). The more realistic case of starting the chains from randomly chosen states led to the same conclusions.
Rows 1-2 of Figure 2 plot the global mean μ and displays the potentially dramatic difference among mixing properties of the Gibbs Sampler under different parametrizations. Based on those, one would certainly exclude using GS(1, 1) and GS(1, 0) to fit this model, leaving GS(0, 0) and GS(0, 1) as possibly feasible algorithms. However, as an additional check, a cautious practitioner may also explore the mixing of the parame- ters at the first level, namely a and γ. Rows 3-4 of Figure 2 display the behaviour of the global averages of such parameters, namely a · = i ai I and γ · = i γi I , in the first 1000 iterations. Again, we see a dramatic difference induced by different parametrizations and, somehow surprisingly, despite having good mixing behaviour at level 0 (i.e. μ), GS(0, 0) displays very poor mixing behaviour at level 1 (i.e. γ). It is then natural to explores also the mixing behaviour at level 2 and rows 5-6 of Figure 2 do so again by plotting the global averages b ·· = ij βij IJ and η ·· = ij ηij IJ . In this case GS(1, 1) and GS(0, 1) are the only chains mixing well. Based on Figure 2 it is natural to choose to fit the model using the sampler GS(0, 1) corresponding to the mixed parametrization (μ, γ, b), as it is the only one mixing well at each of the three levels. Figure 2: Mixing behaviour at level 0 (μ; rows 1-2), level 1 (a · and γ · ; rows 3-4) and level 2 (b ·· and η ·· ; rows 5-6); under four different parametrizations. For each level, the ranges of the y-axes are constant across parametrizations sharing the same parameters.
This simple example shows many typical issues arising when fitting Bayesian multilevel models and raises many questions. For example, one would like to know what are good parameters to use to diagnose convergence, in order to avoid misleading conclusions like the one suggested by rows 1-2 of Figure 2. In fact, while in two level model good mixing of the global hyperparameters such as μ typically indicates good global mixing, this is not true in other multi-level models. Indeed, it is legitimate to wonder whether diagnoses based only on the global means, as in Figure 2, are enough to deduce good mixing of the whole Markov chain, which in our example has more than 10 4 dimensions (1 + I + IJ mean components and 3 precision components). Below we will show that for Model S3, mixing of the global means ensures mixing of the whole (1 + I + IJ)dimensional mean components of the chain given the variances (see e.g. Corollary 1). Therefore it is enough to monitor the three global means and the three variances to ensure a reliable check of the chain mixing properties.
Even more crucially, it is desirable to have simple and theoretically grounded guidance in choosing a computationally efficient parametrization, given the huge impact it can have on computational performances. The theoretical analysis developed in the next section will provide useful guidance in this respect.

Multigrid decomposition for the three level hierarchical model
The basic ingredient of our analysis is the following multigrid decomposition. Consider the four possible parametrization of Model S3: (μ, a, b), (μ, γ, η) and the mixed parametrizations (μ, γ, b) and (μ, a, η). In order to provide a unified treatment, regardless of the chosen parametrization, we denote the parameters used by (β (0) , β (1) , β (2) ) and the resulting Gibbs Sampler by GS(β). For example, in the NCP case β (0) = μ, β (1) = a, β (2) = b and GS(β) coincides with GS(1, 1). First consider the map δ sending β = (β (0) , β (1) , β (2) ) to where, loosely speaking, δ (i) β represent the increments of β at the i-th coarseness level. More precisely It is easy to see that the map δ is a bijection between R d and The dimensionality of δβ equals that of β, which is 1 + I + IJ, because δβ has 3 + 2I + IJ parameters and 2 + I constraints. The following theorem shows that the Markov chain induced by GS(β) factorizes under the transformation δ, as illustrated in Figure 3.
While the posterior independence of δ (0) β, δ (1) β and δ (2) β is well-known, Theorem 1 shows the much stronger statement that the Markov chains induced by the Gibbs Sampler are independent.
Remark 1. The three subspaces of R d spanned by the vectors δ (0) β, δ (1) β and δ (2) β, respectively, do not depend on the choice of parametrization β. For example, the four instances of δ (0) β -i.e. (μ, a · , b ·· ), (μ, a · , η ·· ), (μ, γ · , b ·· ) and (μ, γ · , η ·· ) -span the same subset of the parametric space of Model S3. In this sense, the multigrid decomposition of Theorem 1 factorizes the Gibbs Sampler into three independent chains operating on subspaces that depend only on the model under consideration and not on the particular parametrization being considered. Thus in a sense, the multigrid decomposition is intrinsic to the model, and not specific to the chosen parametrization.
Theorem 1 provides a useful tool to analyse the Markov chain of interest, β(s). In fact the factorization into independent Markov chains implies that the rate of convergence of β(s) is simply given by the slowest rate of convergence among δ (0) β(s), δ (1) β(s) and δ (2) β(s). Interestingly, the slowest chain is always the chain at the highest level δ (0) β(s), regardless of the choice of parametrization and the values of (I, J, K, σ a , σ b , σ e ).

Explicit rates of convergence under different parametrizations
The multigrid decomposition developed in Section 3 permits a direct analysis on the convergence properties of β(s). Since this is a Gibbs Sampler targeting a multivariate Gaussian distributions, in principle it could be analysed using the tools developed in Amit (1996); Roberts and Sahu (1997); Khare et al. (2009). Applying these results requires a spectral decomposition of a matrix derived from Σ. However, given the highdimensionality of β(s), which has 1+I +IJ parameters, it is hard to apply directly such results and in fact the convergence properties of β(s) have been studied heuristically or numerically in the literature (see e.g. Gelfand et al., 1995, Section 4 andSahu, 1997, Section 4.2). Circumventing these theoretical difficulties, Corollary 1 implies that it suffices to study the skeleton chain δ (0) β(s), which is a low-dimensional chain (namely 3-dimensional) amenable to direct analysis. Therefore we can derive analytic expressions for the rates of convergence for the Gibbs Sampler under different parametrizations.

Theorem 3. Given an instance of Model S3, the rate of convergence of the four Gibbs
Sampler schemes GS(0, 0), GS(1, 1), GS(0, 1) and GS(1, 0) are given by Theorem 3 provides explicit and informative formulas regarding the interaction between choice of parametrization and resulting efficiency of the Gibbs Sampler for Model S3. Figure 4 summarizes graphically the dependence of the converge rates of different parametrizations from the values of the variances of various levels. Roughly speaking, the figure suggests that there is a partition of the hyperparameter space (corresponding to the white regions in each plot) such that in each region exactly one of the four parametrizations performs well.
Consider for example the illustrative example of Section 2.1. Applying Theorem 3 to such context we obtain that the L 2 rates of convergence (up to the third decimal digit) of the various Gibbs Samplers under consideration given (I, J, K, σ a , σ b , σ e ) = (100, 100, 5, 10, 10 −0.5 , 10) are (ρ 00 , ρ 11 , ρ 01 , ρ 10 ) = (0.995, 0.998, 0.007, 0.999) . Recall that values of ρ close to 1 mean slow convergence, see (1.1) and discussion thereof. These numbers provide a quantitative and theoretically grounded description of the behaviour heuristically observed in Section 2.1 and can be easily used to optimize performance (see e.g. Section 3.2 below).

Conditionally optimal parametrization
We now consider the optimal parametrization (among the four possible choices (μ, a, b), (μ, γ, b), (μ, a, η) and (μ, γ, η)) as a function of the normalized variance components (σ 2 a ,σ 2 b ,σ 2 e ). Using the formulae of Theorem 3 we obtain the following explicit answers. Corollary 2 (Optimal parametrization for Model S3). The rate of convergence of the Gibbs Sampler targeting Model S3 is minimized by the following parametrization choice: • use a centred parametrization η at the lowest level if and only ifσ 2 b ≥σ 2 e , • use a centred parametrization γ at the middle level if and only ifσ 2 a ≥σ 2 b +σ 2 e . The resulting Gibbs Sampler has a rate of convergence ρ bounded above by 2 3 , with the equality ρ = 2 3 holding if and only ifσ 2 a =σ 2 b +σ 2 e andσ 2 b =σ 2 e (in which case all parametrizations are equivalent). Table 1 provides a graphical representation of the decision process. This simple rule guarantees that the resulting Gibbs Sampler has a rate of converges smaller than 2 3 , thus ensuring high sampling efficiency for fixed variances. Table 1 implies that the choice of parametrization of a given level (i.e. whether it is computationally most efficient ·· ) and the standard deviations (σ a , σ b , σ e ), under the three updating schemes (Centred, Optimal and Blocked) described in Section 3.2. The combination with the parameter expansion methodology is denoted as "+PX". to use a centred or non-centred parametrization) depends on the ratio between the normalized variance at the level under consideration and the sum of the normalized variances of the levels below. These results extend previous intuition for the two-level case (Papaspiliopoulos et al., 2003) to hierarchical models with three levels.
Corollary 2 allows for simple and effective strategies to guarantee high sampling efficiency in practical implementations of Gibbs Sampling for Model S3 in the case of unknown variances. Common implementations choose a fixed parametrization β of the Gaussian component, such as the fully centred parametrization β = (μ, γ, η), and alternate updating β|(σ a , σ b , σ e ) with GS(β) and (σ a , σ b , σ e )|β with direct sampling (which is straightforward using the conditional independence of σ a , σ b and σ e given β). Given Corollary 2, instead, one can at each iteration choose the optimal parametrization β given (σ a , σ b , σ e ) according to Table 1, with basically no additional computational cost compared to the cost of a Gibbs Sampling iteration. This ensures that the sampling step β|(σ a , σ b , σ e ) will have a high efficiency, regardless of the values of (I, J, K, σ a , σ b , σ e ).
We implement and illustrate this strategy in Figure 5, where we compare MCMC autocorrelation functions in the context of the illustrative example of Section 2.1, with unknown variances (σ a , σ b , σ e ). We compare the following three schemes: the sampler updating (μ, γ, η)|(σ a , σ b , σ e ) with GS(0, 0) and (σ a , σ b , σ e )|(μ, γ, η) exactly; the sampler choosing the optimal parametrization β according to Table 1 and then updating β|(σ a , σ b , σ e ) with GS(β) and (σ a , σ b , σ e )|β exactly; the sampler updating both β|(σ a , σ b , σ e ) and (σ a , σ b , σ e )|β exactly (which can be implemented because the distribution of β|(σ a , σ b , σ e ) is multivariate Gaussian). We call the three samplers "Centred", "Optimal" and "Blocked", respectively. The results are displayed in the first three rows of Figure 5 and show that the Optimal sampler reduces significantly the autocorrelation compared to Centred one, and achieves a mixing that is basically equivalent to the one of the Blocked sampler. The potential benefit of the Optimal sampler compared to the Blocked one is that the Gibbs update of β|(σ a , σ b , σ e ) in the Optimal sampler only requires univariate updates and has a potentially lower computational cost compared to a full multivariate block update of β|(σ a , σ b , σ e ), which requires large matrix operations. While these matrix operations can be performed efficiently in the context of nested linear models (see e.g. Papaspiliopoulos and Zanella, 2017), their cost becomes significantly larger for example in the context of crossed random effect models (see Section 4 below and Papaspiliopoulos et al., 2019). Note that the similarity of performances between the Optimal and Blocked sampler is not surprising given our theoretical results above. In fact Corollary 2 guarantees that the sampler GS(β) used in the Gibbs update have a rate of convergence upper bounded by 2/3, which is well separated from 1. When such updates are nested within a larger sampler (e.g. the one updating β|(σ a , σ b , σ e ) and (σ a , σ b , σ e )|β) the difference between and exact update of β and a Gibbs one with good rate of convergence can easily become negligible.
We then combined the three schemes described above with the parameter expansion (PX) methodology of Meng and Van Dyk (1999); Liu and Wu (1999), and denote the resulting schemes as Centred+PX, Optimal+PX and Blocked+PX. The PX methodology aims to avoid potential slow mixing due to strong dependencies between β and (σ a , σ b , σ e ), and in this case it is successful in doing so for the slowly mixing parameter σ b (see Figure 5, rows 4-6). The results suggest that the optimal choice of parametrization for β can be conveniently combined with the PX methodology, and that the two have complementary roles in speeding up the convergence of the Gibbs Sampler.

Multigrid decomposition for crossed effect models
The multigrid decomposition can be used to analyse non-nested models. In this section we focus on the following crossed effect model.
Similarly to Sections 2 and 3, we use bold letters to denote the following vectors: (k) ). The standard Gibbs Sampler to sample from the posterior distribution L(μ, a|y) of Model Ck is defined as follows.
Model Ck and Sampler GS-crossed have recently been analysed in Papaspiliopoulos et al. (2019) using the multigrid decomposition approach developed in Section 3 of this paper to derive expressions for the convergence rate of Sampler GS-crossed. In particular, Papaspiliopoulos et al. (2019) considered the following linear functions of ā for each s ∈ {1, . . . , k} and proved the following result.
Theorem 4 (Papaspiliopoulos et al. (2019)). Let be the Markov chain generated by Sampler GS-crossed. Then the time-wise transforma- Theorem 4 implies that the convergence properties of Sampler GS-crossed deteriorate as N increases because max s∈{1,...,k} (Nτ e ) −1 (Nτ e + n s τ s ) goes to 1 as N → ∞. Motivated by this consideration, Papaspiliopoulos et al. (2019) propose a collapsed Gibbs Sampler that avoids such slowdown for increasing data size while preserving the same computational cost per iteration of Sampler GS-crossed. In the following two sections we extend the analysis of Model Ck performed in Papaspiliopoulos et al. (2019), focusing on the role of, respectively, reparametrizations and statistical identifiability.

Reparametrizations and crossed effects models
In the context of nested models, reparametrization techniques based on hierarchical centering offer a way to make the Gibbs Sampler robust to large datasets (see e.g. Corollary 2). We now show that this is not the case in the crossed effects context of Model Ck. In this section we focus on the case k = 2, which is a case often studied theoretically in the literature (see e.g. Gao and Owen (2017); Brown et al. (2018) for recent examples). In this case, hierarchical centering leads to four possible parametrizations defined as Each parametrization corresponds to a different Gibbs Sampler, which at each iteration updates μ from L(μ|β (1) , β (2) , y), β (1) from L β (1) |μ, β (2) , y , and β (2) from L β (2) |μ, β (1) , y . The following result characterizes the rate of convergence ρ λ1λ2 of such Gibbs Samplers for all combinations (λ 1 , λ 2 ) ∈ {0, 1} 2 . Figure 6: Plot of the rates of convergence in (4.5). Color levels correspond to values of log(1 − ρ), where ρ is the rate of convergence (or its lower bound for ρ 11 ), as a function of log(r 1 /(1 − r 1 )) and log(r 2 /(1 − r 2 )).
(4.5) Figure 6 summarizes graphically the results of Theorem 5, showing the dependence of the converge rates on the choice of parametrization. The rate displayed in Figure 6 for the fully centred parametrization is the lower bound given in (4.5).
Theorem 5 also implies that min{ρ 00 , ρ 01 , ρ 10 , ρ 11 } → 1 as n 1 , n 2 → ∞. Therefore, regardless of the parametrizations chosen, the convergence of Gibbs Samplers targeting Model Ck deteriorates as the number of factors n 1 and n 2 increases. This is in contrast with the nested case analysed in Section 3, where reparametrization techniques are successful in providing samplers with good convergence properties for all choices of hyperparameter values. In the next section we show that a more effective way to achieve good convergence properties is to impose stronger identifiability constraints.

Connections to statistical identifiability
The parameters (μ, a (1) , . . . , a (k) ) in Model Ck are not identifiable, in the sense that the mapping (μ, a (1) , . . . , a (k) ) → L(y|μ, a (1) , . . . , a (k) ) is not injective. While this is not strictly speaking an issue for Bayesian inferences, one may wonder whether imposing identifiability on model parameters results in avoiding the degradation of mixing described in previous sections (see e.g. Vines et al., 1996;Gelfand and Sahu, 1999;Xie and Carlin, 2006;Kaufman et al., 2010;Vallejos et al., 2015 for related discussion and some examples in applications). We consider imposing identifiability by conditioning on some linear constraints, such as the commonly used choices of a Theorem 6. Consider Sampler GS-crossed conditioned on c s = 0 for s = 1, . . . , k, meaning that μ gets updated from L(μ|a, y, c 1 = · · · = c k = 0) rather than L(μ|a, y), and a (s) from L a (s) |μ, a (−s) , y, c 1 = · · · = c k = 0 rather than L a (s) |μ, a (−s) , y . The rate of convergence of the resulting chain is Comparing (4.6) with (4.3) we can see that, since (1 − q s ) ∈ [0, 1), the rate of convergence always decreases after imposing the identifiability constraints c s = 0 for s = 1, . . . , k. Thus, Theorem 6 implies that, in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler. To our knowledge, this is the first rigorous result of this kind in the Bayesian computation literature. On the other hand, the result also shows that imposing identifiability per se does not guarantee fast convergence. For example, Theorem 6 implies that the rate of convergence of Sampler GS-crossed conditioned on a . . , k} leads to a convergence rate that can still go to 1 as the datasize increase. Interestingly (4.6) implies that the rate of convergence is minimized when q s is maximized, which happens when the weights in the linear constraints are constant, for example w

Beyond the Gaussian case: a Poisson example
The results of Section 4.2 provide guidance on the choice of which linear constraint to use to impose identifiability for models also beyond the Gaussian case. As an example, we consider the following crossed random effect model with Poisson likelihood, which is the simplest analogue of Model Ck in the context of count data.
Consider sampling from the posterior distribution L(μ, a|y) of Model CkP using the standard Gibbs Sampler that, similarly to Sampler GS-crossed, at each iteration updates μ from L(μ|a, y) and then a (s) from L a (s) |μ, a (−s) , y for s = 1, . . . , k. Here y, a and a (−s) are defined as in the beginning of Section 4.
We explore the extent to which the conclusions drawn from Theorem 6 apply also to Model CkP by means of simulations. We consider the case k = 2 with three different combinations of values of n 1 and n 2 . The data (y i1i2 ) are generated from the model with (a (1) i1 ) and (a (2) i2 ) sampled from their prior distributions and μ set to 1. For the prior hyperparameters we use α 1 = α 2 = α μ = 2 and β 1 = β 2 = β μ = 0.1. We compare the standard Gibbs Sampler with no constraints, with the versions obtained by imposing the linear constraints a (1) 1 = a (2) 1 = 1 andā (1) =ā (2) = 1, respectively, whereā (1) and a (2) are defined as in (4.2). Table 2 reports the resulting effective same sizes (minimum and median across parameters). The results are consistent with the theoretical guidance offered by Theorem 6 since: (a) imposing identifiability always improves mixing of the samplers; (b) imposing constraints onā (1) andā (2) leads to faster convergence compared to imposing constraints on a (1) 1 and a (2) 1 ; (c) the difference in resulting efficiency between the two set of linear constraints increases with n 1 and n 2 .

Comparison with Hamiltonian Monte Carlo
Finally, we also explore whether the results in Theorem 6 can be useful to guide the implementation of other MCMC schemes targeting Model CkP, such as Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) and the No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) implemented in the widely used software STAN (Carpenter et al., 2017). We consider the same setting of the rightmost column of Table 2 where n 1 = n 2 = 100, comparing the Gibbs Sampler with HMC and NUTS (used their STAN implementation with default setting). Table 3 reports effective sample sizes (ESS), runtimes, number of leapfrog steps per iteration and ESS per leapfrog iteration for HMC/NUTS, ESS per sweep (update of μ, a (1) and a (2) ) for Gibbs Sampling and ESS per unit of computation time for all schemes. Traceplots and autocorrelation functions are provided in the supplement. All simulations reported in Tables 2 and 3 were performed on the same desktop computer with 16 GB of RAM and an Intel core i7-7700 @ 3.60 GHz processor, using R (R Core Team, 2018). Effective sample sizes are estimated using the coda R package. The supplementary material provides the R code used to implement the Gibbs Samplers and the Stan code used to specify the models with linear constraints.  (2) 1 = 1 (v2) orā (1) = a (2) = 1 (v3). ESS and runtimes of each algorithm refer to 10 4 iterations, with the first half of the samples discarded as burn-in. All numbers are averaged over 10 replicates. First, Table 3 suggests that imposing identifiability through linear constraints helps significantly also gradient-based samplers such as HMC and NUTS, with a pattern in accordance with Theorem 6. The improvement involves both speeding up convergence (higher ESS per iteration) and reducing runtime, which is reduced because the number of required leapfrog steps per iteration (which is automatically tuned in STAN) gets lower for better identified and less correlated targets, as the one with the linear constraints. The only exception to this pattern is the fact that HMC-v1 (no constraints) is more efficient than HMC-v2 (constraint a (1) 1 = a (2) 1 = 1). This is mainly due to the very high number of leapfrog steps per iteration employed by HMC-v1 compared to HMC-v2, which ends up being beneficial in terms of efficiency despite the major increase in runtime. This phenomenon is related to the specific tuning procedures implemented in STAN and arguably does not contradict the general fact that improving identifiability by imposing linear constraints is beneficial for HMC and NUTS. Second, Table 3 suggests that the Gibbs Sampler can be substantially more efficient than HMC and NUTS for random effect models such as Model CkP, thanks to a lower runtime and, arguably, a more direct use of conditional independence among random effects. We note that empirical runtimes can be highly dependent on software implementations and this could be unfavourable to a generic software implementation such as the STAN. In order to obtain an implementation-independent comparison of efficiency one should combine the ESS per leapfrog iteration and ESS per sweep values reported in Table 3 with theoretical considerations on the computational costs of such operations, which however are non-trivial and highly model-dependent. We leave a more detailed investigation of these aspects, both theoretical and computational, to future work.

ESS
Remark 2. Interestingly, the multigrid decomposition can be applied also to Model CkP, with the appropriate modifications. In this case the Markov chain ((μ, a)

(t))
∞ t=1 induced by the Gibbs Sampler can be decomposed into (k + 1) independent Markov chains is /ã (s) . In this case the rate of convergence of the original chain coincides with the one of (μ,ã (1) , . . . ,ã (k) )(t) ∞ t=1 , which evolves according to a (k + 1)-dimensional Gibbs Sampler with full conditionals given by: for s ∈ {1, . . . , k} , where y · = i1,...,i k y i1...i k . We expect such a (k + 1)-dimensional Gibbs Sampler to be potentially amenable to analysis using the framework of iterated random functions (Diaconis and Freedman, 1999), in order to obtain an upper bound on convergence rates (see e.g. Alsmeyer and Fuh, 2001, Theorem 2.1.(b)). We leave these extensions to future works and mention it in Section 8 as a possible avenue for future research directions.

Non-symmetric hierarchical models
Section 3 describes how to optimize parametrization as a function of (I, J, K, σ a , σ b , σ e ) for Model S3. In general, both the variance terms σ 2 b and σ 2 e , and the number of branches J and K could depend on i and j. In this section we consider non-symmetric cases for two and three level hierarchical models. In these non-symmetric cases the computationally optimal strategy will involve centering some branches of the hierarchy and non-centering others: we will call these strategies bespoke parametrizations.
Consider the following non-symmetric 2-levels model (which we describe in terms of precisions rather than variances for notational convenience).

Model NS2 (Non-symmetric 2-levels hierarchical model). Consider the following 2levels model with centred parametrization
where the precision components (τ a , (τ e,i ) i ) are assumed to be known. Papaspiliopoulos et al. (2003) studied the symmetric version of Model NS2, where J i = J and τ e,i = τ e for all i and some fixed J and τ e . They showed that the convergence rates induced by the centred and non-centred parametrizations respectively are Equation (6.2) shows that in the non-symmetric case, the GS rate of convergence is given by a weighted average of the precision ratios τã τi+τa andτ ĩ τi+τa depending on whether each component is centred or not. This has clear analogies with the symmetric case in (6.1). The weights in the average of (6.2) are themselves functions of (λ 1 , . . . , λ I ), thus introducing dependence across components in terms of centering and the overall convergence rate. Nonetheless, the following corollary shows that even in the context of Model NS2, optimizing parametrization in each branch of the tree can be carried out independently following the same intuition of the symmetric case: for each i in {1, . . . , I} use centred parametrization γ i if and only if τ a ≤ J i τ e,i , otherwise use a non-centred parametrization a i = γ i − μ. By (6.2), the strategy described in Corollary 3 ensures that ρλ 1 ...λ I ≤ 1/2. This is the same upper bound one can obtain in the symmetric case (see (6.1) and Papaspiliopoulos et al. (2003)), meaning that in this case bespoke parametrizations are successful in dealing with the lack of symmetry. Consider now the three-level non-symmetric case.

Model NS3 (Non-symmetric 3-levels hierarchical model). Consider a more general 3-levels model with centred parametrization
where variance components are assumed to be known.
In this case the multigrid factorization of Theorem 1 does not apply directly to Model NS3, but nonetheless it can still be used to obtain upper bounds on the rates of convergence combining it with monotonicity properties of the spectral radius of nonnegative matrices (see the supplement and Roberts and Sahu, 1997, Theorem 7 for details).

Theorem 8. Given an instance of Model NS3 we define
. . , I}, then the rate of convergence of the Gibbs Sampler with centred parametrization (μ, γ, η) satisfies The results of Theorem 8 suggest that as the number of data points increase the efficiency of the Gibbs sampler with centred parametrization increases. In fact, as K ij increases the assumptions of Theorem 8 are eventually satisfied and the bound on the convergence rate goes to 0 as J i and K ij increase. Theorem 8 provides rigorous theoretical support and characterization of the well known fact that the centred parametrization is to be preferred in contexts of large and informative datasets (Gelfand et al., 1995;Papaspiliopoulos et al., 2003). We note that the convergence rate for the Gibbs Sampler targeting Model NS3 is not easily tractable, and that deriving analytic expressions for the optimal bespoke parametrization in this context is still an open problem.

Hierarchical linear models with arbitrary number of levels
Here we consider Gaussian hierarchical models with k levels for arbitrary k. We refer to the highest level of the hierarchy (i.e. the one furthest away from the data) as level 0, down to level k − 1 the lowest level (i.e. closest to the data). The 3 level model of Section 3 is a special case of the theory developed here where k = 3.

Model formulation
In order to allow for more generality and keep the notation concise, in this section we will use a graphical models notation. In particular T will denote a finite tree with k levels and root t 0 ∈ T . For each node t ∈ T we will denote by pa(t) the unique parent of t and by ch(t) the collection of children of t. Moreover we write s t and s t if s is respectively an ancestor or a descendant of t (with s and t possibly being equal) while s ≺ t and s t denote the same notions with the additional condition of s = t. For each node t ∈ T we denote by (t) the level of node t (i.e. its distance from t 0 ). For each d ∈ {0, . . . , k − 1} we denote by T d = {t ∈ T : (t) = d} the collection of nodes at level d. For example we have T 0 = {t 0 } and T = ∪ k−1 d=0 T d . Noisy observations will occur only at leaf nodes. The collection of leaf nodes is denoted as T L = {t ∈ T : ch(t) = ∅}. For simplicity we assume that all leaf nodes are at level k − 1, i.e. T L = T k−1 , although this assumption could be easily relaxed allowing some branches to be longer than others.
Model NSk (k-levels hierarchical model). Suppose that we have a hierarchy described by a tree T with observations occurring at leaf nodes T L . We assume the following hierarchical model where i ∈ {1, . . . , n t } with n t being the number of observed data at leaf node t, (τ t ) t∈T \t0 and (τ (e) t ) t∈T L are known precision components and all normal random variables are sampled independently. Following the standard Bayesian model specification we assume a flat prior on γ t0 .
We are interested in sampling from the posterior distribution of γ T = (γ t ) t∈T given some observations y = (y t ) t∈T L . The centred parametrization γ T of Model NSk leads to the following Gibbs Sampler.

Non-centering and hierarchical reparametrizations
Model NSk expresses Gaussian hierarchical models using a centred parametrization. The corresponding non-centred version is given by the following example.
Example 1 (Fully non-centred parametrization). Under the same setting as Model NSk, define and assume a flat prior on α t0 .
The non-centred parametrization α T can be obtained as a linear transformation of the centred version γ T of Model NSk. More generally, we will consider the class of parametrizations that can be obtained by reparametrizing γ T as follows.
Definition 1 (Hierarchical reparametrizations). Let γ T = (γ t ) t∈T be a random vector with elements indexed by a tree T . We say that for some real-valued coefficients Λ = (λ tr ) r t,t∈T satisfying λ tt = 0 for all t ∈ T . We denote (7.3) by β T = Λγ T .
Using terminology from Papaspiliopoulos et al. (2003), we refer to the family of hierarchical reparametrizations of γ T = (γ t ) t∈T as partially non-centred parametrizations (PNCP) of Model NSk. Note that (7.3) does not span the space of all linear transformations of γ T . In fact Λ = (λ tr ) r t,t∈T can be thought as a |T | × |T | matrix Λ = (λ tr ) r,t∈T inducing a linear transformation of γ T with the additional sparsity constraint of being zero on all elements λ tr such that r t. The following Lemma shows that the definition of the class of PNCP does not depend on the starting parametrization used to formulate Model NSk. For example, one could equivalently define the class of PNCP of Model NSk as the collection of hierarchical reparametrization of the non-centred parametrization α T of Example 1.

Lemma 1. If β T is a hierarchical reparametrization of γ T , then also γ T is a hierarchical reparametrization of β T .
As for the 3-levels case we are interested in assessing the computational efficiency of the different Gibbs Sampling schemes arising from different PNCP's. For each PNCP β T the corresponding Gibbs Sampler scheme GS(β T ) is defined analogously to GS(γ T ).
Sampler GS(β T ) is easy to implement because the family of PNCP preserves the hierarchical structure of Model NSk. In fact, any PNCP of Model NSk exhibits the following conditional independence structure: β r ⊥β t |β T \{r,t} unless r t or t r . (H) Note that this is a weaker condition than the one holding for the centred parametrization γ T . In the latter case, the conditional independence graph corresponds exactly to the tree T , meaning that if r = t γ r ⊥γ t |γ T \{r,t} unless r = pa(t) or t = pa(r) .
Despite being weaker than (T), condition (H) still guarantees that all parameters at the same level are conditionally independent (thus simplifying the update of β T d |β T \T d ) and that the full conditional distribution of each β t depends only on the descendants or ancestors of t. The following Lemma and Corollary provide a simple way to check that any PNCP of Model NSk satisfies (H).

Symmetry assumption
To provide a full analysis of the convergence properties of Sampler GS(β T ) we need a symmetry assumption that we now define. Let ρ tr denote the partial correlation Corr β t , β r β T \{t,r} , namely and ρ tt = 1 for all t. Here Q is the precision matrix of β T . Let X = (X ) k−1 =0 be a random walk going through T from root to leaves as follows: X 0 = t 0 almost surely and then, for ∈ {0, . . . , k − 2} (r)) .
(7.4) Equation (7.4) implies that at each step X jumps from the current state r to one of its children t ∈ ch(r) choosing t proportionally to the squared partial correlation between β r and β t . Since (X d ) = d almost surely for all d ∈ {0, . . . , k − 1} we can use the following simplified notation: for any t and r in T we use P (t), P (t|r) and P (t ∩ r) to denote respectively P (X (t) = t), P (X (t) = t | X (r) = r) and P (X (t) = t ∩ X (r) = r).
Given the above definitions, we define the following symmetry condition: there exist a k × k symmetric matrix C = (c dp ) k−1 d,p=0 such that and ρ tr = 0 if r t and t r. Note that ρ tr is invariant to coordinate-wise rescaling of β T and therefore both property (S) and the transition kernel of X are invariant to rescalings. Therefore we can consider, without loss of generality, the following rescaled version of β T defined bỹ Condition (S) can then be written in terms of the precision matrix ofβ T = (β t ) t∈T as Q tt = P (t) and −Q tr =c (t) (r) P (t ∩ r) for t = r .
The rescaled versionβ T will be helpful to design an appropriate multigrid decomposition of β T . Also, property ( S) is closed under symmetric hierarchical parametrizations.
Definition 2 (Symmetric hierarchical reparametrizations). We say that β T = Λα T is a symmetric hierarchical reparametrization of α T if the coefficients of Λ = (λ tr ) r t,t∈T depend only on the levels of r and t in the hierarchy T .
Lemma 3. Property ( S) is closed under symmetric hierarchical reparametrizations, meaning that ifβ T satisfies ( S) then any symmetric hierarchical reparametrization of β T satisfies ( S) too.
Various special cases of Model NSk satisfy assumption (S). For example, we now consider three cases: a fully symmetric case (both the tree T and the variances (τ t ) t∈T are symmetric), a weakly symmetric case (non-symmetric tree and symmetric variances) and a non-symmetric case (both tree and variances non-symmetric).

Model Sk (Symmetric k-levels hierarchical model). Consider the k-level Gaussian
Hierarchical model where the observed data are generated from Here ( It is easy to see that the posterior distribution of Model Sk, conditioned on the observed data y = (y i1,...,i k−1 ,j ) i1,...,i k−1 ,j , satisfies (S). In this case the random walk X defined by (7.4) coincides with the natural random walk going through T .
Example 2 (Weakly symmetric case). Another special case of Model NSk satisfying (S) is given by the case of a general tree T and precision terms defined as τ t = τ (t) s≺t |ch(s)| for all t ∈ T and τ (e) t = τe nt s≺t |ch(s)| , where (τ 1 , . . . , τ k , τ e ) ∈ R k+1 + are level-specific precision terms. This is an extension of Model Sk where the lack of symmetry of T is compensated by appropriate variance terms. Condition (S) can be checked by evaluating the partial correlations (ρ tr ) t,r∈T of the resulting vector γ T .

Example 3 (Non-symmetric cases). In both Model Sk and Example 2 the auxiliary
Markov chain X defined in (7.4) follows a natural random walk, in the sense that at each time the chain chooses the next state uniformly at random among children nodes. However, condition (S) is also satisfied by non-symmetric cases where X is not a natural random walk. In particular any instance of Model NSk such that for some (k − 1)-dimensional vector (c 0 , . . . , c k−2 ) induces a posterior distribution satisfying (S). In fact, in the context of Model NSk conditions (S*) and (S) are equivalent (this can be derived from (T) and (7.4)).
The cases previously considered are expressed in terms of centred parametrization, meaning that as all the instances of Model NSk they satisfy (T). Nevertheless Lemma 3 shows that any symmetric hierarchical reparametrization of a vector satisfying ( S) still satisfies ( S). This implies, for example, that the fully non-centred version of Model Sk and any mixed strategy where some level is centred and some is not centred, still satisfies ( S) (after rescaling). Moreover, note that the exact analysis we will now provide for the Gibbs sampler on models satisfying ( S) can be used to provide bound on general cases that do not satisfy ( S) (see for example Theorem 8).

Multigrid decomposition
We now show how to use the multigrid decomposition to analyse the Gibbs Sampler for random vectors β T satisfying (H) and (S). Our aim is to provide a transformation of β T that factorizes the Gibbs Sampler Markov Chain into independent and more tractable sub-chains. Similarly to Section 3 in the following we will often denote β T d = (β t ) t∈T d by β (d) . We proceed in two steps, first introducing the averaging operators φ (p) and then the residual operators δ (p) . For any p ≤ d the averaging operator φ (p) : where X = (X ) k−1 =0 is the Markov chain defined by (7.4). Loosely speaking φ (p) β (d) = E[β X d |β T , X p ] can be interpreted as the averages of β (d) at the coarseness corresponding to the p-th level of the hierarchy. In particular φ (d) β (d) = β (d) and φ Example 4 (Averaging operators in the symmetric case). Let β T = γ T be given by Model Sk. Then Given φ, we define the residual operators δ (p) : Theorem 9 (Multigrid decomposition for k levels). Let (β(s)) s∈N be a Markov chain evolving according to GS(β T ) with β T satisfying (H) and ( S). Then the functionals (δ (0) β(s)) s , . . . , (δ (k−1) β(s)) s are k independent Markov chains. Moreover, each chain δ (p) β(s) = (δ (p) β (d) (s)) k−1 d=p evolves according to the following blocked Gibbs sampler scheme with (k − p) components: for d going from p to k − 1 sample where L(X|Y ) denotes the conditional distribution of X given Y .

Hierarchical ordering of rates
Combining the results in Roberts and Sahu (1997, Section 2.2) with the multigrid decomposition, we can characterize the rates of convergence of the k independent Markov chains described above as follows.
Theorem 10. The rate of convergence of (δ (p) β(s)) s is given by the largest modulus eigenvalue of (I k−p − L) −1 U . Here I k−p is the (k − p) dimensional identity matrix, while L and U are, respectively, the strictly lower and strictly upper triangular part of (c d ) k−1 d, =p , with c d given by ( S).
Theorem 10 implies that the convergence properties of the k independent Markov chains are closely related. In particular, from the rates of convergence point of view, the k Markov chains updating δ (p) β for p = 0, . . . , k − 1 behave as Gibbs samplers targeting a decreasing number of dimensions (from k down to 1) of a single k-dimensional Gaussian distribution with precision matrix given by −C, where C = (c d ) k−1 d, =p is given by ( S). This suggests that the convergence properties of the sub-chains will typically improve from that of (δ (0) β(s)) s to that (δ (k−1) β(s)) s and that the rate of convergence of (δ (0) β(s)) s will typically determine the rate of the whole sampler GS(β T ). In particular, in the centred parametrization case we can use the well-known Cauchy interlacing theorem (see e.g. Bhatia, 2013) to show that the rate of convergence is monotonically non-increasing from (δ (0) β(s)) s to (δ (k−1) β(s)) s .
Theorem 11 (Ordering of rates for centred parametrization). Let γ be a Gaussian vector satisfying (T) and ( S) and let (γ(s)) s∈N be the corresponding Markov chain evolving according to GS(γ T ). Then the convergence rates of the k independent Markov chains (δ (0) γ(s)) s , . . . , (δ (k−1) γ(s)) s satisfy ρ(δ (0) γ(s)) ≥ ρ(δ (1) γ(s)) ≥ · · · ≥ ρ(δ (k−1) γ(s)) = 0 . (7.9) In Theorem 11 we needed the additional assumption (T) to prove (7.9). The reason is that, while in most cases the convergence rates of a deterministic-scan Gibbs Sampler targeting a n-th dimensional Gaussian distribution improves if one of the coordinates is conditioned to a fixed value and the sampler targets only the remaining (n − 1) coordinates, this is not true in general. Example 2 of Roberts and Sahu (1997) provides a counter-example (see also Whittaker, 1990, page 319). In Roberts and Sahu (1997), this example was used a counter-example regarding blocking strategies, it also works in the present context. We note that, if one were to consider a random scan version of the Gibbs Sampler, the reversibility of the induced Markov chains would allow us to prove the ordering result in Theorem 11 with no need to assume (T). We leave this as a direction of future research and briefly mention it in Section 8.
Theorem 11 implies the following corollary.

Corollary 6. Let γ be a Gaussian vector satisfying (T) and ( S). Then the rate of convergence of GS(γ T ) is given by the largest squared eigenvalue of the
In the special case of Model Sk it is easy to deduce the following result.

Example: rates of convergence for 4-level models
The results developed in Sections 7.4 and 7.5 allow the analysis of hierarchical models with an arbitrary number of levels. For example we could consider 4-level extensions of Model S3.

Model S4 (Symmetric 4-levels hierarchical model). Suppose
where i, j, k and run from 1 to I, J, K and L respectively and ijk are iid normal random variables with mean 0 and variance σ 2 e . We employ a standard Bayesian model specification with and a flat prior on μ.
When λ 1 = 0 the expressions for the convergence rates are still explicit, but slightly more involved and are reported in Section 3.1 of the supplementary material. These rates can be derived from Corollary 5 and Theorem 10. It is worth noting that also in this 4-level case the skeleton chain δ (0) β is always the slowest chain for all centred and non-centred parametrizations (which can be checked by computing the rates of convergence of δ (1) β, δ (2) β and δ (3) β using Theorem 10 and comparing those to the ones of δ (0) β), even if for the general k-level case we were able to prove this fact only for the fully-centred parametrization (Theorem 11). The expressions given here can be easily used to derive conditionally optimal parametrizations for Model S4 given the rescaled variance components (σ 2 i ) 4 i=1 . For example, choosing whether to center or not each level by comparing the level-specific rescaled variances with the sum of the rescaled variances of the lower levels like in Section 3.2 leads to rates of convergence upper bounded by 3 4 .

Conclusions and future work
In this work we studied the convergence properties of the Gibbs Sampler algorithm in the context of Gaussian multilevel models. To do so we developed a novel analytic approach based on multigrid decompositions that allows the factorization of the Markov chain of interest into independent and easier to analyse sub-chains. This decomposition enables us to evaluate explicitly the L 2 -rate of convergence in various models of interest. The results offer a detailed insight into the interaction between multilevel structures (e.g. nested and crossed) and the Gibbs Sampler and provide guidance on the choice of the computationally optimal parametrizations or linear constraints, which can potentially be relevant also beyond the Gaussian case (see e.g. Section 5), and indication of which parameters to monitor in the convergence diagnostic process (see Theorem 2 and discussion at the end of Section 2.1). Since the first preprint version of this paper, the multigrid decomposition developed in this paper has already found other practical applications. In particular Papaspiliopoulos et al. (2019) have successfully exploited it to analyse the computational complexity of the Gibbs Sampler in the context of crossed random effect models (see also Gao and Owen, 2017) and to design an algorithmic modification with linear computational complexity.
Together with explicit formulas for L 2 -rates of convergences, the multigrid decomposition we developed in this paper provides a simple and intuitive theoretical characterizations of practical behaviors commonly observed in practice when fitting hierarchical models with MCMC, such as slower mixing for hyper-parameters at higher levels (see Theorems 2 and 11), algorithmic scalability with width of the hierarchy but not with height (e.g. Theorem 3 and Corollary 7) and good performances of centred parametrization in data-rich contexts (Theorem 8). The results presented in this paper provide a first step towards providing quantitative understanding of the behavior of MCMC algorithms (even beyond the Gibbs Sampler) in the extremely popular context of Bayesian hierarchical and multilevel models.
The present work could be extended in many directions. For example, it would be interesting to extend the results for non-symmetric cases, either by generalizing the bounds of Theorem 8 or by weakening the symmetry assumption in (S). In terms of classes of models considered, a natural and important extension would be to consider the multivariate case (where each parameter γ t is a multivariate random vector) and the regression case. We expect many results developed in this work to extend to the multivariate and regression case, even if in that context the role played by non-symmetric cases will be more crucial. Another important class of models that would be worth approaching with methodologies analogous to the ones developed here are models based on Gaussian processes commonly used, for example, in spatial statistics (see e.g. Bass and Sahu, 2016b).
An important and ambitious aim would be to extend the results to other tractable distributions within the exponential family beyond the Gaussian case. A starting point for this could be to analyse Model CkP as mentioned in Remark 2. Also, many non-Gaussian hierarchical models can be well-approximated by Gaussian ones for sufficiently large data sets, so that it is reasonable to conjecture that the qualitative conclusions (at least) of our study might remain valid when extrapolated to non-Gaussian models, rather like the analysis given in Sahu and Roberts (1999). A detailed study of this question is left for future work.
We have concentrated in this paper on deterministic samplers. However, explicit rates of convergence of random scan samplers are also available in the Gaussian case as described in Amit (1996) and extended in Roberts and Sahu (1997). Deterministic and random scan samplers can sometimes differ substantially in their convergence properties, see for example Roberts and Rosenthal (2015), although no general theory for this phenomenon is well-understood, so that the insights of this work could be particularly useful in this direction. Also, in the random scan case the reversibility of the induced Markov chains would allow us to apply the Cauchy interlacing theorem under weaker assumptions than Theorem 11 and thus prove orderings results for general hierarchical parametrizations β T = (β t ) t∈T .
While this work is focused on L 2 -rates of convergence, the same approach could be used to derive bounds on the distance (e.g. total variation or Wasserstein) between the distribution of the Markov chain at a given iteration and the target distribution (see e.g. Amit, 1996, Roberts andSahu, 1997, (15) and Khare et al., 2009, Section 4.4). Such a formulation would be interesting to extend the recent growth in literature on providing rigorous characterizations of the computational complexity of Bayesian hierarchical linear models, see for example Rajaratnam and Sparks (2015); Roberts and Rosenthal (2016); Johndrow et al. (2015). In order to provide full characterizations, however, the case of unknown variances should be considered (see e.g. Jones and Hobert, 2004 for the two level case).

Supplementary Material
Proofs and additional material on the simulations (DOI: 10.1214/20-BA1242SUPP; .pdf). The supplementary material contains the proofs of the theoretical results presented in the paper, as well as additional formulas and material related to the simulations.