Scalable Bayesian computation for crossed and nested hierarchical models

,


Motivation and objectives
In this article we explore efficient computational frameworks to perform Bayesian inference in large-scale hierarchical models.We focus on two canonical hierarchical structures, which are arguably the two most common in applied Statistics; crossed effects models, where the data is organized in a multidimensional contingency table and the within-cell distribution is modeled in terms of a linear expansion of a priori independent effects; and nested multilevel models, where data is organized in hierarchical clusters and parameters in a given level of the hierarchy are shrunk to (fewer) parameters in the deeper level.Crossed effects models are the canonical framework for modeling the dependence of an output variable on a number of categorical input variables.In the literature they appear under various names, e.g.cross-classified data, variance component models or multiway analysis of variance [12,31,37].Nested multilevel models are the basic hierarchical structure for data and parameters that are naturally organized in nested clusters [13].The detailed specifications of these models are given in Sections 2 and 3, respectively.
The number of parameters in these models will be denoted by p and the number of observations by n.The computational challenge is to sample from the corresponding posterior distributions in high-dimensional scenarios where both n and p are large.The "holy grail" is MCMC algorithms whose total computational cost scales linearly in max{n, p}.We define carefully the notion of computational cost later in the article that involves both the cost per iteration and the number of iterations needed.We call algorithms whose cost is linear in max{n, p}, scalable.
Both structures, crossed and nested, involve a high-dimensional vector of regression parameters, and a low-dimensional vector of hyperparameters.Priors for the regression parameters are usually Gaussian while the data likelihood is generic, and in this article we consider methods and numerics for a wide variety of such choices.The class of sampling algorithms we consider are (variations or approximations of) block-updating coordinate-wise schemes -a.k.a.blocked Gibbs samplers -that alternate updating the regression parameters given the hyperparameters, and vice-versa.The main computational challenge, which is the focus of this paper, is how to update the high-dimensional regression parameters in a scalable way.
The paper is structured as follows.Section 2 is devoted to crossed effect models.After describing the models and the high-level computational approach, in Section 2.2 we propose a methodology for scalable sampling of the high-dimensional regression parameters in crossed models with generic likelihood, using what we call a local centering approach.In Section 2.3, we consider the practically important special case of multicategorical logistic likelihood where, beyond applying local centering ideas, we introduce a novel scheme for sampling efficiently without imposing identifiability constraints.These can instead be imposed, if desired, at a post-sampling stage.In Section 2.4, in the specific context of binomial likelihoods, we introduce an alternative scheme that combines the collapsed Gibbs sampler recently proposed in [24] with Polya-Gamma data augmentation [25].Finally, Section 2.5 provides a theoretical analysis of the complexity of the above schemes, we establish scalability under certain assumptions and provide comparison results.In Section 3 we consider nested multilevel models.We first show, in Section 3.1, how to use sparse linear algebra methods to update the high-dimensional regression coefficients for Gaussian likelihoods in a scalable way (which is not directly possible for crossed models).Then, in Section 3.2, we combine the this methodology with data augmentation approaches to obtain scalable samplers for non-Gaussian likelihoods, such as binomial, Cauchy or Poisson.The article contains various numerical experiments, including both simulated data and illustrations on two challenging large-scale applications, one on electoral survey data and one on real-estate modeling.In all cases, we compare our proposed methods to state of the art alternatives, including Hamiltonian Monte Carlo, showing superior performance, often by more than an order of magnitude.Our numerical experiments also suggest that the conclusions from our theory extrapolate beyond the specific assumptions required for the theoretical analysis.Throughout the article, we use boldface letters to denote vectors (when lower case) and matrices (when upper case).

Crossed effects models
We consider the following class of crossed effects models.There are K categorical input variables, known as factors, each with p k categories, known as levels, and their numbering is arbitrary.In this context, p = K k=1 p k corresponds to the sum of levels over all factors.The combination of the levels of the different factors defines a p 1 × • • • × p k contingency table.For each observation j ∈ {1, . . ., n} we record the values of the categorical input variables and an output variable that could be multivariate and will be denoted by y j , with L denoting the dimension of the response.Typical examples for L = 1 are continuous, binary or count observations y j , and multicategorical observations y j for L > 1, with details following below.The distribution of y j depends on a linear predictor η j : where X j are continuous input variables (excluding an intercept), β and a (0) are what we typically call fixed effects and associated to each level of each categorical factor there is a random effect.The notation we use, which is quite standard in some of the relevant literature (see, e.g., Section 1.1 and Chapter 11 of [13]) is that i k [j] refers to the level that the jth observation has on the kth factor.Therefore, the available data per "individual" j (in our election survey example, a survey respondent) are the output data y j (e.g., a 1-hot encoded vector for the party they plan to vote for), the strata i k [j] they belong on each categorical factor k (e.g., reside in Catalunya, unemployed, higher education etc), and other, if any, continuous inputs X j .In our notation, a (k) i is the random effect associated to the level i of the kth factor, and its dimension is L, just as for the observations.We model all random effects as a priori independent with factor-dependent variances.The main reason why we allow for multivariate random effects in this article is to deal with multi-categorical responses, as in the election survey application, in which case there are precision matrices describing the dependence of the elements of each multivariate random effect.
At this level of generality, with L denoting laws of random variables, the crossed effects model is as follows: i k [j] , j = 1, . . ., n, where N denotes Gaussian distributions, L(y j | •) denotes the law of y j given all other variables, ⊗ stands for the Kronecker product, T k are L×L precision matrices, a (k) is the vector of all stacked a (k) i 's, and the I's are identity matrices of appropriate dimension.We understand the limiting case µ pr = 0, T pr = 00 T as the improper flat prior on a (0) .When L = 1 we denote the corresponding precision scalars by τ k and assign them gamma priors.The "fixed effects" coefficients β are typically assigned a Gaussian or flat prior.
In our numerical illustrations we focus on three important special cases: Gaussian models, where L = 1 and L(y j | η j , τ ) = N (η j , 1/τ ) (with a gamma prior on τ ); binomial models, where L = 1 and L(y j | η j ) = Binomial(m j , (1 + exp{−η j }) −1 ); and multicategorical logit models, where L > 1, y j is an L-dimensional vector of zeros and a single one (one-hot encoding), and L(y j | η j ) = Categorical(softmax{η j }), with The multicategorical logistic model introduces the further complication of being illidentified due to softmax being invariant under translations of η.In Section 2.3 we address this and other methodological issues specific to this model.

Computational strategy
For simplicity we focus on the case of no fixed effects (β = 0), but our methodology extends easily to when the model includes fixed effects, as we discuss below.In this context the regression-type parameters are a = (a (0) , a (1) , . . ., a (K) ).We let γ refer collectively to the precision matrices T k and potential parameters specific to the likelihood L(y j | η j ).The methodology developed in the following sections is about sampling from L(a | y, γ), where y = (y j ) j=1,...,n denotes all the observed data.We augment such methods with steps that update L(γ | y, a), which in our context is easily done by sampling from gamma and (low-dimensional) Wishart distributions to obtain a block coordinate-wise sampling algorithm for L(a, γ | y).Potential issues related to dependence between a and γ in the joint posterior distribution L(a, γ | y) are discussed in the numerical parts and in the discussion on future work in Section 4.
The methods we discuss are generally based on partitioning the regression parameters into K + 1 blocks, (a (0) , a (1) , . . ., a (K) ).Previous work by [8,24] has proved that the complexity of the vanilla Gibbs sampler that updates each a (k) conditionally on the rest (k = 0, 1, . . ., K) is typically super-linear in n and p. [24] developed an approach based on a joint update of (a (0) , a (k) ) for each k = 1, . . ., K, which under certain conditions is provably scalable.Their methodology, which they call collapsed Gibbs sampling, is only applicable for Gaussian likelihoods.Below we develop methodologies that extend these ideas to the case of general likelihoods.In some cases one can take advantage of data augmentation techniques, such as Polya-gamma data augmentation for binomial likelihoods, to restore conditional Gaussianity and directly apply the collapsed Gibbs sampler.We explore this in Section 2.4.On ther other hand, in Section 2.2 we develop a methodology, called local centering, that works with any (sufficiently smooth) likelihood, while inheriting the properties of the collapsed Gibbs sampler both in terms of convergence speed and cost per iteration.Note that, for Gaussian likelihoods, one could also perform a direct block update of the high-dimensional multivariate Gaussian distribution L(a | y, γ).However, the cost of this operation would be super-linear in max{n, p} for crossed effect models and thus not scalable.See also Section 2.6 and related discussions in [8,19,24,14].
Regarding the inclusion of fixed effects β in the model, we can adapt the computational approach we develop here in various ways.The simplest strategy would be to update the global coefficients β from their full conditional distribution through a Gibbs or Metropolis-within-Gibbs update, following the update of a detailed below.Regardless of the specific implementation, the key aspect is that the dimensionality of β is much lower than the dimensionality of a in our applications of interest.More robust strategies are possible, such as updating β jointly with the intercept a (0) , but we defer the discussion of such strategies, as well as of the discussion of the case of high-dimensional β, to future work.

Methodology for generic non-Gaussian likelihoods: local centering
In the same spirit as the collapsed Gibbs sampler of [24] for Gaussian likelihoods, we develop a scheme for the joint update of (a (0) , a (k) ) that is feasible for generic likelihoods.To put our proposal into perspective, we first note that off-the-shelf sampling methods will not be successful.When factor k has a large number of levels, this will be a high-dimensional sampling problem.Additionally, there is strong posterior dependence between a (0) and a (k) , due to the constraint imposed by the data via the linear predictor.The prior precision does not capture this dependence in crossed effects models, since a priori all variables are independent.Thus one cannot use gradient-based samplers for latent Gaussian models that precondition using the prior precision, as for example in [33].Moreover, even with hypothetically good preconditioning, the convergence time of standard gradient-based samplers targeting the high-dimensional joint distribution of (a (0) , a (k) ) will typically grow with p k (see e.g.[26,7,36,40] and related literature for various analysis of the dimensionality dependence of gradient-based MCMC algorithms), resulting in algorithms with super-linear complexity w.r.t.n and p (see Section 2.5 for proper definitions on convergence).
Instead, we explicitly leverage the sparse conditional independence structure of L(a (0) , a (k) | y, γ, a (−0,−k) ), where a (−0,−k) is a short form of (a (ℓ) ) ℓ̸ =0,k .We propose to use local hierarchical centering within each block: Working locally with a centered parameterisation is motivated by the general principles of parameterisation of hierarchical models: as discussed in [11,23], the model correlation structure is such that the conditional posterior dependence is weaker between the centered (a (0) , ξ (k) ) than the non-centered (a (0) , a (k) ) pair.Now, note that due to the hierarchical structure, we obtain that L(a (0) | y, γ, a are independent across i conditionally on a (0) and the levels of the other factors, hence a high-dimensional sampling problem reduces to p k independent L-dimensional ones.With these observations made, we propose Algorithm 1.

Algorithm 1 Gibbs sampler with local centering for non-Gaussian likelihoods
for k = 1 : We now detail a gradient-based scheme that leaves L(ξ The scheme is most concisely described by using the symbol f i ) to refer to the loglikelihood of y as a function of a (k) i , holding other parameters fixed.When L = 1, we use a second-order Metropolis-Hastings proposal that requires no tuning parameters and is easy to implement.The approach is based on a second-order expansion of the corresponding density around the current value and amounts to generating a proposal i (ξ When L > 1, we use the gradient-based Metropolis-Hastings sampler of [33], which and accepting it with probability .
The scalars δ are level-specific step size parameters that we tune adaptively according to the Robbins-Monro procedure described in, e.g., Algorithm 4 of [1], with target acceptance rate equal to 0.5 and learning rate at iteration t set to t −0.5 .Notice that the matrices D (k) i for different i's share the same eigenspace.Accordingly, an efficient implementation first eigendecomposes T k = V ΛV T , then computes then simulates the proposal and evaluates its density.
Note that, while the local parameterisation proposed above is effective (as we show below), a global reparameterisation of the model would not be successful.The intuition is that one cannot simultaneously center all K factors.Indeed, [41,Theorem 5] show that for crossed effects models with K = 2 factors, all global reparameterisations based on hierarchical centering lead to a complexity that is super-linear in n and p.

Numerical illustration
To demonstrate the scalability of the local centering approach, we generate data of growing dimension, and infer the corresponding models using the local centering and i , and dashed to τ −1 k , for k = 1, 2. Estimates are computed by averaging over 10 MCMC runs with 10000 iterations each.We set the priors τ k ∼ Gamma(1/2, 1/2), a flat prior on a (0) and τ ∼ Gamma(1/2, 1/2) for the Gaussian model.Note that whereas the simulation of data was done with different values of p k , here we show the results as a function of the implied n.
the vanilla Gibbs samplers.Elements of a are sampled from a standard Gaussian with settings L = 1, K = 2, p 1 = p 2 and increasing values of p 1 .Observations y are missing completely at random, where each cell in the contingency table is missing with probability 0.9, therefore n ≈ 0.1p 2  1 .For the non-empty cells we simulate observations either from the Gaussian distribution L(y j | η j ) = N (η j , 1), or from the binomial law L(y j | η j ) = Binomial(1, (1 + exp(−η j )) −1 ), and we use the corresponding likelihood when we learn the models.The precision parameters τ k and the residual precision τ of the Gaussian model are treated as unknown.
We monitor the empirical estimates of integrated autocorrelation times (IAT) for various parameters, and report them in Figure 1.The IAT is a measure of MCMC sampling efficiency (less being better), whose definition and interpretation can be found in Appendix B. For the local centering algorithms this quantity remains upper bounded with respect to n (and thus also of p).This happens for all parameters and suggests that the convergence properties of Algorithm 1 do not deteriorate as n and p increase.Combining this with considerations of cost per iteration, we conjecture that Algorithm 1 is scalable.Section 2.5 contains a more detailed discussion and some theoretical results that support the conjecture.

Methodology for categorical likelihoods
The local centering approach of Section 2.2 applies to any sufficiently smooth likelihood.A case of special interest, see e.g. the application to electoral survey data below, is the categorical logistic regression model with likelihood with softmax defined as in (2).In this model, L(y j | η j ) = L(y j | η j + c1) for any constant c, where 1 denotes a vector of 1's of appropriate dimensionality.Thus, η j is only identified by the likelihood up to such translations, which is also the case with the factor levels a i .Of course, given proper priors, the resulting posterior is proper.Therefore, we set proper priors throughout this section, including on a (0) , albeit weakly informative.Even with a proper prior, the posterior is typically much more informative about the differences in the elements of a (k) i than it is about the mean of said vector.As a consequence, the elements have strong posterior dependence, and an MCMC algorithm would need to propose highly correlated moves to fully traverse the posterior.In particular, the Metropolis-Hastings algorithm introduced in Section 2.2 will fail to propose such moves, since it approximates L(ξ ) by way of its gradient and the prior covariance, neither of which capture the extent of the dependence.Therefore, its mixing for categorical likelihoods can be poor, and we provide some numerical evidence to that effect in Figure 3. Since the differences of the elements of a (k) i are usually well-identified, we do expect the Metropolis-Hastings algorithm to mix adequately once the magnitude of the vector has been fixed.Therefore, one way of addressing this issue is to change the prior on the factor levels by imposing an identifiability constraint.One such constrain is to set the L-th element a (k) iL of the vector a (k) i to 0, define the remaining L − 1 elements as ȧi , and specify the Gaussian prior with an analogous constraint on a (0) .We may then proceed as in Section 2.2, replacing the symbols (a (0) , a (k) , T 0 , T k ) with ( ȧ(0) , ȧ(k) , Ṫ 0 , Ṫ k ).In doing so, we keep in mind that since a (k) iL = 0, the linear predictors have a trailing element fixed to 0, i.e.
which needs to be accounted for when evaluating the likelihood and its gradient.
While this solution is computationally effective, it is statistically unsatisfactory because different identifiability constraints lead to different posterior inferences for the outcome probabilities of the categorical variable.We think it preferable to improve the algorithm without changing the model, and leave it to the modeler to impose constraints post-hoc if desired.Hence, we propose a novel scheme that, although conceptually close to the identifiability constraint described above, does not change the prior but rather changes the sampling algorithm.Rather than tethering one of the elements to 0, we follow the principle set out in Section 2.2 by updating the well-identified quantities by Metropoliswithin-Gibbs, and the ill-identified translation through an additional Gibbs step which only involves the prior.To obtain the well-identified quantity, define ä(k iL 1, and similarly ä(0) as the first L−1 elements of a (0) −a Note that the data y are conditionally independent of a then the implied prior on ä(k) with corresponding expressions for ä(0) .Thus, our proposed scheme is to apply the methodology of Section 2.2 starting from (ä i , ä(0) ) and their implied priors, merely replacing (a (0) , a (k) , T 0 , T k ) with (ä (0) , ä(k) , T 0 , T k ), and again keeping in mind that the linear predictors are as in (8).We then recover a with similar expressions for a (0) L .Thereafter, we transform back according to In effect, the resampling of a (0) L according to simple prior-induced Gaussian computations induces the required positive dependence in the update.This procedure is repeated for all factors.We refer to an algorithm using the above procedure as a projected algorithm.The numerical experiments of the following section suggest that the projected algorithm with local centering mixes at least as well as its counterpart based on prior constraints.If required, such constraints can still be imposed post-sampling.

Illustration on electoral survey data
We experimentally assess algorithmic performance on a large-scale electoral survey dataset.The former previously featured in [20], and allows us to assess algorithms under realistic k factor name no. of levels (p k ) conditions that arise in applied work.Electoral surveys, consisting of the self-reported voting intention and various demographic traits for a sample of potential voters, provide one of the most natural applications for crossed-effects modeling, see for example [20,16] and references therein.Here the response variable is the multicategorical voting intention, with each category corresponding to a different party.Demographic traits such as location and age serve as input variables; these may be stratified into multicategorical variables.Electoral surveys are useful both for disaggregating the national voting intention to local districts, which is crucial for predicting parliamentary representation, and for dealing with insurgent political parties with significant electoral success, which is becoming a common phenomenon in European politics.The survey was carried out by the Centro de Investigaciones Sociológicas ahead of the November 2019 Spanish general elections.For the experiments in this paper we focus on a simplified data structure with L = 3 parties, K = 7 input variables, and 210 regression parameters overall (3 per level, and p = 70 levels overall).There are n = 9234 survey respondents and the input variables define a contingency table with 33696 cells, of which only 4764 are non-empty, hence the table is sparse.Table 1 summarises the dataset characteristics.
The response being multi-categorical, we find ourselves in the setting of Section 2.3.In view of that discussion, we compare four algorithms: the vanilla Metropolis-within-Gibbs sampler for the constrained model (const/Van-MwG), the locally centered algorithm for the constrained model (const/LC-MwG), the locally centered algorithm for the unconstrained model (free/LC-MwG), and the projected local centering algorithm for the unconstrained model (free/PLC-MwG).Table 2 in the Appendix provides a list of the algorithms discussed in this article and their acronyms.Notice that the first two sample from a different posterior than the last two.
We carry out two experiments on this elections dataset, one of which is a comparison between const/Van-MwG, const/LC-MwG and the generic Stan implementation of NUTS, showing that const/LC-MwG provides a consistent speedup of more than an order of magnitude compared to Stan and const/Van-MwG, see identifies the mean of a (k) and breaks dependence with a (0) .Factors with few levels remain strongly dependent with a (0) a posteriori, and therefore mix slowly.This finding is in accordance with the theoretical results of [24] (e.g., their Theorem 3) that predict this behaviour, albeit under conditions not met in this situation.
The other experiment is a comparison between the various bespoke Gibbs samplers, which illustrates the importance of either constraining or projecting before local centering, see Figure 3.Both the const/LC-MwG and the free/PLC-MwG algorithms perform well, while the mixing of most parameters in the free/LC-MwG algorithm is two orders of magnitude slower.The extent of the slowdown depends on the strength of the prior, which can restrict the possible range of translations in the softmax; a (0) and factors k with many levels and hence well-identified hyperparameter T k mix better.Since const/LC-MwG and free/PLC-MwG lead to similar computational efficiency, we recommend the use of the unconstrained model with the free/PLC-MwG algorithm and impose, if required, additional constraints in a post-sampling stage.

Methodology for binomial likelihoods: Polya-gamma augmentation and collapsed Gibbs sampling
While the local centering approach described above has the appeal of applying to general likelihoods, it is not optimal for Gaussian likelihood since in that instance we can directly sample L(a (0) , a (k) | y, γ, a (−0,−k) ) by exploiting Gaussian properties, as shown in [24] (and revisited later in this section).For certain non-Gaussian likelihoods we can still benefit from this direct approach when there are latent variables z such that L(a (0) , a (k) | y, z, γ, a (−0,−k) ) is Gaussian.We focus here on one such case of significant practical importance, that of binomial likelihod.[25] observed that when L(y j | η j ) = Binomial(m j , (1 + e −η j ) −1 ), then the corresponding probability mass function admits the following mixture representation: where PG(z; α, β) denotes the density of the so-called Polya-gamma distribution with parameters α and β evaluated at z.The Polya-gamma distribution is constructed in [25] as a weighted, infinite sum of Gamma random variables.For integer-valued α, a PG(α, β)-variable is obtained as a sum of α PG(1, β)-variables, while simulation of a PG(1, β)-variable is done by exploiting the representation of its density as a bounded Cauchy series whose increments have alternating signs.
This neat construction leads to the following computational strategy.We augment the space with the vector z = (z 1 , . . ., z n ) and design a sampler for L(a, γ, z | y).It is fairly simple to check (see [25]) that with m j and η j defined above.Another direct calculation shows that L(a | y, z, γ) is Gaussian.Then, we can follow the same paradigm as in Section 2.2 but sample L(a (0) , a (k) | y, z, γ, a (−0,−k) ) directly from the corresponding Gaussian distribution.Algorithm 2 summarizes how this step is carried out.

Algorithm 2
The collapsed Gibbs sampler for L(a|y, γ, z).
Define the quantities An obvious advantage of this approach over the one based on local centering, when applicable, is the avoidance of the Metropolis-Hasting step, which typically slows down mixing and requires tuning.However, there are two disadvantages to the Polya-gamma augmentation.One is that the dimension of the distribution to be sampled is increased by the n additional latent variables.Another is that sampling PG(m j , η j ) has computational cost O (m j ), hence it will be inefficient for responses with large counts (although see [39] for some mitigation of this problem).
Notice that we may contrast the above algorithm, which we call the Polya-Gamma collapsed Gibbs sampler (PG/Col-G), with another sampler on the augmented space that updates all effects one at a time and we call the Polya-Gamma vanilla Gibbs sampler (PG/Van-G).We also compare those samplers to the local centering algorithm (LC-MwG) applied to binomial likelihood.Figure 4 shows that the scalable algorithms LC-MwG and PG/Col-G significantly outperform the non-scalable algorithm PG/Van-G on the elections dataset.As in Figure 2, the vanilla algorithm slows down the most for the factors with few levels.Note that while PG/Col-G and LC-MwG have similar ESS per iteration, PG/Col-G has substantially higher iteration time due to the latent variable update.

Theoretical analysis
We now perform some theoretical analysis of the complexity of the algorithms described above.The computational complexity of an MCMC algorithm depends both on the cost per iteration and on the number of iterations required to obtain each effective sample.
In this work, we quantify the number of iterations required through the relaxation time of the associated Markov chain (other options are possible, but we do not discuss them here), leading to the following definition.
Definition 1.We define the complexity of an MCMC algorithm as where the relaxation time refers to the reciprocal of 1 minus the geometric rate of convergence of the associated Markov chain [28, Proposition 1].
We now analyse the cost per iteration and the relaxation time of the proposed schemes.Following the discussion in Sections 1 and 2.1, our theoretical analysis applies to the algorithm for sampling L(a | y, γ), even though all our numerics pertain to sampling L(a, γ | y).

Cost per iteration of the proposed schemes
The cost per iteration of the Gibbs sampler based on local centering proposed in Section 2.2 is of order O nKL + KL 3 + pL 2 , as we show now.The evaluation of the gradients terms in (4) has a O (nL) cost for each factor, since implies that exactly N terms of the form ∇ a for all i at O p k L 2 cost.Summing over factors we obtain the O nKL + KL 3 + pL 2 cost.The update from L(a (0) | ξ (k) ) requires O L 3 operations, which does not influence the overall cost and can further be reduced to O L 2 if the prior precision T pr is diagonal and T k has already been eigendecomposed.
Conveniently, with L = 1, this formula also holds for the collapsed Gibbs sampler for binomial likelihoods of Section 2.4, i.e. the complexity of updating L(a (0) , a (k) |y, a (−0,−k) , γ, z) for all k = 1, . . ., K is O (nK + p).This is due to needing to sample from p Gaussian distributions, and carrying out O (nK) summations to calculate their parameters.When sampling L(z | y, a, γ) we incur the additional cost of O j m j , for a total cost of O nK + p + j m j .
For the categorical const/LC and free/pLC algorithms, notice that if the response is L+1categorical, then local centering is done in L dimensions.The additional computations in the free/pLC algorithm can also be carried out with complexity O KL 3 .Conversely, for free/pLC, the update to γ is still L + 1-dimensional, therefore the complexity of the full algorithm is O nKL + K(L + 1) 3 + pL 2 .

Relaxation time analysis
While the relaxation time of MCMC algorithms is hard to characterize in general, that of the Gibbs sampler for crossed effects models with Gaussian likelihood has become more amenable to explicit computations combining the classical results of [27] with the so-called multigrid decomposition developed in [41,24].In this section we extend those results to analyze and compare the sampling algorithms discussed in previous sections, providing two main results.The first, contained in Proposition 2, Corollary 1 and Theorem 1, provides an analysis of local centering and a comparison of the latter with the collapsed Gibbs Sampler.These results show that local centering always increases the relaxation time compared to exact collapsing, but does so only by a constant factor, thus leading to the same asymptotic complexity.The second contribution, formalized in Theorem 2, is to expand the results in [24] by leveraging connections to random graph theory to show that both collapsing and local centering induce algorithms that are scalable as n and p diverge for average-case observation designs.
In this section we focus on models with Gaussian likelihood.Also, we assume L = 1 to keep the notation simple, even if most of the results would naturally extend to the L > 1 case.More precisely, we study algorithms targeting the law L(a | y, γ) induced by the following model for fixed hyperparameters γ = (τ, τ 1 , . . ., τ k ).We compare the following three schemes: (GS) the vanilla Gibbs sampler that updates from L(a (k) | y, γ, a (−k) ) for k = 0, . . ., K; (cGS) the collapsed Gibbs sampler updating from L(a (0) , a (k) | y, γ, a (−0,−k) ) for k = 1, . . ., K; (Loc) the Gibbs sampler based on local centering that executes Algorithm 1 with the update of ξ being the exact update from the conditional L(ξ ), which is feasible in the Gaussian case.
We first define some concepts and quantities related to the data observation pattern.Definition 2. We denote occurrence and co-occurrence counts associated to levels of different factors by We say that a design has balanced levels if n (k) i = n/p k for each factor k and level i.We define n = Kn/p for p = k p k , to be interpreted as the average number of observed data points per factor level.
The multigrid decomposition relates the rate of convergence of the Gibbs sampling Markov chain to that of simpler Markov chains.We summarise the main result in Proposition 1 below, which is an amalgamation of Theorem 1, Lemmas 1 and 2 of [24], extended to the case of local centering.The proof is analogous to the ones of the results just mentioned, and is omitted here for brevity.
Proposition 1 (Multigrid decomposition).For a factor k, let ā(k) = i a (k) i /p k be the factor average, and δa i − ā(k) be the factor residuals.Let {a(t)}, where t indexes iteration, be the Markov chain generated by either (GS), (cGS) or (Loc).Let {ϕ(t)} and {ψ(t)} be the processes that correspond to the factor averages (including a (0) ) and factor residuals respectively.Then, under balanced levels, the processes {ϕ(t)} and {ψ(t)} are Markov chains and independent of each other, and the rate of convergence of {a(t)} is the maximum of their convergence rates.Additionally, the convergence rate of {ψ(t)} is the same for all three algorithms.Proposition 1 reduces the computation of the rate of convergence of the algorithm of interest to that of the rates of two different Markov chains, one low-dimensional that operates on factor averages, and another high-dimensional on factor residuals.The latter is common for all algorithms considered, therefore the algorithms differ in the speed of the chain of the averages.Although defined on a space of fixed dimension, the chain of the averages can have a mixing time that deteriorates as n and p grow.In fact, Proposition 3 of [24] shows that vanilla Gibbs sampling is not scalable precisely for this reason in regimes where both n and p grow.On the other hand, the same proposition shows that the chain of averages generates independent samples in the case of the collapsed Gibbs sampler, therefore the rate of convergence of this algorithm is completely determined by that of the residuals.While in the case of local centering the chain of averages does not consist of independent draws, in Proposition 2 we show that it has a relaxation time that is bounded by constants that do not depend on n and p.The remaining results of this section, whose proofs can be found in Appendix A, are derived under the following assumption: (A1) Assume model (16) with balanced levels and an improper flat prior on a (0) , i.e.
The previous results imply that (cGS) and (Loc) have similar convergence properties, as formally shown below.
Corollary 1.Under (A1), the relaxation times of (cGS) and (Loc), denoted by T cgs and T loc respectively, satisfy where C is a constant depending only on γ and not on n, p or K.
Obviously, for non-Gaussian likelihoods for which the sampling of the ξ (k) i cannot be done directly, but where a gradient-based sampler is used instead, one would expect a further increase in relaxation time.However, as shown by numerical results in the above sections, this will typically involve only constants that do not depend on n and p.
Therefore, in view of Corollary 1, for the rest of the section we focus on the collapsed Gibbs sampler, and refer to its computational complexity as Cost(cGS).We now provide a more refined result on its scalability in the two-factor case, i.e.K = 2, building on previous work in [24].On the basis of the previous discussion, a characterisation of the relaxation time of the chain of residuals is central to the following result.It involves an auxiliary Markov chain on a discrete state-space that we introduce first.Definition 3. T aux is the relaxation time of the two-component deterministic-scan Gibbs sampler that targets the discrete distribution with state space {1, . . ., p 1 } × {1, . . ., p 2 } and probability mass function proportional to n for (i, i ′ ) being an element of the state space.
Theorem 1. Assume (A1) and K = 2. Then where the constants depend only on γ and not on n or p.
The factor max{n, p} comes from the cost per iteration, and min{n, T aux } comes from the relaxation time, which in view of the previous discussion is that of the chain of residuals.
To obtain some intuition about the implications of this result, it is convenient to interpret the 2-factor (K = 2) model as a recommender system, where one factor pertains to users, and the other to products.Note that n can be interpreted as the average number of products rated by a customer; it is precisely this number when p 1 = p 2 and the design has balanced levels.In very sparse designs, increasing values of n will not be associated with increasing values of n (more products and customers, but each customer rates a bounded number of products).In such regimes the sampler is scalable.In less sparse designs where n increases with n, the contingency table is more populated, and provided it is populated enough for the auxiliary Markov chain to mix well, the sampler will again be scalable.
The above discussion suggest that n and T aux typically exhibit complementary behaviour with respect to sparsity of the data co-occurrence matrix.However, the value of T aux does not only depend on the level of sparsity, but also on the specific observation pattern, encoded in the co-occurrence counts n (1,2) = (n ii ′ ) i=1,...,p 1 ;i ′ =1,...,p 2 , and can vary greatly across different designs with the same sparsity level.Indeed, T aux is the only term in the bound (17) that does not only depend on the amount of data and parameters in the model, but also on the observation pattern.We now perform a more refined analysis of Cost(cGS) in ( 17) by considering random observation patterns and relating T aux to the rich literature on spectral properties of random walks on random graphs.
We consider random binary designs with balanced levels.Specifically, given non-negative integers d 1 , d 2 and n such that n is a multiple of both d 1 and d 2 , we define D(n, d 1 , d 2 ) as the collection of all p 1 × p 2 binary designs with balanced levels and n observations in total, where p 1 = n/d 1 , p 2 = n/d 2 .In other words D(n, d 1 , d 2 ) contains all designs n (1,2)  such that n ∈ {0, 1} for all i = 1, . . ., p 1 and i ′ = 1, . . ., p 2 .We will assume that the observed design is sampled uniformly from such collection, i.e. n (1,2) ∼ Unif(D(n, d 1 , d 2 )).The following theorem provides a characterization of T aux and thus of Cost(cGS) in such regimes.
Theorem 2 suggests that the collapsed Gibbs Sampler for 2-factor models is scalable in the average case, where average refers to the data co-occurrence pattern n (1,2) .In particular, the almost sure statements ( 18)- (19) imply that for fixed d 1 and d 2 , the fractions of balanced levels designs in D(n, d 1 , d 2 ) for which the collapsed Gibbs sampler is scalable goes to 1 as n → ∞.
As shown in the proof of Theorem 2, which can be found in Appendix A, (18) follows from relating T aux to the spectrum of random bipartite bi-regular graphs.This allows to then apply an extension for bipartite graphs, which was recently developed in [5], of the so-called Friedman's second largest eigenvalue theorem.The statement in (19) then follows by combining (18) and Theorem 1.Thus, at a high level, the positive result in ( 19) is related to the fact that random graphs have near-optimal connectivity properties (i.e. they are expanders), as shown by a large body of random graph theory literature, see e.g.references in [5].
Interestingly, the upper bound in (18) goes to 1 as d 1 , d 2 → ∞.This suggests that in regimes where d 1 and d 2 grow with n, which is a reasonable assumption in many contexts, then T aux will get closer and closer to 1, which in turn by Theorem 1 implies that the relaxation time of the collapsed Gibbs Sampler will converge to 1.While this cannot be deduced from Theorem 2, since the regime studied there keeps d 1 and d 2 fixed and does not allow for d 1 and d 2 to grow n, the above results still point to a blessing of dimensionality, where the sampling algorithm actually converges faster and faster as n and p grow.This conjecture is coherent with the empirical results in, e.g., Figure 1 as well as results in [24,14].
The theory developed in this section holds for Gaussian likelihoods.We see this more as a necessity of our strategy in proving the result, as opposed to the nature of the result itself, which we believe relates to the conditional independence structure in the model.This is, again, corroborated by the findings in Figure 1, which suggest that both the change from Gibbs to Metropolis-within-Gibbs, from known to unknown γ and from Gaussian to non-Gaussian likelihood slow down convergence only by a factor that is constant w.r.t.n and p, and do not affect the scalability of the resulting MCMC algorithm.

Related literature and alternative approaches
Early work on Bayesian computation for crossed effects models includes [34] and [10, Section 6], who discuss, respectively, the use of identifiability constraints and reparameterisation for computational purposes.[8,9] provide a systematic discussion of the super-linear cost of standard frequentist fitting procedures when K = 2, and develop a scalable method of moments.[24] provide a detailed analysis of the Gibbs sampler complexity, covering also sparse balanced designs and K ≥ 2 factors, and propose the collapsed Gibbs sampler for Gaussian likelihood.[14] propose a coordinate-wise descent method for computing the MAP estimator (phrased in their article as a backfitting procedure to compute generalized least squares estimates) that has close connections to collapsed Gibbs sampling.They prove O (max{n, p}) complexity results also for somewhat unbalanced designs, their arguments are based on simple but effective concentration inequalities and linear algebra techniques.
A common practice among practitioners is the imposition of constraints within each factor, for example i a Of course, such constraints are not needed for the inference to make sense in a Bayesian formulation, if desired they can be imposed a posteriori and different constraints will lead to different inferences.To avoid confusion this is an entirely different issue compared to the identifiability issues that arise specifically to categorical logistic models and are discussed in Section 2.3.From the point of view of this article it is worth considering their impact on algorithmic performance.Theorem 6 of [41] derives the relaxation times of the vanilla Gibbs sampler for crossed effects models with Gaussian likelihood under identifiability constraints.Some constraints can make the vanilla Gibbs sampler scalable, whereas others cannot.Unfortunately, those found to be effective lead to difficult sampling problems when considering non-Gaussian likelihoods.
One could combine collapsing with techniques that seek to reduce posterior dependence between a and γ, such as parameter expansion [17,18].While useful and applicable, in the simulation studies of [24] for crossed effects models with Gaussian likelihoods, this strategy was found to have relatively limited impact on the resulting MCMC efficiency and it did not appear to change the overall complexity of the parent algorithm.Nevertheless, this is an important aspect of designing useful algorithms for large scale problems.
[19] and [16] propose mean field variational Bayes procedures for crossed models with Gaussian and logistic likelihoods, respectively, motivated by the overly high computational cost of Laplace approximations and HMC sampling for such models.There is an interesting interplay and scope for cross-fertilization between the methodology we develop in this article and variational inference and we will report our progress in that direction in future work.
Many popular software for models with Gaussian latent variables, such as lme4 [3] or INLA [30], perform approximate integration of a using variants of the Laplace approximation.However, these approaches rely on sparse linear algebra techniques, which in general incurs super-linear complexity in max{n, p} for crossed effects models, see e.g.numerical results and related discussions in [8,19,24,9,14,15].On the contrary, such methods are entirely effective for nested multilevel models, as discussed in the following section.

Nested multilevel models
We consider the following broad class of nested multilevel hierarchical models.There is an underlying hierarchical organization in the data that is structured as a tree.To give a concrete example, in the large scale applications we consider later in this Section, the spatial domain in Spain is organized according to 5-digit postal codes where any longer sequence of digits (from 2 to 5) corresponds to an administrative unit (e.g., municipality) that is nested geographically within those that omit the last digits (e.g., city, province etc).A different example is classes, nested within schools, nested within municipalities, etc.We denote the root of the tree as i 0 , its offsprings as i 0 i 1 , for i 1 = 1, . . ., n 0 , for each such offspring its offsprings are i 0 i 1 i 2 , for i 2 = 1, . . ., n i 0 i 1 , etc. Overall there are K levels apart from the root (hence K = 4 in the postal code example).In full generality we observe a matrix of covariates X i 0 •••i k and a vector of responses y i 0 •••i k and at every node in the tree, although it is more common that such observations are available only at level K.For example, in one of the real estate models we consider in the sequel for each leaf at K = 4 the covariate is a national house pricing index and the response local average house sale prices, with replications given over different quarters.The model relates the response to the covariates via a linear predictor: Inference borrows strength across units by shrinking the regression parameters to those at the deeper level β i 0 •••i k−1 in terms of a Gaussian shrinkage prior: where A i 0 •••i k 's are design matrices, often simply the identity matrices.Putting these ingredients together the full model is as follows: With Gaussian likelihoods the posterior L(β | y, γ) is multivariate Gaussian that can be sampled directly in a provably scalable way using sparse linear algebra.The result is given in Proposition 3.For non-Gaussian likelihoods this result can be leveraged in different ways.One, which we explore in this work, is in situations where latent variables z can be augmented such that L(β | y, z, γ) is Gaussian.This is in the same spirit as the collapsed Gibbs sampling with Polya-gamma augmentation that we developed in Section 2.4 for crossed effects models.In Section 3.2 we show results for binomial and Cauchy likelihoods.A different strategy appropriate for light-tailed likelihoods, such as the Poisson, is to combine the efficient Gaussian sampling with distributed gradient-based sampling of the regression parameters on the leaves of the tree; this is also discussed in Section 3.2.

Methodology for Gaussian likelihoods: sparse linear algebra
For nested multilevel models with Gaussian likelihoods, we can generate samples from the posterior L(β | y, γ) directly at O (max{n, p}) cost, as shown in Proposition 3. Specifically, we consider the model ( 20) with likelihood As a consequence of ( 20)-( 21), the conditional distribution L(β | y, γ) is multivariate Gaussian.Sampling a multivariate Gaussian distribution can be reduced to solving linear systems; for details on this perspective and associated tools we refer to [29], who provides results that can be used to prove the following claims.When, given appropriate matrices V , U and R, the model can be expressed as (the second equation up to constants in β), then L(β | y, γ) is also a Gaussian distribution with precision Q = U + V .Let L be the lower-triangular Cholesky factor of Q, and z be a standard Gaussian vector.Then, a sample from the posterior L(β | y, γ) is obtained by solving the following two systems: It is well-known that L can be computed column-wise from first to last column.Hence, the computation of L, and the simultaneous solution of the top equation in ( 23) by forward substitution, can be thought of as a forward pass.Having completed the forward pass, we can then solve the second equation in ( 23) by backward substitution, in what we can think of as a backward pass.The cost of solving these systems is dominated by the cost of computing L, which is in general cubic in the dimension of β.
A direct calculation verifies the following claims.Let Q denote the posterior precision associated with the Gaussian L(β | y, γ).We refer to the elements of Q by indicating what blocks of regression coefficients they pertain to, rather than labeling the blocks with ordinal indices.For instance, we write In the nested specification (20), Accordingly, Q is sparse in the sense that it only has O (p) non-zero elements.These are located in the following positions (with the obvious additional non-zero blocks due to symmetry): where we define T i 0 i −1 = T pr for k = 0, and we recall that n i 0 •••i k is the number of offsprings of node β i 0 •••i k in the graphical model.With respect to (20)-( 22), for nested multilevel models V m is a vector of zeros except for the location that corresponds to β 0 , in which case it takes the value T pr µ pr .Thus, the top equation in ( 23) is simplified accordingly.Finally, as defined in (22), Ry can trivially be computed at O (max{n, p}) cost.
The Cholesky factor L does not necessarily inherit the sparsity of Q, and it is known that the ordering of the indices plays a role in the sparsity of the Cholesky factor.This role can be understood through the concept of future sets, see for example in Theorem 2.8 of [29].Therefore, the computational performance of linear algebra methods for sampling from Gaussian distributions depends strongly on the way β and Q are organized.We now define a specific ordering that turns out to be optimal for nested multilevel models.
Definition 4. For nested multilevel models as defined in (20), we define the "depth-last" ordering, according to which β is organized by placing β 0 last, then the β i 0 i 1 's (the order among them is irrelevant), then the β i 0 i 1 i 2 's, and so on, until we reach the level furthest from the root.
For nested multilevel models, Proposition 3 shows that under the "depth-last" ordering the Cholesky factor L has the same sparsity as Q, i.e. it has O (p) non-zero elements, and that it may be computed at O (p) cost.Therefore, (23) can be used at a O (max{n, p}) cost to draw samples from p(β | y, γ).The proposition shows how the non-zero elements can be computed sequentially.However, the main consequence of the Proposition is to establish the sparsity of L under the given ordering.Any sparse linear algebra software (e.g.CHOLMOD/SuiteSparse) will efficiently compute L from a representation of Q as a sparse matrix with "depth-last" ordering, hence the user would not need to hard-code the iterations described in the proposition.
Proposition 3.For the nested multilevel model as defined in (20) and (21), and under the "depth-last" ordering defined in Definition 4, let Q be the corresponding posterior precision matrix and L its associated Cholesky factor.Then, L has the same sparsity structure as Q and its only non-zero blocks are computed according to where as before where and the notation j k+1 Ci 0 •••i k j k+1 is to be understood as summing over the children of node The cost of carrying out the recursion in Proposition 3 is dominated by O (p) decompositions of L × L-dimensional matrices, where recall that L is the dimension of the regression parameters at each node.Therefore, the non-zero elements are obtained at O pL 3 cost.The Proposition can be proved in different ways, one using a future set argument, and another using belief propagation.In fact, the sequence of computational steps induced by sparse linear algebra based on "depth-last" ordering is almost identical to the one induced by belief propagation algorithms applied to the joint distributions defined by the posterior of the nested multilevel model.Belief propagation, as well as the general junction tree algorithm, is a framework for computations in graphical models, see Chapter 6 of [6] and Chapter 2 of [38], but for nested multilevel models the two paradigms are essentially equivalent.For an introduction to belief propagation see Section 8.4 in [4].For details on how belief propagation can be used for nested multilevel models see the technical report [22], which also provides a historical perspective on this and links to various references.For other references discussing scalable Sparse Linear Algebra (SLA) methods to perform matrix factorisation for sparse matrices with non-zero elements corresponding to a tree see e.g.[35,Section 13.2] and references therein.Given that the proof can be worked out on the basis of one of these established approaches, we omit it from the present article.

Methodology for non-Gaussian likelihoods: data augmentation and blocked Gibbs sampling
The direct sampling approach based on belief propagation or sparse linear algebra can still be useful to build scalable sampling algorithms even when the likelihoods are non-Gaussian.One such setting arises when conditionally on latent variables We have already discussed this approach with the binomial likelihood in Section 2.4, and we can transfer the result to the nested setting.Moreover, heavier-tailed likelihoods, e.g. the Laplace or the Student-t, can be represented as scale mixtures of Gaussians, i.e.

L(y
Letting z denote the vector of all such latent variables, our proposed scheme is to work with the data augmented target L(β, γ, z | y), and target it with a Gibbs sampler, where we sample from L(β | y, γ, z) using the methodology of Section 3.1.L(z | y, β, γ) decomposes according to with each z i 0 •••i k easily updated.
Various important specifications fall outside of this class, such as the Poisson regression model L( This instance is most easily described when the data y is only observed at the leaf level K, which in any case is the most common.Defining β (K) as the set of all regression parameters β i 0 •••i K at level K, we suggest the strategy of alternating updates to L(β (−K) | β (K) , γ) and L(β (K) | y, β (K−1) , γ), noting that β (−K) is conditionally independent of y.Indeed, β (−K) is now structured as a tree with K − 1 levels and Gaussian data β (K) at the new leaves.Thus, we sample from the full conditional distribution law of β (−K) according to the methodology of Section 3.1.Conversely, while L(β (K) | y, β (K−1) , γ) is not Gaussian, it decomposes according to with each β i 0 •••i K updated independently by a Metropolis-within-Gibbs step akin to the one presented in Section 2.2, at a total cost of order O (p) for the full update from L(β (K) | y, β (K−1) , γ).While the cost of the combined update remains O (p), the main potential weakness of that strategy lies in the induced dependence between β (K) and β (−K) , which in some settings increases the relaxation time of the algorithm as p increases.The work in [23] and the associated theory in [21] suggests that for likelihood with tails lighter than the Gaussian, such as the Poisson, this centered parameterisation (in the terminology of [23]) is expected to perform well, whereas for heavy-tailed likelihoods it would not.Hence, we view the two strategies proposed in this Section as complementary.

Numerical experiments with a real estate model
We assess algorithmic performance on a semi-synthetic dataset that originates from a broader task of temporal prediction of real estate prices at high spatial resolution.Our approach combines a type of multilevel capital asset pricing model, which relates the local average square meter price of different type of real estate properties to national economic indices (e.g.GDP), with a multivariate time series model for these indices.In this article we focus on the multilevel regression component of this model.The spatial domain is Spain, organized according to 5-digit postal codes which define K = 4 levels, plus the root.The first two digits define a province in the country and the later codes correspond to higher spatial resolutions.There are 44 observations at each postal code, each of which is the difference in consecutive quarters of the logarithm of local average sales price.The quarters span the period 2007-2018.The predictors, an intercept and a national housing price index (hence L = 2), are only available at the leaf-level.In this data structure, there are p ≈ 8×10 3 regression parameters and n ≈ 4×10 6 observations.Due to confidentiality constraints, we work with a dataset that has been simulated from a misspecified version of (20), using the part of the dataset that is publicly available and parameter values β that are consistent with the real data.
In two further experiments, we adopt the latent variable augmentation setting of 3.2, but keep the same prior structure as in the Gaussian experiment.In one instance, we transform the response into a binary variable according to its sign, set L( ), and use Polya-gamma augmentation to obtain conditionally Gaussian observations, as in Section 2.4.We show results in Figure 6.Remarkably, the elements of β appear to mix instantly, while elements of γ have significantly larger IATs.These are ordered by the level that the parameter pertains to: the higher the level, the slower the mixing.This is due to the rising number of coefficients per hyperparameter, and therefore increasing strength of the prior.This could be remedied, e.g. by way of parameter expansion.In the other instance, we postulate a heterogeneous observation model where L(y ) for a proportion α of the terminal leaves, and Here, the t subscript denotes replications over times for t = 1, . . ., 44.We vary α = 0.1, 0.2, . . ., 1, with α = 0 corresponding to the model and associated algorithm that is analyzed in Figures 5.For observations y i 0 •••i K t with non-Gaussian observation density we augment with latent variables z i 0 •••i K t , given which the density is Gaussian, and proceed with the blocked Gibbs sampler outlined above for each of the resulting 10 models.In Figure 7 we plot the estimated ACF for the likelihood of the data (marginalizing over the latent variables z, where present).The experiment shows that as the proportion of latent variables increases the mixing of the algorithm slows down, but the deterioration is bounded.
In particular, the autocorrelation is almost negligible after 10 iterations in all settings.
Considering the large value of n and p of this dataset, the results are coherent with the conjecture that the algorithm's relaxation time remains small as n and p diverge.
In a final experiment, the response y i 0 •••i K t is the number of transactions at the location i 0 • • • i K and time t, and we model it as Pois(exp{η There is a trend in the transaction numbers, which we account for by introducing the time index t as an additional covariate, and therefore L = 3.We apply our strategy for light-tailed likelihoods as laid out in Section 3.2, with updates to 1) , γ).We show results in Figure 8.The algorithm achieves good mixing throughout the tree, with the exception of a few nodes located on the penultimate level L = 3.Our investigation suggests that this is due to the fact that the model we use for the data is purposely mis-specified and this creates the slower, although by no means slow, mixing of a few internal nodes.In fact, the mis-specification creates a non-trivial dependence between precision hyperparameters and a few regression parameters.

Discussion
We have developed efficient sampling algorithms for large scale Bayesian inference with two of the most popular families of models for applied research, crossed effects and nested multilevel models with Gaussian priors and non-Gaussian likelihoods, and we have obtained rigorous results on their scalability properties.Our results support the claim that appropriately designed blocked Gibbs samplers can effectively exploit the sparse conditional dependence structure inherent to these models in order to achieve an overall computational complexity that is O (max{n, p}).From a practical point of view, these algorithms are observed to provide dramatic computational speed-ups compared to off-the-shelf HMC, both for simulated data and in challenging applications.
There are various open questions and directions for future research that would complement and extend the present work, two of which we elaborate on.Firstly, we worked under the assumption that scalable sampling from the conditional distribution of the high-dimensional regression parameters, i.e.L(a | y, γ) and L(β | y, γ) for crossed and nested models respectively, is sufficient to scalably sample from the joint distributions L(a, γ | y) and L(β, γ | y), at least in commonly encountered scenarios.While this is strongly supported by simulations, theoretical results on when the assumption is met, and when it isn't, would further strengthen our conclusions.A first step in this direction is made in [2], which analyses the two-block Gibbs sampler targeting the joint distribution L(β, γ | y) for two-level nested hierarchical models.The theory developed therein is asymptotic and supports the claim that under appropriate assumptions, the convergence time of the above two block sampler remains bounded as p, n → ∞.Secondly, while we demonstrated that SLA methods are ideally suited to nested models, it turns out that they are not effective for crossed effect models.In ongoing work we are performing a systematic exploration of SLA for crossed models, both to substantiate this claim, as well as to assess the effectiveness of approximate SLA methods, which sample from an

A. Proofs
Proof of Theorem 1.The cost per iteration has been established to be O (2n + p) = O (max{n, p}), hence the proof focuses on quantifying the relaxation time.Denote the relaxation time of (cGS) by T cgs .We now prove that for any balanced level design and for some positive and finite constants c and C, which depend only on γ.The thesis follows then directly from (30).
Given ā(t) = (a (0) (t), ā(1) (t), . . ., ā(K) (t)) do: Set a (0) (t, 0) ← a (0) (t) for k = 1 : K do Draw a (0) (t, k) from where we use "const" to indicate fixed values that do now depend on ā(t).It follows from ( 32) that E[ā (1)  .Therefore, the transition from ā(t) to ā(t + 1) corresponds to a linear autoregressive process with a (K + 1) × (K + 1) autoregressive matrix B. By well-known theory for Gaussian autoregression processes, the rate of convergence of such a Markov chain equals the largest modulus eigenvalue of the matrix B, which we denote by ρ(B).Defining c 0 = 1, the above results imply that B kℓ = c k for all k, ℓ = 0, . . ., K, meaning that the K + 1 columns of B are all equal to (c 0 , . . ., c K ) T .It follows that B is a rank 1 matrix with K eigenvalues equal to 0 and one non-zero eigenvalue equal to K k=0 c k .Hence, to bound ρ(B) we need to upper bound the absolute value of K k=0 c k .Define d k = k ℓ=1 c ℓ for k = 1, . . ., K. We now prove by induction that: d k ∈ (−1, 0), d k ≤ −ρ k and d k is monotonously decreasing in k = 1, . . ., K. The hypothesis is true for k = 1 since d 1 = −ρ 1 and ρ k ∈ (0, 1) for all k = 1, . . ., K. For k ≥ 2, we have where τ min = min k=1,...,K τ k is a fixed constant in our analysis.Above we used p k ≤ n for all k = 1, . . ., K, which holds by construction as every level of every factor gets at least one observation.
Proof of Theorem 2. By the argument in the proof of [24,Prop.4],we obtain that where P 1 = n (1,2) /d 1 and P 2 = (n (1,2) ) T /d 2 are, respectively, p 1 ×p 2 and p 2 ×p 1 stochastic matrices; and λ * 2 (M ) denotes the second largest modulus eigenvalue of a matrix M .The spectrum of the p 1 × p 1 transition matrix P 1 P 2 is related to the one of the (p 1 + p 2 ) × (p 1 + p 2 ) matrix where 0 m×r denotes a zero matrix of dimension m × r.The matrix A can be interpreted as the adjacency matrix of a bipartite graph with p 1 + p 2 nodes, one for each parameter in a (1) and a (2) , and edges between a i and a (2) i ′ if and only if n (1,2) ii ′ = 1.The definitions of A and P 1 P 2 imply that λ * 2 (P where λ 2 (M ) denotes the second largest eigenvalue of a matrix M 1 .
The bipartite graph associated to A, denoted by G(A), is also bi-regular, or more specifically (d 1 , d 2 )-regular, meaning that the degree of all nodes in the k-th component is d k , for k ∈ {1, 2}.Moreover, it can be seen that the map from n (1,2) to G(A), with A is as in (34), defines a bijection from D(n, d

B. MCMC performance metrics
The integrated autocorrelation time (IAT), is defined as where X is a stochastic process and acf t X is the autocorrelation function (ACF) of X evaluated at t-th lag.In practice, we estimate ACF by empirical averages from a single trajectory of X.We then use these estimates to estimate the IAT following Sokal's truncation heuristic [32].The heuristic consists of computing a running sum over ACF lags and stopping as soon as the running sum exceeds a multiple of the next lag.We modify this heuristic slightly and only truncate at even lags to account for NUTS' tendency to produce estimates of ACF of different sign at successive lags.We define the effective sample size (ESS) as the run length divided by the IAT estimate.Since we are comparing methods with vastly different computing time per iteration, we monitor the run time of the algorithm for both algorithms and use effective sample size per second as our performance metric.Notice that we include transience time in the ESS/sec estimate.Moreover, when overlaying ACFs for both methods, we plot them as a function of computing time, as opposed to iteration lag. C.

5 Gaussian | factor 2 1 Figure 1 :
Figure 1: [Crossed effects simulation study, binomial and Gaussian likelihoods] Integrated autocorrelation time estimates as a function of n for the locally centered (blue) and the vanilla (orange) Metropolis-within-Gibbs sampler.Solid lines refer to a (0) and p −1 k

Figure 2 .Figure 2 :
Figure 2: [Crossed effects elections model, constrained categorical likelihood ] Comparison of sampling efficiency between the locally centered sampler (const/LC-MwG, blue), the vanilla Metropolis-within-Gibbs sampler for the constrained model (const/Van-MwG) and Stan/NUTS (orange).The left panel overlays parameter ACFs of a and γ as a function of wall time.The right panel shows the distribution of effective samples per second.Estimates are computed from MCMC runs with 10000 (const/LC-MwG, const/Van-MwG) and 1000 iterations (NUTS) respectively, with an identical number of iterations used for adaptation.We set the priors Ṫ k ∼ Wishart(L − 1, I/(L − 1)), ȧ(0) ∼ N (0, I).

Figure 3 :
Figure 3: [Crossed effects election model, categorical likelihood ] Comparison of sampling efficiency between the locally centered sampler on the constrained model (const/LC-MwG, blue), the local centering sampler on the free model (free/LC-MwG, orange) and the projected local centering sampler on the free model (free/PLC-MwG, green).The left panel overlays parameter ACFs of a and γ as a function of wall time.The right panel shows the distribution of effective samples per iteration.Estimates are computed from MCMC runs with 10000 iterations each, with an identical number of iterations used for adaptation.We set the priors T k ∼ Wishart(L, I/L), a (0) ∼ N (0, I) in the free likelihood case and Ṫ k ∼ Wishart(L − 1, I/(L − 1)), ȧ(0) ∼ N (0, I) in the constrained likelihood case.

Figure 4 :
Figure 4: [Crossed effects election model, binomial likelihood ] Comparison of sampling efficiency between the collapsed Gibbs sampler on the augmented space (PG/Col-G, green), the locally centered marginal sampler (LC-MwG, blue) and the vanilla Gibbs sampler on the augmented space (PG/Van-G, orange).The left panel overlays parameter ACFs of a and γ as a function of wall time.The right panel shows the distribution of effective samples per iteration.Estimates are computed from MCMC runs with 10000 iterations each.We set the priors τ k ∼ Gamma(1/2, 1/2) and a flat prior on a (0) .
(k) i log p(y j | a) for j = 1, . . ., n need to be computed, at O (L) cost each.By similar arguments also the likelihood evaluations required by the acceptance probability computation require O (nL) operations per factor.Given the gradient and likelihood computations, one can eigendecompose T k at O L 3 cost and then perform the updates of ξ (k) i

Figure 5 :
Figure 5: [Nested effects real estate model, Gaussian likelihood ] Comparison of sampling efficiency between MCMC based on Gibbs with SLA (blue) and Stan/NUTS (orange).The left panel overlays ACFs of β and γ as a function of wall time.The right panel shows the distribution of effective samples per second.Due to the large number of parameters, we only plot ACFs for a subset.For each algorithm, we include ACFs for the 20 slowest parameters and a representative subsample of the rest.Estimates are computed from MCMC runs with 10000 (SLA) and 1000 iterations (NUTS) respectively, with an warm-up period of 1000 iterations for NUTS.We set the hyperparameters ν k = L for T k , a flat prior on β 0 , and τ ∼ Gamma(1/2, 1/2).

Figure 6 :
Figure 6: [Nested effects real estate model, binomial likelihood ] Sampling efficiency the PG/SLA algorithm for binomial likelihoods.The plot overlays ACFs of β and γ as a function of MCMC lag.Due to the large number of parameters, we only plot ACFs for a subset: we include ACFs for the 20 slowest parameters and a representative subsample of the rest.Estimates are computed from an MCMC run with 10000 iterations.We set the hyperparameters ν k = L for T k and (µ pr = 0, T pr = I) for β 0 , corresponding to a standard Gaussian prior.

1 Figure 8 :
Figure 8: [Nested effects real estate model, Poisson likelihood ] Sampling efficiency of the LT/SLAwG algorithm for Poisson likelihoods.The plot overlays ACFs of β and γ as a function of MCMC lag.Due to the large number of parameters, we only plot ACFs for a subset: we include ACFs for the 20 slowest parameters and a representative subsample of the rest.Estimates are computed from an MCMC run with 10000 iterations.We set the hyperparameters ν k = L for T k and (µ pr = 0, T pr = I) for β 0 , corresponding to a standard Gaussian prior.

Table 1 :
Categorical factors included in the crossed effects elections model.

Table 2 :
Table of Algorithms Algorithms and corresponding acronyms.