Convergence Analysis of a Collapsed Gibbs Sampler for Bayesian Vector Autoregressions

We propose a collapsed Gibbs sampler for Bayesian vector autoregressions with predictors, or exogenous variables, and study the proposed algorithm's convergence properties. The Markov chain generated by our algorithm converges to its stationary distribution at least as fast as those of competing (non-collapsed) Gibbs samplers and is shown to be geometrically ergodic regardless of whether the number of observations in the underlying vector autoregression is small or large in comparison to the order and dimension of it. We also give conditions for when the geometric ergodicity is asymptotically stable as the number of observations tends to infinity. Specifically, the geometric convergence rate is shown to be bounded away from unity asymptotically, either almost surely or with probability tending to one, depending on what is assumed about the data generating process. Our results are among the first of their kind for practically relevant Markov chain Monte Carlo algorithms.


Introduction
Markov chain Monte Carlo (MCMC) is often used to explore the posterior distribution of a vector of parameters θ given data D. To ensure the reliability of an analysis using MCMC it is essential to understand the convergence properties of the chain in use [6,7,9,10,20,26,48] and, accordingly, there are numerous articles establishing such properties for different MCMC algorithms [e.g. 1,2,13,15,17,21,32,38,46]. It has been common in this literature to treat the data D as fixed, or realized. Thus, the model for how the data are generated has typically been important only insofar as it determines the likelihood function based on an arbitrary realization-the stochastic properties of the data prescribed by that model have not been emphasized. This is natural since the target distribution, i.e. the posterior distribution, treats the data as fixed. On the other hand, due to the rapid growth of data available in applications, it is also desirable to understand how performance is affected as the number of observations increases. When this happens, the data are more naturally thought of as stochastic; each time the sample size increases by one, the additional observation is randomly generated. The study of how convergence properties of MCMC algorithms are affected by changes in the data is known as convergence complexity analysis [35] and it has attracted increasing attention recently [16,32,34,49,50].
Accounting for randomness in the data and letting the sample size grow leads to a more complicated analysis than when the data are fixed. In fact, to date, convergence complexity analysis has only been successfully carried out for a few, maybe even one [32], practically relevant MCMC algorithms, by which we mean MCMC algoritms for settings where one could not easily sample by less complicated methods. We study a MCMC algorithm for a fundamental model in time series analysis: a Bayesian vector autoregression with predictors (VARX). Briefly, the VARX we consider assumes that Y t ∈ R r and X t ∈ R p satisfy, for t = 1, . . . , n, with U 1 , . . . , U n independent and multivariate normally distributed with mean zero and common covariance matrix Σ ∈ S r ++ , the set of r × r symmetric and positive definite (SPD) matrices, A i ∈ R r×r , i = 1, . . . , q, and B ∈ R p×r . As we will see, the likelihood for the VARX is closely related to that of a multivariate linear regression and, consequently, our results apply also to that model. Moreover, (1) simplifies to a multivariate linear regression if A i = 0 for all i. We focus on the VARX because it motivates our study and, in particular, our choice of priors. More details on the model specification, priors, and resulting posterior distributions are given in Section 2.
The target distribution of our algorithm is the posterior distribution of θ = (A 1 , . . . , A q , B, Σ) given D = {(Y 1 , X 1 ), . . . , (Y n , X n )}. We will consider both fixed and growing data and refer to the two settings as the small-n and large-n setting, respectively. By n being small we mean that it is fixed and possibly small in comparison to r and q, but n > p throughout. Many large VARs in the literature [3,11,24] are covered by this setting. By n being large we mean that it is increasing and that the data are stochastic.
The algorithm we consider is a collapsed Gibbs sampler that simplifies to a commonly considered (non-collapsed) Gibbs sampler when there are no predictors in the model; that is, when B ≡ 0 in (1). To discuss our results, we require some more notation. Let F (·|D) denote the VARX posterior distribution having density f (θ|D) with support on Θ ⊆ R d for some d ≥ 1 and let K h (K ≡ K 1 ) be the h-step transition kernel for a Markov chain with state space Θ, started at a point θ ∈ Θ. We assume throughout that all discussed Markov chains are irreducible, aperiodic, and Harris recurrent [31], and that sets on which measures are defined are equipped with their Borel σ-algebra. Our analysis is focused on convergence rates in total variation distance, by which we mean the rate at which K h (θ, ·)−F (·|D) T V approaches zero as h tends to infinity, where · T V denotes the total variation norm. If this convergence happens at a geometric (or exponential) rate, meaning there exist a ρ ∈ [0, 1) and an M : Θ → [0, ∞) such that for every θ ∈ Θ and h ∈ {1, 2, . . . } then the Markov chain, or the kernel K, is said to be geometrically ergodic. The geometric convergence rate ρ is the infimum of the set of ρ ∈ [0, 1] such that (2) holds [32]. Since all probability measures have unit total variation norm, ρ is always in [0, 1], and K is geometrically ergodic if and only if ρ < 1. A substantial part of the literature on convergence of MCMC algorithms is centered around geometric ergodicity, for good reasons: under moment conditions, a central limit theorem (CLT) holds for functionals of geometrically ergodic Markov chains [5,18] and the variance in the asymptotic distribution given by that CLT can be consistently estimated [8,19,47], allowing principled methods for ensuring reliability of the results [39,48]. Our main result in the small-n setting gives conditions that ensure ρ < 1 when the data are fixed and K is the kernel corresponding to the algorithm we consider. Notice that, although it is suppressed in the notation, K, M , ρ, and, hence, ρ typically depend on D. In the large-n setting, we are no longer considering a single dataset, but a sequence of datasets {D n } := {D 1 , D 2 , . . . }, where D n here denotes a dataset with n observations. Consequently, for every n there is a posterior distribution F (·|D n ) and a Markov chain with kernel K n that is used to explore it. To each kernel K n there also corresponds a geometric convergence rate ρ n . Since ρ n depends on D n , the sequence {ρ n } is now one of random variables, ignoring possible issues with measurability. If the convergence rate ρ n tends to one in probability or almost surely as n → ∞, we say that the convergence rate is unstable; in practice, we expect an algorithm that generates a chain with geometric convergence rate tending to one to be less reliable when applied to large datasets. Thus, we are interested in bounding {ρ n } away from unity asymptotically, in either one of two senses: first, if there exists a sequence of random variables {ρ n } such that ρ n ≤ρ n for every n and lim sup n→∞ρn < 1 almost surely, then we say that {K n } is asymptotically geometrically ergodic almost surely, or the geometric ergodicity is asymptotically stable almost surely. Secondly, if instead of the upper limit being less than unity almost surely it holds that lim n→∞ P(ρ n < 1) = 1, then we say that {K n } is asymptotically geometrically ergodic in probability, or that the geometric ergodicity is asymptotically stable in probability. In the large-n setting, we give conditions for asymptotically stable geometric ergodicity, in both of the two senses, of the Markov chain generated by our algorithm. An intuitive, albeit somewhat loose, interpretation of our main results is that the geometric ergodicity is asymptotically stable if the sample covariance matrix tends to some positive definite limit. In particular, as long as this holds, the geometric ergodicity of the Markov chain we study is asymptotically stable under arbitrary model misspecification. Our small-and large-n results are complementary: the small-n results establish geometric ergodicity for fixed n but do not ensure asymptotic geometric ergodicity, while the large-n results give asymptotic geometric ergodicity but do not imply geometric ergodicity for fixed n.
The rest of the paper is organized as follows. We begin in Section 2 by completing the specification of the model and priors. Because some of the priors may be improper we derive conditions which guarantee the posterior exists and is proper. In Section 3 we develop a collapsed Gibbs sampler for exploring the posterior. Conditions for geometric ergodicity for small n are presented in Section 4 and conditions for asymptotically stable geometric ergodicity are given in Section 5. Some concluding remarks are given in Section 6.

Bayesian vector autoregression with predictors
Recall the definition of the VARX in (1). To complete the specification, we assume that the starting point (Y −q+1 , . . . , Y 0 ) is non-stochastic and known and that the predictors are strongly exogenous. By the latter we mean that {X t } is independent of {U t } and has a distribution that does not depend on the model parameters. With these assumptions the following lemma is straightforward. Its proof is provided in Appendix B for completeness.
T ∈ R qr×r and α = vec(A), where vec(·) is the vectorization operator, stacking the columns of its matrix argument.
For fixed data, f (Y | X, A, B, Σ) is the same as for a multivariate linear regression with partitioned design matrix [Z, X] ∈ R n×(qr+p) and coefficient matrix [A T , B T ] T . However, our choice of priors is guided by the vector autoregression. Let S r + denote the set of r × r symmetric positive semi-definite (SPSD) matrices and, to define priors, let m ∈ R qr 2 , C ∈ S qr 2 + , D ∈ S r + , and a ≥ 0 be hyperparameters. Our prior on θ = (α, B, where |·| means the determinant when applied to matrices. The flat prior on B is standard in multivariate scale and location problems, including in particular the multivariate regression model. The priors on α and Σ are common in macroeconomics [23] and the prior on Σ includes the inverse Wishart (D ∈ S r ++ , a > 2r) and Jeffreys prior (D = 0, a = r + 1) as special cases. In other work on similar models it is often assumed that [Z, X] has full rank or that the prior for [A T , B T ] T is proper [1,2,11,13,25,44]. Treating A and B differently is appealing in the current setting: it adheres to the common practice of using flat priors for regression coefficients such as B, while m and C can be chosen to reflect the fact that many commonly studied time series are known to be near non-stationary in the unit root sense. In particular, with economic and financial data it is often reasonable to expect the diagonal elements of A 1 to be near one. For a more thorough discussion of popular priors in Bayesian VARs we refer the reader to Karlsson [23].
The following result gives two different sets of conditions that lead to a proper posterior. Though we focus on proper normal or flat priors for α in the rest of the paper, it may be relevant for other work to note that the proposition holds for any prior f (α) satisfying the conditions. For example, f (α) could be truncated to impose stability of the VARX, which if q = 1 corresponds to a prior with support only on α for which the spectral norm of A is less than one [see e.g. 28, for definitions].
The first set of conditions is relevant to the small n-setting. It implies that if the prior on Σ is a proper inverse Wishart density, so that a > 2r and D ∈ S r ++ , then the posterior is proper if f (α) is proper and X has full column rank. In particular, r or q can be arbitrarily large in comparison to n. Thus, this setting is compatible with large VARs [3,11,24]. The second set of conditions allows for the use of improper priors also on α and Σ when n is large in comparison to all of p, q, and r. The full column rank of [Y, Z, X] is natural in large-n settings. In practice, one expects it to hold unless the squares regression of Y on Z and X gives residuals that are identically zero.
The literature on convergence properties of MCMC algorithms for Bayesian VAR(X)s is limited. An MCMC algorithm for a multivariate linear regression model has been proposed and its convergence rate in the small-n setting studied [13]. By the preceding discussion, this includes the VARX as a special case, however, the (improper) prior used is f (θ) ∝ |Σ| −a which is not compatible with the large VARXs we allow for in the small-n setting. A two-component (A and Σ) Gibbs sampler for Bayesian vector autoregressions without predictors has been proposed [22], but the analysis of it is simulation-based and as such does not provide any theoretical guarantees. Our results address this since, as we will discuss in more detail below, the algorithm we consider simplifies to this Gibbs sampler when there are no predictors.

A collapsed Gibbs sampler
If the precision matrix in the prior on f (α), C, is a matrix of zeros, then the VARX posterior is a normal-(inverse) Wishart for which MCMC is unnecessary. However, when C ∈ S qr 2 ++ the posterior is analytically intractable and there are many potentially useful MCMC algorithms. For example, the full conditional distributions have familiar forms so it is straightforward to implement a threecomponent Gibbs sampler. Another sensible option is to group A and B and update them together. Here, we will instead make use of the particular structure the partitioned matrix [Z, X] offers and devise a collapsed Gibbs sampler [27]; that is, a Gibbs sampler where some updates are not using full conditional distributions, but conditional distributions with one or more components integrated out. As we will see, the structure of the collapsed sampler, and in particular relations between the convergence rates of marginal chains and the full chain, is instrumental to our development. For a discussion more generally of how operator norms of collapsed and non-Collapsed Gibbs samplers compare we refer the reader to [27].
For the case where the precision matrix C in the prior on α is positive definite and B ≡ 0, so that the predictor matrix X plays no role in the model, a two-component Gibbs sampler has been proposed [23]. We will show that the algorithm we study includes this Gibbs sampler as a special case, with minor modifications, and as a consequence our results apply almost verbatim to that sampler. A formal description of one iteration of the collapsed Gibbs sampler is given in Algorithm 1.

Algorithm 1. Collapsed Gibbs sampler
We next give the conditional distributions necessary for its implementation. Let M(M, U, V ) denote the matrix normal distribution with mean M and scale matrices U and V , that is, the distribution of a matrix whose vectorization is multivariate normal with mean vec(M ) and covariance matrix V ⊗ U , where ⊗ is the Kronecker product. Let also W −1 (U, c) denote the inverse Wishart distribution with scale matrix U and c degrees of freedom. For any real matrix M , define P M to be the projection onto its column space and Q M the projection onto the orthogonal complement of its column space.

Lemma 3.1. If one of the two sets of conditions in Proposition 2.2 holds, then
The collapsed Gibbs sampler in Algorithm 1 simulates a realization from a Markov chain having one-step transition kernel K C (θ , A) defined, for any where the subscript C is short for collapsed. However, instead of working directly with K C we will use its structure to reduce the problem in a convenient way.
. . , obtained by ignoring the component for B in Algorithm 1. The sequence {(α h , Σ h )} is essentially generated by a two-component Gibbs sampler. More precisely, if Q X is replaced by the n × n identity I n in the conditional distributions of Σ and α used in Algorithm 1, then the algorithm defined by steps 1, 2, 3, and 5 is a two-component Gibbs sampler A routine calculation shows that since K G , by construction, has invariant dis- The transition kernel, K Σ , for the {Σ h } sequence is constructed similarly. The kernel K A satisfies detailed balance with respect to the posterior marginal F A (· | Y, X) and similarly for K Σ and hence each has the respective posterior marginal as its invariant distribution. However, the kernels K C and K G do not satisfy detailed balance with respect to their invariant distributions. In Sections 4 and 5 we will establish geometric ergodicity of K C and study its asymptotic stability, respectively. Our approach, which is motivated by the following lemma, will be to analyze K A in place of K C ; the lemma says we can analyze either of K G , K A or K Σ in place of K C (see also [36]). The proof of the lemma uses only well known results about de-initializing Markov chains [37] and can be found in Appendix B.

Lemma 3.2. For any
The primary tool we will use for investigating both geometric ergodicity and asymptotic stability is the following well known result [40,Theorem 12], which has been specialized to the current setting. Note that the kernel K A acts to the left on measures, that is, for a measure ν, we define Also suppose there exists ε > 0, a measure R, and some Then K A is geometrically ergodic and, moreover, if then, for any initial distribution ν, It is common for the initial value to be chosen deterministically, in which case (7) suggests choosing a starting value to minimize V . Theorem 3.3 has been successfully employed to determine sufficient burn-in in the sense that the upper bound on the right-hand side of (7) is below some desired value [20,21,41], but, unfortunately, the upper bound is often so conservative as to be of little utility. However, our interest is twofold; it is easy to see that there is a c ∈ (0, 1) such thatρ < 1 and hence if K A satisfies the conditions, then it is geometrically ergodic and, as developed and exploited in other recent research [32], the geometric convergence rate ρ is upper bounded byρ. Outside of toy examples, we know of no general state space Monte Carlo Markov chains for which ρ is known.
Consider the setting where the number of observations tends to infinity; that is, there is a sequence of data sets {D n } and corresponding transition kernels {K A,n } with n → ∞. If lim inf n→∞ρn = 1 almost surely, then we say the drift (5) and minorization (6) are asymptotically unstable in the sense that, asymptotically, they provide no control over ρ n . On the other hand, because ρ n ≤ρ n establishing that lim sup n→∞ρn < 1 almost surely or that lim n→∞ P(ρ n < 1) = 1, leads to asymptotically stable geometric ergodicity as defined in the introduction.
Notice thatρ depends on the drift function V through ε, λ, and L. Thus the choice of drift function which establishes geometric ergodicity for a fixed n may not result in asymptotic stability as n → ∞. Indeed, in Section 4 we use one V to show that K A is geometrically ergodic under weak conditions when n is fixed, while in Section 5 a different drift function and slightly stronger conditions are needed to achieve asymptotically stable geometric ergodicity of K A .

Geometric ergodicity
In this section we consider the small-n setting. That is, n is fixed and the data Y and X observed, or realized, and hence treated as constant. Accordingly, we do not use a subscript for the sample size on the transition kernels. We next present some preliminary results that will lead to geometric ergodicity of K A , and hence K G and K C .
We fix some notation before stating the next result. Let · denote the Euclidean norm when applied to vectors and the spectral (induced) norm when applied to matrices, · F denotes the Frobenius norm for matrices, and superscript + denotes the Moore-Penrose pseudo-inverse. Least squares estimators of A and α are denoted byÂ = (Z T Q X Z) + Z T Q X Y andα = vec(Â), respectively, and y = vec(Y ).
++ and at least one of the two sets of conditions in Proposition 2.2 holds, then for any λ ≥ 0 and with Proof. Assume first Q X = I n ; the general case is then recovered by replacing Z and Y by Q X Z and Q X Y everywhere. Using (4) and Fubini's Theorem yields Lemma 3.1 and standard expressions for the moments of the multivariate normal distribution [42,Theorem 10.18] give for the inner integral that We work separately on the last two summands. First, since Σ −1 ⊗ Z T Z is SPSD, we get by Lemma A.2 that Secondly, Now by Lemma A.3, with (Σ −1/2 ⊗ I n )y and (Σ −1/2 ⊗ Z)C −1/2 taking the roles of what is there denoted y and X, we have for any generalized inverse (denoted by superscript g) that Lemma A.4 says that Using that the Moore-Penrose pseudo-inverse distributes over the Kronecker product [30], the middle part of this generalized inverse can be written as (Σ −1 ⊗ Z T Z) + = Σ⊗(Z T Z) + . Thus, for this particular choice of generalized inverse (8) is equal to Thus, using also that tr( Since the right-hand side does not depend on Σ, the proof is completed upon integrating both sides with respect to f (Σ | α , Y ) dΣ.
Proof. We will prove that there exists a function g : S r ++ → (0, ∞), depending on the data and hyperparameters, such that g(Σ) dΣ > 0 and g(Σ) ≤ f (Σ | A, Y ) for every α such that α 2 ≤ T , or, equivalently, A 2 F ≤ T . This suffices since if such a g exists, then we may take ε = g(Σ) dΣ and define the distribution R by, for any Borel set A ⊆ R qr 2 , To establish existence of a g with the desired properties we will lower bound the first and third term in f (Σ | A, Y ) using two inequalities, namely We prove the former inequality first.
and therefore which is clearly SPSD.
For the second inequality we get, using the triangle inequality, submultiplicativity, and that the Frobenius norm upper bounds the spectral norm, Since the spectral norm for SPSD matrices is the maximum eigenvalue, we have shown that c 1 I r − E T Q X E is SPSD. Thus, Σ −1/2 (I r c 1 − E T Q X E)Σ −1/2 is also SPSD and, hence, which is what we wanted to show. We have thus established that f (Σ | A, Y ) is greater than We are ready for the main result of this section.

Theorem 4.3.
If C ∈ S qr 2 ++ and at least one of the two sets of conditions in Proposition 2.2 holds, then the transition kernels K C , K G , and K A are geometrically ergodic.
Proof. By Lemma 3.2 it suffices to show it for K A . Lemma 4.1 establishes that a drift condition (5) holds for K A with V (α) = α 2 and all λ ∈ [0, 1), while Lemma 4.2 establishes a minorization condition (6) for K A . The claim now follows immediately from Theorem 3.3.
We note that, since V (α) is unbounded off compact sets and it is routine to show that K A is weak Feller [31, p. 124], the theorem can in fact be proven without using Lemma 4.2 [31, Lemma 15.2.8]. However, with Lemma 4.2 we also get an explicit bound on the convergence rate, through Theorem 3.3, which will be useful in what follows.

Asymptotic stability
We consider asymptotically stable geometric ergodicity as n → ∞. Motivated by Lemma 3.2, we focus on the sequence of kernels {K A,n }, where K A,n is the kernel K A with the dependence on the sample size n made explicit; we continue to write K A when n is arbitrary but fixed. Similar notation applies to the kernels K C and K G .
It is clear that as n changes so do the data Y and X. Treating Y and X as fixed is not appropriate unless we only want to discuss asymptotic properties holding pointwise, i.e. for particular, or all, paths of the stochastic process {(Y t , X t )}, which is unnecessarily restrictive. We assume that (Y t , X t ), t = 1, 2, . . . are defined on a common probability space so the joint distribution of Y and X exists for every n, and we allow for model misspecification; that is, {Y t } and {X t } need not satisfy (1).
Recall that Theorem 3.3 is instrumental to our strategy: if K A,n satisfies Theorem 3.3 with some V = V n , λ = λ n , L = L n , ε = ε n , and T = T n , then there exists aρ n < 1 that upper bounds ρ n . We focus on the properties of thosē ρ n , n = 1, 2, . . . , as n tends to infinity. Throughout the section we assume that the priors, and in particular the hyperparameters, are the same for every n. The latter is not necessary and could be replaced by appropriate bounds on how the hyperparameters change with n; however, doing so complicates notation and does not lead to fundamental insights in our setting. For example, C could be allowed to vary with n as long as its eigenvalues are bounded away from zero and from above.
Clearly, the choice of drift function V n is important for the upper bound ρ n one obtains. The drift function used for the small-n regime is not well suited for the asymptotic analysis in this section. Essentially, problems occur if λ n → 1, L n → ∞, or ε n → 0 so that the corresponding upper bounds satisfy lim n→∞ρn = 1 almost surely [32, Proposition 2]. Consider Theorem 4.3. Since we can take λ n = 0 for all n, only L n or ε n can lead to problems. Because L n is essentially a quadratic in α, it is clear that L n (Y, X) = O P (1) if and only if α = O P (1), while we show in Appendix B.1 that ε n → 0 almost surely as n → ∞ if We expect this to occur in many relevant configurations. Indeed, we expect the order of Q X Z will often be at least that of Q [X,Z] Y . To see why, consider the case without predictors and data generated according to the VARX.
where γ max (·) denotes the maximum eigenvalue, and Y T Q Z Y/n is the maximum likelihood estimator of Σ which is known to be consistent, for example, if data are generated from a stable VAR with i.i.d. Gaussian innovations [28]. For such VARs it also holds that Z T Z/n converges in probability to some SPD limit [28], and hence Z 2 = nγ max (Z T Z/n) = O P (n). Similar arguments can be made for any other data generating processes for which Y T Q Z Y/n and Z T Z/n are suitably bounded in probability or almost surely as n → ∞.
The intuition as to why the drift function that works in the small-n regime is not suitable for convergence complexity analysis is that the drift function should be centered (minimized) at a point the chain in question can be expected to visit often [32]. The function defined by V (α) = α 2 is minimized when α = 0, but there is in general no reason to believe the α-component of the chain will visit a neighborhood of the origin often. On the other hand, if the number of observations grows fast enough in comparison to other quantities and we suppose momentarily that the data are generated from the VARX, then we expect the marginal posterior density of A to concentrate around the true A, i.e. the A according to which the data is generated. We also expect that for large n the least squares and maximum likelihood estimatorÂ = (Z T Q X Z) + Z T Q X Y is close to the true A. Thus, intuitively, the α-component of the chain should visit the vicinity ofα = vec(Â) often. Formalizing and extending this intuition to cases where the model can be misspecified, so that no true A exists, leads to the main result of the section.
Let us re-define V : We will use the following lemma to verify the drift condition in (9) for all large enough n almost surely or with probability tending to one. The probabilistic qualifications are needed because, in contrast to in the small n setting, V here depends on the data. Accordingly, the λ given in the lemma depends onÂ and, consequently, need not be less than one for a fixed n or a particular sample.
Proof. Suppose first that Q X = I n and notice that Z has full column rank, and hence (Z T Z) −1 exists. Since f (α | Σ, Y, X) is a multivariate normal density, standard expressions for the moments of the multivariate normal distribution gives For the second term we use cyclical invariance of the trace to write Since C and Σ −1 ⊗Z T Z are both SPD, the last expression is in the form required by Lemma A.2, and hence where the last line uses that the trace of a projection matrix is the dimension of the space onto which it is projecting. Focusing now on the first term on the right hand side in (9) we have, defining H = Σ −1 ⊗ Z T Z and usingα = H −1 (Σ −1 ⊗ Z T )y, that is upper bounded by Moreover, since B = C + H the Woodbury identity gives (10) can be upper bounded as follows: Here, the power G t , t ∈ R, for a SPD matrix G is defined by taking the spectral decomposition G = U G diag(γ max (G), . . . , γ min (G))U T G , where γ max (·) and γ min (·) denote the largest and smallest eigenvalues, respectively, and setting Now by standard properties of eigenvalues and eigenvectors of Kronecker products [14, Theorem 4.2.12] we get In addition, using Lemma A.2, It remains to deal with the second term in (10). Using a similar technique as with the previous term, applying sub-multiplicativity and Lemma A.2 twice, we have Putting things together we have shown that, for any Σ, and hence we get from (9) The proof for the case Q X = I n is completed by upper bounding Σ ≤ tr(Σ), integrating both sides with respect to f (Σ | α , Y, X) dΣ, and noting that , and that P Z Y = ZÂ. The general case is recovered by replacing Z and Y by Q X Z and Q X Y , respectively, and invoking Lemma A.1. That Z T Q X Z is invertible also in the general case follows from the same lemma.
Proof. The proof idea is similar to that for Lemma 4.2. We prove that there exists a g : S r ++ → [0, ∞), depending on the data and the hyperparameters, such that g(Σ) dΣ = ε > 0 and g(Σ) ≤ f (Σ | A, Y, X) for every A such that Q X ZÂ − Q X ZA 2 F ≤ T . Assume first that Q X = I n and let c = n + a − r − p − 1 be the degrees of freedom in the full conditional distribution for Σ. Using that (Y − ZA) T Moreover, for any A such that ZÂ − ZA 2 F ≤ T , where the first inequality follows from that ZÂ − ZA 2 = (ZÂ − ZA) T (ZÂ − ZA) and that, therefore, is SPSD, and the second inequality follows from that the Frobenius norm upper bounds the spectral norm, so that With the determinant and trace inequalities just established, we have that f (Σ | A, Y, X) is, for any A satisfying the hypotheses, lower bounded by Noticing that g so defined is proportional to an inverse Wishart density and using well known expression for its normalizing constant finishes the proof for the case where Q X = I n . The general case is recovered upon replacing Z and Y by Q X Z and Q Z Y everywhere and invoking Lemma A.1.
We are ready to state the main result of the section. Recall, γ max (·) denotes the largest eigenvalue of its argument matrix; let γ min (·) denote the smallest. (r+qr+p) and S W = W T W/n, almost surely as n → ∞, then {K C,n }, {K G,n }, and {K A,n } are asymptotically geometrically ergodic almost surely.
Proof. By Lemma 3.2, it is enough to prove that lim sup n→∞ρn < 1 almost surely for theρ n corresponding to K A,n . Inspecting the definition ofρ n in Theorem 3.3 one sees that it suffices to show that Lemmas 5.1 and 5.2 apply and that the λ = λ n , L = L n , T = T n , and ε = ε n they give almost surely satisfy, respectively: (i) lim sup n→∞ λ n < 1, (ii) lim sup n→∞ L n < ∞, (iii) lim sup n→∞ T n < ∞, and (iv) lim inf n→∞ ε n > 0. The prior f (α) is bounded since which follows from upper bounding the Frobenius norm by the spectral norm times √ r and using the Minimax Principle [4, Corollary III.1.2]; in particular, since [Z, X] T [Z, X]/n and Y T Y/n are the trailing and leading block of S W , respectively, their eigenvalues must be bounded between the smallest and largest of S W , and so also between M −1 and M almost surely as n → ∞. Since we have shown Â F = O(1), it follows that λ n = O(1/n), so (i) holds. That (ii) holds, and hence we can pick a sequence T n > 2L n (1 − λ n ), n = 1, 2, . . . , such that (iii) holds. For (iv), we have with τ j denoting the jth . Now, we have picked T n so thatT = lim sup n→∞ T n < ∞, and since where the final inequality follows from that the fraction in parentheses is 1 + (2 −1 M −1T )/n and that the exponent is of the same order as n; this finishes the proof.
If assumption (b) is relaxed to holding with probability tending to one instead of almost surely, then the conclusion can be weakened accordingly to give the following corollary. We omit the proof since it is essentially the same as the proof of Theorem 5.3, arguing that the necessary conditions hold with probability tending to one.

Example
We illustrate the behavior of λ = λ n , L = L n , and ε = ε n from Lemmas 5.1 and 5.2 in a stable VARX of order q = 1; that is A = A 1 and A < 1 in (1). Stability makes some the arguments in this example easy to motivate formally, but we emphasize that it is not needed for the preceding results. We take X t = 1 for all t, B = 1 T r , Σ = σ 2 I r , and pick hyperparameters C = I r 2 , m = 0, D = 0, and a = 0. We first examine λ n , L n , and ε n analytically and then illustrate those calculations using simulations.
In the present setting, the expressions for λ n and L n in Lemma 5.1 simplify to Because the VARX is stable,Â and Y T Q [Z,X] Y/n are consistent for A and Σ = σ 2 I r , respectively, as n → ∞ [28], which implies and Clearly, λ n = O P (1/n) and L n = O P (1). To get some intuition for how r affects λ n and L n , let us ignore the stochastic terms that are of lower order when n grows with r fixed; that is, consider the approximations Because A 2 F ≤ r A 2 ≤ r, it holds thatλ n < 1 for large enough n even if r grows, as long as r = o(n). By contrast,L n is of order r 2 if r grows, regardless of n. This suggests, at least informally, that λ n can stay below one if r grows with n but that L n may not be bounded in such settings.
To investigate how ε n behaves, note that we can take T n =T n + o P (1) with T n = 2L n = 2rσ 2 (r + A 2 F ) and have T n > 2L n /(1 − λ n ) with probability tending to one as n grows with r fixed. Thus, using that Y T Q [Z,X] Y/n → Σ = σ 2 I r in probability and that the determinant is a continuous mapping, Consider the approximation ε n ≈ε n = exp[−r 2 (r + A 2 F )] and note, as argued previously, A 2 F ≤ r. Thus, if r grows, thenε n is of the order exp(−r 3 ), regardless of n, and therefore we do not expect ε n to behave well if r grows with n.
To investigate the finite sample behavior, we generate one sample path from the VARX and compute λ n , L n , and ε n using the n first observations in that sample path, for different values of n. We compare these quantities to the corresponding ones from Lemmas 4.1 and 4.2, that is, those that are obtained in the small-n setting. For simplicity we also refer to the latter quantities as obtained using a non-centered drift function. Code producing the results is available at https://github.com/koekvall/gibbs-bvarx.
When generating data, we fix r = 10 and σ 2 = 1, so Σ = I 10 . We construct A by letting U ∈ R r×r have entries drawn independently from the uniform distribution on [−1/2, 1/2] and taking A = (U + U T + I r )/( U + U T + I r + 0.1), which ensures A < 1.
The first plot in Figure 1 shows the λ n calculated using the centered drift function from the large-n setting, is less than 1 when n is greater than 40 in our sample, but greater than one for smaller samples; this illustrates the fact that our large-n results do not control the convergence rate in small (fixed) samples.
The second plot in Figure 1 shows the observed values of L n , calculated using the centered drift function, appear to tend to their probability limit where A 2 F ≈ 2.6 for the A used to generate our sample path. Figure 1 also shows the behavior of λ n and L n calculated using the centered drift function is, as n changes, similar to that observed ifÂ and Y T Q [Z,X] Y/n are replaced by A and Σ. We have not plotted the λ n calculated using the non-centered driftfunction since it can be identically zero. However, the L n obtained from that drift function is plotted and is smaller (better) than that for the centered drift function for every n considered (second plot, Figure 1).
The first plot in Figure 2 indicates ε n calculated using the centered drift function decreases towardsε n ≈ exp[−100(10 + 2.6)] = exp(−1260) as n increases, and that there is decent agreement with the approximation obtained by replacingÂ and Y T Q [Z,X] Y/n by A and Σ. In comparison to λ n and L n , it seems a much larger n is needed for ε n to get close to its probability limit. Note also that, unlike for λ n and L n , larger values of ε n correspond to better convergence rate bounds.
The second plot in Figure 2 shows the ε n from Lemma 4.2, calculated using the non-centered drift function. That ε n appears to tend to zero at an exponential rate (linear for its logarithm). This rate is straightforward to verify analytically by noting that, in the expression for ε n in Lemma 4.2, the exponent is of order n and what is inside the exponent tends in probability to a number between zero and one in the current setting. Hence, unlike the ε n from Lemma 5.2, it does not provide any control over the convergence rate asymptotically.
In summary, this example shows the drift parameters λ n and L n are typically smaller (better) for the non-centered drift function used in the small-n setting, but the minorization parameter ε n obtained using that drift function tends to zero as n tends to infinity.

Discussion
Markov chain Monte Carlo is used in a wide range of problems, including but not limited to the Bayesian settings considered here. However, the theoretical properties of algorithms used by practitioners are not always well understood. We have focused on the case of Bayesian vector autoregressions with predictors. This is one of the most common models in time series, and in particular in the analysis and forecasting of macroeconomic time series. Moreover, due to similarities of the likelihoods of vector autoregressions and multivariate linear models, our results apply also to the latter. The Gibbs sampler has been suggested for exploring the posterior distribution of the parameters A and Σ when there are no predictors [23], but there has been a lack of theoretical support. We have addressed this by proposing a collapsed Gibbs sampler that handles predictors and studying its convergence properties. Since our algorithm simplifies to the usual Gibbs sampler when there are no predictors, our results apply also in that setting.
We have proven that our algorithm generates a geometrically ergodic Markov chain under reasonable assumptions (Theorem 4.3). This result is applicable both in classical settings where the sample size is large (but fixed) in comparison to the number of parameters, and in large VARXs where the dimension of the process or the lag length is large in comparison to the number of observations. Though we have not emphasized it, this geometric ergodicity holds also if the model is misspecified. Indeed, once the data are fixed and the posterior specified, it is of no importance for the convergence rate how the data were actually generated, as long as they satisfy the conditions laid out in the relevant lemmas and theorem. Thus, with the algorithm we consider, characteristics of the posterior distribution can be reasonably estimated using principled approaches to ensuring the simulation results are trustworthy [7,20,48]. Our asymptotic analysis, or convergence complexity analysis, indicates our algorithm should perform well also in large samples; we have proven that, as the sample size tends to infinity, the geometric ergodicity of the sequence of transition ker- nels corresponding to our algorithm is asymptotically stable. This result is one of the first of its kind for practically relevant MCMC algorithms. As with our results for small samples, our asymptotic results hold under almost arbitrary model mispecification, as specified in Theorem 5.3.
Avenues for future research include convergence complexity analysis of cases where the dimension of the process or the lag length tends to infinity, either together with the sample size or for a fixed sample size. Our proof uses the intuition that the posterior concentrates near the least squares estimator, and it may be possible to formalize this intuition also in settings where r, p, or q grows with n. However, it is likely that the drift function would need to be adjusted, perhaps by using a different norm: the Frobenius norm has convenient properties that we used in our proofs but is often less useful in high-dimensional settings [45]. If the sample size is fixed or grows slowly in comparison to other quantities, one would likely have to use a different drift function altogether or move to an approach that avoids the use of the minorization condition [33,34].
Thus, to show that f (Y | A, B, Σ, X)f (α)f (Σ) can be normalized to a proper posterior, we need only show that Let us consider the two sets of conditions separately, starting with the first. Since tr(E T Q X EΣ −1 ) = tr(Σ −1/2 E T Q X EΣ −1/2 ) ≥ 0, we can upper bound the integrand in (12) by which since we are assuming that n−p+a −r −1 > r−1, i.e. that n+a > 2r +p and that D is SPD, is the product of a proper inverse Wishart and a proper density for α. This finishes the proof for the first set of conditions. For the second set of conditions, notice that for (12) it suffices, since D is SPSD, and hence f (Σ) and f (α) both bounded, to show that Using the same decomposition as before we have for the last integrand Under the second set of assumptions, the last line is proportional to the product of an inverse Wishart density for Σ with scale matrixỸ T QZỸ and n + a − p − qr − r − 1 degrees of freedom and a matrix normal density for A with mean HZỸ and scale matrices (Z TZ ) −1 and Σ, and hence integrable. The assumption that [Y, X, Z] has full column ensures that, by Lemma A.1,Ỹ T QZỸ andZ TZ are positive definite matrices.
Proof of Lemma 3.1. The full conditional distribution of B is immediate from dropping terms not depending on B in (11). Consider next the integrand in (12). The first term in the exponential is Thus, the log of the integrand is quadratic as a function of α, with Hessian −B = −Σ −1 ⊗ Z T Q X Z − C and gradient −(Σ −1 ⊗ Z T Q X ) vec(Q X Y ) − Cm, which implies the desired distribution for α | Σ, Y . Finally, the distribution of Σ | α, Y is immediate from dropping terms in the integrand in (12)   Proof. Since C −1 is SPD, the term tr(C −1 ) is positive, and so dropping it and the term C −1 Cm in the expression for L n (Y, X) gives L n (Y, X) > C −1/2 2 C 1/2α 2 = γ min (C) −1 C 1/2α 2 .