Convergence rates for MCMC algorithms for a robust Bayesian binary regression model

Abstract: Most common regression models for analyzing binary random variables are logistic and probit regression models. However it is well known that the estimates of regression coefficients for these models are not robust to outliers [26]. The robit regression model [1, 16] is a robust alternative to the probit and logistic models. The robit model is obtained by replacing the normal (logistic) distribution underlying the probit (logistic) regression model with the Student’s t−distribution. We consider a Bayesian analysis of binary data with the robit link function. We construct a data augmentation (DA) algorithm that can be used to explore the corresponding posterior distribution. Following [10] we further improve the DA algorithm by adding a simple extra step to each iteration. Though the two algorithms are basically equivalent in terms of computational complexity, the second algorithm is theoretically more efficient than the DA algorithm. Moreover, we analyze the convergence rates of these Markov chain Monte Carlo (MCMC) algorithms. We prove that, under certain conditions, both algorithms converge at a geometric rate. The geometric convergence rate has important theoretical and practical ramifications. Indeed, the geometric ergodicity guarantees that the ergodic averages used to approximate posterior expectations satisfy central limit theorems, which in turn allows for the construction of asymptotically valid standard errors. These standard errors can be used to choose an appropriate (Markov chain) Monte Carlo sample size and allow one to use the MCMC algorithms developed in this paper with the same level of confidence that one would have using classical (iid) Monte Carlo. The results are illustrated using a simple numerical example.


Introduction
The logistic and probit regression models are commonly used in practice to analyze binary data.However, the estimators of the regression coefficients for these popular models are not robust to outliers [26].[16] proposed a robust alternative to the logistic and probit models which he called the robit regression model.In order to describe the robit model, suppose that Y = (Y 1 , Y 2 , . . ., Y n ) is a vector of n independent binary random variables such that P (Y i = 1) = F ν (x T i β), where F ν (•) is the cdf of the univariate Student's t distribution with known and fixed degrees of freedom ν, the x i 's, i = 1, 2, . . ., n are p × 1 known vector of covariates associated with Y i and β is the p × 1 vector of unknown regression coefficients.The joint probability mass function (pmf) of Y is given by where y = (y 1 , y 2 , . . ., y n ).The above model, as an alternative to the logistic model, has been previously suggested by [24] and [1].Both probit and logistic regression models can be well approximated by the robit model.In fact, a robit link with about seven degrees of freedom provides an excellent approximation to the logit link, and the probit link can be well approximated by a robit link with large degrees of freedom.Gelman and Hill [7, chap. 6] showed that in the presence of outliers, the robit model, unlike the logistic and probit models, can effectively downweight the discordant data points for a better model fitting.On the other hand, in the absence of any discrepancy in the data set, if the data actually come from say, a logistic model, then the estimated response curve obtained by fitting the robit model is close to the true logistic model.We consider a Bayesian analysis of binary data with the pmf of Y as defined in (1.1) and a normal prior on β.The posterior density π(β|y) is given by where φ p (β; β a , Σ −1 a ) is the density of the p-dimensional normal distribution with mean β a , dispersion matrix Σ −1  a , evaluated at β, m(y) is the normalizing constant, that is, m(y) = R p p(y|β)φ p (β; β a , Σ −1 a )dβ.Proper choice of the dispersion matrix, Σ −1 a is important.One choice is to take Σ a = cX T X for some constant c, where X is the n × p design matrix [see e.g . 38].The simplicity of this choice of Σ a is that only one hyper-parameter c needs to be specified.
Unfortunately, E π h is a ratio of two intractable integrals which are not available in closed form.Moreover, classical Monte Carlo methods based on independent and identically distributed (iid) samples are problematic when p is large.In this case we may resort to MCMC methods.Here we construct a data augmentation (DA) algorithm that can be used to explore the posterior density π(β|y).DA algorithms, like its deterministic counterpart the EM algorithms, often suffer from slow convergence [see e.g.36].In the case when the prior mean β a = 0, following [10] we present an alternative MCMC algorithm, called the sandwich algorithm, that is computationally equivalent to the DA algorithm, but converges faster to the stationary distribution and is more efficient (See Section 3 for details).
Let {β m } ∞ m=0 be the Markov chain associated with either the DA or the sandwich algorithm.Let Similarly, let L 2 (π) denote the vector space of real-valued functions that are square integrable with respect to the target density π(β|y).
) is a consistent estimator of E π h since by ergodic theorem hm → E π h with probability 1 as m → ∞.However, in practice we need to choose the sample size m.The sample size m must be large enough to ensure that the Monte Carlo error hm − E π h is sufficiently small and this is where an approximate distribution of hm − E π h can be used.In fact if there is a central limit theorem (CLT) for h, that is, and if σ2 h is a consistent estimator of the asymptotic variance σ 2 h , then an asymptotic 95% confidence interval (CI) for E π h based on m iterations of the chains can be constructed as hm ± 2σ h / √ m.If we are satisfied with the width of the CI, we stop, otherwise, we increase the sample size m to achieve the desired level of precision.Unfortunately, unlike in the case of classical Monte Carlo methods based on iid draws from the target distribution, the second moment condition (h ∈ L 2 (π)) is no longer enough to guarantee a Markov chain CLT (1.3).The most common method of establishing the CLT in (1.3) is to show that the Markov chain {β m } ∞ m=0 is geometrically ergodic (see Section 2 for the definition), and this requires a deep theoretical convergence analysis of the chain.Moreover, due to the serial correlation in the Markov chain, the asymptotic variance σ 2 h has a complicated form, and consistent estimation of σ 2 h is a challenging problem.On the other hand, if {β m } ∞ m=0 is geometrically ergodic then results in [9,12,2], and [6] show that specialized techniques such as the regenerative simulation and the batch means can be used to construct consistent estimators of σ 2 h .(For more on standard errors of MCMC based estimates, see [13] and [5].)The main result in this paper is a proof that, under certain easily verifiable conditions on n, ν, c, and the design matrix X, the Markov chains underlying the DA and sandwich algorithms presented here converge at a geometric rate.As described above, our convergence rate results allow one to use the MCMC algorithms developed in this paper with the same level of confidence that one would have using classical (iid) Monte Carlo.
The remainder of this paper is organized as follows.Section 2, contains a description of the DA algorithm as well as the statement of our results regarding convergence rates of the DA algorithm.In Section 3, we show how the group action recipe of [10] can be used to improve the DA algorithm.Our results are illustrated using a simple example in Section 4. A short discussion appears in Section 5. Technical results and proofs are relegated to the Appendices.

The Data Augmentation algorithm
We mentioned in the introduction that sampling directly from (1.2) is rather difficult.But, as we explain now, it is easy to construct a data augmentation algorithm for (1.2) by introducing two sets of new (latent) random variables.In particular, let z 1 , . . ., z n be n independent variables with z i ∼ t ν (x T i β, 1) where t ν (µ, 1) denotes the univariate Student's t distribution with location µ, scale 1 and degrees of freedom ν.If we define Y i = I R+ (z i ), then Y 1 , . . ., Y n are n independent Bernoulli random variables with P (Y i = 1) = P (z i > 0) = F ν (x T i β).Thus z 1 , . . ., z n can be thought of as latent variables underlying the binary data y.Now we use the fact that t distribution can be expressed as a scale mixture of normal distributions, that is, if z i |λ i ∼ N (µ, 1/λ i ) and λ i ∼ Gamma (ν/2, ν/2), then the marginal distribution of z i is t ν (µ, 1).The joint posterior density of (β, λ, z) given y is where is the density of the univariate normal distribution with mean a, variance b, evaluated at the point x, that is φ ≡ φ 1 , and q(ω; a, b) is the gamma density with shape parameter a, scale parameter b, evaluated at ω (i.e., q(ω; a, b) = b a ω a−1 e −bω /Γ(a)).
It can be shown that conditional on (β, y), z 1 , . . ., z n are independent with z i |β, y ∼ T t ν (x T i β, y i ), where T t ν (x T i β, y i ) denote the truncated t distribution with mean x T i β, variance 1 and degrees of freedom ν that is truncated left at 0 if y i = 1 and truncated right at 0 if y i = 0. Sampling from the truncated t distribution can be done by the inversion method.Lastly, conditional on (z, β, y), λ i 's are independent with λ i |z, β, y ∼ Gamma ((ν + 1)/2, (ν + (z i − x T i β) 2 )/2) for i = 1, . . ., n.A single iteration of the DA algorithm uses the current state β to produce the new state β ′ through the following two steps: The Markov transition density of the DA algorithm is given by Note that, while the Markov transition density does depend on the data, y, and the design matrix X, these quantities are fixed, so this dependence is suppressed in the notation.The basic theory of DA algorithms implies that k(β ′ |β) is reversible with respect to the posterior density π(β|y); i.e., we have for all β, β ′ ∈ R p .It follows immediately that the posterior density is invariant for the chain; i.e., for all β ∈ R p .Let Z denote the subset of R n where z lives, that is, Z is the Cartesian product of n positive and negative half lines (R + and R − ), where the ith component is either Harris ergodic; that is, irreducible, aperiodic and Harris recurrent (Tan and Hobert [34, Lemma 1], Hobert [8]).See [22] for definitions of irreducibility, aperiodicity and Harris recurrence.Harris ergodicity implies that, no matter how the chain is started, the chain converges to its stationary distribution and that ergodic average based on the DA algorithm, hm converges almost surely to its population counterpart, which is, of course, the (posterior) expectation E π h.
V. Roy

Geometric convergence of the DA algorithm
We begin with defining what it means for the DA algorithm to converge at a geometric rate.Let K(•, •) denote the Markov transition function (Mtf) of the DA Markov chain; that is, where K 1 ≡ K. Let Π(•|y) denote the probability measure corresponding to the posterior density π(β|y).Harris ergodicity implies that the total variation distance between the probability measures K m β, • and Π(•|y) decreases to zero as m gets large; that is, for any starting value, β ∈ R p , we have Note that the above expression gives no information about the rate at which the total variation distance converges to zero.The Markov chain is called geometrically ergodic if there exists a function M : R p → [0, ∞) and a constant r ∈ (0, 1) such that, for all m, It is known that if a reversible Markov chain is geometrically ergodic, then there is a CLT (1.3) for every function that is square integrable with respect to the stationary distribution [29].Unfortunately, Harris ergodicity of a Markov chain, which is generally easy to verify, does not imply geometric ergodicity.We will establish geometric ergodicity of the DA algorithm by establishing the so-called drift condition, which we now describe.(See Jones and Hobert [13] for a gentle introduction to these ideas.)A function V : R p → R + is said to be unbounded off compact sets if, for each α > 0, the level set {β : V (β) ≤ α} is compact.We say that a geometric drift condition holds if there exists a V : R p → R + that is unbounded off compact sets and constants ρ ∈ [0, 1), and The function V is called the drift function.Since π(β, (λ, z)|y) > 0 for all (β, λ, z) ∈ R p × R n + × Z, a geometric drift condition implies that the DA algorithm is geometrically ergodic (Meyn and Tweedie [22, chap. 15.], Hobert [8]).
Let W be an n × p matrix whose ith row is w T i where w i = x i I {0} (y i ) − x i I {1} (y i ).We now define two conditions on y and X which are used to prove the geometric ergodicity of the DA algorithm.
A1 The design matrix X has full rank, and A2 there exists a vector a = (a 1 , ..., a n ) T with a i > 0 for all i = 1, 2, . . ., n such that W T a = 0 .
[32] provide a simple way to check the condition A2 that can be easily implemented in most statistical software languages.By establishing a drift condition for the DA algorithm we prove the following theorem in Appendix B.
Theorem 1. Assume that A1 and A2 hold.The DA algorithm is geometrically ergodic if Σ a = cX T X, ν > 2 and n < cν Ideally, we would like to be able to say that the DA algorithm is geometrically ergodic for any n, ν, y, X, and Σ a .As mentioned in the Introduction, the assumption Σ a = cX T X is made in the literature and this simple choice requires only one hyperparameter c to be specified.Several inequalities used in the proof of Theorem 1 heavily depends on this assumption.Similarly the conditions A1 and A2 are crucial in our proof.The condition ν > 2 guarantees the existence of finite second moment of the latent t random variables.In our opinion, the most restrictive of all conditions is (2.2), which is a direct consequence of the several inequalities we have used in the proof.Note that (2.2) implies that n < c.Thus when n is large, the elements of the prior covariance matrix becomes small.We prove Theorem 1 by establishing a drift condition using the drift function V (β) = β T X T Xβ.We believe that a substantial reduction of the conditions in Theorem 1 would require the use of a different drift function V (β), which means starting over from square one.
In practice often it is assumed that β a = 0.For the rest of this article, we assume that β a = 0 and in this case the posterior density of β becomes We can derive the corresponding complete posterior density π(β, (λ, z)|y) just by replacing φ p (β; β a , Σ −1 a ) with φ p (β; 0, Σ −1 a ) in (2.1), and use it to construct a data augmentation algorithm for π(β|y).Obviously, a single iteration of this DA algorithm uses the current state β to produce the new state β ′ through the following two steps: m=0 be the Markov chain underlying the above DA algorithm for π(β|y).From Theorem 1 it follows that under conditions A1 and A2, { βm } ∞ m=0 is geometrically ergodic if Σ a = cX T X, ν > 2 and n < cν/(ν + 1).
As explained in [36], the standard DA algorithms often suffer from slow convergence.In the next section we present algorithms that have faster convergence than { βm } ∞ m=0 .
Suppose that R((λ, z), (dλ ′ , dz ′ )) is a Markov transition function on R n + × Z that is reversible with respect to π(λ, z | y).Consider adding an extra step to the DA algorithm where, after (λ, z) is drawn in the first step, we move to a new value, (λ ′ , z ′ ) ∼ R((λ, z), •), before drawing new value of β.To be more specific, let { βm } ∞ m=0 be a new Markov chain that proceeds from the current state β to the next state β ′ via the following three steps 1. Draw {(λ i , z i ), i = 1, 2, . . ., n} by first drawing z i ∼ T t ν (x T i β, y i ) and then draw λ i ∼ Gamma ν+1 2 , where Λ ′ is the n × n diagonal matrix with elements of λ ′ on the diagonal.Note that the first and third steps are the same as the DA algorithm.[37] call the above algorithm the "sandwich algorithm" since the draw from R((λ, z), •) is sandwiched between the two steps of the DA algorithm.We will provide a specific R later in this section.
A routine calculation shows that the reversibility of R with respect to π(λ, z|y) implies that the sandwich chain, { βm } ∞ m=0 , is reversible with respect to the target (posterior) density, π(β|y).The sandwich algorithm is known to converge faster than the DA algorithm.In order to make precise comparisons of the DA and sandwich algorithms, we need to introduce some notations.Let that is, L 2 0 (π) is the subspace of L 2 (π) of mean zero functions.Then L 2 0 (π) is a Hilbert space with inner product defined as and the norm is h = h, h .Let K, K : L 2 0 (π) → L 2 0 (π) denote the usual Markov operators defined by the DA chain { βm } ∞ m=0 and the sandwich chain { βm } ∞ m=0 , respectively.In particular, K maps h ∈ L 2 0 (π) to where k(β ′ |β) is the Markov transition density of the DA chain { βm } ∞ m=0 defined in Section 2.2.The Markov operator K is similarly defined.The reason that we define K as an operator on L 2 0 (π) (instead of L 2 (π)) is to eliminate the eigenvalue 1 associated with constant eigenfunction from its spectrum [See e.g.23].Let K and K denote the (operator) norms of K and K, for example, K = sup In general, the closer the norm of a Markov operator is to 0, the faster the corresponding Markov chain converges [see, e.g., 18].There are close connections between convergence properties discussed in the previous section and the norm of a Markov operator.Indeed, a reversible Markov chain is geometrically ergodic if and only if the norm of the corresponding Markov operator is strictly less than 1 [29,30].From Roy [31] we know that, K ≤ K , and hence the sandwich chain converges at least as fast as the DA chain.
Another criterion that is used to compare Markov chains (with the same stationary distribution) is efficiency ordering.Let h ∈ L 2 (π) and we want to estimate E π h.Define σ2 h to be the asymptotic variance in the CLT for the ergodic averages h m based on the DA algorithm if such a CLT exists, and ∞ otherwise.

Similarly define σ2
h for the sandwich algorithm.Sandwich chain is said to be at least as efficient as the DA chain if σ2 h ≤ σ2 h for all h ∈ L 2 (π).In this case after running the two chains for equal number of iterations, we may expect that the sandwich algorithm results in a shorter CI for E π h than the DA algorithm.This is why if the two algorithms are similar in terms of simulation effort, we prefer the sandwich algorithm over the DA.In fact, the results in Hobert and Marchev [10] can be used to show that K − K is a positive operator, that is, ( K − K)h, h ≥ 0 for all h ∈ L 2 0 (π), and this implies that the sandwich chain is at least as efficient as the DA chain [23].
Recall that the only difference between a single iteration of the DA algorithm with that of the sandwich algorithm is that the sandwich algorithm has an extra step according to the Mtf R. Following [19] and [17], [10] gave a recipe of constructing R that involves group actions and (left) Haar measure.In order to use Hobert and Marchev's [2008] group action recipe, let G be the multiplicative group R + where group composition is defined as multiplication.The multiplicative group R + is unimodular with Haar measure ̺(dg) = dg/g, where dg denotes the Lebesgue measure on R + .We now define a (left) group action of G on R n + × Z (the support of the density π(λ, z|y)) as g(λ, z) = (λ, gz) where gz = (gz 1 , • • • , gz n ).(It is easy to verify that the above is a valid group action.)If g ∈ G and h : R n + ×Z → R is an integrable function (with respect to Lebesgue measure), straightforward calculations show that where χ(g) = g n .Also it is easy to see that χ(g −1 ) = 1/χ(g) and χ(g 1 , g 2 ) = χ(g 1 )χ(g 2 ).Here χ(g) plays the role of the function j defined in Hobert and Marchev [10, page 543].Consider a distribution on G with density function where , or, equivalently is positive for all (λ, z) ∈ R n + × Z and finite for almost all (λ, z) ∈ R n + × Z.We now provide an R for Step 2 of the sandwich algorithm.Given (λ, z), we make the transition (λ, z) → (λ ′ , z ′ ) by drawing g 2 from (3.2) and setting λ ′ = λ (that is, λ is left unaltered) and z ′ = gz.In other words, the corresponding Markov operator R on L 2 0 (π(λ, z | y)) is defined as From Hobert and Marchev's [2008] Proposition 3, it follows that R is reversible with respect to π(λ, z|y).Of course, the Markov chain driven by R is reducible, which is a common feature of efficient sandwich algorithms [14].Note that reducibility of R does not stop the sandwich algorithm { βm } ∞ m=0 from being Harris ergodic.From our discussion before, we know that the above sandwich algorithm is more efficient and converges faster than the DA algorithm.The extra step R, which is the sole difference between the DA and sandwich algorithms, is just a single draw from a univariate gamma density and since the computation of the parameters of this gamma density does not involve any extra (compared to the DA algorithm) computationally demanding calculation like matrix inversion, the two algorithms are essentially equivalent in terms of computer time per iteration.Following the proof of Corollary 1 in [32], we can show that the sandwich algorithm inherits the geometric ergodicity of the DA algorithm.
Note that, unlike most examples of sandwich algorithm available in literature, in the above we do not let the group G act on R n + × Z through component-wise multiplication; i.e., we do not define g(λ, z) = (gλ, gz).A similar group action is used in [20] to improve a DA algorithm for a Bayesian logistic model.Now we discuss what happens when G is allowed to act on the entire augmented space R n + × Z through component-wise multiplication, that is g(λ, z) = (gλ, gz).Note that, under the above group action, (3.1) holds if we replace χ(g) with g 2n .To construct the sandwich algorithm in this case, we must first demonstrate that there is a probability density (with respect to the Haar measure ̺(dg)) that is proportional to π(gλ, gz|y)g 2n .In other words, we must show that d 1 (λ, z) := G π(gλ, gz|y)g 2n ̺(dg) < ∞ for all (λ, z) ∈ R n + × Z.It can be shown that, as a function of g, π(gλ, gz|y)g 2n = ag n( 3+ν2 ) e − gν λ i where the constant a does not depend on g.Since Σ a is positive semidefinite, it follows that [28, page 70] (gX T ΛX) −1 − (gX T ΛX + Σ a ) −1 is positive semidefinite.Thus, and hence where the last inequality follows since Now in order to construct an effective sandwich algorithm, we need a fast and efficient way of sampling from the following density on R + (with respect to Lebesgue measure on R + ) .
Of course, in this case the Step 2 of the sandwich algorithm will be to draw g ∼ t λ,z (g) and set z ′ = gz as well as λ ′ = gλ.Note that, even though the above is a univariate density, it may be difficult to efficiently sample from t λ,z (g).Also in this case, since λ is updated in step 2, an extra (compared to the DA algorithm) matrix inversion is required in step 3 of the sandwich algorithm.This is why we prefer the sandwich algorithm with Step. 2 as defined in (3.2).
the standard errors corresponding to the DA and sandwich algorithms are 2.684 and 1.169 respectively.(Note that, Corollary 1 or Corollary 2 are not applicable here anymore as c ≯ n(ν + 1)/ν; we are simply assuming that the CLT still holds in this case.)These estimates suggest that, even in this simple example, the DA algorithm requires about 2.684 2 /1.169 2 ≈ 5.3 times as many iterations as the sandwich algorithm to achieve the same level of precision.We repeated the simulation 1,000 times independently and the above ratio estimates ranged between 2.34 and 24.41.

Discussion
We present two MCMC algorithms for exploring the posterior density associated with a Bayesian robit model with a normal prior on the regression coefficients.
The first one is a DA algorithm which is obtained using the fact that t distribution can be expressed as a scale mixture of normal distributions.We then use the group action recipe given in [10] to construct an efficient sandwich algorithm.Unlike most of the sandwich algorithms available in the literature, we do not consider a group action on the entire augmented space defined through componentwise multiplication.The sandwich algorithm converges faster than the DA algorithm in the Markov operator norm sense.Also, the sandwich algorithm is more efficient than the DA algorithm in the sense that the asymptotic variance in the CLT under the sandwich algorithm is no larger than that under the DA algorithm.Since the only difference between a single iteration of the sandwich algorithm and that of the DA algorithm is a univariate draw from a gamma distribution, the two algorithms are essentially equivalent in terms of computational effort.Thus, we prefer the sandwich algorithm to the DA algorithm.We prove that, under certain conditions, both DA and sandwich algorithms converge at a geometric rate.These convergence rate results are important from a practical standpoint because geometric ergodicity guarantees the existence of central limit theorems which are essential for the calculation of valid asymptotic standard errors for MCMC based estimates.Our results are illustrated through a numerical example.
In this paper we have considered a multivariate normal prior distribution for the regression coefficients β.As a possible avenue for future work, it would be interesting to see if the methods presented here can be used to establish geometric convergence of MCMC algorithms for Bayesian robit models with other priors on β, e.g., a multivariate Student's t prior or an improper uniform prior.Recently [20] [also see 11] showed that a DA algorithm can be obtained for Bayesian logistic models using the mixture normal representation of logistic distribution.It would also be interesting to see if the DA algorithms for Bayesian logistic regression converge at a geometric rate.

Appendix A: A Mill's ratio type result for Student's t distribution
The following result can be gleaned from [25], but here we give a proof for completeness.
Proof.Let U ∼ t ν (0, 1), that is, U follows the t-distribution with mean 0, variance 1 and ν d.f.The pdf of U is Then, Since u > 0, using a well known bound for Mill's ratio [3, p. 175] we have Therefore, we have , or, equivalently 1 where, as the notation suggests, the (conditional) expectations are with respect to the densities π(β | λ, z, y), π(λ | z, β, y) and π(z | β, y) in the given order.The inner-most expectation in (B.1) is with respect to π(β | λ, z, y), which is a multivariate normal density.The next level expectation is with respect to π(λ | z, β, y), which is a product of univariate gamma densities.And, lastly, the outer-most expectation is with respect to π(z | β, y), which is a product of truncated Student's t densities.
Starting with the innermost expectation, we have where tr(•) denote the trace of a matrix.Note that, tr (X T X)(X T ΛX + Σ a ) −1 = tr where the last equality follows from our assumption that Σ a = cX T X.Now is a positive definite matrix and is positive semidefinite, from Roy and Hobert's [2010] Lemma 3, it follows that Now we consider the second term in (B.2).Note that where A, B and C denote the first, second and the third terms in the above expression.Note that, all of these three terms are functions of random vectors z and λ as well as the prior covariance matrix Σ −1 a and the data (y, X).We will analyze each of these three terms separately.We begin with the second term.Note that where the inequality follows from the fact that 1 c X T ΛX is a positive semidefinite matrix and the second equality is due to our assumption that Σ a = cX T X.Since X T ΛX is positive semidefinite, it follows that [see e.g. 28, page 70] Σ −1 a − (X T ΛX + Σ a ) −1 is positive semidefinite.So we have V. Roy Thus we have Now recall that π(λ|z, β, y) is product of n univariate Gamma densities with So we have and so from (B.5) we have Finally, we consider the outer-most expectation in (B.1).Note that In order to calculate n i=1 E(z 2 i |β, y), recall that conditional on (β, y), z 1 , . . ., z n are independent with z i |β, y ∼ T t ν (x T i β, y i ).Since ν > 2, simple calculations using the results in [15] show that where κ is as defined in Appendix A or we can simply write Note that n i=1 E(z 2 i |β = 0, y) = nν ν−2 .In order to bound the above expression when β ∈ R p \ {0}, as in [32] we construct a partition of the set R p \ {0} using the n hyperplanes defined by w T i β = 0.For a positive integer m, define N m = {1, 2, . . ., m}.Let D 1 , D 2 , . . ., D 2 n denote all the subsets of N n , and, for each j ∈ N 2 n , define a subset of the p-dimensional Euclidean space as follows: S j = β ∈ R p \ {0} : w T i β ≤ 0 for all i ∈ D j and w T i β > 0 for all i ∈ Dj where Dj denotes the complement of D j ; that is, Dj = N n \ D j .Note that the sets S j 's are disjoint, ∪ 2 n j=1 S j = R p \ {0}, and some of the S j 's may be empty.Since the condition A2 is in force, following [32] we can show that if S j is nonempty, then so are D j and Dj .Now define E = j ∈ N 2 n : S j = ∅ .For each j ∈ E, define i∈Dj (w T i β) 2 + i∈Dj (w T i β) 2 .and ρ j = sup β∈Sj R j (β) .
From [32] we know that ρ j < 1 for all j ∈ E. Now we consider the following Mill's ratio type expression for Student's t distribution u , then M ∈ (0, ∞).From Lemma 1 we know that when u > 0, Fix j ∈ E. It follows from (B.7) and above two results that for all β ∈ S j , we have Therefore, since ⊎ j∈E S j = R p \ {0}, it follows that for all β ∈ R p , we have , ρ is also less than 1.So the DA chain is geometrically ergodic.