Geometric ergodicity of Gibbs samplers for Bayesian error-in-variable regression

Multivariate Bayesian error-in-variable (EIV) linear regression is considered to account for additional additive Gaussian error in the features and response. A 3-variable deterministic scan Gibbs samplers is constructed for multivariate EIV regression models using classical and Berkson errors with independent normal and inverse-Wishart priors. These Gibbs samplers are proven to always be geometrically ergodic which ensures a central limit theorem for many time averages from the Markov chains. We demonstrate the strengths and limitations of the Gibbs sampler with simulated data for large data problems, robustness to misspecification and also analyze a real-data example in astrophysics.

Bayesian approaches develop a strategy for additional error in the variables by constructing a new model incorporating additional error.We consider multivariate Bayesian EIV linear regression (Charisse Farr et al., 2020;Dellaportas and Stephens, 1995;Fang et al., 2017;Huang, 2010;Mallick and Gelfand, 1996;Muff et al., 2015;Richardson S, 1993;Rodrigues and Bolfarine, 2007;Torabi et al., 2021;Vidal and Arellano-Valle, 2010) accounting for additive Gaussian error in the features (covariates) and response.We assume the variability of the additive Gaussian error is known beforehand which arises often in astrophysics in the presence of known instrumentation error (Hilbe, de Souza and Ishida, 2017;Kelly, 2012).Alternative approaches to EIV models attempt to correct existing parameter estimation methods such as least squares or method of moments with weighting and other techniques (Fuller, 1987;Stefanski and Carroll, 1985).Several other strategies for EIV modeling are discussed in more comprehensive treatments on the topic (Buonaccorsi, 2010;Carroll et al., 2006;Fuller, 1987).
We write x ∼ N d (m, C) to mean the d-dimensional normal distribution with mean m and symmetric, positive-definite (SPD) covariance matrix C. We also write x ∼ W −1 d (ν, V ) to be the inverse-Wishart distribution with positive integer degrees of freedom ν ≥ d and scale SPD matrix V ∈ R d×d .Let vec(A) denote the vectorization of a matrix A by stacking the columns.Let (Y i , X i , Z i ) n i=1 be independent and identically distributed (i.i.d.) where the response Y i takes values in R m along with features X i taking values in R p and fixed, known features Z i ∈ R q where m, n, p, q are positive integers.Let θ = vec(Θ) ∈ R qm , β = vec(B) ∈ R pm , and SPD matrix Σ ∈ R m×m be unknown regression and covariance parameters respectively.We introduce new parameters A = (A 1 , . . ., A n ) T with A i ∈ R p to model additional error in X i using classical or Berkson errors (Berkson, 1950).The classical error model specifies X i |A i and the Berkson error model (Berkson, 1950) assumes instead a data-dependent prior on A i |X i .When there is additional error in X i , the EIV linear regression model for i ∈ 1, . . ., n is i.i.d. with where the SPD matrices V i ∈ R p×p are known.When there is also additional error in the responses Y i , we assume an i.i.d.hierarchical regression model with where U i ∈ R m×m are known SPD matrices.
We will be interested in the posterior for both models (1) and (2) using independent normal and inverse-Wishart priors on the parameters (A, θ, β, Σ).The independent prior choice is a popular choice in Bayesian regression models with and without measurement error (Carroll et al., 2006;Dellaportas and Stephens, 1995;Ekvall and Jones, 2021;Rajaratnam and Sparks, 2015).For the EIV regression models (1) and (2), the independent priors are chosen where q+p)  and SPD matrix J 0 ∈ R m(q+p)×m (q+p) .The classical and Berkson error models assume either where k i ∈ R p and K i ∈ R p×p are SPD matrices.For example, an exposure model (Gustafson, 2003) utilized often in epidemiology would assume classical errors with a data-dependent prior on each A i depending on Z i .In the Berkson error model, each A i |X i is already specified and it is natural to assume an improper flat prior on each A i .Previous work has proposed Gibbs sampling (Geman and Geman, 1984) to draw samples from the posterior, denoted by Π n , in Bayesian EIV regression models (Bhadra and Carroll, 2016;Carroll et al., 2006;Dellaportas and Stephens, 1995;Richardson S, 1993).However, trustworthy estimation from a Gibbs sampler requires the Markov chain to converge to the posterior distribution at a sufficiently fast rate.Consider a vector-valued function f with f 2+δ dΠ n < ∞ for some δ ∈ (0, ∞) and denote fm as the time average of m samples from the Gibbs sampler.In order to be confident in the estimator fm in applications, a standard error and confidence interval are essential.A Gibbs sampler is geometrically ergodic if initialized at points, its marginal distribution is converging to Π n at an exponential rate in total variation.Geometrically ergodic Gibbs samplers provide rich theoretical guarantees which are of practical relevance in applications.These Gibbs samplers satisfy a central limit theorem (Chan and Geyer, 1994;Jones, 2004), that is, √ m fm − f dΠ n is asymptotically normally distributed and under suitable assumptions, the covariance in this normal distribution can be consistently estimated (Flegal and Jones, 2010).Further pertinent tools to ensuring reliable estimation such as estimates of the effective sample size, consistent confidence ellipsoids, and consistent confidence intervals for quantile estimation are also available (Doss et al., 2014;Vats, Flegal and Jones, 2019a).
To the best of our knowledge, the rate of convergence for Gibbs sampling in EIV regression models has not been previously investigated.Related approaches have instead proposed variational Bayesian methods (Bresson et al., 2021;Pham, Ormerod and Wand, 2013) and the integrated nested Laplace approximation (INLA) (Håvard Rue, 2009;Muff et al., 2015).We construct a general density which in special cases, is the posterior for the 4 Bayesian EIV regression models (1) and (2) using the independent normal and inverse-Wishart prior choice on the parameters (3) and (4).Our main contribution constructs a 3-variable deterministic scan Gibbs sampler for this general density, and we show it is always geometrically ergodic using a drift and minorization condition (Hairer and Mattingly, 2011;Meyn and Tweedie, 2009).The 3-variable Gibbs sampler we construct can be simulated efficiently on a computer without the need for complex Metropolis-Hastings or rejection sampling steps at each iteration.
The organization of this paper is as follows.In Section 2, we construct a general EIV regression density and construct a 3-variable Gibbs sampler for this density.We show the Gibbs sampler is always geometrically ergodic and apply this to the 4 multivariate Bayesian EIV regression models presented in this introduction.Section 3 studies the algorithm empirically where we demonstrate limitations of the Gibbs sampler with simulated data for large data problems and also the behavior of the Gibbs sampler under model misspecification.Section 4 studies a real-data example in astrophysics to study supermassive black hole mass (Harris, Poole and Harris, 2014;Hilbe, de Souza and Ishida, 2017).Finally in Section 5, we discuss our results and future research directions.

General Gibbs Sampler for EIV regression
For positive integers p, define p-norms by • p and the Frobenius norm by • F .Let ⊗ denote the Kronecker product.The posteriors for the Bayesian EIV regression models (1) and (2) using independent prior choices (3) and ( 4) for both classical and Berkson errors share a common general form which we study in this section.The posterior densities for these Bayesian EIV regression models are special cases of the density (5) but will differ depending on the EIV modeling choice illustrated in the subsequent sections.For i ∈ 1, . . ., n, define hyperparameters We will construct a 3-variable deterministic scan Gibbs sampler using a specific update order for the density (5).We also derive the conditional densities for the Gibbs sampler which can be sampled directly.Initialize (θ 0 , β 0 , Σ 0 ) and

Finally, generate independently
where For points (A, θ, β, Σ) and (A , θ , β , Σ ), the Gibbs sampler has Markov transition density and Markov transition kernel defined for suitable sets B by The Markov kernel at larger iteration times t ≥ 2 is defined recursively with P 1 ≡ P by We will use the following drift function defined by combined with a minorization condition to show there is a ρ ∈ (0, 1) and M 0 ∈ (0, ∞) so that for any initialization A, θ, β, Σ, where (Hairer and Mattingly, 2011).The condition (6) implies the Gibbs sampler is geometrically ergodic.We now state our main result.
Theorem 1.The 3-variable deterministic scan Gibbs sampler for the general density (5) is geometrically ergodic.
Proof.Using a special property of the Gibbs sampler, it will be sufficient to develop a drift and minorization condition based only on the marginal chain (A t , θ t , β t ) t (Roberts and Rosenthal, 2001, Example 3.6).In particular, we will use the special property of this Gibbs Markov kernel P that for suitable sets B, P (•, B) is a function of only the parameters (A, θ, β) and does not depend on Σ.We first show a minorization condition.For ∈ (0, ∞), define the function g r by and the constant Z g = g (A , θ , β , Σ )dA dθ dβ dΣ .The drift function V is a continuous, strongly convex function on a closed, convex domain so its sublevel sets are closed and bounded (Nesterov, 2018, Corollary 3.2.2).For fixed θ , β , Σ , the function is continuous and achieves its minimum over compact sets.Thus, Z g is not 0 and we can define the probability measure For any ∈ (0, ∞) and any suitable set B, inf It remains to show a drift condition.Fix A 0 , θ 0 , β 0 , and fix i ∈ 1, . . ., n.Since and taking the expectation with respect to Using singular value decomposition from (Horn and Johnson, 2012, Theorem 2.6.3),choose matrices Using ( 8) and properties of the trace For x ∈ [0, ∞) and a ∈ (0, ∞), we have the inequality Using inequalities ( 8) and ( 9), the matrix norm is sub-multiplicative, and U i 2 = 1, we have Define the matrix X = (d 1 , . . ., d n ) T .Applying these upper bounds to (7) and combining for each i ∈ 1, . . ., n, By convexity, for every x, y, x − y are SPD.Using convexity, and the matrix norm is sub-multiplicative, we have Therefore, Now taking the expectation with respect to θ, β|A 0 , θ 0 , β 0 , Σ ).
Using singular value decomposition (Horn and Johnson, 2012, Theorem 2.6.3),choose matrices U ∈ R mn×mn , V ∈ R m(p+q)×m(p+q) with U T U = U U T = I mn and V T V = V V T = I m(p+q) and a rectangular diagonal matrix Σ A0 ∈ R mn×m(p+q) with diagonal nonnegative singular values (σ Using ( 11) and properties of the trace Using convexity, Using the inequality (9) and the identity (11), Using (11), Combining the upper bounds Now using ( 10) and ( 12) and taking the iterated expectation with respect to θ, β|A 0 , θ 0 , β 0 , Σ, Since Σ −1 has a Wishart distribution and using properties of the trace, Similarly, we use the second moment formula of the Wishart (Letac and Massam, 2004) to get the upper bound, Taking the iterated expectation with respect to Σ|A 0 , θ 0 , β 0 in (13), there is a constant L ∈ (0, ∞) so that the drift condition is satisfied with

Bayesian EIV regression with errors in the features
Using Theorem 1, we develop geometrically ergodic Gibbs samplers for Bayesian EIV regression with additive Gaussian error in the features.For the remainder, we write the observed data as Y = (y 1 , . . ., y n ) T ∈ R n×m , X = (x 1 , . . ., x n ) T ∈ R n×p , and Z = (Z 1 , . . ., Z n ) T ∈ R n×q .Consider the Bayesian EIV regression (1) with Berkson errors and priors (3) and ( 4).We will write the posterior density π n for this Bayesian model as This posterior density is a special case of the general density (5) choosing M ≡ Z, R ≡ Y , c 0 , C 0 ≡ j 0 , J 0 , and We can define a 3-variable deterministic scan Gibbs sampler which generates a Markov chain (A t , θ t , β t , Σ t ) ∞ t=0 for this posterior density as a special case of the Gibbs sampler constructed in Section 2. Initialize (A 0 , θ 0 , β 0 , Σ 0 ) and for t ∈ 1, . .., Applying Theorem 1, we have the following result.
Corollary 1.The 3-variable Gibbs sampler (A t , θ t , β t , Σ t ) ∞ t=0 for the posterior in Bayesian EIV regression (1) with Berkson errors and priors (3) and ( 4) is geometrically ergodic.Now consider Bayesian EIV regression (1) with additive Gaussian error in X i using classical errors and priors (3) and ( 4).The posterior has density where The posterior density is also a special case of the general density (5) when Z ≡ M , R ≡ Y , and c 0 , C 0 ≡ j 0 , J 0 , and We define a 3-variable deterministic scan Gibbs sampler similarly.Initialize (A 0 , θ 0 , β 0 , Σ 0 ) and for t ∈ 1, . .., We also have the following as a direct result of Theorem 1.

Bayesian EIV regression with errors in the response and features
Similar to the previous section, we develop geometrically ergodic Gibbs samplers for Bayesian EIV regression with additional additive Gaussian error in the features and response.Consider the Bayesian EIV regression (2) with Berkson errors in X i and additional error in Y i along with priors (3) and (4).Let V = Corollary 3. The 3-variable Gibbs sampler (A t , ν t , θ t , β t , Σ t ) ∞ t=0 for Bayesian EIV regression (2) with Berkson errors and priors (3) and (4) is geometrically ergodic.Now consider the Bayesian EIV regression (2) with classical errors in X i and additional error in Y i with priors (3) and ( 4).The posterior Π n for this Bayesian model has density This posterior density is also a special case of the density (5) when redefining θ ≡ (ν, θ) T , M ≡ −I Z , R ≡ 0, c 0 = (y, j 0 ) T , We define a 3-variable deterministic scan Gibbs sampler similarly.Initialize (A 0 , ν 0 , θ 0 , β 0 , Σ 0 ) and for t ∈ 1, . .., Using Theorem 1, we have the following result.

Limitations of the Gibbs sampler in large problem sizes
Theoretically, we developed a qualitative convergence result for the Gibbs sampler in Bayesian multivariate EIV regression.It is important in practice to understand the relationship between scaling of the problem size and the estimation reliability from the Gibbs sampler.We look at artificially generated data to empirically demonstrate the dependence of the Gibbs sampler when the the dimension of the response m and the dimension of the features p are increasing in configurations (m, p) = (1, 1), (2, 7), (3, 7) in the Bayesian posterior.Artificial data is generated according to the multivariate EIV Berkson linear regression model for i = 1, . . ., 50 with We simulate T = 10 5 MCMC realizations from the Gibbs sampler in each configuration using 10 4 realizations for burn-in and analyze diagnostics for β t taking values in R mp .We independently replicate the simulation 5 times to reduce variability.Geometric ergodicity guarantees the properly scaled and summed samples from the Gibbs sampler as T → ∞ in distribution where Σ * is a SPD covariance matrix.The central limit theorem can be seen to hold in this case due to the established drift condition with the drift function.Figure 1a and Figure 1b plot the largest and smallest eigenvalues of a batch means estimate to the multivariate standard error matrix Σ * 1/2 in the central limit theorem.This simulation shows an increase in the largest eigenvalue in larger size problems which can lead to suggesting more iterations are needed for appropriate estimation in practice.Figure 1c plots the multivariate effective sample size (Vats, Flegal and Jones, 2019b).We can also see a relatively sharp decrease in the estimation of the effective sample size as the problem size increases suggesting the algorithm should be run for many iterations even in moderately sized problems.This simulation demonstrates that even though the algorithm is always geometrically ergodic and can scale reasonably well to larger problem sizes, the Gibbs sampler generally requires many more iterations for reliable estimation even in moderately sized problems.Simulation code is available using Python (Van Rossum and Drake Jr, 1995) and (Harris et al., 2020) for matrix calculations at https://github.com/austindavidbrown/BayesEIV.

Robustness to model misspecification
Although the multivariate Bayesian model for EIV accounts for additional error in the features, this error can be misspecified.In particular, the error X i |A i from the model in Section 3.1 may be a multivariate t distribution with heavier tails in practical problems.We are interested to empirically study the robustness of the convergence of the Gibbs sampler to misspecification in this modeling error.Denote t d (v, m, V ) as a multivariate t distribution in dimension d with v degrees of freedom, location vector m, and scale matrix V .With df denoting the degrees of freedom, artificial data is generated according to the misspecified multivariate EIV Berkson linear regression model for i = 1, . . ., 50 with We look at compare more dispersed tail behavior with df = 2 and less dispersed tail behavior with df = 10.We replicate a simulation 5 times where we simulate T = 10 5 MCMC realizations from the Gibbs sampler in each configuration using 10 4 realizations for burn-in and analyze diagnostics for β t .Similar to the previous simulation in Section 3.1, Figure 2a and Figure 2b plot the largest and smallest eigenvalues of a batch means estimate to the multivariate standard error matrix and Figure 2c plots the multivariate effective sample size (Vats, Flegal and Jones, 2019b).We can see similar behavior in the estimation from the Gibbs sampler based on the both smaller and larger degrees of freedom.The simulation results suggest the Gibbs sampler is reasonably robust to misspecification of the tails in the error distribution of the features for X i |A i .We look at Bayesian EIV linear regression proposed and analyzed in (Harris, Poole and Harris, 2014;Hilbe, de Souza and Ishida, 2017).The dataset consists of the central galaxy supermassive black hole mass and the stellar bulge velocity dispersion from n = 46 different galaxies (Harris, Poole and Harris, 2014).The response Y i is the logarithm of the observed central black hole mass and the predictor variable X i is the logarithm of the observed velocity dispersion.The measurement errors are known beforehand and denoted by Yi and Xi for both the response and predictor variables.The EIV linear regression model studied in (Hilbe, de Souza and Ishida, 2017) folllows σ 2 ∼ Inverse-gamma(10 −3 , 10 −3 ) α ∼ N 1 (0, 10 3 ), β ∼ N 1 (0, 10 3 ) yi ).We generate 10 5 MCMC realizations from the Gibbs sampler.Figure 3 plots the autocorrelation, estimates to the standard errors in the central limit theorem, and effective sample sizes from these realizations.The autocorrelations are computed up to lag 20.Overall, we see the Gibbs sampler performs well.However, the standard error and effective sample size plots suggest that even though the Gibbs sampler is geometrically ergodic, many iterations are still recommended even in low dimensions.These figures suggest empirical diagnostics for the regression parameter β as a reasonable choice as opposed to the other parameters α, σ 2 to determine the reliability of the algorithm in practice.

Conclusion and Future Directions
We showed using a 3-variable deterministic scan Gibbs sampler to sample the posterior in 4 different multivariate Bayesian EIV regression models with ad- ditive Gaussian errors and independent priors is always geometrically ergodic.This is of pragmatic importance to practitioners as trustworthy estimation from a Gibbs sampler is dependent on the speed of convergence of the Markov chain.More specifically, time averages from the Markov chains have many practically relevant theoretical guarantees such as a central limit theorem.Secondly, these Gibbs samplers can be simulated efficiently without the need for complex, intermediate Metropolis-Hastings or rejection sampling steps.One drawback, however, is our convergence analysis is qualitative as we do not construct an explicit convergence rate.
There are many future research directions in studying the convergence of Gibbs samplers in EIV models.It appears reasonable that some Gibbs samplers for generalized linear models such as the Pólya-Gamma sampler will also be geometrically ergodic (Choi and Hobert, 2013;Polson, Scott and Windle, 2013;Wang and Roy, 2018).It seems also interesting to look at alternative errors in the variables such as non-Gaussian or non-additive errors.
Fig 1: (a), (b) Largest and smallest eigenvalues of the MCMC standard error matrix targeting the average of β for iterations of the Gibbs sampler and (c) the multivariate effective sample size for iterations of the Gibbs sampler Fig 2: (a), (b) Largest and smallest eigenvalues of the MCMC standard error matrix (c) the multivariate effective sample size for iterations of the Gibbs sampler Fig 3: (a) Autocorrelation for each regression parameter (b) MCMC standard errors for the regression parameters (c) MCMC effective sample size plots for each regression parameter