A Metropolized adaptive subspace algorithm for high-dimensional Bayesian variable selection

A simple and efficient adaptive Markov Chain Monte Carlo (MCMC) method, called the Metropolized Adaptive Subspace (MAdaSub) algorithm, is proposed for sampling from high-dimensional posterior model distributions in Bayesian variable selection. The MAdaSub algorithm is based on an independent Metropolis-Hastings sampler, where the individual proposal probabilities of the explanatory variables are updated after each iteration using a form of Bayesian adaptive learning, in a way that they finally converge to the respective covariates' posterior inclusion probabilities. We prove the ergodicity of the algorithm and present a parallel version of MAdaSub with an adaptation scheme for the proposal probabilities based on the combination of information from multiple chains. The effectiveness of the algorithm is demonstrated via various simulated and real data examples, including a high-dimensional problem with more than 20,000 covariates.


Introduction
Variable selection in regression models is one of the big challenges in the era of highdimensional data where the number of explanatory variables might largely exceed the sample size. During the last two decades, many classical variable selection algorithms have been proposed which are often based on finding the solution to an appropriate optimization problem. As the most famous example, the Lasso (Tibshirani, 1996) relies on an 1 -type relaxation of the original 0 -type optimization problem. Convex methods like the Lasso are computationally very efficient and are therefore routinely used in high-dimensional statistical applications. However, such classical methods mainly focus on point estimation and do not provide a measure of uncertainty concerning the best model, per se, although recent works aim at addressing these issues as well (see e.g. Wasserman and Roeder, 2009, Meinshausen and Bühlmann, 2010and Lee et al., 2016. On the other hand, a major advantage of a fully Bayesian approach is that it automatically accounts for model uncertainty. In particular, Bayesian model averaging (Raftery et al., 1997) and the median probability model (Barbieri and Berger, 2004) can be used for predictive inference. Furthermore, posterior inclusion probabilities of the individual covariates can be computed to quantify the Bayesian evidence. Important 0 -type criteria like the Bayesian Information Criterion (BIC, Schwarz, 1978) and the Extended Bayesian Information Criterion (EBIC, Chen and Chen, 2008) can be derived as asymptotic approximations to a fully Bayesian approach (compare e.g. Liang et al., 2013). It has been argued that 0 -type methods posses favourable statistical properties in comparison to convex 1 -type methods with respect to variable selection and prediction (see e.g. Raskutti et al., 2011 andNarisetty andHe, 2014). Since solving the associated, generally NP-hard, discrete optimization problems by an exhaustive search is computationally prohibitive, there have been recent attempts in providing more efficient methods for resolving such issues, as for example, mixed integer optimization methods (Bertsimas et al., 2016) and Adaptive Subspace (AdaSub) methods (Staerk, 2018;Staerk et al., 2021).
The challenging practical issue of a fully Bayesian approach is similar to that of optimizing 0 -type information criteria: computing (approximate) posterior model probabilities for all possible models is not feasible if the number of explanatory variables p is very large, since there are in general 2 p possible models which have to be considered. Often, Markov Chain Monte Carlo (MCMC) methods based on Metropolis-Hastings steps (e.g. Madigan et al., 1995), Gibbs samplers (e.g. George and McCulloch, 1993;Dellaportas et al., 2002) and "reversible jump" updates (e.g. Green, 1995) are used in order to obtain a representative sample from the posterior model distribution. However, the effectiveness of MCMC methods depends heavily on a sensible choice of the proposal distributions being used. Therefore, such methods may suffer from bad mixing resulting in a slow exploration of the model space, especially when the number of covariates is large. Moreover, tuning of the proposal distribution is often only feasible after manual "pilot" runs of the algorithm.
Adaptive MCMC methods aim to address these issues by updating the proposal parameters "on the fly" during a single run of the algorithm so that the proposal distribution automatically adjusts according to the currently available information. Recently, a number of different adaptive MCMC algorithms have been proposed in the Bayesian variable selection context, see e.g. Nott and Kohn (2005), Lamnisos et al. (2013), Ji and Schmidler (2013), Griffin et al. (2014), Griffin et al. (2021) and Wan and Griffin (2021). In this work we propose an alternative, simple and efficient adaptive independent Metropolis-Hastings algorithm for Bayesian variable selection, called the Metropolized Adaptive Subspace (MAdaSub) algorithm, and compare it to existing adaptive MCMC algorithms. In MAdaSub the individual proposal probabilities of the explanatory variables are sequentially adapted after each iteration. The employed updating scheme is inspired by the AdaSub method introduced in Staerk et al. (2021) and can itself be motivated in a Bayesian way, such that the individual proposal probabilities finally converge against the true respective posterior inclusion probabilities. In the limit, the algorithm can be viewed as a simple Metropolis-Hastings sampler using a product of independent Bernoulli proposals which is the closest to the unknown target distribution in terms of Kullback-Leibler divergence (among the distributions in the family of independent Bernoulli form).
The paper is structured as follows. The considered setting of Bayesian variable selection in generalized linear models (GLMs) is briefly described in Section 2. The MAdaSub algorithm is motivated and introduced in Section 3. By making use of general results obtained by Roberts and Rosenthal (2007), it is shown that the MAdaSub algorithm is ergodic despite its continuing adaptation, i.e. that "in the limit" it samples from the targeted posterior model distribution (see Theorem 1). Alternative adaptive approaches are also briefly discussed and conceptually compared to the newly proposed algorithm. In Section 4, a parallel version of MAdaSub is presented where the proposal probabilities can be adapted using the information from all available chains, without affecting the ergodicity of the algorithm (see Theorem 3). Detailed proofs of the theoretical results of Sections 3 and 4 can be found in the Supplement to this paper. The adaptive behaviour of MAdaSub and the choice of its tuning parameters are illustrated via low-and high-dimensional simulated data applications in Section 5, emphasizing that the speed of convergence against the targeted posterior depends on an appropriate choice of these parameters. In Section 6 various real data applications demonstrate that MAdaSub provides an efficient and stable way for sampling from high-dimensional posterior model distributions. The paper concludes with a discussion in Section 7. An R-implementation of MAdaSub is available at https: //github.com/chstaerk/MAdaSub.

The setting
In this work we consider variable selection in univariate generalized linear models (GLMs), where the response variable Y is modelled in terms of p possible explanatory variables X 1 , . . . , X p . More precisely, for a sample of size n, the components of the response vector Y = (Y 1 , . . . , Y n ) T are assumed to be independent with each of them having a distribution from a fixed exponential dispersion family with g E(Y i | X i, * ) = µ + p j=1 β j X i,j , i = 1, . . . , n , where g is a (fixed) link function, µ ∈ R is the intercept and β = (β 1 , . . . , β p ) T ∈ R p is the vector of regression coefficients. Here, X = (X i,j ) ∈ R n×p is the design matrix; it's i-th row X i, * corresponds to the i-th observation and it's j-th column X * ,j ≡ X j corresponds to the values of the j-th predictor. For a subset S ⊆ {1, . . . , p}, the model induced by S is defined by a GLM of the form (1) but with design matrix X S ∈ R n×|S| in place of X ∈ R n×p and corresponding vector of coefficients β S ∈ R |S| , where X S denotes the submatrix of the original design matrix X containing only the columns with indices in S.
For brevity, we often simply refer to the model S. Without further notice, we assume that we always include an intercept µ in the corresponding GLM with design matrix X S . We In a fully Bayesian approach we assign prior probabilities π(S) to each of the considered models S ∈ M as well as priors π(µ, ψ, β S | S) for the parameters of each model S ∈ M, where ψ denotes a possibly present dispersion parameter (e.g. the variance in a normal linear model). After observing data D = (X, y), with X ∈ R n×p and y ∈ R n , the posterior model probabilities are proportional to π(S | D) ∝ π(y | X, S) π(S) , S ∈ M , where π(y | X, S) = f (y | X, S, µ, ψ, β S ) π(µ, ψ, β S | S) dµ dψ dβ S is the marginal likelihood of the data y under model S, while f (y | X, S, µ, ψ, β S ) denotes the likelihood of the data y under model S given the parameter values µ, ψ, β S and the values of the explanatory variables X. Note that the marginal likelihood π(y | X, S) is generally only available in closed form when conjugate priors are used.
Remark 2.1. A prominent example in normal linear models is a conjugate prior structure, where the prior on the variance ψ = σ 2 is given by Jeffreys prior (independent of the model S) and the prior on the vector of coefficients β S in model S ∈ M is given by a multivariate normal distribution, i.e. β S | S, σ 2 ∼ N |S| (ϑ S , σ 2 g W S ), π(σ 2 ) ∝ 1 σ 2 , where ϑ S ∈ R |S| , g > 0 and W S ∈ R |S|×|S| are hyperparameters. After centering each of the covariates X j , j ∈ P, the improper prior π(µ) ∝ 1 is a common choice for the intercept µ (again, independent of the model S). With no specific prior information, the prior mean of β S can be set to the zero vector (ϑ S = 0). The matrix W S is often chosen to be the identity matrix I |S| of dimension |S| or to be W S = (X T S X S ) −1 yielding Zellner's g-prior (Zellner, 1986). The first choice corresponds to Ridge Regression and implies prior independence of the regression coefficients, while the second choice with g = n corresponds to a unit information prior. In case no specific prior information is available about the possible regressors, a natural choice for the model prior is an independent Bernoulli prior of the form where ω = π(j ∈ S) is the prior probability that variable X j is included in the model, for all j ∈ P. One can either set the prior inclusion probability ω to some fixed value or consider an additional hyperprior for ω, with the latter option yielding more flexibility. A convenient choice is the (conjugate) beta prior ω ∼ Be(a ω , b ω ), where a ω > 0 and b ω > 0 can be chosen in order to reflect the prior expectation and prior variance of the model size s = |S|, S ∈ M (see Kohn et al., 2001 for details). In practice, one often imposes an a-priori upper bound s max on the model size (with s max ≤ n) by setting π(S) = 0 for |S| > s max (cf. Liang et al., 2013;Rossell, 2021), while for fixed control variables X j one can enforce the inclusion of such variables by setting π(j ∈ S) = 1.
In the general non-conjugate case the marginal likelihood is not readily computable and numerical methods may be used for deriving an approximation to the marginal likelihood.
Laplace's method yields an asymptotic analytic approximation to the marginal likelihood (Kass and Raftery, 1995). Similarly, different information criteria like the Bayesian Information Criterion (BIC, Schwarz, 1978) or the Extended Bayesian Information Criterion (EBIC, Chen and Chen, 2008) can be used directly as asymptotic approximations to fully Bayesian posterior model probabilities under suitable choices of model priors. Under a uniform model prior, i.e. π(S) = 1 2 p for all S ∈ M, the BIC can be derived as an approximation to −2 log(BF(S)) = −2 log(PO(S)), where BF(S) = π(y | X, S)/π(y | X, ∅) denotes the Bayes factor of model S ∈ M versus the null model ∅ ∈ M and PO(S) denotes the corresponding posterior odds (Schwarz, 1978;Kass and Wasserman, 1995). In a high-dimensional but sparse situation, in which only a few of the many possible predictors contribute substantially to the response, a uniform prior on the model space is a naive choice since it induces severe overfitting. Therefore, Chen and Chen (2008) propose the prior where γ ∈ [0, 1] is an additional parameter. If γ = 1, then π(S) = 1 p+1 p |S| −1 , so the prior gives equal probability to each model size, and to each model of the same size; note that this prior does also coincide with the beta-binomial model prior discussed above when setting a ω = b ω = 1, providing automatic multiplicity correction (Scott and Berger, 2010).
If γ = 0, then we obtain the uniform prior used in the original BIC. Similar to the derivation of the BIC one asymptotically obtains the EBIC with parameter γ ∈ [0, 1] as where f (y | X, S,μ S ,ψ S ,β S ) denotes the maximized likelihood under the model S ∈ M (compare Chen and Chen, 2012). Under the model prior (6) and a unit-information prior on the regression coefficients for each model S ∈ M, one can asymptotically approximate the model posterior by In this work we consider situations where the marginal likelihood π(y | X, S) is available in closed form due to the use of conjugate priors (see Remark 2.1) or where an approximation to the posterior π(S | D) is used (e.g. via equation (8) with the EBIC or any other 0 -type criteria such as the risk inflation criterion, cf. Foster and George, 1994;Rossell, 2021).
This assumption allows one to focus on the essential part of efficient sampling in very large model spaces, avoiding challenging technicalities regarding sampling of model parameters for non-conjugate cases. It also facilitates empirical comparisons with other recent adaptive variable selection methods, which focus on conjugate priors (Zanella and Roberts, 2019;Griffin et al., 2021). Furthermore, conjugate priors such as the g-prior as well as normalized 0 -type selection criteria such as the EBIC in equation (8) have shown to provide concentration of posterior model probabilities on the (Kullback-Leibler) optimal model under general conditions even in case of model misspecification (Rossell, 2021), as well as model selection consistency for the true model in GLMs without misspecification (Chen and Chen, 2012;Liang et al., 2013).

The MAdaSub algorithm
A simple way to sample from a given target distribution is to use an independent Metropolis-Hastings algorithm. Clearly, the efficiency of such an MCMC algorithm depends on the choice of the proposal distribution, which is in general not an easy task (see e.g. Rosenthal, 2011). In the ideal situation, the proposal distribution for an independence sampler should be the same as the target distribution π(S | D), leading to an independent sample from the target distribution with corresponding acceptance probability of one. Adaptive MCMC algorithms aim to sequentially update the proposal distribution during the algorithm based on the previous samples such that, in case of the independence sampler, the proposal becomes closer and closer to the target distribution as the MCMC sample grows (see e.g. Holden et al., 2009, Giordani andKohn, 2010). However, especially in high-dimensional situations, it is crucial that the adaptation of the proposal as well as sampling from the proposal can be carried out efficiently. For this reason, we restrict ourselves to proposal distributions which have an independent Bernoulli form: if S ∈ M is the current model, for some vector r = (r 1 , . . . , r p ) ∈ (0, 1) p of individual proposal probabilities.

Serial version of the MAdaSub algorithm
The fundamental idea of the newly proposed MAdaSub algorithm (given below as Algorithm 1) is to sequentially update the individual proposal probabilities according to the currently "estimated" posterior inclusion probabilities. In more detail, after initializing the vector of proposal probabilities j of variables X j are updated after each iteration t of the algorithm, such that r (t) j finally converges to the actual posterior inclusion probability π j = π(j ∈ S | D), as t → ∞ (see Corollary 2 below). Therefore, in the limit, we make use of the proposal which is the closest distribution (in terms of Kullback-Leibler divergence) to the actual target π(S | D), among all distributions of independent Bernoulli form (9) (see Clyde et al., 2011). Note that the median probability model (Barbieri and Berger, 2004;Barbieri et al., 2021), defined by S MPM = {j ∈ P : π j ≥ 0.5}, has the largest probability in the limiting proposal (10) of MAdaSub, i.e. arg max V ∈M q(V ; r * ) = S MPM . Thus, MAdaSub can be interpreted as an adaptive algorithm which aims to adjust the proposal so that models in the region of the median probability model are proposed with increasing probability.
For j ∈ P, the concrete update of r (t) j after iteration t ∈ N is given by where, for j ∈ P, L j > 0 are additional parameters controlling the adaptation rate of the algorithm and 1 S (i) denotes the indicator function of the set S (i) . If j ∈ S (t) (i.e. 1 S (t) (j) = 1), then variable X j is included in the sampled model in iteration t of the algorithm and the proposal probability r (t) j of X j increases in the next iteration t + 1;
similarly, if j / ∈ S (t) (i.e. 1 S (t) (j) = 0), then the proposal probability decreases. The additional "truncation" step 2 (a) in the MAdaSub algorithm ensures that the truncated individual proposal probabilitiesr (t) j , j ∈ P, are always included in the compact interval I = [ , 1 − ], where ∈ (0, 0.5) is a pre-specified "precision" parameter. This adjustment simplifies the proof of the ergodicity of MAdaSub. Note that the mean size of the proposed model V from the proposal q(V ;r) in equation (9) withr ∈ [ , 1− ] p is at least E|V | ≥ ×p; thus, in practice we recommended to set ≤ 1 p , so that models of small size including the null model can be proposed with sufficiently large probability. On the other hand, if is chosen to be very small, then the MAdaSub algorithm may take a longer time to convergence in case proposal probabilities of informative variables are close to ≈ 0 during the initial burn-in period of the algorithm. Simulations and real data applications show that the choice = 1 p works well in all considered situations (see Sections 5 and 6). The updating scheme of the individual proposal probabilities is inspired by the AdaSub method proposed in Staerk (2018) and Staerk et al. (2021) and can itself be motivated in a Bayesian way: since we do not know the true posterior inclusion probability π j of variable X j for j ∈ P, we place a beta prior on π j with the following parametrization where r is the prior expectation of π j and L j > 0 controls the variance of π j via Var(π j ) = 1 If L j → 0, then Var(π j ) → r , which is the variance of a Bernoulli random variable with mean r (0) j . If L j → ∞, then Var(π j ) → 0. Now, one might view the samples S (1) , . . . , S (t) obtained after t iterations of MAdaSub as "new" data and interpret the information learned about π j as t approximately independent Bernoulli trials, where j ∈ S (i) corresponds to "success" and j / ∈ S (i) corresponds to "failure". Then the (pseudo) posterior of π j after iteration t of the algorithm is given by with posterior expectation E(π j | S (1) , . . . , S (t) ) = L j r (0) and posterior variance The interpretation of r (0) j as the prior expectation for the posterior inclusion probability π j motivates the choice of r (0) j = π(j ∈ S) as the actual prior inclusion probability of variable X j . If no particular prior information about specific variables is available, but the prior expected model size is equal to q ∈ (0, p), then we recommend to set r (0) j = q p and L = L j = p for all j ∈ P, corresponding to the prior π j ∼ Be(q, p − q) in equation (12). In this particular situation, equation (15) reduces to Even though it seems natural to choose the parameters r (0) j and L j of MAdaSub as the respective prior quantities, this choice is not imperative. While the optimal choices of these parameters generally depend on the setting, various simulated and real data applications of MAdaSub indicate that choosing r (0) j = q p with q ∈ [2, 10] and L j ∈ [p/2, 2p] for j ∈ P yields a stable algorithm with good mixing in sparse high-dimensional set-ups irrespective of the actual prior (see Sections 5 and 6). Furthermore, if one has already run and stopped the MAdaSub algorithm after a certain number of iterations T , then one can simply restart the algorithm with the already updated parameters r (T ) j and L j + T (compare equation (16)) as new starting values for the corresponding parameters.
Using general results for adaptive MCMC algorithms by Roberts and Rosenthal (2007), we show that MAdaSub is ergodic despite its continuing adaptation.
The proof of Theorem 1 can be found in Section A of the Supplement, where it is shown that MAdaSub satisfies both the simultaneous uniform ergodicity condition and the diminishing adaptation condition (cf. Roberts and Rosenthal, 2007). As an immediate consequence of Theorem 1 we obtain the following important result.
Corollary 2. For all choices of r (0) ∈ (0, 1) p , L j > 0 and ∈ (0, 0.5), the proposal probabilities r (t) j of the explanatory variables X j in MAdaSub converge (in probability) to the respective posterior inclusion probabilities π j = π(j ∈ S | D), i.e. for all j ∈ P it holds that r (t) j P → π j as t → ∞.

Comparison to related adaptive approaches
In this section we conceptually compare the proposed MAdaSub algorithm (Algorithm 1) with other approaches for high-dimensional Bayesian variable selection, focusing on adaptive MCMC algorithms most closely related to the new algorithm (see Section D of the Supplement for details on further related methods).
In a pioneering work, Nott and Kohn (2005) propose an adaptive sampling algorithm for Bayesian variable selection based on a Metropolized Gibbs sampler, showing empirically that the adaptive algorithm outperforms different non-adaptive algorithms in terms of efficiency per iteration. However, since their approach requires the computation of inverses of estimated covariance matrices, it does not scale well to very high-dimensional settings. Recently, several variants and extensions of the original adaptive MCMC sampler of Nott and Kohn (2005) have been developed, including an adaptive Metropolis-Hastings algorithm by Lamnisos et al. (2013), where the expected number of variables to be changed by the proposal is adapted during the algorithm. Zanella and Roberts (2019) propose a tempered Gibbs sampling algorithm with adaptive choices of components to be updated in each iteration. Furthermore, different individual adaptation algorithms have been developed in Griffin et al. (2014) as well as in the follow-up works of Griffin et al. (2021) and Wan and Griffin (2021), which are closely related to the proposed MAdaSub algorithm. These strategies are based on adaptive Metropolis-Hastings algorithms, where the employed proposal distributions are of the following form: if S ∈ M is the current model, then the probability of proposing the model V ∈ M is given bỹ where η = (A 1 , . . . , A p , D 1 , . . . , D p ) T ∈ (0, 1) 2p is a vector of tuning parameters with the following interpretation: For j ∈ P, A j is the probability of adding variable X j if it is not included in the current model S and D j is the probability of deleting variable X j if it is included in the current model S. An important difference is that the adaptation strategies in Griffin et al. (2021) specifically aim to guard against low acceptance rates of the proposal (18), while MAdaSub aims at obtaining a global independent proposal with the largest possible acceptance rate, focusing on regions close to the median probability model. for multi-armed bandits in reinforcement learning, which has recently been investigated in the context of non-parametric Bayesian variable selection (Liu and Ročková, 2021). In contrast to MAdaSub, Thompson Variable Selection (TVS) does not provide samples from the posterior distribution but is designed to minimize the regret (i.e. the difference between optimal and actual rewards); as a consequence, the sampling probabilities in TVS are not guaranteed to converge to the posterior inclusion probabilities.

Parallelization of the MAdaSub algorithm
In this section we present a parallel version of the MAdaSub algorithm which aims at increasing the computational efficiency and accelerating the convergence of the chains. The simplest approach to parallelization would be to independently run the MAdaSub algorithm in parallel on each of K ∈ N different workers, yielding K individual chains which, in the limit, sample from the posterior model distribution (see Theorem 1). However, it is desirable that the information learned about the adaptive parameters can be shared efficiently between the different chains, so that the convergence of the adaptive parameters to their optimal values can be accelerated, leading to a faster convergence of the chains to their common limiting distribution.
We propose a parallel version of MAdaSub, where the workers sample individual MAda-Sub chains in parallel, but the acquired information is exchanged periodically between the chains and the adaptive proposal probabilities are updated together (see Algorithm 2 in Section B of the Supplement for full algorithmic details). More specifically, let S (k,1) , . . . , S (k,T ) denote the models sampled by MAdaSub (see Algorithm 1) for the first T iterations on worker k, for k ∈ {1, . . . , K}. Then, for each worker k ∈ {1, . . . , K}, we define the jointly updated proposal probabilities after the first round (m = 1) of T iterations bȳ where r (k,0) j denotes the initial proposal probability for variable X j and L (k) j the corresponding adaptation parameter (both can be different across the chains).
After the joint update, each MAdaSub chain is resumed (withr (k,1) j as initial proposal probabilities and L (k) j + T K as initial prior variance parameters for j ∈ P) and is run independently on each of the workers for T additional iterations in a second round (m = 2); then the proposal probabilities are updated jointly again tor (k,2) j , and so on (up to m = R rounds in Algorithm 2 of the Supplement). The joint updates of the proposal probabilities after m ∈ N rounds of T iterations are given bȳ Similarly to the serial version of MAdaSub, the adaptive learning of its parallel version can be naturally motivated in a Bayesian way: each worker k = 1, . . . , K can be thought of as an individual subject continuously updating its prior belief about the true posterior inclusion probability π j of variable X j through new information from its individual chain; additionally, after a period of T iterations the subject updates its prior belief also by obtaining new information from the K − 1 other subjects. If the (possibly different) priors where r is the prior expectation of subject k about π j and L (k) j > 0 controls its prior variance, then the (pseudo) posterior of subject k about π j after m rounds of T iterations of the parallel MAdaSub algorithm is given by (compare to equation (14)) with posterior expectation (compare to equation (15)) corresponding to the joint update in equation (20).
Although the individual chains in the parallel MAdaSub algorithm make use of the information from all the other chains in order to update the proposal parameters, the ergodicity of the chains is not affected.
Corollary 4. For each worker k ∈ {1, . . . , K} and all choices of r (k,0) ∈ (0, 1) p , L (k) j > 0, j ∈ P and ∈ (0, 0.5), the proposal probabilitiesr (k,m) j of the explanatory variables X j converge (in probability) to the respective posterior inclusion probabilities π j = π(j ∈ S | D), i.e. for all j ∈ P and k = 1, . . . , K it holds thatr Thus, the same convergence results hold for the parallel version as for the serial version of MAdaSub. The benefit of the parallel algorithm is that the convergence of the proposal probabilities against the posterior inclusion probabilities can be accelerated via the exchange of information between the parallel chains, so that the MCMC chains can converge faster against the full posterior distribution. There is a practical trade-off between the effectiveness regarding the joint update for the proposal probabilities and the efficiency regarding the communication between the different chains. If the number of rounds R is chosen to be small with a large number of iterations T per round, the available information from the multiple chains is not fully utilized during the algorithm; however, if the number of rounds R is chosen to be large with a small number of iterations T per round, then the computational cost of communication between the chains increases and may outweigh the benefit of the accelerated convergence of the proposal probabilities.
If T max denotes the maximum number of iterations, we observe that choosing the number of rounds R ∈ [10, 100] with T = T max /R iterations per round works well in practice (see Sections 5 and 6 as well as Table G.4 of the Supplement).

Illustrative example
We first illustrate the adaptive behaviour of the serial MAdaSub algorithm (Algorithm 1) in a relatively low-dimensional setting. In particular, we consider an illustrative simulated dataset D = (X, y) with sample size n = 60 and p = 20 explanatory variables, by gener- ∼ N (X i, * β 0 , 1), i = 1, . . . , n. We employ the g-prior with g = n and an independent Bernoulli model prior with inclusion probability ω = 0.5, resulting in a uniform prior over the model space (see Remark 2.1). In the MAdaSub algorithm we set r (0) j = 1 2 for j ∈ P, i.e. we use the prior inclusion probabilities as initial proposal probabilities. We first consider the choice L j = p (for j ∈ P) for the variance parameters of MAdaSub, corresponding to equation (17 , corresponding to the targeted proposal distribution, which is, as stated above, the closest independent Bernoulli proposal to the target π(· | D) in terms of Kullback-Leibler divergence (Clyde et al., 2011).
Note that the non-adaptive independence sampler with posterior inclusion probabilities as proposal probabilities (r (t) j = π(j ∈ S | D)) is only considered as a benchmark and cannot be used in practice, since the true posterior probabilities are initially unknown and are to be estimated by the MCMC algorithms. Furthermore, we also present comparisons with a standard local "Markov chain Monte Carlo model composition" (MC 3 ) algorithm (Madigan et al., 1995), which in each iteration proposes to delete or add a single variable to the current model. (black) and of the sizes |S (t) | of the sampled models (red) along the first 5,000 iterations (t) for non-adaptive sampler with prior marginals as proposal probabilities, for MAdaSub (with L j = p), for non-adaptive sampler with posterior marginals as proposal probabilities and for local add-delete MC 3 sampler (from top to bottom).
to zero. On the other hand, the non-adaptive sampler with posterior marginals as proposal probabilities leads to fast mixing with corresponding acceptance rate of approximately 0.54.
Even though the MAdaSub algorithm starts with exactly the same "initial configuration" as the non-adaptive sampler with prior marginals, it quickly adjusts the proposal probabilities accordingly, so that the resulting acceptance rate approaches the target value of 0.54 from the non-adaptive sampler with posterior marginals. In particular, when inspecting the evolution of the sampled model sizes in Figure 1, the MAdaSub algorithm is very difficult to distinguish from the sampler with posterior marginals after a very short burn-in period (see also Figure E.1 of the Supplement).
To illustrate the behaviour of the MAdaSub algorithm with respect to the variance parameters L j , additionally to the choice L j = p we examine two further runs of MAdaSub for non-adaptive independence sampler with prior marginals (blue) and posterior marginals (red) as proposal probabilities, for add-delete MC 3 sampler (gray), as well as for MAdaSub with L j = p (black), L j = p/n (orange) and L j = 100p (purple) for j ∈ P.
with the same specifications as before, but with L j = p/n and with L j = 100p, respectively. Figure 2 indicates that the original choice L j = p is favourable, yielding a fast and "sustainable" increase of the acceptance rate (see also  j are continuously adjusted towards the current empirical inclusion frequencies f (11) j | ≤ δ. Even when automatic stopping may be applied, we additionally recommend to investigate the convergence of the MAdaSub algorithm via the diagnostic plots presented in this section and in Section E of the Supplement.

Low-dimensional simulation study
In this simulation study we further investigate the performance of the serial MAdaSub algorithm in relation to local non-adaptive and adaptive algorithms. In particular, we analyse how the algorithms are affected by high correlations between the covariates.
We consider a similar low-dimensional setting as in the illustrative data application with p = 20 covariates and sample size n = 60. To evaluate the performance in a variety of different data situations, for each simulated dataset the number s 0 of informative variables is randomly drawn from {0, 1, . . . , 10} and the true active set S 0 ⊆ P of size |S 0 | = s 0 is randomly selected from the full set of covariates P = {1, . . . , p}; then, for each j ∈ S 0 , the j-th component β 0,j of the true coefficient vector β 0 ∈ R p is simulated from a uniform distribution β 0,j ∼ U (−2, 2). As before, the covariates are simulated using a Toeplitz correlation structure, while the response is simulated from a normal linear model with error variance σ 2 = 1. We consider three different correlation settings by varying the correlation ρ between adjacent covariates in the Toeplitz structure: a low-correlated setting with ρ = 0.3, a highly-correlated setting with ρ = 0.9 and a very highly-correlated setting with ρ = 0.99. For each of the three settings, 200 different datasets are simulated as described above; in each case, we employ a g-prior with g = n on the regression coefficients and a uniform prior on the model space.
For each simulated dataset we apply MAdaSub with 20,000 iterations, using L j = p for j ∈ P and = 1 p . In order to investigate the influence of the initial proposal probabilities r and setting r 9} to prevent the premature focus of the algorithm on some covariates (if π marg j ≈ 1) or the avoidance of other covariates (if π marg j ≈ 0).
Here, the marginal posterior odds PO j are crude approximations to the true posterior odds, derived under the assumption of posterior independence of variable inclusion. The local MC 3 algorithm (Madigan et al., 1995) is applied as before as well as with additional swap moves to potentially improve the mixing (as in Griffin et al., 2021). Using the Rpackage scaleBVS (Zanella and Cabezas Gonzalez, 2020), we apply the adaptive weighted tempered Gibbs sampling algorithm of Zanella and Roberts (2019) to obtain (weighted) frequency estimates (as for the other algorithms) and Rao-Blackwellized estimates of posterior inclusion probabilities (PIPs). Exact PIPs are again derived using the BAS algorithm to the true PIPs, where PIP convergence is defined to occur at the smallest iteration t c for which max j∈P |f if t c ≥ 20,000, then the number of iterations for convergence is displayed as 20,000 in
Similar to the low-dimensional simulations, covariates are generated from a Toeplitz correlation structure with ρ = 0.6 and the response is simulated via y i ind.
are the same for all chains k. For the parallel version, we consider different random initializations of proposal probabilities r (k,0) j = q (k) /p, j ∈ P, with q (k) ∼ U (2, 10) and variance parameters L  Table G.1 of the Supplement, comparing the performance of MAdaSub also with the adaptive approaches in Griffin et al. (2021). Table 1 shows that in all considered settings the median estimated time-standardized effective sample size for both MAdaSub versions is several orders larger than for the MC 3 algorithm. For low SNRs (e.g. SNR = 0.5), both MAdaSub versions tend to show larger improvements compared to the MC 3 algorithm than for high SNRs (e.g. SNR = 3). Note that for high SNRs, the posterior distribution tends to be more concentrated around the true model S 0 = {1, . . . , 10}, so that local proposals of the add-delete-swap MC 3 algorithm may also be reasonable. On the other hand, for low SNR, the posterior tends to be less concentrated, so that global moves of MAdaSub have a larger potential to improve the mixing compared to the MC 3 algorithm. The acceptance rates of MAdaSub are also larger in small SNR scenarios, as the posterior model distribution tends to be better approximated by independent Bernoulli proposals. However, in all considered settings, the acceptance rates of MAdaSub are reasonably large with median acceptance rates between 5.1% and 54.2% (see Table 1) and are considerably larger compared to the MC 3 algorithm with median acceptance rates between 0.6% and 5.8% (detailed results not shown).
For low SNRs (SNR ≤ 1), serial updating in MAdaSub tends to yield larger (for p = 500) or similar (for p = 5000) time-standardized effective sample sizes compared to parallel updating, as both versions appear to have converged to stationarity with similar acceptance rates, while the parallel version tends to yield larger computation times as a result of communicating chains. For large SNRs (SNR ≥ 2), MAdaSub with parallel updating performs favourable since the proposal probabilities tend to converge faster than with serial updating, which leads to considerably larger acceptance rates and outweighs the computational cost of communicating chains. Previous results for the same simulation set-up indicate that the two alternative individual adaptation algorithms of Griffin et al. (2021) tend to yield the largest improvements compared to the MC 3 algorithm for higher SNR (particularly for SNR = 2). The proposal (18) of these algorithms allows for larger moves than the add-delete-swap proposal in MC 3 , but -in contrast to the independence proposal of MAdaSub -the proposal (18) still locally depends on the previously sampled model. Overall, MAdaSub shows a competitive performance compared to the adaptive algorithms of Griffin et al. (2021), with advantages of MAdaSub in low SNR settings and advantages of the adaptive algorithms of Griffin et al. (2021) in high SNR settings (see Table G.1 of the Supplement).
6 Real data applications

Tecator data
We first examine the Tecator dataset which has already been investigated in Griffin and Brown (2010), Lamnisos et al. (2013) and Griffin et al. (2021). The data has been recorded by Borggaard and Thodberg (1992)  computation times can be found in Section H of the Supplement. As the covariates represent 100 channels of the near-infrared absorbance spectrum, adjacent covariates are highly correlated and it is not surprising that they have similar posterior inclusion probabilities.
If one is interested in selecting a final single model, the median probability model (which includes all variables with posterior inclusion probability greater than 0.5, see Barbieri and Berger, 2004) might not be the best choice in this particular situation, since then only variables corresponding to the "global mode" and no variables from the two other "local modes" in Figure 4 are selected. Alternatively, one may choose one or two variables from each of the three "local modes" or make use of Bayesian model averaging (Raftery et al., 1997) for predictive inference.

PCR and Leukemia data
We illustrate the effectiveness of MAdaSub for two further high-dimensional datasets. In particular, we consider the polymerase chain reaction (PCR) dataset of Lan et al. (2006) with p = 22,575 explanatory variables (expression levels of genes), sample size n = 60 (mice) and continuous response data (the dataset is available in JRSS(B) Datasets Vol. 77(5), Song and Liang, 2015). Furthermore, we consider the leukemia dataset of Golub et al. (1999) with 6817 gene expression measurements of n = 72 patients and binary response data (the dataset can be loaded via the R-package golubEsets, Golub, 2017). For the PCR dataset we face the problem of variable selection in a linear regression framework, while for the leukemia dataset we consider variable selection in a logistic regression framework. We have preprocessed the leukemia dataset as described in Dudoit et al. (2002), resulting in a restricted design matrix with p = 3571 columns (genes). Furthermore, in both datasets we have mean-centered the columns of the design matrix after the initial preprocessing.
Here we adopt the posterior approximation induced by EBIC γ with γ = 1 (see equation (8)), corresponding to a beta-binomial model prior with a ω = b ω = 1 as parameters in the beta distribution (see Section 2). For both datasets we run 25 independent serial MAdaSub chains with 1,000,000 iterations and 25 parallel MAdaSub chains exchanging  serial MAdaSub chains (Algorithm 1, top) and 25 parallel MAdaSub chains exchanging information after every round of 20,000 iterations (Algorithm 2, bottom). Bold lines represent median frequencies with 5%-and 95%-quantiles (shaded area) over the chains within each round, for most informative variables X j (with final estimate f j ≥ 0.05 for at least one chain).
information after each of R = 50 rounds of T = 20,000 iterations (yielding also 1,000,000 iterations for each parallel chain). For each serial and parallel chain k = 1, . . . , 50, we set = 1 p and randomly initialize the proposal probabilities r (k,0) j = q (k) /p, j ∈ P, with q (k) ∼ U (2, 5) and the variance parameters L (k) j = L (k) , j ∈ P, with L (k) ∼ U (p/2, 2p). For the leukemia dataset we make use of a fast C++ implementation for ML-estimation in logistic regression models via a limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm, which is available in the R-package RcppNumerical (Qiu et al., 2016). For both datasets, the 50 MAdaSub chains are run in parallel on a computer cluster with 50 CPUs, yielding overall computation times of 2,836 seconds for the PCR data (2,310 seconds for a single chain) and 1,402 seconds for the leukemia data (995 seconds for a single chain).  Note that in very high-dimensional settings such as for the PCR data (with p = 22,575), the classical MC 3 algorithm (Madigan et al., 1995) does not yield stable estimates due to slow mixing (cf. Griffin et al., 2021), while the BAS algorithm (Clyde, 2017) using sampling without replacement is computationally intractable. Further results in Griffin et al. (2021) show that several competing adaptive algorithms -including sequential Monte Carlo algorithms of Schäfer and Chopin (2013)  The reliable estimation of posterior inclusion probabilities is particularly important for Bayesian inference, since the median probability model (MPM) -including all variables with posterior inclusion probability larger than 0.5 -has been shown to yield optimal predictions for uncorrelated covariates (Barbieri and Berger, 2004) and also a favourable performance for correlated designs (Barbieri et al., 2021) Since MAdaSub is based on adaptive independent proposal distributions, in each iteration of the algorithm the proposed model is (almost) independent of the current model, so that "distant" moves in the model space are encouraged. This can be advantageous in comparison to Gibbs samplers and Metropolis-Hastings algorithms based on local proposal distributions, which may yield larger acceptance rates but are more prone to be stuck in local modes of the posterior model distribution. In future work one may also consider combinations of the adaptive independent proposals in MAdaSub with adaptive local proposals as for example in Lamnisos et al. (2013) and Zanella and Roberts (2019). While MAda-Sub yields competitive results without the use of Rao-Blackwellization compared to the related adaptive algorithms of Griffin et al. (2021), the incorporation of Rao-Blackwellized estimates of posterior inclusion probabilities in the burn-in phase or as initial proposal probabilities may further increase the speed of convergence of MAdaSub. Finally, the extension of MAdaSub to settings with non-conjugate priors is interesting to be investigated, for example by considering data augmentation approaches with additional latent variables or by incorporating reversible-jump moves (Green, 1995;Wan and Griffin, 2021).

A Ergodicity of the MAdaSub algorithm
In this section we present a detailed proof for the ergodicity of the serial MAdaSub algorithm (see Theorem 5), i.e. we show that "in the limit" MAdaSub samples from the targeted posterior model distribution π(· | D) despite the continuing adaptation of the algorithm.
We will make use of a general ergodicity result for adaptive MCMC algorithms by Roberts and Rosenthal (2007). In order to state the result directly for the specific setting of the MAdaSub algorithm, we first introduce some notation.
where q(S ;r) is the probability of proposing the model S and α(S | S;r) is the corresponding acceptance probability.
denote the t-step transition kernel of MAdaSub when the vector of proposal probabilitiesr is fixed (i.e. not adapted during the algorithm). Similarly, let denote the t-step transition kernel for the first t iterations of MAdaSub, given only the initial conditions S (0) = S andr (0) =r.
The following theorem provides the ergodicity result of Roberts and Rosenthal (2007, Theorem 1) adjusted to the specific setting of MAdaSub.
Theorem A.1 (Roberts and Rosenthal, 2007). Consider the MAdaSub algorithm with initial parameters r (0) ∈ (0, 1) p , L j > 0 and ∈ (0, 0.5). Suppose that for each fixed vector of proposal probabilitiesr ∈ [ , 1 − ] p , the one-step kernel P (· | · ;r) of MAdaSub is stationary for the target distribution π(· | D), i.e. for all S ∈ M we have Further suppose that the following two conditions hold: (a) The simultaneous uniform ergodicity condition is satisfied, i.e. for all δ > 0, there exists an integer T ∈ N such that denotes the total variation distance between two distributions P 1 and P 2 defined on some common measurable space (Ω, A).
(b) The diminishing adaptation condition is satisfied, i.e. we have max S∈M P · S;r (t) − P · S;r (t−1) wherer (t) andr (t−1) are random vectors of proposal probabilities induced by the MAdaSub algorithm (see Notation A.1).
Then the MAdaSub algorithm is ergodic, i.e. for all S ∈ M andr ∈ [ , 1 − ] p we have Furthermore, the weak law of large numbers holds for MAdaSub, i.e. for any function where E[g | D] = S g(S)π(S | D) denotes the posterior expectation of g.
In the following we will show that MAdaSub satisfies both the simultaneous uniform ergodicity condition and the diminishing adaptation condition, so that Theorem A.1 can be applied.
Proof. Here we make use of a very similar argumentation as in the proof of Lemma 1 in Griffin et al. (2021). We show that M is a 1-small set (see Roberts and Rosenthal, 2004, Section 3.3), i.e. there exists β > 0 and a probability measure ν on M such that Then by Theorem 8 in Roberts and Rosenthal (2004), the simultaneous uniform ergodicity condition is satisfied.
In order to prove that M is 1-small (note that M is finite), it suffices to show that there exists a constant β 0 > 0 such that P (S | S;r) ≥ β 0 for all S, S ∈ M and allr ∈ [ , 1 − ] p .
Indeed, for S, S ∈ M andr ∈ [ , 1 − ] p it holds This completes the proof.
In order to show that the diminishing adaptation condition is satisfied for the MAdaSub algorithm, we will make repeated use of the following simple observation.
Proof. Since a (t) j t∈N 0 are bounded sequences, there are constants L j > 0 so that |a (t) j | ≤ L j for all t ∈ N 0 and j ∈ {1, . . . , m}. We proceed by induction on m ∈ N: equation (35) obviously holds for m = 1. Now suppose that the assertion holds for m − 1 and we want to show that it also holds for m. Then we have Lemma A.3. Consider the application of the MAdaSub algorithm on a given dataset D with some tuning parameter choices r (0) ∈ (0, 1) p , L j > 0 and ∈ (0, 0.5). Then, for j ∈ P, we have Furthermore, for all S, S ∈ M it holds In particular, MAdaSub fulfils the diminishing adaptation condition.
Proof. For j ∈ P we have With Lemma A.2 (set m = p and note that the number of variables p = |P| is fixed for the given dataset) we conclude that for V ∈ M it holds Let S, S ∈ M and suppose that S = S . Then we have P (S | S;r (t) ) − P (S | S;r (t−1) ) = q(S ;r (t) )α(S | S;r (t) ) − q(S ;r (t−1) )α(S | S;r (t−1) ) .
→ 0 for all S ∈ M. Therefore, we also have α(S | S;r (t) ) − α(S | S;r (t−1) ) ≤ C(S ) q(S;r (t) ) C(S) q(S ;r (t) ) − C(S ) q(S;r (t−1) ) C(S) q(S ;r (t−1) ) where we made use of Lemma A.2 with m = 2 and a (t) Again by using Lemma A.2 and combining equations (38), (39) and (40) we conclude that Finally, we consider the case S = S . Then it holds Thus we have shown that equation (37) holds for all S, S ∈ M. In particular, we conclude that the diminishing adaptation condition is satisfied for MAdaSub (recall that almost sure convergence implies convergence in probability).
Proof. The MAdaSub algorithm fulfils the simultaneous uniform ergodicity condition (see Lemma A.1) and the diminishing adaptation condition (see Lemma A.3). Furthermore, for each fixedr ∈ [ , 1 − ] p , the corresponding transition kernel P (· | · ;r) is induced by a simple Metropolis-Hastings step and therefore has the desired target distribution π(· | D) as its stationary distribution. Hence, by Theorem A.1 the MAdaSub algorithm is ergodic and fulfils the weak law of large numbers.
Corollary 6. For all choices of r (0) ∈ (0, 1) p , L j > 0 and ∈ (0, 0.5), the proposal probabilities r (t) j of the explanatory variables X j in MAdaSub converge (in probability) to the respective posterior inclusion probabilities π j = π(j ∈ S | D), i.e. for all j ∈ P it holds that r (t) j P → π j as t → ∞.
Proof. Since MAdaSub fulfils the weak law of large numbers (Theorem 5), for j ∈ P it holds that Hence, for j ∈ P, we also have

B Algorithmic details of parallel version of MAdaSub
Algorithm 2 Parallel version of MAdaSub Input: • Number of workers K ∈ N.
• Number of rounds R ∈ N.
• Number of iterations per round T ∈ N.

C Ergodicity of parallel version of MAdaSub
In this section we extend the ergodicity result for the serial MAdaSub algorithm (Algorithm 1) of Section A to the parallel version of MAdaSub (Algorithm 2).
Proof. The proof of the simultaneous uniform ergodicity condition for each of the parallel chains is along the lines of the proof for the serial version of MAdaSub (see Lemma A.1).
As before, we can conclude with Theorem A.1 that each parallel chain is ergodic and fulfils the weak law of large numbers, provided that the diminishing adaptation condition is also satisfied for each of the parallel chains.
In order to show the diminishing adaptation condition for the chain on worker k ∈ {1, . . . , K} it suffices to show that for j ∈ P it holds where denotes the proposal probability of variable X j after t iterations of the chain on worker k; the remaining steps of the proof are analogous to the proof of diminishing adaptation for the serial version of MAdaSub (see Lemma A.3). Note that in equation (42) we make use of the convention that b i=a c i = 0 for b < a; additionally, c ∈ N denotes the greatest integer less than or equal to c ∈ R. Furthermore, note that for t = mT with m ∈ N it holds r (k,t) j =r (k,m) j for j ∈ P, k ∈ {1, . . . , K}.
Using the triangle inequality (compare proof of Lemma A.3) and noting that for all t, T ∈ N we have t T − t−1 T ≤ 1, we conclude that for k ∈ {1, . . . , K} it holds Thus, we have shown that equation (41) holds and this completes the proof.
Proof. Since each chain in the parallel MAdaSub algorithm fulfils the weak law of large numbers (Theorem 7), for j ∈ P and k ∈ {1, . . . , K} it holds that Hence, for j ∈ P and k ∈ {1, . . . , K}, we also havē to be ensured that no model is sampled twice and therefore, after each iteration of the algorithm, the sampling probabilities of some of the remaining models have to be renormalized.

D Further approaches related to MAdaSub
Additionally, BAS differs from the other methods discussed in Section 3.2 since it is not an MCMC algorithm and may yield biased estimates of posterior inclusion probabilities after a limited number of iterations.
Another related adaptive method for Bayesian variable selection has been proposed by Ji and Schmidler (2013). They consider an adaptive independence Metropolis-Hastings algorithm for sampling directly from the posterior distribution of the regression coefficients β = (β 1 , . . . , β p ) T , assuming that the prior of β j for j ∈ P is given by a mixture of a point-mass at zero (indicating that the corresponding variable X j is not included in the model) and a continuous normal distribution (indicating that variable X j is "relevant").  G Additional results for the high-dimensional simulation study of Section 5.3

Mixtures
In this section we present additional results for the high-dimensional simulation study of Section 5.3 of the main document.
(n, p) Algorithm SNR = 0.  reported in Table 1 of the paper for the 20 variables with the largest estimated PIPs, here the median is taken over all variables, even though the majority of variables receives very small posterior probability. *Results for exploratory individual adaptation (EIA) and adaptively scaled individual adaptation (ASI) algorithms are taken from Table 1 in Griffin et al. (2021). Comparisons between MAda-Sub and algorithms of Griffin et al. (2021) should be interpreted in a holistic way, as the used computational systems, implementations and the specific simulated datasets for each setting may differ.
A,B / Acc.r (20) A,B / Acc.r (20) A,B / Acc.r (20) A,B / Acc. MAdaSub for high-dimensional simulation setting with n = 500 and p = 500, with fixed choices of r (k,0) j = 10/p for all serial and parallel chains k. Performance of MAdaSub algorithms (A) with serial and parallel updating schemes compared to add-delete-swap MC 3 algorithm (B) in terms of median estimated ratiosr (20) A,B of the relative time-standardized effective sample size for PIPs over the 20 variables with the largest estimated PIPs, and in terms of median acceptance rates (Acc.). SNR = 0.5 SNR = 1 SNR = 2 SNR = 3 Initializationr (20) A,B / Acc.r (20) A,B / Acc.r (20) A,B / Acc.r (20) A,B / Acc.   = L (k) ∼ U (p/2, 2p) for each chain k. Performance of parallel MAdaSub algorithm (A) compared to adddelete-swap MC 3 algorithm (B) in terms of median estimated ratiosr (20) A,B of the relative timestandardized effective sample size for PIPs over the 20 variables with the largest estimated PIPs and in terms of median acceptance rates (Acc.).
A,B / Acc. / Timer (20) A,B / Acc. / Timer (20) A,B / Acc. / Timer  (20) A,B of the relative time-standardized effective sample size for PIPs over the 20 variables with the largest estimated PIPs, in terms of median acceptance rates (Acc.) and in terms of median computation times (in seconds). fixed (the same) initialisations of the tuning parameters r  also Table G.2). Despite this, to avoid optimistic biases in the evaluation of the proposed algorithm (cf. Buchka et al., 2021), in Table 1 of the main document we still report the results for the parallel version with the originally considered random initializations of both tuning parameters r larger numbers of rounds R), then the convergence of the proposal probabilities is accelerated, leading to larger acceptance rates (for SNR ≥ 1); however, the higher frequency of communication between the chains comes at the prize of larger computation times. For settings with high signal-to-noise ratios (SNR ≥ 2), the resulting median estimated ratios of the relative time-standardized effective sample size are largest for R ∈ [20,100]. Note that we considered the number of parallel chains to be the same as the number of assigned CPUs (i.e. 5 parallel chains with 5 CPUs, see Section 5.3), which is the most natural choice. However, in practice the "optimal" choice of the number of rounds R may also depend on the number of available CPUs for parallel computation (especially in case this number is considerably different from the number of parallel MAdaSub chains).
H Additional results for Tecator data application of Section 6.1 Here we provide additional results regarding the efficiency of the serial MAdaSub algorithm under the same setting as in Lamnisos et al. (2013), where several adaptive and non-adaptive MCMC algorithms are compared using normal linear models for the Tecator data. In particular, Lamnisos et al. (2013) consider a classical MC 3 algorithm (Madigan et al., 1995), the adaptive Gibbs sampler of Nott and Kohn (2005) and adaptive and nonadaptive Metropolis-Hastings algorithms based on the tunable model proposal of Lamnisos et al. (2009). In the comparative study of Lamnisos et al. (2013) each algorithm is run for 2,000,000 iterations, including an initial burn-in period of 100,000 iterations. Furthermore, thinning is applied using only every 10th iteration, so that the finally obtained MCMC sample has size 190,000. For comparison reasons, after a burn-in period of 100,000 iterations, we run the serial MAdaSub algorithm for 190,000 iterations, so that the considered MCMC sample has the same size as in Lamnisos et al. (2013). In the serial MAdaSub algorithm we set r (0) j = 5 100 for j ∈ P, i.e. we use the prior inclusion probabilities as the initial proposal probabilities in MAdaSub; further, we set L j = p for j ∈ P and = 1 p . Since the acceptance rate of MAdaSub is already sufficiently large in the considered setting yielding a well-mixing algorithm, we do not consider additional thinning of the resulting chain. In fact, the acceptance rate of the serial MAdaSub chain is approximately 0.38 for the 190,000 iterations (excluding the burn-in period). We note that in this example the relatively large number of 100,000 burn-in iterations is not necessarily required for MAdaSub and is only used for comparison reasons. Lamnisos et al. (2013) report estimated median effective sample sizes of the different samplers for the evolution of the indicators γ (t) j T t=1 for j ∈ P, where γ (t) j = 1 S (t) (j) indicates whether variable X j is included in the sampled model S (t) in iteration t. The estimated median effective sample size for the 190,000 iterations of the serial MAdaSub algorithm is approximately 38,012 (using the R-package coda), which is slightly larger than the values for the competing algorithms reported in Lamnisos et al., 2013 (the largest one is 37,581 for the "optimally" tuned Metropolis-Hastings algorithm). Note that when using 1,900,000 iterations with thinning (every 10th iteration after 100,000 burn-in iterations) as in the other algorithms, the estimated median effective sample size for MAdaSub is much larger (178,334), yielding almost independent samples of size 190,000.
We finally provide details on the computational costs of the serial and parallel versions of MAdaSub for the analysis of the Tecator data presented in Section 6.1 of the main document. The computation time for each of the 5000 iterations of the serial MAdaSub algorithm is approximately 3.5 seconds (using an R implementation of MAdaSub on an Intel(R) Core(TM) i7-7700K, 4.2 GHz processor); thus, even without parallelization, one obtains accurate posterior estimates with the serial MAdaSub algorithm within seconds using a usual desktop computer (e.g. after 10,000 or 15,000 iterations, see Figure 4 of the main document). Lamnisos et al. (2013) report that the computation times for each of the other considered MCMC methods were in the order of 25,000 seconds for the total number of 2,000,000 iterations (using a MATLAB implementation). Although the computation times are not directly comparable, these results indicate that the serial MAdaSub algorithm is already very efficient. The timings for MAdaSub are also of a similar order as for the recent adaptive algorithms of Griffin et al. (2021), who report that short runs of 6000 iterations of the exploratory individual adaptation algorithm yield stable estimates for the Tecator data with computation times of about 5 seconds . When using a computer cluster with 50 CPUs, the overall computation time for all considered 50 MAdaSub chains (each with a large number of 290,000 iterations) is 460 seconds, while the computation time for a single chain is 231 seconds on the same system. This shows that, even though 25 of the 50 MAdaSub chains communicate with each other after every 5,000 iterations, the parallelization yields a substantial speed-up in comparison to a serial application of 50 independent chains. I Additional results for PCR and Leukemia data applications of Section 6.2 To further illustrate the stability of the results, we examine three independent runs of the serial MAdaSub algorithm for the PCR and leukemia data, each with T = 1,000,000 iterations, setting r (0) j = q p as initial proposal probabilities with different expected search sizes q: for the first run we set q = 2, for the second run q = 5 and for the third run q = 10.
Further tuning parameters are set to L j = p and = 1/p for each of the three MAdaSub runs.     and of 25 parallel MAdaSub chains exchanging information every 20,000 iterations (Algorithm 2, bottom) in terms of empirical variable inclusion frequencies f j for most informative variables X j (with final f j ≥ 0.1 for at least one chain).