Informed reversible jump algorithms

Incorporating information about the target distribution in proposal mechanisms generally produces efficient Markov chain Monte Carlo algorithms (or at least, algorithms that are more efficient than uninformed counterparts). For instance, it has proved successful to incorporate gradient information in fixed-dimensional algorithms, as seen with algorithms such as Hamiltonian Monte Carlo. In trans-dimensional algorithms, Green (2003) recommended to sample the parameter proposals during model switches from normal distributions with informative means and covariance matrices. These proposal distributions can be viewed as asymptotic approximations to the parameter distributions, where the limit is with regard to the sample size. Models are typically proposed using uninformed uniform distributions. In this paper, we build on the approach of Zanella (2020) for discrete spaces to incorporate information about neighbouring models. We rely on approximations to posterior model probabilities that are asymptotically exact. We prove that, in some scenarios, the samplers combining this approach with that of Green (2003) behave like ideal ones that use the exact model probabilities and sample from the correct parameter distributions, in the large-sample regime. We show that the implementation of the proposed samplers is straightforward in some cases. The methodology is applied to a real-data example. The code is available online.


Reversible jump algorithms
Reversible jump (RJ, Green (1995)) algorithms are Markov chain Monte Carlo (MCMC) methods that one uses to sample from a target distribution π( · , · | D n ) defined on a union of sets k∈K {k} × R d k , K being a countable set and d k positive integers. This distribution corresponds in Bayesian statistics to a joint posterior distribution of a model indicator K ∈ K and the parameters of Model K, denoted by X K ∈ R d K , given D n , a data sample of size n. Such a posterior distribution allows to jointly infer about (K, X K ), or in other words, to simultaneously achieve model selection/averaging (Hoeting et al., 1999) and parameter estimation. In the following, we assume for simplicity that the parameters of all models are continuous random variables. Again for simplicity, we will abuse notation by also using π( · , · | D n ) to denote the joint posterior density with respect to a product of the counting and Lebesgue measures. We will use π( · | D n ) and π( · | k, D n ) to denote the marginal posterior probability mass function (PMF) of K and conditional posterior distribution/density of X K given that K = k, respectively. Note that in this paper, we assume that K is defined such that π(k | D n ) > 0 for all k.
The variable K ∈ K can take different forms in practice. In mixture modelling (Richardson and Green, 1997), K is the number of components and K is a subset of the positive integers and defines a sequence of nested models, i.e. Model 1 is nested in Model 2 which is nested in Model 3, and so on. Often there is no such "ordering" between the models. This is for instance the case in variable selection. In this framework, K is a vector of 0's and 1's indicating which covariates are included in a model, and for instance, Model k 0 = (1, 1, 1, 0, 0, 1, 0, . . . , 0) is the model with covariates 1, 2, 3 and 6 (as in Figure 1 below); K is thus a collection of such vectors. A RJ algorithm explores both the model space and the parameter spaces. We focus in this paper on the case where K does not define a sequence of nested models. When K defines a sequence of nested models, K can be recoded as an ordinal discrete variable as above and this case has been recently studied by Gagnon and Doucet (2021) who proposed efficient non-reversible trans-dimensional samplers.
The proposal mechanism in a RJ algorithm can be outlined as follows: given a current state of the Markov chain (k, x k ), the algorithm first proposes a model, say Model k , by using a PMF g(k, · ) and then parameter values for this model by applying a diffeomorphism D k →k to x k and auxiliary variables u k →k ∼ q k →k , yielding the parameter proposal y k and another vector of auxiliary variables u k →k . The proposal is accepted, meaning that the next state of the Markov chain is (k , y k ), with probability (assuming that the current state has positive density under the target): α RJ ((k, x k ), (k , y k )) := 1 ∧ π(k , y k | D n ) g(k , k) q k →k (u k →k ) π(k, x k | D n ) g(k, k ) q k →k (u k →k ) |J D k →k (x k , u k →k )| −1 , where x ∧ y := min(x, y) and |J D k →k (x k , u k →k )| is the absolute value of the determinant of the Jacobian matrix of the function D k →k . If the proposal is rejected, the chain remains at the same state (k, x k ).
Recall that a diffeomorphism is a differentiable map having a differentiable inverse and note that we abused notation by using q k →k to denote both the distribution and the probability density function (PDF) of u k →k . The notation k → k in subscript is used to highlight a dependence on the model transition that is proposed, which is from Model k to Model k . In the particular case where k = k, we say that a parameter update is proposed; otherwise, we say that a model switch is proposed. Model k is reachable from Model k if it belongs to the support of g(k, · ) which is considered in this paper to be the neighbourhood of Model k, denoted by N(k).
In trans-dimensional samplers, the neighbourhoods are usually formed of the "closest" models. The notion of "proximity" is often natural, making these closest models straightforward to identify. For instance, in Section 4 we apply the proposed methodology in a variable-selection example and the closest models are those with an additional and one less covariates, to which the current model is added. In this paper, we consider that Model k belongs to the support of g(k, · ) for all k; this will be seen to be useful when the posterior model PMF concentrates.
Algorithm 1 presents a general RJ. Looping over the steps described in Algorithm 1 produces Markov chains that are reversible with respect to the target distribution π( · , · | D n ). If the chains are in addition irreducible and aperiodic, then they are ergodic (Tierney, 1994), which guarantees, among others, that the law of large numbers holds, implying that ergodic averages are approximations to expectations under π( · , · | D n ).
3. If u ≤ α RJ ((k, x k ), (k , y k )), set the next state of the chain to (k , y k ). Otherwise, set it to (k, x k ).

Problems with RJ
Algorithm 1 takes as inputs: an initial state (k, x k ), a total number of iterations, a PMF g(k, · ) for each Model k, and functions q k →k and D k →k for each pair (k, k ) with Model k being reachable from Model k. Implementing RJ is well known for being a difficult task considering the large number of functions that need to be specified and the often lack of intuition about how one should achieve their specification, the latter being especially true for functions involved in model switches. Significant amount of work has been carried out to address the specification of the functions q k →k and D k →k involved in model switches in a principled and informed way; see, e.g., Green (2003) and Brooks et al. (2003). The approaches of these authors are arguably the most popular. Their objective is the following: given x k ∼ π( · | k, D n ), we want to identify q k →k , q k →k and D k →k such that applying the transformation D k →k to (x k , u k →k ) ∼ π( · | k, D n ) ⊗ q k →k leads to (y k , u k →k ) ∼ π( · | k , D n ) ⊗ q k →k (at least approximatively). They essentially look for a way to sample from the conditional distributions π( · | k , D n ), in this constrained framework. This in turn aims at increasing the acceptance probability α RJ defined in (1) towards α MH (k, k ) := 1 ∧ π(k | D n ) g(k , k) π(k | D n ) g(k, k ) , which corresponds to the acceptance probability in a Metropolis-Hastings (MH, Metropolis et al. (1953) and Hastings (1970)) algorithm targeting the PMF π( · | D n ). The approach of Green (2003), for instance, proceeds as if the conditional distributions π( · | k , D n ) were normal. Notwithstanding the merit of this objective, it is to be noticed that, even when it is achieved, "half" of the work for maximizing α RJ is done because, as shown in Figure 1, poor models may be proposed more often than they should be if g is not well designed, leading to smaller acceptance rates. Note that when considering candidate proposal distributions g(k, · ) having all the same support N(k), choosing the one that maximizes the acceptance probability α RJ represents a first step towards proving that it is the best (or at least one of the best) given that they all allow to reach the same models; this choice is indeed expected to improve mixing.

Objectives
The specification of g has been overlooked; this PMF is indeed typically set to a uniform distribution as in Figure 1 (b). The first objective of this paper is to propose methodology to incorporate information about neighbouring models in its design to make fully-informed RJ available. To achieve this we rely on a generic technique recently introduced by Zanella (2020) that is used to construct informed samplers for discrete state-spaces. Let us assume for a moment that we have access to the unnormalized posterior model probabilities and that we use a MH algorithm to sample from π( · | D n ); the technique consists  (1989); the models with an additional covariate can be found in the upper part, whereas those with one less variable are in the lower part; (b) probabilities of proposing the models forming the neighbourhood (represented by the size of the dots) when the current model is that with covariates {1, 2, 3, 6} and g is the uniform over the neighbourhood; (c) probabilities of proposing the models forming the neighbourhood (represented by the size of the dots) when the current model is that with covariates {1, 2, 3, 6} and g is instead an informed version (as defined in Section 1.3) in constructing what the author calls locally-balanced proposal distributions: where h is a continuous positive function such that h(0) = 0 and h(x) h(1/x) = x for all positive x (the square root satisfies this condition for instance), and 1 is the indicator function. Such a function h leads to an acceptance probability in the MH sampler given by α MH (k, k ) = 1 ∧ c(k)/c(k ), where c(k) is the normalizing constant of g(k, · ). A motivation for using this technique is that, in the limit, when the ambient space becomes larger and larger (but the neighbourhoods have a fixed size), there is no need for an accept-reject step anymore; the proposal distributions leave the distribution π( · | D n ) invariant. Zanella (2020) indeed proves that c(k)/c(k ) −→ 1 for all possible pairs (k, k ) under some conditions. These properties suggest that locally-balanced samplers are efficient, at least in high dimensions. The author in fact empirically shows that they perform better than alternative solutions to sample from PMFs in some practical examples, with a highly marked difference in the examples with high-dimensional spaces.
The first obstacle to achieving our first objective is that we typically do not have direct access to the model probabilities, because they involve integrals over the parameter spaces. Drawing inspiration from the approach of Green (2003) that can be viewed as using asymptotic approximations to π( · | k , D n ) to design RJ, where the limit is with regard to the sample size, we propose to use approximations to the unnormalized version of the model probabilities π(k | D n ) whose accuracy increases with n. We prove that, in some scenarios, RJ using both approximations behave asymptotically as RJ that set g(k, k ) ∝ h π(k |D n ) π(k|D n ) 1(k ∈ N(k)) and that sample parameters from the correct conditional distributions π( · | k , D n ). These ideal RJ have acceptance probabilities equal to α MH . All this suggests that the resulting RJ with asymptotically locally-balanced proposal distributions are efficient, at least when n is large enough. We show that it is the case even in a moderate-size real-data example. In this example, the PMF π( · | D n ) concentrates on several models. We also provide evidences that the proposed RJ are efficient when n is large and the PMF π( · | D n ) is highly concentrated so that the mass is non-negligible only for a handful of models.
The approximations on which the proposed methodology is based are more accurate when all parameters of all models take values on the real line. We thus recommend to apply a transformation to parameters for which this is not the case. Such a practice is often beneficial for MCMC methods. It is in fact required when using Hamiltonian Monte Carlo (HMC, see, e.g., Neal (2011)), which is employed to perform parameter updates within RJ in the real-data application of Section 4.
The second objective of this paper is to make clear how each function required for implementation should be specified, at least in some cases, allowing a fully-automated procedure. These cases are those where each of the log conditional densities log π( · | k, D n ) has a well-defined mode and second derivatives that exist and that are continuous. The procedure can be executed even if the model space is large or infinite, as long as the posterior model PMF concentrates on a reasonable number of models, which is expected in practice.

Organization of the paper
In Section 2, we address the specification of the inputs required for the implementation of the proposed samplers; in particular, we discuss the design of the PMFs g(k, · ), so that both objectives of the paper are achieved. We next present in Section 3 a theoretical result about the asymptotic behaviour of the proposed RJ, and an analysis of the limiting RJ. In Section 4, the methodology is evaluated in a real-data variable-selection example. The paper finishes in Section 5 with retrospective comments and possible directions for future research.
The approximations on which the design of the functions g(k, · ), q k →k and D k →k rely can be inaccurate when the sample size is not sufficiently large. There exist methods that allow to build upon the approximations used to design q k →k and D k →k to improve the parameter-proposal mechanism. The methods of Karagiannis and Andrieu (2013) and Andrieu et al. (2018) are examples of such methods. An overview of these methods is provided in Section 2, while the details are presented in Appendix A. In Section 2, it is explained that these methods are also useful for improving the model-proposal mechanism. In fact, as the precision parameters of these methods increase without bounds, the resulting samplers converge towards ideal ones that have access to the unnormalized model probabilities to design g and that sample parameters from the correct conditional distributions π( · | k , D n ), for fixed n (see Appendix A for the details). The methods of Karagiannis and Andrieu (2013) and Andrieu et al. (2018) are sophisticated and come at a computational cost (especially that of Karagiannis and Andrieu (2013)), which may not be worth it when n is large enough for the problem at hand. This is the case in our variable-selection example, highlighting that the simple RJ proposed in this paper that use simple approximations are efficient in certain practical situations. The methods of Karagiannis and Andrieu (2013) and Andrieu et al. (2018) are nevertheless useful in the variable-selection example as they allow to show which of the approximations used in the design of g(k, · ), q k →k and D k →k explain the gap between the proposed RJ and the ideal ones. Their computational cost is offset by a significant enough improvement in, for instance, the change-point problem presented in Green (1995) (as shown in Karagiannis and Andrieu (2013) and Andrieu et al. (2018)).
Some details of the variable-selection example are presented in Appendix B. All proofs of theoretical results are provided in Appendix C.

Input design and proposed RJ
We start in Section 2.1 with the proposed design of g. We next turn in Section 2.2 to the proposed design of the functions q k →k and D k →k , which will be seen to represent a simplified version of that in Green (2003). We present in Section 2.3 the resulting RJ. In Section 2.4, we present an overview of the methods of Karagiannis and Andrieu (2013) and Andrieu et al. (2018) to improve upon the approximations used in the design of the proposal mechanism when these approximations are not accurate enough.

Specification of the model proposal distributions
The design of g starts with the definition of neighbourhoods around all models which specify the support of g(k, · ) for all k. As mentioned in Section 1.1, it is typically possible to define these neighbourhoods in a natural way. It will thus be considered that the neighbourhoods are given and that we can build on these to address the specification of g.
In practice, g(k, · ) is commonly set to the uniform distribution over N(k): g(k, k ) = 1/|N(k)| for all k ∈ N(k), where |N(k)| is the cardinality of N(k). In the simple situation where u k →k = y k ∼ π( · | k , D n ) and |N(k)| = |N(k )| for all possible pairs (k, k ) (the latter is true in variable selection with neighbourhoods defined as in Section 1.1), the RJ acceptance probability reduces to It is thus easily seen that a chain may get stuck for several iterations when there are a lot of poor models with π(k | D n ) π(k | D n ) in the neighbourhood; this is especially true in high dimensions because the number of poor models may be very high. Our goal is to include local neighbourhood information in the PMFs g(k, · ) to skew the latter towards high probability models.
We propose here to follow the strategy of Zanella (2020), implying that ideally we would set g(k, k ) ∝ h π(k | D n ) π(k | D n ) for all k ∈ N(k) (recall (3)), where for all k, These integrals are typically intractable. We propose to approximate them and for this we assume that each log conditional density log π( · | k, D n ) has a well-defined mode and that its second derivatives exist and are continuous. In fact, the integrals that we propose to approximate are where π un. ( · | k, D n ) := L( · | k, D n ) π( · | k) is the unnormalized posterior density under Model k given by the product of the likelihood function and prior parameter density under Model k, and we use that π(k) being the prior probability assigned to Model k. The approximation that we propose to apply is the Laplace approximation, meaning that we write π un. ( · | k, D n ) = exp(log π un. ( · | k, D n )) and log π un. (x k | k, D n ) is developed using a Taylor series expansion (see, e.g., Davison (1986)). This yields: wherex k is the maximizer of π un. ( · | k, D n ) (and π( · | k, D n )),Î k is minus the matrix of second derivatives of log π un. ( · | k, D n ) (and log π( · | k, D n )). We useÎ k to denote the matrix of minus the second derivatives because when the prior density of the parameters is uniform, this matrix corresponds to the observed information matrix. It is evaluated atx k , but we make this implicit to simplify because it will always be the case in this paper. Note that other approximations may be employed. In our framework, we are interested by approximations that are easy to compute (at least in some cases) and asymptotically exact as n −→ ∞. This is the case forπ(k | D n ) in (4); it is more precisely a consistent estimator of π(k | D n ) under the assumptions mentioned above. We thus propose to define the model proposal distributions as follows: for all k ∈ N(k). The function h needs to be specified. A natural choice (that does not satisfy the conditions mentioned in Section 1.3) is the identity function: g(k, k ) ∝π(k | D n ) π(k | D n ) ∝π(k | D n ). The resulting proposal distributions are called globally-balanced in Zanella (2020). To understand why, consider the situation where the size of K is small and it is feasible to switch from any model to any other one, and thus we can set N(k) = K for all k. In this case, g(k, k ) is exactly equal toπ(k | D n ), and for large enough n,π(k | D n ) ≈ π(k | D n ), corresponding to independent Monte Carlo sampling when u k →k = y k ∼ π( · | k , D n ). Using such global proposal distributions g(k, · ) with N(k) = K thus asymptotically leaves the target distribution invariant without an accept/reject step, hence the name globally-balanced. In contrast, when the ambient space becomes larger and larger and the neighbourhoods have a fixed size, i.e. when the locally-balanced proposal distributions asymptotically leave the target distribution invariant, setting h to the identity function instead asymptotically leaves a distribution with X K | K ∼ π( · | K, D n ) and K ∼ π( · | D n ) 2 invariant. We recommend to set h to the identity only when it is to feasible to set N(k) = K for all k. Otherwise, it seems better to use locally-balanced proposal distributions; this is the case for instance in Zanella (2020) and our moderate-size (both in n and dimension) variable-selection example. Two choices of locally-balanced functions h are considered in Zanella (2020): h(x) = √ x and h(x) = x/(1 + x). The choice h(x) = x/(1 + x) yields what the author calls Barker proposal distributions because of the connection with Barker (1965)'s acceptance-probability choice: The empirical results of Zanella (2020) suggest that this latter choice is superior because h is bounded, which stabilizes the normalizing constants c(k) and c(k ) and thus the acceptance probabilities. Livingstone and Zanella (2019) reached the same conclusion using locally-balanced proposal distributions for continuous random variables. In our numerical analyses, both choices lead to similar performances. In practice, a user will choose one; we thus recommend setting h(x) = x/(1 + x).

Specification of the parameter proposal distributions
In this section, we discuss the specification of the functions q k →k and D k →k that are used when k k, i.e. during model-switch attempts. As mentioned, in Section 1.3, HMC is used during parameter-update attempts. The tuning of HMC free parameters is discussed in Section 2.3. Green (2003) proposed to set q k →k and D k →k as if the parameters of both Model k and Model k were normally distributed. More precisely, when d k > d k (the other cases are similar), the author proposed to set where µ k and µ k are vectors and L k and L k are matrices. If π( · | k, D n ) and π( · | k , D n ) are normal distributions with means µ k and µ k and covariance matrices L k L T k and L k L T k , then setting q k →k to a (d k − d k )-dimensional standard normal yields y k ∼ π( · | k , D n ). The function yielding y k in (5) defines D k →k . Note that in this case u k →k is non-existent. Note also that in the following, we will write ϕ( · ; µ, Σ) to denote both the distribution and PDF of a normal with mean µ and covariance matrix Σ.
The rationale behind the approach of Green (2003) is to propose y k around a suitable location with a suitable covariance structure using a normal distribution. This approach is asymptotically valid in some cases, by virtue of Bernstein-von Mises theorems (see, e.g., Theorem 10.1 in Van der Vaart (2000) and Kleijn and Van der Vaart (2012)), meaning that for regular models, X k | k, D n is asymptotically normally distributed as n −→ ∞, for all k, when K and d k are fixed and do not depend on n. In fact, π( · | k, D n ) concentrates aroundx k , withx k −→ x * k , at a rate of 1/ √ n with a shape that resembles a normal for large enough n (see Section 2.1 for the definition ofx k ). When Model k is well specified, x * k is the true parameter value and the covariance matrix is I −1 k , the latter denoting the inverse information matrix based on the whole data set, evaluated at x * k . We make the dependence on x * k implicit to simplify; the inverse information matrix will always be evaluated at x * k in the following. The information matrix I k is equivalent to n times the information of one data point in the independent and identically distributed (IID) setting. In the misspecified case, x * k is the best possible value under Model k and the covariance matrix is slightly different than I −1 k (see Kleijn and Van der Vaart (2012) for the details). The approach to specify q k →k and D k →k proposed here is a simpler version of that of Green (2003), but one that is based on the same idea: set with q k →k = ϕ( · ;x k ,Î −1 k ), for any reachable Model k , regardless of k (see Section 2.1 for the definition ofÎ k ). Consequently, u k →k = x k and D k →k (x k , u k →k ) = (u k →k , x k ) = (y k , u k →k ), implying that q k →k = ϕ( · ;x k ,Î −1 k ) and |J D k →k (x k , u k →k )| = 1. Thus, if Model k is a well-specified and regular model, y k is asymptotically generated from the correct conditional distribution, i.e. q k →k and π( · | k , D n ) are asymptotically equal, because π( · | k , D n ) is asymptotically equal to ϕ( · ;x k , I −1 k ) (Bernstein-von Mises theorem) andÎ k /n − I k /n −→ 0 in probability, where we say that a matrix converges in probability whenever all entries converge in probability.
It may be the case that q k →k and π( · | k , D n ) are not asymptotically equal because Model k may be misspecified, implying that the limiting covariance of π( · | k , D n ) is not I −1 k . The difference between I −1 k and the correct limiting covariance depends on how close Model k is to the true model, in a Kullback-Leibler divergence sense. In general, the true model is unknown; therefore we cannot obtain an analytical expression of the correct limiting covariance. The recommended methodology thus represents in some cases an approximated version of the asymptotically exact one, but it is a methodology that can be easily applied and, fortunately, the resulting algorithms are expected to propose rarely models that are far from the true model because of the design of g, as long as a subset of K are suitable approximations to the true model. In fact when the latter is true, the resulting algorithms are expected to behave similarly to the asymptotically exact ones. This discussion is made more precise in Section 3. In our numerical example, the resulting algorithms are sufficiently close to the asymptotically exact ones and n is sufficiently large so as to yield efficient samplers. Note that the covariance matrices can be estimated otherwise to obtain a generic asymptotically exact methodology; for instance, they can be estimated using pilot MCMC runs. This has for disadvantage of being computationally expensive and challenging.
We finish this section by noting that the approach of Green (2003) is also an approximated version of the asymptotically exact one when µ k =x k and L k is such that L k L T k =Î −1 k for all k. We also note that a RJ user employing the approach of Green (2003) or that described above, already computes what is required to design g(k, · ) as in Section 2.1. This user can therefore implement the proposed methodology without much additional effort.

Resulting RJ
We now present in Algorithm 2 a special case of Algorithm 1 with inputs specified as in Sections 2.1 and 2.2.
while keeping the value of the model indicator k fixed.
, set the next state of the chain to (k , y k ). Otherwise, set it to (k, x k ).

Go to
Step 1.
Recall that we recommend to set h to the identity function when N(k) = K for all k and to a locally-balanced function, namely h(x) = x/(1 + x), otherwise.
We recommend to apply HMC to update the parameters in Step 2.(a). HMC generate paths which evolve according to Hamiltonian dynamics, thus exploiting the local structure of the target through gradient information; the endpoints of these paths are used as proposals within a MH scheme. This proposal mechanism allows the chain to take large steps while not ending up in areas with negligible densities. HMC thus has the ability to decorrelate, which is an interesting feature that is especially important in RJ given that the samplers may not spend a lot of consecutive iterations updating the parameters.
There exist several variants of HMC, the most popular being the No-U-Turn sampler (Hoffman and Gelman, 2014). The latter can be run automatically for parameter estimation using the probabilistic programming langage Stan (see, e.g., Carpenter et al. (2017)). To our knowledge, it is not possible to directly use the Stan implementation within RJ to update the parameters, and given that the No-U-Turn sampler is difficult to implement, we have chosen to employ the "vanilla" version of HMC (see, e.g., Neal (2011)). The implementation of this version for a given model requires the specification of three free parameters: a step size, a trajectory length and a mass matrix. To tune these parameters for a given model, we recommend to first run Stan with the option static HMC (meaning that the vanilla HMC is employed), and to extract the step size and marginal empirical standard deviations of all parameters. This step size results from a tuning procedure and can thus directly be used. The mass matrix is set to a diagonal matrix with diagonal elements being these standard deviations. For the trajectory length, a grid search is applied and the best (found) value is used. The merit of each trajectory length is evaluated through the minimum of marginal effective sample sizes. Note that in this trans-dimensional framework, the momentum variables need in theory to be considered with care. Theoretically, we may consider that a momentum refreshment is performed every odd iteration, and that the algorithm proceeds as in Algorithm 2 every even iteration. Also, we need (in theory) to add or withdraw momentum variables when switching models. In practice, we do not have to proceed in this way. Given that momentum variables are only required when updating the parameters and that the expectations that one wants to approximate typically do not depend on these variables, we may generate them only when it is known that a parameter update is proposed (i.e. k = k).
Some authors (for instance, Green (2003)) mentioned that using informed RJ samplers like Algorithm 2 may be problematic when it is required to gather information for each model before running the algorithms, because this is infeasible for large (or infinite) model spaces. The information that is required for an iteration of Algorithm 2 is in factx k andÎ k for all k ∈ N(k), and possibly free parameters for an HMC step in Step 2.(a). The required information can thus be gathered on the fly as the chains reach new models; it is unnecessary to gather all of it beforehand. The maximizersx k , matricesÎ k and free parameters should be stored and reused next time they are needed. We only need to make sure that these are independent of the chain paths to have a valid procedure. For maximizers, any general-purpose optimization tool can be employed, but a starting point, if needed, should consequently not depend on the chain path to make sure there is no interference; the same recommendation follows for the computation ofÎ k and tuning of HMC free parameters.
This strategy makes the implementation of informed RJ samplers possible, even if the model space is large or infinite, provided that the posterior model PMF concentrates on a reasonable number of models (in the sense that the number of different models visited during algorithm runs is on average reasonable). When the PMF concentrates on few models, this implementation strategy is expected to be highly effective as the information required for model switches and parameter updates will in practice be gathered for these few models and their neighbours only (if the algorithm is well initialized).

Methods to use when the approximations are not accurate
For finite n, the shapes of the posterior parameter densities under Model k and Model k involved during a model-switch attempt may be quite different from bell curves. When this is the case, using normal approximations to the distributions, i.e. q k →k = ϕ( · ;x k ,Î −1 k ) and q k →k = ϕ( · ;x k ,Î −1 k ), may lead to high rejection rates. The method of Karagiannis and Andrieu (2013) allows to build upon these approximations by generating a sequence ( k being the proposal for the parameters of Model k . More precisely, the starting point ( k →k being, for each t, reversible with respect to an annealing intermediate distribution given by The free parameter T is a positive integer, γ 0 = 0, γ T = 1 and γ t = t/T for each t ∈ {1, . . . , T − 1}.
We notice that when switching from Model k to Model k , we start with distributions ρ (t) k →k close to π(k, · | D n ) ⊗ ϕ( · ;x k ,Î −1 k ) to finish, after a transition phase, with distributions close to ϕ( · ;x k ,Î −1 k ) ⊗ π(k , · | D n ). Under some regularity conditions, choosing T large enough allows a smooth transition and makes that the last steps of the sequence, with t ≥ t * , are similar to steps of a time-homogeneous Markov chain with ϕ( · ; A proof can be found in Gagnon and Doucet (2021). Also, the acceptance probability in the resulting RJ, given by This shows that the resulting RJ behave asymptotically like the ones in which the parameter proposals are sampled from the correct conditional distributions π( · | k , D n ), for fixed n.
Given that r RJ2 ((k, Andrieu et al. (2018) proposed to exploit this through a scheme allowing to generate in parallel N estimates to average them to reduce the variance. With T fixed, increasing N thus gets the resulting RJ closer to the ones in which the parameter proposals are sampled from the correct conditional distributions π( · | k , D n ). In Appendix A, it is shown that the methods of Karagiannis and Andrieu (2013) and Andrieu et al. (2018) with their estimators of ratios of posterior probabilities can be used to enhance the approximations π(k | D n )/ π(k | D n ) in g(k, · ).

Asymptotic result and analysis
In Section 2.2, we described situations where using an informed but approximate design for q k →k is expected to yield RJ that behave like asymptotically exact ones, the latter being asymptotically equivalent to ideal RJ that sample parameters y k during model switches from the correct conditional distributions π( · | k , D n ) and that use the unnormalized version of the model probabilities π(k | D n ) to construct g. In Section 3.1, we make this discussion precise and present a weak convergence result. We next provide in Section 3.2 an analysis of the limiting RJ.

Asymptotic result
The weak convergence result presented in this section is about weak convergence of Markov chains, but the convergence does not happen for almost all data sets D n . It rather occurs with a probability that becomes closer to 1 as n −→ ∞, where the randomness comes from D n . A first statement of this kind in MCMC recently appeared in Schmon et al. (2021). It allows to exploit Berstein-von Mises theorems which are about the convergence of posterior parameter distributions towards normal distributions in total variation (TV), with a probability that converges to 1 (thus not for almost all data sets D n ). In this section, the randomness for all convergence in probability comes from D n ; the source of randomness is thus omitted to simplify the statements.
The first condition for the weak convergence result to hold is about an explicit independence between n and the state-space.
Assumption 1. The model space K and the parameter dimensions, i.e. d k for all k ∈ K, do not change with n.
The second condition is thatπ(k | D n ) is an asymptotically exact estimator of π(k | D n ), the latter admitting a limitπ(k), for all k. In some cases, the posterior model mass concentrates. For instance when a model, say Model k * , is well specified and regular,π(k * ) = 1 (see, e.g., Johnson and Rossell (2012) in linear regression).
The last assumption is about the asymptotic equivalence in TV between q k →k that is used to generate u k →k = y k and π( · | k , D n ), for all k . This happens when a Bernstein-von Mises theorem holds for each Model k and q k →k = ϕ( · ;x k ,Σ k /n) withx k andΣ k being consistent estimator of x * k and Σ k , respectively, where Σ k is the limiting covariance matrix (if we divide by n) of π( · | k , D n ). When, for instance, Model k is well defined and regular and data points are IID,Σ k = (Î k /n) −1 and Σ k is the inverse information matrix of one data point.
in probability, for all k (thus regardless of k).
As mentioned in Section 2.2, the proposed methodology represents in some cases an approximation to the asymptotically exact one, because for some probable transitions from Model k to Model k , TV(ϕ( · ;x k , Σ k /n), q k →k ) does not converge to 0 if Model k is misspecified. In fact, the proposed methodology is asymptotically exact only if all models are well specified, or at least, if some are misspecified, thenπ must assign a probability of 0 to them. Indeed, in the latter case, considering that the chain is currently visiting Model k withπ(k) > 0, for all misspecified models with k k, so that the proposal distributions q k →k with TV(ϕ( · ;x k , Σ k /n), q k →k ) −→ 0 will asymptotically never be used.
In practice, it is expected that all models are misspecified, but that some well approximate the true model. We now briefly explain why the proposed methodology is a good approximation to the asymptotically exact one in this case and next present the weak convergence result. Denote by K * ⊂ K the subset formed of the good models, and consider that for all k ∈ K * and for large enough n, TV(ϕ( · ;x k , Σ k /n), q k →k ) is small (becauseΣ k = (Î k /n) −1 and Σ k are similar), implying that TV(π( · | k , D n ), q k →k ) is small. Consider also to simplify that all models are either good or poor approximations to the true model, and that for all the poor models, i.e. with k K * , g(k, k ) −→ 0 for k ∈ K * (becausê π(k | D n )/π(k | D n ) −→ 0). Therefore, for large enough n, if the chain is currently visiting Model k ∈ K * , then with probability close to 1, a model in K * is proposed, say Model k , with and next, parameters y k = u k →k ∼ q k →k are proposed, with TV(π( · | k , D n ), q k →k ) small. In this situation, the proposed RJ are well approximating the asymptotically exact ones. We keep this situation in mind for the rest of Section 3.
We now introduce notation that are required to present the weak convergence result. Use {(K, X K ) n (m) : m ∈ N} to denote a Markov chain simulated by a RJ that targets π( · , · | D n ) and sets g( Because the target distribution concentrates, we need to apply a transformation to the Markov chains to obtain a non-trivial limit. We apply a transformation reflecting that even for large n, the samplers continue to explore the parameter spaces, but at different scales given that the parameters are continuous variables. In contrast, if the posterior PMF π( · | D n ) concentrates on say, one model, then fewer and fewer models are visited during an algorithm run as more and more of them have negligible mass as n −→ ∞. The transformation that is applied aims at reflecting this reality and thus obtaining a limiting situation that represents an approximation to what one encounters in practice. We more precisely standardize the parameter variable X K , but leave the model in- To be an admissible limit of : m ∈ N} would need to be independent of n, but it is readily seen to be not the case. For instance, the invariant conditional density of Z K given K is π( · / √ n +x k | k, D n )/ √ n, which is asymptotically equivalent to a normal with mean 0 and covariance Σ k , but not equal to it for all n. We thus introduce a process for which the transformation is independent of n and prove that {(K, Z K ) n (m) : m ∈ N} and {(K, Z K ) ideal (m) : m ∈ N} weakly converge towards it, showing that they all share a similar behaviour for large enough n.
Denote by {(K, X K ) limit (m) : m ∈ N} this process. It is an RJ which targets a distribution such that X K | K ∼ ϕ( · ;x K , Σ K /n) and K ∼π. It sets g(k, k ) ∝ h π(k ) π(k) , q k →k = ϕ( · ;x k , Σ k /n), and D k →k (x k , u k →k ) = (u k →k , x k ) = (y k , u k →k ), for all k withπ(k) > 0 and k ∈ N(k). Because the chains will be assumed to start in stationarity, there is no need to define the functions for k withπ(k) = 0. Denote the standardized version of {(K, X K ) limit (m) : m ∈ N} by {(K, Z K ) limit (m) : m ∈ N}, which is such that (K, Z K ) limit (m) = (K, √ n(X K −x K )) limit (m), for all m. We denote the weak convergence of, for instance, Theorem 1 (Weak convergence). Under Assumptions 1 to 3, we have that : m ∈ N}, both in probability as n −→ ∞, provided that all chains start in stationarity.
Formally, in Theorem 1, it is considered that, for each sampler, the proposal mechanism for parameter updates and model switches is the same. For instance, in the implementable RJ (that simulate {(K, Z K ) n (m) : m ∈ N}), a normal is used to sample the parameter proposals in parameter-update attempts. This is often not the case in practice; for instance in our numerical example, we use HMC steps. To accommodate for such types of parameter-update mechanisms, an additional, rather technical, assumption is required about the convergence of the associated Markov kernels in some sense as n −→ ∞. We proceeded in that way to simplify and for brevity.

Analysis of the limiting RJ
The stochastic process {(K, Z K ) limit (m) : m ∈ N} can be used as a proxy to {(K, Z K ) n (m) : m ∈ N} in the large-sample regime. An analysis of it thus helps understand how {(K, Z K ) n (m) : m ∈ N}, and thus {(K, X K ) n (m) : m ∈ N}, behave in this regime. The stationary distribution of {(K, Z K ) limit (m) : m ∈ N} is such that Z K | K ∼ ϕ( · ; 0, Σ K ) and K ∼π.
A theoretical result that can be established in the situation where the posterior model PMF concentrates on one model is the dominance of the limiting RJ over any RJ targeting the same distribution and using the same functions, except g(k, · ). This is true because the limiting RJ does not waste iterations trying to propose to switch to models other than the best. This shows the relevance of using informed RJ which automatically incorporate such features of the target. Let us denote by P limit,1 and P limit,2 the Markov kernels of the limiting RJ setting g(k, k ) ∝ h π(k ) π(k) and that of the other RJ with the only difference being in g, respectively. The result is a Peskun-Tierney ordering (Peskun, 1973;Tierney, 1998), meaning an order on the Markov kernels, implying an order on the asymptotic variances. We use var( f, P limit,i ) to denote the asymptotic variance of the function f applied to the Markov chain of transition kernel P limit,i at equilibrium, i = 1, 2.
Proposition 1. Assume that there exists k * withπ(k * ) = 1. Then, for any z k * ∈ R d k * and measurable set A, implying that for any square-integrable function f , var( f, P limit,1 ) ≤ var( f, P limit,2 ).
Note that this result holds for globally-balanced proposal distributions as well. What is required is that h(x) = 0 when x = 0 and h(x) > 0 when x > 0, which holds when h is the identity function.
In the limiting RJ, the parameters are sampled from the correct conditional distributions and therefore the transition probabilities for K are the same as a MH sampler targetingπ using g(k, k ) ∝ h π(k ) π(k) with acceptance probabilities given by This implies that any analysis of MH samplers with h set to a locally-balanced function (which is what we recommend in typical cases) applies to the limiting RJ. For a thorough analysis of such samplers, we refer the reader to Zanella (2020). Recall that c(k) and c(k ) are the normalizing constants of g(k, · ) and g(k , · ), respectively. As mentioned in Section 1.3, Zanella (2020) proved that c(k)/c(k ) are close to 1 in some situations, suggesting that locally-balanced samplers are efficient. Note that in practice even when π( · | D n ) concentrates, we have that π(k | D n ) > 0 for any finite n and any k, implying that before finding the best model with a significantly larger probability, the chain may move around for a while. For these transitions with acceptance probabilities close to α MH (k, k ) in the large-sample regime, the result of Zanella (2020), stating that α MH (k, k ) −→ 1 as the ambient space becomes larger and larger, holds under some conditions; this suggests again an efficiency in exploring the state-space and thus finding the best model.
When h is the globally-balanced function, i.e. the identity function, and N(k) = K for all k, then the limiting RJ proposes models and parameters independently of the current state of the chain and these proposals are always accepted. It thus corresponds to independent Monte Carlo sampling, which is often seen as the gold standard in MCMC.

Application: Variable selection in wholly-robust linear regression
In this section, we apply the proposed RJ methodology to a variable-selection problem in a whollyrobust linear regression. An overview of wholly-robust linear regression is provided in Section 4.1 and then the computational results are presented and analysed in Section 4.2.

Wholly-robust linear regression
A new technique emerged to gain robustness against outliers in parametric modelling: replace the traditional distribution assumption (which is a normal assumption in the problems previously studied) by a super-heavy-tailed distribution assumption (Desgagné, 2015;Desgagné and Gagnon, 2019;Gagnon et al., 2020a. The rationale is that this latter assumption is more adapted to the eventual presence of outliers by giving higher probabilities to extreme values. A proof of effectiveness of the approach resides in the following: the posterior distribution converges towards that based on the non-outliers only (i.e. excluding the outliers) as the outliers move further and further away from the bulk of the data. This theoretical result corresponds to a property in Bayesian statistics called whole robustness (Desgagné, 2015), and implies a conflict resolution (O'Hagan and Pericchi, 2012). As explained in the papers cited above, the models with super-heavy-tailed distributions have built-in robustness that resolve conflicts due to contradictory sources of information in a sensitive way. It takes full consideration of non-outliers and excludes observations that are undoubtedly outlying; in between these two extremes, it balances and bounds the impact of possible outliers, reflecting that there is a uncertainty about whether or not these observations really are outliers. These features of wholly-robust models are appealing, but they come at price, mainly a computational one. The super-heavy-tailed distribution used in the papers cited above cannot indeed be represented as a scale mixture of normal distributions, contrarily to the Student distribution which is commonly assumed in Bayesian robust models. A scale mixture representation of the Student allows a straightforward implementation of the Gibbs sampler in, for instance, linear regression, for parameter estimation and variable selection (Verdinelli and Wasserman, 1991). A motivation for assuming a distribution with heavier tails than the Student is that the latter only allows to reach partial robustness. In linear regression, partial robustness translates into robust regression-coefficient point estimates, but inflated posterior variances, which contaminates uncertainty assessments and thus model selection (see Hayashi (2020) for a proof of partial robustness in linear regression).
In Gagnon et al. (2020a), the convergence of the posterior distribution is proved under the most general linear-regression framework, encompassing analysis of variance and covariance (ANOVA and ANCOVA), and variable selection. In this section, we apply the RJ methodology presented in the previous sections to sample from a joint posterior distribution of wholly-robust linear regression models and their parameters. RJ is thus required for this task. The data analysed are the same prostate-cancer data mentioned in Figure 1.
The models are thus linear regressions, but the errors follow a super-heavy-tailed distribution. See Appendix B for all the details about the models and inputs required for RJ implementation. The superheavy-tailed distribution used is called log-Pareto-tailed normal (LPTN). This distribution was introduced by Desgagné (2015). Its density exactly matches that of the normal on the interval [−τ, τ], where P(−τ ≤ ϕ( · ; 0, 1) ≤ τ) = ρ. Outside of this area, the tails of this continuous density behave as a log-Pareto: (1/|x|)(log |x|) −λ−1 , hence its name. The only free parameter of this distribution is ρ: the parameter λ is a function of ρ and τ, the latter being itself a function of ρ. The linear regression with a LPTN is thus expected to behave similarly to the traditional normal one in the absence of outliers. Not only this is the case in the absence of outliers, but the limiting robust-regression posterior distribution (as the distance between the outliers and the bulk of the data approaches infinity) is also similar to the normal one, the latter being instead based on the non-outliers only.
This resemblance is useful in our informed RJ framework because, first, it suggests that the proposed methodology is an approximation to the asymptotically exact one if some linear regressions are good approximations to the true model. Indeed, in the normal linear-regression framework, Assumptions 1 and 2 of Section 3.1 hold, and a Berstein-von Mises theorem holds for each model. It is thus expected to be the case in the robust LPTN framework as well, yet it is much more difficult to prove.
Second, the resemblance between the wholly-robust and normal regressions provides us with an approximation to the observed information matrix. We indeed setÎ k in the robust LPTN framework to the observed information matrix of the normal regression, but evaluated at the maximizer under the wholly-robust model. The matrixÎ k is thus basically the maximum a posteriori estimate of the scale parameter in the robust model multiplied by the observed covariate correlation matrix. The approximation is thus expected to be accurate when there are no severe outliers among the covariate observations. One may instead use a robust version of the correlation matrix (see  for such a robust alternative), but this is not investigated here for brevity. Whether or not a robust version of the covariate matrix is used is expected to have no impact on our comparison of the algorithms in the next section because they all use q k →k = ϕ( · ;x k ,Î −1 k ) and the matrixÎ k only appears in q k →k . Indeed, |Î k | −1/2 inπ(k | D n ) (4) cancels with another term in the same expression.
Finally, the resemblance between the wholly-robust and normal regressions is useful to validate our RJ computer code. Indeed, the robust approach yields an outlier-detection method and a "clean" data set without outliers can thus be identified. Based on this data set, posterior coefficient estimates and model probabilities of the normal models can be computed using their closed-form expressions. These estimates and probabilities are expected to be similar to those obtained under the robust models.

Results and analysis
The performance of the different algorithms are summarized in Table 1 and Figure 2. The results for the uninformed RJ are based on 1,000 runs of 100,000 iterations, with burn-ins of 10,000. The uninformed RJ is in fact a "non-fully" informed RJ corresponding to Algorithm 2 with the only difference that g(k, · ) = U{N(k)} instead of locally-or globally-balanced proposal distributions such that g(k, k ) ∝ h π(k | D n ) π(k | D n ) for all k ∈ N(k). It may thus be seen as corresponding to the approach of Green (2003). For a fair comparison with this RJ, the number of iterations for which the fully informed RJ, i.e. Algorithm 2, is run is reduced to account for the additional complexity of using informed distributions g(k, · ). The number of iterations has to be reduced to 85,000 to reach the same runtime as the uninformed RJ; one run takes about one minute in R on a i9 computer. 2 To measure performance, we display the model-switching acceptance rates, model-visit rates and relative increases in TV with respect to the best sampler, which is Algorithm 2 with h(x) = √ x. The TVs are in between approximated model distributions from RJ outputs and the posterior model PMF 3 .
The model-switching acceptance rates are the acceptance rates, but computed considering only the iterations in which model switches are proposed. The visit rates are the average number of model switches in one run, reported per iteration. For both these measures, we count the number of ac- cepted model switches, and this number is divided by either the number of proposed model switches or total number of iterations. These rates are thus similar but they convey different information: modelswitching acceptance rates reflect the quality of the proposal distributions used during model switches, while visit rates measure the propensity to propose a switch to another model (and accept it). They together allow to understand why, for instance, Algorithm 2 with h(x) = √ x is (slightly) better than Algorithm 2 with h(x) = x/(1 + x) in terms of TV. Indeed, even if it has a (slightly) worse model-switch proposal scheme (model-switching acceptance rates: 66% vs 67%), it proposes more often to switch models (visit rates: 55% vs 53%), which has an impact on the TV.
This robust-variable-selection application shows that even in a moderate-size example, in this case with 8 covariates and 256 models, the locally-balanced proposal distributions lead to RJ that can outperform RJ proposing models uniformly at random, at fixed computational budget. In contrast, the globally-balanced proposal distribution does not yield an improvement that is significant enough to compensate for the decrease in number of iterations. We observed that even when the RJ with the function h(x) = x is run for the same number of iterations as the uniformed one, the former only slightly outperforms the latter.
For the informed RJ to be effective, the approximations on which they rely have to be accurate. Recall that their accuracy depends on the sample size (at least in some scenarios regarding the approximations of π( · | k , D n ) by q k →k , as explained in Section 3.1). In this example, n = 97. To understand if this is actually large for such a robust-linear-regression problem with 8 covariates, we can compare the model-switching acceptance rates of Algorithm 2 with those of the ideal RJ they approximate. Recall that ideal RJ set g(k, k ) ∝ h π(k | D n ) π(k | D n ) and q k →k = π( · | k , D n ). It is possible to evaluate their model-switching acceptance rates because these correspond to the acceptance rates of MH samplers targeting the PMF π( · | D n ) using g(k, k ) ∝ h π(k | D n ) π(k | D n ) , computed considering only iterations in which k k is proposed. When h(x) = x/(1 + x), the acceptance rate of the MH sampler is 88%, which is sufficiently close to 1 to suggest that the asymptotic regime presented in Zanella (2020) where the size of the discrete space increases and the acceptance probabilities converge to 1 is nearly reached. To understand which approximations explain the gap between Algorithm 2 with h(x) = x/(1 + x) and a model-switching acceptance rate of 67% and its ideal counterpart with a rate of 88%, we can look at the model-switching acceptance rate of a RJ using methods to sample parameter proposals from a distribution closer to π( · | k , D n ) and next at that of a RJ additionally using a method making g(k, k ) closer to being proportional to h π(k | D n ) π(k | D n ) (recall the overview of the methods presented in Section 2.4). It is 84% for the first RJ (with T = N = 10), and it increases to 87% for the second RJ. Using such methods simply make RJ too computationally expensive to be competitive in this example. Recall that their precision parameters can be increased to make the second RJ arbitrarily close to the ideal one with a model-switching acceptance rate of 88%.
These results suggest that n is of moderate size, in the sense that the approximations used in Al-gorithm 2 are good enough to yield an improvement, but it is expected that the improvement would be even more marked if we had access to more data points. Also, the results suggest that the approximations explaining the most part of the gap between Algorithm 2 with h(x) = x/(1 + x) and a model-switching acceptance rate of 67% and its ideal counterpart with a rate of 88% are the approximations of π( · | k , D n ) by q k →k . We investigated to see if there are significant differences between the covariance matrices used in q k →k and those of π( · | k , D n ), and it is not the case. This means that the approximations of π( · | k , D n ) by q k →k are not as accurate as we would like because the densities π( · | k , D n ) are significantly different from bell curves. Our analysis also allows to rule out an issue with using the same observed covariance matrix as the normal regression in the robust LPTN framework. In fact, a robust-linear-regression analysis reveals the presence of outliers, but only for models with negligible posterior mass (both in the robust and non-robust frameworks). We finally note that, for models with non-negligible posterior mass, there is no gross violation of the assumptions underlying linear regression.

Discussion
In this paper, we proposed simple fully-informed RJ and described the situations in which they are expected to be efficient: when the sample size is sufficiently large, Assumptions 1-2 of Section 3.1 are verified, some models among those considered well approximate the true data generating process and Bernstein-von Mises theorems hold for these models. Informed proposal distributions are crucial when the model probabilities and parameter densities vary significantly within neighbourhoods of states of the Markov chains. They are expected to vary significantly in the large-sample regime in which the target concentrates. But we noticed with our numerical example that they may vary greatly even when this large-sample regime is not reached. The proposed RJ show major improvements in this example as the chains spend less iterations at the same state, compared with the RJ that naively proposes models uniformly at random and thus often tries to reach low probability models, leading to higher rejection rates. Yet, the proposed samplers are reversible which allows them to return to recently visited models often. The next step in the line of research of trans-dimensional samplers for non-nested model selection is to propose sampling schemes which do not suffer from this diffusive behaviour, but instead induce persistent movement in the model-indicator process. A first step in this direction has recently been made by Gagnon and Maire (2020), but their approach cannot be used in all contexts of non-nested model selection. It can nevertheless be applied in any variable-selection problem. It is however expected to be efficient when the posterior model mass is diffused over a large number of models with different number of covariates, which is not the case in our numerical example. Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A. and Yang, N. (1989)

A Improving the approximations
In practice, the sample size may not be large enough for the approximations in the proposal mechanisms to be accurate. In Sections A.1 and A.2, we present the details of the methods reviewed in Section 2.4 that allow to improve upon the normal-distribution approximations to π( · | k, D n ). These methods turn out to be useful for improving upon the approximations forming the model proposal distributions g(k, · ) as well; this is explained in detail in Section A.3.
A.1 RJ with the method of Karagiannis and Andrieu (2013) In the definition of the annealing intermediate distributions ρ (t) k →k in (6), it is considered that D k →k (x k , u k →k ) = (u k →k , x k ) = (y k , u k →k ), implying that |J D k →k (x k , u k →k )| = 1, and q k →k = ϕ( · ;x k ,Î −1 k ). For the general definition, see Karagiannis and Andrieu (2013).
We now present in Algorithm 3 an informed RJ incorporating the method of Karagiannis and Andrieu (2013). Recall that the acceptance probability in this RJ is defined in (7). Note that Algorithm 3 corresponds to Algorithm 2 when T = 1; no path is sampled in Step 2.(b). (2013) 1. Sample k ∼ g(k, · ) with g(k, k ) ∝ h π(k | D n ) π(k | D n ) for all k ∈ N(k).

2.(a)
If k = k, attempt a parameter update using a MCMC kernel of invariant distribution π( · | k, D n ) while keeping the value of the model indicator k fixed.
) set the next state of the chain to (k , y (T −1) k ). Otherwise, set it to (k, x k ).

Go to
Step 1.
Using the same proof technique as in Gagnon and Doucet (2021), it can be proved that under regularity conditions a Markov chain simulated by Algorithm 3 converges weakly to that of a RJ which is able to sample from π( · | k, D n ) for all k with acceptance probabilities α MH but using g(k, k ) ∝ h π(k | D n ) π(k | D n ) , as T −→ ∞, for fixed n. Note that the weak convergence in this case is not in probability because the target is considered non-random (contrarily to the framework under which Theorem 1 is stated).
As T increases, it is expected that the acceptance probabilities increase steadily towards α MH until convergence is reached. A RJ user thus has to find a balance between a RJ close to be ideal and the computational cost associated to this. As explained in Karagiannis and Andrieu (2013), the computational cost scales linearly with T . In some situations, the improvement as a function of T is very marked for T ≤ T 0 , effectively offsetting the computational cost. This is not the case in our numerical example. Karagiannis and Andrieu (2013) prove that under two conditions Algorithm 3 is valid, in the sense that the target distribution is an invariant distribution. These conditions are the following.
Symmetry condition: For t = 1, . . . , T − 1 the pairs of transition kernels K (t) k →k ( · , · ) and K (T −t) k →k ( · , · ) satisfy Reversibility condition: For t = 1, . . . , T − 1, and for any (x k , y k ) and (x k , y k ), As mentioned in Karagiannis and Andrieu (2013), (10) is verified if for all t, K (t) k →k ( · , · ) and K (T −t) k →k ( · , · ) are MH kernels sharing the same proposal distributions. (11) is satisfied if K (t) k →k ( · , · ) is a MH kernel targeting ρ (t) k →k . We recommend to use MALA (Metropolis-adjusted Langevin algorithm, Roberts and Tweedie (1996)) proposal distributions whenever this is possible; see Karagiannis and Andrieu (2013) for other examples. If MALA proposal distributions are used, a step size is required for each pair (k, k ) defining a transition from Model k to Model k . For any of these model switches, we recommend to set the step size to /(d k + d k ) 1/6 , according to the findings in Roberts and Rosenthal (1998), and to tune . To achieve the latter, do a line search with a fixed value of T and choose that maximizes the model-switching acceptance rate, so that the resulting RJ is the closest to the ideal one.
A.2 RJ with additionally the method of Andrieu et al. (2018) Denote the N estimates of π(k | D n )/π(k | D n ) produced by the scheme of Karagiannis and Andrieu (2013) ). Denote the average (with simplified notation) byr We now present in Algorithm 4 the RJ additionally incorporating the method of Andrieu et al. (2018).
No additional assumptions to those presented in Section A.1 are required to guarantee that Algorithm 4 is valid. Andrieu et al. (2018) prove that increasing N decreases monotonically the asymptotic variances of the Monte Carlo estimates produced by RJ incorporating their approach.
It is expected that increasing N (as increasing T in the last section) leads to a steady increase in the acceptance probabilities until convergence is reached. As with Algorithm 3, there exists a balance Algorithm 4 RJ additionally incorporating the method of Andrieu et al. (2018) 1. Sample k ∼ g(k, · ) with g(k, k ) ∝ h π(k | D n ) π(k | D n ) for all k ∈ N(k).
2.(a) If k = k, attempt a parameter update using a MCMC kernel of invariant distribution π( · | k, D n ) while keeping the value of the model indicator k fixed.
set the next state of the chain to (k , y (T −1, j * ) k ). Otherwise, set it to (k, x k ).
set the next state of the chain to (k , y (T −1,1) k ). Otherwise, set it to (k, x k ).

Go to
Step 1.
between a RJ close to be ideal and the computational cost associated to this. An advantage of the approach presented in this section is that several operations can be executed in parallel, so that the computational burden is alleviated. The computational cost (over that of using the approach of Karagiannis and Andrieu (2013)) nevertheless depends on N because computational overheads have to be taken into account. In our numerical example, the improvement as a function of N is not significant enough to offset the computational cost.

A.3 Improving the model proposal distribution
We have seen in Section A.2 thatr(k, k ) and the ratios r RJ2 forming it are estimators of π(k | D n )/π(k | D n ). They can thus be used to enhance the approximations to the posterior probability ratios in g(k, · ). If we want to enhance the PMF g(k, · ), we need to improve π(l | D n )/ π(k | D n ) for all l ∈ N(k) as these are all involved in the construction of the PMF. Also, once k has been sampled, we need to do the same for g(k , · ) given that this PMF comes into play in the computation of the acceptance probabilities (see, e.g., Algorithm 4). We thus need parameter proposals y (T −1,1) l , . . . , y (T −1,N) l for all Models l ∈ N(k), and also for all models belonging to N(k ). The ratios r RJ2 are next computed.
There are several ways to combine these ratios with π(l | D n )/ π(k | D n ) and π(s | D n )/ π(k | D n ) to improve the estimation of π(l | D n )/π(k | D n ) and π(s | D n )/π(k | D n ). We define the improved version of the PMF g as follows to reflect this flexibility: for all j ∈ {1, . . . , N} and l ∈ N(k). Here we added the subscript k → l and superscript (t, j) to highlight that the sequence generated using Step 2.(b) of Algorithm 3 depends on j (different sequences are generated for different j), and also on l through ρ (t) k →l (6). Note thatπ(l | D n )/π(k | D n ) is an estimator of π(l | D n )/π(k | D n ) which is in fact a function of and y (0, j) l , . . . , y (T −1, j) l for all j ∈ {1, . . . , N} additionally to k and l; we used this notation to simplify and make the connection with π(l | D n )/ π(k | D n ).
Algorithm 5 includes the idea of improving π(l | D n )/ π(k | D n ) using ratios r RJ2 in a valid way (as indicated by Proposition 2 below). This algorithm is quite complicated, but once a code for Algorithm 4 has been written, we can use the connections between Algorithm 5 and Algorithm 4 to facilitate the coding of the former. Another drawback of Algorithm 5 is that it requires to perform the computations for g imp. (k , · ) even when k = k. This is because g imp. (k, k , x ) (see Step 2.(i)), even when k = k. At least, most of computations for the two main steps (Steps 2.(i) and 2.(ii)) can be performed in parallel. Again, computational overheads have to be taken into account. In our numerical example, the improvement is not significant enough to offset the computational cost. (c) Sample k ∼ g imp. (k, · ) and j * from a PMF such that P(J * = j) ∝ r RJ2 ((k, x (0) k ), (k , y (T −1, j) k )), and computer(k, k ).
set the next state of the chain to (k , y (T −1, j * ) k ). Otherwise, set it to (k, x k ).
set the next state of the chain to (k , y (T −1,1) k ). Otherwise, set it to (k, x k ).
It is natural to setπ(l | D n )/π(k | D n ) to 1 when l = k. This implies that we in fact do not need to sample proposals for Model k in the first parts of Steps 2.(i) and 2.(ii). Also, in the second parts of Steps 2.(i) and 2.(ii), it is not required to sample proposals for s = k for the same reason. When k = k, we actually perform a "vanilla" parameter-update step (an HMC step in our numerical example). In this case, the MH ratio as to be multiplied by Step 2.(i) or Step 2.(ii) is used. The function in (12) specifies the way the information is combined. It may be set for instance to the simple average:π One may alternatively take the average of π(l | D n )/ π(k | D n ) andr(k, l): These reflect a choice of putting more or less weight on π(l | D n )/ π(k | D n ). We know that if T and N are large enough thenr(k, l) is close to π(l | D n )/π(k | D n ), which may not be the case for π(l | D n )/ π(k | D n ) when n is not sufficiently large. The latter ratio may thus act as outlying/conflicting information against which these averages above are not robust. A robust approach consists in setting to be the median of π(l | D n )/ π(k | D n ), r RJ2 ((k, x (0) k ), (l, y (T −1,1) l )), . . . , r RJ2 ((k, x (0) k ), (l, y (T −1,N) l )). We recommend this approach and use it in our numerical example.
Furthermore, as T, N −→ ∞,π(l | D n )/π(k | D n ) is a consistent estimator of π(l | D n )/π(k | D n ) for fixed n when the median or (13) is used (recall the properties of r RJ2 andr mentioned in Sections A.1 and A.2). Therefore, if the function h is such that h(x) = x h(1/x) for x > 0, then the acceptance probabilities in Algorithm 5 converge towards 1 ∧ c(k)/c(k ), where c(k) and c(k ) are the normalizing constants with In fact, the same technique as in the proof of Theorem 1 in Gagnon and Doucet (2021) allows to prove that a Markov chain simulated by Algorithm 5 converges weakly for fixed n to that of an ideal RJ which has access to the unnormalized version of the posterior probabilities π(k | D n ) and is able to sample from the conditional distributions π( · | k, D n ) (and for which the acceptance probabilities are 1 ∧ c(k)/c(k )), with its good mixing properties as discussed in Section 1.3.

B Details of the variable-selection example
We review linear regression and introduce notation in Section B.1. We present in Section B.2 all the details to compute estimates for the normal linear-regression model. Some are useful for the computation of algorithm inputs, such as the observed information matrix, and to verify the validity of our RJ code (as mentioned in Section 4.1). In Section B.3, we turn to the details of the robust linear regressions and the computation of algorithm inputs.

B.1 Linear regression
We first introduce notation. We define γ 1 , . . . , γ n ∈ R to be n data points from the dependent variable, and γ n := (γ 1 , . . . , γ n ) T . We denote the full design matrix containing n observations from all covariates by C ∈ R n×p , where p is a positive integer. For simplicity, we refer to the first column of C as the first covariate even if, as usual, it is a column of 1's. The design matrix associated with Model k whose columns form a subset of C is denoted by C k , with lines denoted by c T i,k . We use d k to denote the number of covariates in Model k; we therefore slightly abuse notation given that the number of parameters for Model k is d k + 1 (one regression coefficient per covariate plus the scale parameter of the error term).
The linear regression is where K is the model indicator, β K is the random vector containing the regression coefficients of Model K and 1,K , . . . , n,K ∈ R are the random errors of Model K. We assume that 1,K , . . . , n,K and β K are n + 1 conditionally independent random variables given (K, σ K ), with σ K > 0 being the scale parameter of the errors of Model K. The conditional density of i,K is given by

B.2 Normal linear regression
We present in this section a result giving the precise form of the joint posterior density for the normal linear regression.
Note that σ 2 K | K, γ n has an inverse-gamma distribution with shape and scale parameters given by (n − d k )/2 and γ n − γ k 2 2 /2, respectively. We work on the log-scale for the scale parameters so that all the parameters take values on the real line, and presumably, have densities that are closer to the bell curve. We thus define η k := log σ k . The associated conditional distribution is given by The other terms in the joint posterior do not change except that we now consider that β K | K, η K , γ n ∼ ϕ( · ; β k , exp(2η K )(C T K C K ) −1 ). The maximizers of the conditional posterior densities given that K = k are: The observed information matrix evaluated at the maximizers is thus given bŷ Relying on improper priors such as π(β k , σ k | k) ∝ 1/σ k may lead to inconsistencies in model selection (see, e.g., Casella et al. (2009)). When this problem happens, the phenomenon is referred to as the Jeffreys-Lindley paradox (Lindley (1957) and Jeffreys (1967)) in the literature. It can be shown that the Jeffreys-Lindley paradox does not arise under the framework of normal linear regression described above when the prior distribution on K is carefully chosen, and more precisely, set to See  for an analogous proof in a framework of principal component regression. When the covariates are orthonormal, |C T k C k | 1/2 = |nI d k | 1/2 = n d k /2 (if all covariate observations are standardized using a standard deviation in which the divisor is n). In this case, π(k) ∝ |C T k C k | 1/2 /n d k /2 = 1. The proposed prior on K can thus be seen as a relative adjustment of the volume spanned by the columns of C T K C K .
where x ∈ R and Φ( · ) is the cumulative distribution function (CDF) of a standard normal. The terms τ > 1 and λ > 0 are functions of ρ and satisfy τ = {τ : P(−τ ≤ Z ≤ τ) = ρ for Z ∼ ϕ( · ; 0, 1)}, Setting ρ to 0.95 has proved to be suitable for practical purposes (see Gagnon et al. (2020a)). Accordingly, this is the value that is used in our numerical example. Using the same prior as the normal regression, the joint posterior density is: With the change of variable η k = log σ k , we have Recall that for the robust models we setÎ −1 k as in (15) but use the maximizerη k of the function above. We need log conditionals and their gradients for optimizers and HMC: where the proportional sign has to be understood in the original scale, and sgn( · ) being the sign function.
To use the annealing distributions in Algorithms 3, 4 and 5, we work with the log densities; therefore we simply multiply log π( · | k, γ n ) by 1 − t/T and log π( · | k , γ n ) by t/T to obtain log ρ (t) k →k . To use MALA proposal distributions, we however need to compute the gradient of log ρ (t) k →k . We now compute the gradient with respect to x (t) k ; the gradient with respect to y (t) k is computed similarly. The proportional sign "∝" is with respect to everything that are not the parameters: where we omitted the superscript "(t)" for the variables to simplify. Therefore,

C Proofs
We present the proofs of Theorem 1, Proposition 1, Proposition 2 and Proposition 3 in that order.
Proof of Theorem 1. We here only prove that {(K, Z K ) n (m) : m ∈ N} =⇒ {(K, Z K ) limit (m) : m ∈ N} in probability as the proof of {(K, Z K ) ideal (m) : m ∈ N} =⇒ {(K, Z K ) limit (m) : m ∈ N} is similar. To prove this result, we use Theorem 2 of Schmon et al. (2021). We thus have to verify the following three conditions.
2. Use P n and P limit to denote the transition kernels of {(K, Z K ) n (m) : m ∈ N} and {(K, Z K ) limit (m) : m ∈ N}, respectively. These are such that in probability as n −→ ∞ for all φ ∈ BL, where π K,Z K ( · , · | D n ) denotes the stationary distribution/density of {(K, Z K ) n (m) : m ∈ N} and BL denotes the set of bounded Lipschitz functions.
3. The transition kernel P limit is such that P limit φ is continuous for any φ ∈ C b (the set of continuous bounded functions).
We start with Step 1. Given that K is finite (Assumption 1), it suffices to verify that π(k | D n ) P(Z k,n ∈ A) −π(k) P(Z k,limit ∈ A) −→ 0 in probability, for any k and measurable set A, where P(Z k,n ∈ A) and P(Z k,limit ∈ A) are computed using the conditional distributions given that K = k. Using the triangle inequality, we have that π(k | D n ) P(Z k,n ∈ A) −π(k) P(Z k,limit ∈ A) ≤ π(k | D n ) P(Z k,n ∈ A) −π(k) P(Z k,n ∈ A) + π(k) P(Z k,n ∈ A) −π(k) P(Z k,limit ∈ A) .
We now show that both absolute values converge towards 0 in probability which will allow to conclude by Slutsky's theorem and monotonicity of probabilities. We first have that in probability by Assumption 2 and the fact that P(Z k,n ∈ A) ≤ 1. Using now thatπ(k) ≤ 1, we have that π(k) P(Z k,n ∈ A) −π(k) P(Z k,limit ∈ A) in probability, by Jensen's inequality and Assumption 3, where A n is the set A after applying the inverse transformation to retrieve the original random variables. Note that in the last inequality, we also used that A n ⊂ R d k .
P limit φ is thus continuous for any φ ∈ C b so Condition 3 above is satisfied. We also have that P n ((k, z k ), (k , dy k )) = g(k, k ) ϕ(dy k ; 0,Σ k ) α((k, z k ), (k , y k )) where in this case Therefore, using the triangle inequality. We now show that first term on the RHS in (18) converges towards 0 in probability. The second term converges towards 0 in probability following similar arguments. We will thus be able to conclude by Slutsky's theorem and monotonicity of probabilities. Firstly, because we can assume that |φ| ≤ M with M a positive constant and using Jensen's inequality. Because the sums are on a finite number of terms, it suffices to prove that the integral converges to 0, for any k, k . Note that we have not properly defined g limit (k, k ) whenπ(k) = 0. For the proof, we can consider that these g limit (k, k ) withπ(k) = 0 have any definition, because we can use that using the triangle inequality, that 0 ≤ g, α, g limit , α limit ≤ 1, and that π(k | D n ) −→π(k) = 0. We can thus restrict our attention to the case whereπ(k) > 0. Note that for similar reasons, we can restrict our attention to the case where g limit (k, k ) > 0 (implying thatπ(k ) > 0). Indeed, consider thatπ(k) > 0, then all terms involved in g(k, k ), i.e. h(π(l | D n )/π(k | D n )) with l ∈ N(k), have a well-defined limit in probability given by h(π(l)/π(k)) by Slutsky's and continuous mapping theorems and Assumption 2. Given that g(k, k ) is a continuous function of a finite number of random variables all converging towards a constant in probability, g(k, k ) −→ g limit (k, k ) = 0 by Slutsky's theorem, and g limit (k, k ) = 0 whenπ(k ) = 0.
We prove the convergence to 0 of the first term. The other convergence follows from a similar argument.
The first term converges to 0 by Assumption 3 after a change of variables, and the second one converges to 0 as well by Proposition 2.1 of Devroye et al. (2018) and Assumption 3 as seen above. This concludes the proof.
Proof of Proposition 1. Becauseπ(k * ) = 1, we only have to establish the inequality for any set A = {(k, z k ) ∈ {k * } × B}. So A \ {(k * , z k * )} implies that a parameter update is proposed and accepted with a parameter proposal, denoted here by y k * , in B. When the chain is in stationarity, P limit,1 always proposes to update the parameters. Therefore, P limit,1 ((k * , z k * ), A \ {(k * , z k * )}) = P(y k * ∈ B is accepted).
Proof of Proposition 2. We prove the result for the case T = 1 (without annealing intermediate distributions), to simplify; the general case is proved similarly. When T = 1, r RJ2 = r RJ which is the ratio in (1). We prove that the probability to reach the state {k } × {y ( j * ) k ∈ A k }, from {k} × {x k ∈ A k }, is equal to the probability of the reverse move. We denote by P the Markov kernel. We thus prove that Note that we abused notation by denoting the integration variables y ( j * ) k and x k because a group of vectors like y (0,•) • are used in the transitions and they are not of the same dimension as y k and x k . To simplify, we will use notation like y ( j) s := y (0, j) s to denote the j-th proposal, j ∈ {1, . . . , N}. We now introduce notation to improve readability. We define three joint densities that are used to enhance the approximations when Step 2.(i) is applied to sample the proposal: q l →k (z ( j) k ).
The densitiesq k →N(k)\{k } andq k →k together represent the joint density of the random variables sampled in the first part of Step 2.(i).q k →N(k )\{k} represents the joint density of the random variables sampled in the second part of Step 2.(i). We now define three joint densities that are used to enhance the approximations when Step 2.(ii) is applied to sample the proposal: The densitiesq k →N(k)\{k } andq k →k together represent the joint density of the random variables sampled in the first part of Step 2.(ii).q k →N(k )\{k} represents the joint density of the random variables sampled in the second part of Step 2.(ii). We have that P((k, x k ), (k , dy ( j * ) k )) = 1 2q k →N(k)\{k } (y (•) N(k)\{k } )q k →k (y (•) k ) × g imp. (k, k , x k , y (•) N(k)\{k } , y (•) k ) r RJ ((k, x k ), (k , y ( j * ) k )) Nr(k, k , x k , y (•) k ) ×q k →N(k )\{k} (z (•) N(k )\{k} ) k + δ (k,x k ) (k , dy ( j * ) k ) P(rejection | (k , y ( j * ) k )), where P(rejection | (k , y ( j * ) k )) is the rejection probability given that the current state is (k , y ( j * ) k ). Note that we considered that in Step 2.(ii) we set uniformly at random the index of the proposal. This is however in practice not important (which is why in Algorithm 5 we set it to be 1) because of the form of the acceptance ratio. Note also that we use the notationr(k, k , x k , y (•) k ) to be clear about which variables is involved.
The probability of reaching the state {k } × {y ( j * ) k ∈ A k }, from {k} × {x k ∈ A k }, is thus given by