The Median Probability Model and Correlated Variables

The median probability model (MPM) Barbieri and Berger (2004) is defined as the model consisting of those variables whose marginal posterior probability of inclusion is at least 0.5. The MPM rule yields the best single model for prediction in orthogonal and nested correlated designs. This result was originally conceived under a specific class of priors, such as the point mass mixtures of non-informative and g-type priors. The MPM rule, however, has become so very popular that it is now being deployed for a wider variety of priors and under correlated designs, where the properties of MPM are not yet completely understood. The main thrust of this work is to shed light on properties of MPM in these contexts by (a) characterizing situations when MPM is still safe under correlated designs, (b) providing significant generalizations of MPM to a broader class of priors (such as continuous spike-and-slab priors). We also provide new supporting evidence for the suitability of g-priors, as opposed to independent product priors, using new predictive matching arguments. Furthermore, we emphasize the importance of prior model probabilities and highlight the merits of non-uniform prior probability assignments using the notion of model aggregates.


Introduction
This paper investigates the extent to which the median probability model rule of Barbieri and Berger (2004) can be used for variable selection when the covariates are correlated. To this end, we consider the usual linear model submodel indexed by γ = (γ 1 , . . . , γ q ) , where γ i ∈ {1, 0} for whether the i th covariate is in or out of the model. We tacitly assume that the response and predictors have been centered and thereby omit the intercept term.
For prediction of a new observation y from x under squared error loss, the optimal model γ o is known to satisfy (Lemma 1 of Barbieri and Berger (2004)) H γ β γ is the overall posterior mean of β under the hierarchical prior π(γ) and π(β | γ); β γ is the conditional posterior mean under γ; π(γ | Y ) is the posterior probability of model M γ ; Q = E [x x ] = X X, essentially the assumption that random covariates in the future will be like those arising in the data; and H γ the q ×|γ| stretching matrix (defined in Section 2.2 of Barbieri and Berger (2004)) which satisfies where h ij equals 1 if γ i = 1 and j = i r=1 γ r , and 0 otherwise. (1.3) As (1.2) reveals, γ o can be regarded as the best single-model approximation to model averaging.
Contrary to what might be commonly conceived as an optimal predictive model, γ o is not necessarily the modal highest posterior probability model. In orthogonal and nested correlated designs, Barbieri and Berger (2004) show that the optimal model γ o is also the median probability model γ MP , namely the model consisting of variables whose marginal inclusion probability π(γ i = 1 | Y ) is at least 0.5. Furthermore, compared to the maximum-a-posteriori model (MAP), a major attraction of the median probability model (MPM) is the speed with which it can be well approximated via Markov chain Monte Carlo (MCMC) methods. Whereas the MAP must be approximated by a single high probability model among the 2 p possible models, approximating the MPM only requires estimates of the p marginal inclusion probabilities, each of which can be quickly and more accurately estimated from MCMC sampled binary inclusion indicators. Thus even when the MAP and MPM are identical, which may often be the case, the MPM offers a much faster route to computing them both.
The MPM is now routinely used for distilling posterior evidence towards variable selection; Clyde et al. (2011);Feldkircher (2012); Garcia-Donato and Martinez-Beneito (2013); Ghosh (2015); Piironen and Vehtari (2017) and Drachal (2018) are some of the articles that have used and discussed the performance of the MPM. Despite its widespread use in practice, however, the optimality of MPM has so far been shown under comparatively limited circumstances. In particular, the priors π(β |γ) are required to be such that the MPM estimator β γ is proportional to the maximum likelihood estimator (MLE) under γ. This property will be satisfied by e.g. the point-mass spikeand-slab g-type priors (Zellner, 1986;Liang et al., 2008). However, the also very popular continuous spike-and-slab mixtures (George and McCulloch, 1993;Ishwaran and Rao, 2003;Ročková and George, 2014;Ročková, 2018) will fail to satisfy this requirement. Here, we will show that this condition is not necessary for MPM to be predictive optimal. In particular, we provide significant generalizations of the existing MPM optimality results for a wider range of priors such as the continuous spike-and-slab mixtures and, more generally, independent product priors. Barbieri and Berger (2004) presented a situation with correlated covariates (due to Merlise Clyde) in which the MPM was clearly not optimal. Thus there has been a concern that correlated covariates (reality) might make the MPM practically irrelevant. Hence another purpose of this paper is to explore the extent to which correlated covariates can degrade the performance of the MPM. We address this with theoretical studies concerning the impact of correlated covariates, and numerical studies; the magnitude of the scientific domain here limits us (in the numerical studies) to consider a relatively exhaustive study of the two variable case, made possible by geometric considerations. The overall conclusion is that (in reality) there can be a small degradation of performance, but the degradation is less than that experienced by the MAP in correlated scenarios.
First, using predictive matching arguments (Berger and Pericchi, 2001;Bayarri et al., 2012;Fouskakis et al., 2018), we provide new arguments for the suitability of g-type priors as opposed to independent product priors. Going further, we highlight the importance of prior model probabilities assignments and discuss their "dilution" issues (George, 2010) in highly collinear designs. Introducing the notion of model aggregates, we showcase the somewhat peculiar behavior of separable model priors obtained with a fixed prior inclusion probability. We show that the beta-binomial prior copes far better with variable redundancy. We also characterize the optimal predictive model and relate it to the MPM through relative risk comparisons. We also provide several "minitheorems" showing predictive (sub)optimality of the MPM when q = 2.
The paper is structured as follows. Section 2 introduces the notion of model collectives and looks into some interesting limiting behaviors of the MPM when the predictors are correlated. Section 3 delves into a special case with 2 collinear predictors. Section 4 generalizes the optimality of the MPM to other priors and Section 5 wraps up with a discussion.

The Marginal Likelihood Under Many Highly Correlated Variables
One reasonable requirement for objective model selection priors is that they be properly matched across models that are indistinguishable from a predictive point of view. Recall that two models are regarded as predictive matching (Bayarri et al., 2012) if their marginal likelihoods are close in terms of some distance. In this section, we take a closer look at the marginal likelihood for the model (1.1) under the celebrated g-priors (Zellner, 1986), assuming that the n × (p + k) design matrix X satisfies for some > 0, where B consists of p possibly correlated regressors and where δ 1 , . . . , δ k are (n × 1) perturbation vectors. For clarity of exposition, we first assume that σ 2 > 0 is fixed and later extend our considerations to the random case. We assume that δ 1 , . . . , δ k are orthonormal and orthogonal to an (n × 1) vector x and B, 1 while x and B are not necessarily orthogonal. We will be letting be very small to model the situation of having k highly correlated variables. For the full model (1.1), the g-prior is for some g > 0 (typically n). Assuming, for now, that σ 2 is fixed the corresponding marginal likelihood is Note that X (X X ) −1 X is the projection matrix onto the column space of X . Hence, having near duplicate columns in X should not change this matrix much at all. Indeed, the following Lemma shows that, as → 0, this is a fixed matrix (depending only on B and x).
Proof. Let 1 be the k-column vector of ones, so that 11 is the k × k matrix of ones, and let v = B x. Note first that The result follows by multiplying this matrix with X and X , and taking the limit as → 0.
Lemma 1 can perhaps be more readily understood using the following intuitive explanation. With k = 1, the limiting design matrix lim ε→0 X ε = [B | x] is full rank and yields the projection matrix P in (2.3). While with k > 1 the limiting matrix of X ε is no longer full rank, its columns [B | x | . . . | x] span the same space as [B | x] and so it is expected that the limiting projection matrix be the same as for the case [B | x].
One important conclusion from Lemma 1 is that no matter how many columns of highly correlated variables are present in the model, the marginal likelihood under the g-prior will essentially be Y ∼ N n 0, σ 2 I + g σ 2 P as → 0. Thereby all models including all predictors in B and at least one replicate of x can be essentially regarded as predictive matching.
We let γ = (γ 1 , γ 2 ) denote the global vector of inclusion indicators, where γ 1 is associated with B and γ 2 is associated with the k near duplicates. The same analysis holds for any sub-model γ 1 ∈ {0, 1} p , defined by the design matrix B γ 1 consisting of the active variables corresponding to the 1's in γ 1 . Before proceeding, we introduce the notion of a model collective which will be useful for characterizing the properties of g-priors and the median probability model in collinear designs. Let P γ 1 be the limiting projection matrix corresponding to any of the models inside the model collective M γ 1 ,x . The limiting marginal likelihood of such models under the g-prior is

Definition 1 (A Model Collective
where φ(y | μ, Σ) denotes a multivariate Gaussian density with mean vector μ and covariance matrix Σ.
Lemma 2. Let m(y | γ 1 ) denote the marginal likelihood under the model γ 1 . Then we have (2.7) Proof. We denote with P the projection matrix P Bγ and z = ( We use the fact that (I+gP )(I−P ) = (I−P ) and therefore (I+gP ) −1 (I−P ) = (I−P ) to find that

The Median Probability Model and Correlated Variables
The statement then follows from (2.4) using the fact  George and McCulloch (1997)) where, using Lemma 1, y,x,g) . The limiting marginal likelihood satisfies (2.5) with For the orthogonal case (when x is orthogonal to B), we have Including at least one copy of the covariate x in a model γ 1 still inflates the marginal likelihood, where now the inflation factor ψ(y, x, γ 1 ) depends on γ 1 . However, ψ(y, x, γ 1 ) still does not depend on the number of copies.

Dimensional Predictive Matching
As a first application of Lemma 2, we note that the (limiting) marginal likelihood under the g-prior is the same, no matter how many replicates of x are in the model. This property can be regarded as a variant of dimensional predictive matching, one of the desiderata relevant for the development of objective model selection priors (Bayarri et al. (2012)). This type of predictive matching across dimensions is, however, new in the sense that the matching holds for all training samples, not only the minimal ones.
Corollary 2.1. Mixtures of g-priors are dimensional predictive matching in the sense that the limiting marginal likelihood of all models within the model collective is the same, provided that the mixing distribution over g is the same across all models.
Proof. Follows directly from Lemma 2.
Remark 4. According to Remark 3, Corollary 2.1 applies to both fixed and random σ 2 with a prior (2.9). Note that a similar conclusion also holds for the usual objective prior 1/σ 2 which is obtained as the limiting case as η → 0. Throughout the rest of the paper, we implicitly assume the prior (2.9) and/or the objective prior 1/σ 2 whenever we refer to random σ 2 .
In contrast, it is of interest to look at what happens with an alternative prior for β such as a N (0, I) prior. If a model has j near-replicates of x, the effective parameter for x in that model is the sum of the j β's, which will each have a N (0, j) prior. So the marginal likelihoods will depend strongly on the number of replicates, even though there is no difference in the models.

When all Non-Duplicated Covariates are Orthogonal
To get insights into the behavior of the median probability model for correlated predictors, we consider an instructive example obtained by setting = 0 and B B = n I in (2.1). In particular, we will be working with an orthogonal design that has been augmented with multiple copies of one predictor (2.10) where x 1 , . . . , x p , x are orthogonal and standardized so that x i 2 = x 2 = n. There is no qualitative difference between this and the previous assumption of highly correlated covariate vectors; going directly to the limiting case of replicate covariate vectors makes matters pedagogically easier to understand.
A few points are made with this toy example. First, we want to characterize the optimal predictive model and generalize the MPM rule when the designs have blocks of (nearly) identical predictors. Second, we want to understand how close to the optimal predictive model the MPM actually is in this limiting case. Third, we want to highlight the benefits of the g-prior correlation structure. We denote by z = x y, z i = x i y for i = 1, . . . , p and z = (z 1 , z 2 ) , where z 1 = (z 1 , . . . , z p ) and z 2 = z1 k . We will again split the variable inclusion indicators into two groups γ = (γ 1 , γ 2 ) ∈ {0, 1} p+k , where γ 1 is attached to the first p and γ 2 to the last k predictors. To begin, we assume the generalized g-prior on regression coefficients, given the model γ, where (X γ X γ ) + is the Moore-Penrose pseudo-inverse and where X γ denotes the subset of X with active indicators γ. The following lemma characterizes the optimal predictive model under (2.11) and (2.10).
Proof. Due to the block-diagonal nature of the matrix X X = n Ip 0 p×k 0 k×p 1 k 1 k , the posterior mean under the non-null model γ satisfies The optimal predictive model minimizes R(γ) defined in (1.2). Due to the fact that Q is block-diagonal, the criterion R(γ) separates into two parts, one involving the first p independent variables and the second involving the k identical copies. In particular, The statement then follows from (2.14) and (2.15). With duplicate columns, the optimal predictive model γ o ≡ arg min γ R(γ) is not unique. Any model γ o = (γ o 1 , γ o 2 ) defined through (2.12) and (2.13) will minimize the criterion R(γ).
The last k variables in the optimal predictive model thus act jointly as one variable, where the decision to include x is based on a joint posterior probability π(γ 2 = 0 | Y ). This intuitively appealing treatment of x is an elegant byproduct of the g-prior. We will see in the next section that such clustered inclusion no longer occurs in the optimal predictive model under independent product priors. The risk of the optimal model is Contrastingly, recall that the median probability model The median probability model γ MP thus behaves as the optimal model γ o for the first p variables. For the k duplicate copies, however, γ MP 2 consists of either all ones or all zeros. The MP rule correctly recognizes that the decision to include x is ultimately dichotomous: either all x's in or all x's out. Moreover, when the median model decides "all in", it will be predictive optimal. .
The term R 1 (γ o 1 ) in (2.14) can be quite large when p is large, implying that the relative risk can be quite small. The MP model is thus not too far away from the optimal predictive model in this scenario.
Several conclusions can be drawn from our analysis of this toy example. First, Lemma 3 shows that, in the presence of perfect correlation, it is the joint inclusion rather than marginal inclusion probabilities that guide the optimal predictive model selection. Second, the clone variables ultimately act collectively as one variable, which has important implications on the assignment of prior model probabilities. We will elaborate on this important issue in Section 2.4, 2.5 and 2.6. Third, purely from a predictive point of view, all models in the model collective (including at least one x) are equivalent. The g-prior here appears to be egalitarian in the sense that it (rightly) treats all these models equally. This property is not retained under independent product priors, as shown below.
Remark 5 (Independent Product Priors). Let us replace (2.11) with an independent prior covariance structure β γ ∼ N |γ| 0, σ 2 g/n I |γ| (2.16) for g = n, which corresponds to the usual scaling when the predictors are centered and standardized so that x i 2 = n. The posterior meanβ then satisfies The criterion R(γ) = R 1 (γ 1 ) + R 2 (γ 2 ) again separates into two parts, where The optimal predictive model for the last k variables now has a bit less intuitive explanation. Denote with c(|γ|) = |γ| − n|γ| 1+n|γ| . The optimal predictive model now consists of any collection of variables of size |γ o 2 | for which c(|γ o 2 |) is as close as possible to the posterior mean of c(|γ 2 |). It is worthwhile to note that this does not need to be the null or Besides these narrow situations, the optimal model γ o 2 will have a nontrivial size (other than 0 or k). The median probability model will still maintain the dichotomy by either including all or none of the x's. However, contrary to the g-prior it is not guaranteed to be "optimal" when, for instance, γ MP 2 = 1. It seems that the mission of the optimal model under the independent prior is a bit obscured. It is not obvious why models in the same model collective should be treated differentially and ranked based on their size. The independence prior correlation structure thus induces what seems as an arbitrary identifiability constraint.

Prior Probabilities on Model Collectives
It has been now standard to assume that each model of dimension |γ| has an equal prior probability with π(|γ|) being the prior probability (usually 1/(p+k+1)) of the collection of models of dimension |γ|. One of the observations from Lemma 3 is that it is the aggregate posterior probability π(γ 2 = 0 | Y ) rather than individual inclusion probabilities π(γ 2i = 1 | Y ) that drive the optimal predictive model γ o 2 in our collinear design. Thereby, it is natural to inspect the aggregate prior probability π(γ 2 = 0). We will be using the notion of model collectives introduced earlier in Definition 1. The number of models of size (2.20) We investigate the prior probability of the model collective under two usual choices: fixed prior inclusion probability (the separable case) and the random (non-separable) case.

The Separable Case
Suppose that all variables have a known and equal prior inclusion probability θ = π(γ j = 1 | θ) for j = 1, . . . , p + k. Then the probability of the model aggregate, given θ, is and the prior probability of the "null model" M γ 1 ,0 (not including any of the correlated variables) is The ratio satisfies ( 2.21) This analysis reveals a rather spurious property of the separable prior: regardless of the choice θ ∈ (0, 1), the model aggregate M γ 1 ,x will always have a higher prior probability than the model M γ 1 ,0 without any x in it. Such a preferential treatment for x is generally unwanted. We illustrate this issue with the uniform model prior (obtained with θ = 0.5) which is still widely used in practice.
With fixed θ = 0.5, all models have an equal prior probability of 2 −(p+k) . The number of models in the collective M γ 1 ,x is 2 k − 1, and so π(M γ 1 ,x ; 1/2) = (2 k − 1)2 −(p+k) = (2 k − 1) π(M γ 1 ,0 ; 1/2). (2.22) The collective can thus have much more prior probability than M γ 1 ,0 . Furthermore, the marginal prior probability of inclusion of x is γ 1 π(M γ 1 ,x ; 1/2) = 1 − 2 −k . Hence, if k is even moderately large, the prior mass is concentrated on the models which include x as a covariate, and the posterior mass will almost certainly also be concentrated on those models. The model-averagedβ will reflect this, essentially only including models that have x as a covariate.

Beta-Binomial Prior
It is generally acknowledged (Cui and George, 2008;Ley and Steel, 2009;Scott and Berger, 2010) that assigning equal prior probability to all models is a poor choice, since it does not adjust for the multiple testing that is effectively being done in variable selection. The common alternative (which does adjust for multiple testing), is replace the separable prior with θ ∼ B(a, b). Then the prior probability of the model aggregate satisfies This ratio is guaranteed to be smaller than under the separable case with a fixed θ when With the usual choice a = b = 1, the ratio in (2.24) will be smaller than the one in (2.21) when |γ 1 | is smaller than (p + 2)θ − 1, i.e. when the number |γ 1 | of non-duplicated variables is roughly smaller than its expectation under the fixed prior case with a probability θ. This suggests that the beta-binomial prior can potentially cope better with variable redundancy. We elaborate on this point in the next section. In the forthcoming Lemma 5, we provide an approximation to (2.24) as p gets large.

Posterior Inclusion Probabilities
In the previous section, we have shown that equal prior model probabilities can be problematic because each model collective M γ 1 ,x receives much more prior mass relative to M γ 1 ,0 , essentially forcing the inclusion of x. Going further, we show how this is reflected in the posterior inclusion probabilities. We first focus on the case when σ 2 is known.
With the usual choice g = n, it follows that π(γ 2 = 0 | Y ) > 0.5 if and only if z 2 > log √ 1+n From Lemma 4 it follows that the optimal predictive model (characterized in Lemma 3) will include x if the number of duplicates k is large enough, even when x has a small effect (z is small). Thus, the choice of equal prior model probabilities for optimal predictive model, in the face of replicate covariates, is potentially quite problematic. If one is only developing a model for prediction in such a situation, such forced inclusion of x is probably suboptimal, but it is only one covariate and so will not typically have a large effect, unless only very small models have significant posterior probability. For prediction, one could presumably do somewhat better by only considering the first p + 1 variables in the model uncertainty problem, finding the model averagedβ for this subset of variables.
This statement at first seems odd, because we 'know' the model averaged answer in the original problem is optimal from a Bayesian perspective. But that optimality is from the internal Bayesian perspective, assuming we believe that the original model space and assignment of prior probabilities is correct. If we really believed -e.g., that any of k highly correlated genes could be in the model with prior inclusion probabilities each equal to 1/2 (equivalent to the assumption that all models have equal prior probability) -then the original model averaged answer would be correct and we should include x in the prediction. At the other extreme, if we felt that only the collection of all k genes has prior inclusion probability of 1/2, then the result will be like the model averaged answer for the first p + 1 variables.
To get some feel for things in the general case (non-uniform model prior), suppose π i,x for some 0 ≤ i ≤ p is much bigger than the others, so that (2.26) becomes Using (2.24), it is immediate that this is bigger than 0.5 if (2.28) The following Lemma characterizes the behavior of π i,x π i,0 when p gets large.

Lemma 5. Suppose a and b are integers. As p gets large with i fixed,
The first order result follows immediately and the second order result follows from expanding the products in the last term above.
Utilization of the first order term in (2.28), and again choosing g = n and assuming x = √ n, yields that the collective has posterior inclusion probability greater than 0.5 if Note that this is much less likely to be satisfied than (2.25), when k grows, since (1 + k/p) [i+a] is then much smaller than 2 k ; thus having many duplicate x's does not ensure that x will be included in the model, as it was in the equal model probability case.
Remark 6 (The Case of Unknown Variance). Lemma 4 was postulated for the simpler case when the variance σ 2 is known. We have seen in Remark 3 that the inflation factor ψ(y, x, γ 1 ) depends on γ 1 even when x and B are orthogonal, which complicates the analysis. Sufficient characterizations can be obtained from where ψ min = min γ 1 ψ(y, x, γ 1 ) and ψ max = max γ 1 ψ(y, x, γ 1 ).
While the assumption of having duplicated predictors (used throughout this section) is somewhat theoretical, it sheds light on the most extreme case of correlation where one would expect the median probability model to fail. The performance of the median probability model in the more optimistic and realistic scenarios of near-correlation is shown through simulations in Section 3.4.

The Dilution Problem
Sets of predictors which are highly correlated with each other become proxies for one another in our linear model (1). This quickly leads to an excess of redundant models, each of which is distinguished only by including a different subset of these. To prevent such redundant models from accumulating too much posterior probability, dilution priors may be considered (George, 2010). Such priors downweigh individual model probabilities based on their proximity to one another, and a variety of strategies to do this may be considered.
When faced with a single identifiable cluster of highly correlated predictors such as our k clones of x, a simple dilution strategy would be to first assign a reasonable amount of prior mass to the entire cluster, and then dilute this mass uniformly across all subset models within this cluster. More precisely, to smear out the prior aggregation on M γ 1 ,x , one might like to consider different inclusion probabilities. Let [x 1 , . . . , x p ] have a prior inclusion probability θ 1 and each of the x clones have a prior inclusion probability θ 2 . With Assuming (2.29), variables with correlated copies have smaller inclusion probabilities (the more copies, the smaller the probability). This may correct the imbalance between π(M γ 1 ,x ) and π(M γ 1 ,0 ) by treating the multiple copies of x essentially as one variable. This prior allocation would put x on an equal footing with x 1 , . . . , x p in the optimal predictive model rule (based on π(γ 2 = 0 |Y )), but would disadvantage x in the median probability model. From our considerations above, it would seem that there is a fix to the dilution problem in our synthetic example (with clone x's). However, general recommendations for other correlation patterns are far less clear.

The Geometric Representation
The situations analyzed in previous sections may also be considered from a geometric perspective. Using the definition of H γ from (1.3), we define α γ = XH γ β γ (3.1) and denote withᾱ = γ π(γ | Y )XH γ β γ the overall posterior mean. Under this transformation, the expected posterior loss (1.2) to be minimized may be written as This implies that the preferred model will be the one whose corresponding α γ is nearest toᾱ in terms of the Euclidean distance.
To geometrically formulate the predictive problem, each model M γ may be represented by the point α γ and the set of models becomes a collection of points in pdimensional space. The convex hull of these points is a polygon representing the set of possible model averaged estimatesᾱ, as the π(γ |Y ) vary over their range. Any point in this polygon is a possible optimal predictive model, depending on π(γ | Y )'s. The goal is to geometrically characterize when each single model is optimal, given that a single model must be used.
In what follows we will refer to circumstances where β γ is equal (up to a constant) to β γ = (X γ X γ ) −1 X γ Y , as it happens under a non-informative prior or a g-prior on model parameters (see Barbieri and Berger (2004) for a discussion on the use of mixed default strategies to determine the posterior probability of each model and the posterior distribution of the parameters). In this context α γ is (proportional to) the projection of Y onto the space spanned by the columns of X γ .
The solid lines divide the figures into the four optimality subregions associated with the four models, namely the sets of thoseᾱ which are closer to one of the α γ .
The colors in Figure 1 indicate the regions where the average model point (the best model averaged answer) could lie if the model with the corresponding color is the median posterior probability. In the orthogonal case, the model averaged answer and model optimality regions always coincide, i.e., the MPM is always optimal. In the other cases, this need not be so. In Case 1, for instance, the red region extends into the (blue) null model's optimality region; thus M 01 could be the MPM, even when the null model is optimal. Likewise the green region extends into the optimality region of the null model, and the grey region (corresponding to the full model) extends into the optimality regions of all other models. Only the null model is fine here; if the null model is the MPM, it is guaranteed to be optimal.

Characterizations of the Optimal Model
For the case of two correlated covariates, we obtained partial characterizations of the optimal predictive model and the median probability model. These are summarized by the "mini-theorems" below, whose proofs can be found in the Supplementary Material (Barbieri et al., 2020).
Theorem 1 ("Mini-Theorems"). Consider the model (1.1) with q = 2 and the three cases described in Table 1. Then the following statements hold.
1. In Case 1, if M 00 is the median, it is optimal.
2. In Case 2, if M 11 is the median, it is optimal.

In Case 1 and 3, if at most one variable has posterior inclusion probability larger
than 1/2, M 11 cannot be optimal.
4. In Case 2 and 3, if at least one variable has posterior inclusion probability larger than 1/2, M 00 cannot be optimal.
5. In Cases 1 and 2, if M 00 or M 11 has posterior probability larger than 0.5, it is optimal.
6. In any case, if M 00 or M 11 has posterior probability larger than 0.5, the other cannot be optimal.
7. In Case 3, if M 00 or M 11 has posterior probability smaller than 0.5 it cannot be optimal.
The motivation for developing these mini-theorems was to generate possible theorems that might hold in general. Unfortunately, in going to the three-dimensional problem, we were able to develop counterexamples (not shown here) to each of the mini-theorems.

Numerical Study of the Performance of the MPM
We present a numerical study that investigates the extent to which the MPM and MAP agree, and how often they differ from the optimal predictive model. The goal was to devise a study that effectively spans the entire range of correlations that are possible and this was easiest to do by limiting the study to the two-dimensional case. We considered (1) equal prior probabilities for the four models, (2) the unit information g-prior for the parameters and (3) the more realistic scenario where the variance σ 2 is unknown and assigned the usual objective prior 1/σ 2 .
The study considered the following correlations and sample sizes: • r 1y and r 2y vary over ranges meant to span the range of likely data under either the full model, one-variable model, or null model; the description (and derivation) of the various correlation ranges is given in the Supplementary Material.
The reason the numerical study is conducted in this way is to reduce the dimensionality of the problem. In terms of ordinary inputs, one would have to deal with a study over the space of x 1 , x 2 , β 1 , β 2 , and the random error vector ε (or Y ). But, because the predictive Bayes risks only depend on r 12 , r 1y , r 2y and n, we can reduce the study to a three dimensional problem. And, since these are simply correlations, we can  Legend: columns (a) to (f) contain percentages of cases, over combinations of different values of the correlations among variables; OP denotes the optimal predictive model; MPM>MAP (resp. MAP>MPM) means that MPM (resp. MAP) has a smaller value of risk defined in (1.2) than MAP (resp. MPM); GM is the geometric mean of relative risks (to the optimal model) when MPM or MAP is not optimal. * denotes cases when OP is the lowest probability model.  Table 2 summarizes the results, combining the three cases from Table 1, under each of the model correlation scenarios (full, one-variable, and null), i.e. the overall combination of values of r 12 , r 1y and r 2y considered. We have 1602, 2100 and 2323 cases for the full, one-variable and null models, respectively. The table reports the percentage of times the MPM and MAP models equal the optimal predictive model (denoted with OP), i.e. the model minimizing (1.2). The last two columns of the table contain the geometric averages of relative risks to the optimal predictive model when MPM or MAP are not optimal, while graphs in Figure 2 depict box-plots of the corresponding distributions.
Here are some observations from Table 2: • Simpler models are more challenging for the MPM (and MAP); indeed, MPM = OP in 92.1% (1421+54 out of 1602), 85.2% (1767+22 out of 2100) and 84.5% (1895+68 out of 2323) of the cases for the full, one-variable, and null model, respectively; still, these are high success rates, given that correlations vary over the full feasible spectrum.
• As would be expected, both the MPM and MAP do better with larger sample sizes.
• The vast majority of the time, the MPM and MAP are the same model but, when they differ, the MPM is typically better: Figure 2: The case of two covariates: boxplots of the relative risks (to the optimal model) when MAP or MPM is not optimal under the full, one-variable and null models (cases combined). Dots correspond to values more extreme than 1.5 times the interquartile range from the box.
-Combining all three model types (full, one-variable and null), the MPM does better than the MAP (from the MPM = OP = MAP and MPM > MAP columns) in 2.7% of the cases (i.e. there are 162 cases MPM = OP = MAP or MPM > MAP out of 6025 = 1602 + 2100 + 2323 total cases); while the MAP does better than the MPM in 44 out of 6025 total cases (0.7%).
-When the MPM and MAP are not optimal, the MAP may present values of the risk (relative to that of OP) more extreme than the MPM (see Figure 2); the geometric average of the MAP relative risk is higher than the geometric average for the MPM.
• In the cases denoted with a star * , the optimal model OP is, curiously, the lowest probability model.
We may have some insight on the role of correlation between covariates through Table 3, which contains summaries under different values of r 12 = Corr(x 1 , x 2 ), after combining all cases and model scenarios (full, one variable and null). When the correlation between x 1 and x 2 is small, MPM is virtually always optimal (MPM = OP in 98.1% of the cases when r 12 = 0.1). While, as correlation increases, the rate of success degrades (MPM = OP in 75% of the cases when r 12 = 0.9). However the deterioration in the performance of MPM appears slower than that of MAP. Tables S.2, S.3 and S.4, in the Supplementary material, summarize some additional features of the numerical study. Note that the last sub-tables of Tables S.2-S.4 are included in Table 2. Each of those tables considers one model scenario (either the full model, the one-variable model, or the null model) and presents the results separately for the Case 1, Case 2, and Case 3. It is very clear from these tables that the Case 1 scenario is very favorable for the MPM -it is then virtually always the optimal model -while, in Cases 2 and 3, the MPM fails to be the optimal model in roughly 12% of  Legend: columns (a) to (f) contain the percentage over combinations of all cases (1, 2 and 3), model scenarios (full, one variable and null), sample sizes (n = 10, 50, 100) and r iy = Corr(x i , Y ) (i = 1, 2); OP denotes the optimal predictive model; MPM>MAP (resp. MAP>MPM) means that MPM (resp. MAP) has a smaller value of risk defined in (1.2) than MAP (resp. MPM). dot indicates that the median probability model was M 00 , because that is the color of α 00 . As before, the true optimal model for a dot is the external vertex defining the quadrilateral in which the dot lies; thus, if the blue dot lies within the quadrilateral with α 00 as the external vertex, the MPM is the optimal model, while if the blue dot lies within the quadrilateral for which α 10 is the external vertex, the MPM is incorrectly saying that M 00 is optimal, when actually M 10 is optimal.
The figures reinforce the earlier messages; Case 1 is nice for the MPM and MAP (almost all the colored dots are in the quadrilateral with the external vertex being of the same color), while Case 2 and, especially, Case 3 here are problematic -in Case 3, the MPM is typically M 00 when M 10 is optimal. Careful examination of the figures shows that the MPM is slightly better than the MAP, but the improvement is not dramatic.
The interesting feature revealed by the figures is that, essentially always, when the MPM and MAP fail, they do so by selecting a model of smaller dimension than the optimal model. There are a handful of dots going the other way, but they are hard to find. (This same feature was present in the corresponding figures for the one-variable and null model correlation scenarios, so those figures are omitted.) We highlight this feature because it potentially generalizes; if the MPM and MAP fail, they may typically do so by choosing too-small models.
The numerical study has also been implemented with three possible predictors (x 1 , x 2 , x 3 ) using the same ingredients, conveniently adapted. In particular we referred to the usual choice of prior distributions, sample sizes and grid of values of the correlations between covariates. Just as with two variables, we also considered the complete set of feasible correlations between the response and the explicative variables, under each model scenario (full, two variables, one variable and null). Since the results are substantially comparable to those in two dimensions, we only report a concise summary of the conclusions. Indeed more complex models and larger sample sizes are more   Legend: columns (a) to (f) contain the percentage over combinations of all model scenarios (full, two variables, one variable and null), sample sizes (n = 10, 50, 100) and all feasible values of the correlations between the response variable and each covariate; Correlations combined refers to the complete set of the correlations between the covariates; OP denotes the optimal predictive model; MPM>MAP (resp. MAP>MPM) means that MPM (resp. MAP) has a smaller value of risk defined in (1.2) than MAP (resp. MPM); GM is the geometric mean of relative risks (to the optimal model) when MPM or MAP is not optimal.
suitable for MPM and MAP: combining all correlations, under the full model and with n = 100 MPM is the optimal model in 93.8% out of the 196 198 cases (MAP in 92%), while under the null model and with n = 10 MPM = OP in 52.4% (MAP in 47%) out of the 198 043 cases. Table 4 provides further hints on the effect of correlation between covariates. Combining all other ingredients of the study, when correlation is low MPM and MAP are almost always both equal to the optimal model. Higher correlation is more challenging for MPM and MAP: when covariates are equicorrelated and the common value of the correlation is 0.9, MPM and MAP are the same model in 84% of the cases and both optimal in 62%; when they differ, MPM is the optimal model in almost half of the cases, in the others MAP, although not optimal, is preferable to MPM in terms of risk. However on average the relative risks to the optimal model of MPM and MAP (when neither is optimal) do not differ much.

Generalizations of the Median Probability Model Optimality
In orthogonal designs, the primary condition for optimality of the median probability model is that β γ , the conditional posterior mean of β under γ, is obtained by taking the relevant coordinates of the posterior mean under the full model (condition (17) of Barbieri and Berger (2004)). With X X = D = diag{d i } q i=1 , the likelihood factors into independent likelihoods for each β i and thereby any independent product prior (4.1) will satisfy the condition (17). This is a very important extension because priors that are fat-tailed are often recommended over sharp-tailed priors, such as the g-prior (for which the optimality results of the MPM were originally conceived).
Example 2 (Continuous Spike-and-Slab Priors). The point-mass spike is not needed for the MPM to be optimal. Consider another example of (4.1), the Gaussian mixture prior of George and McCulloch (1993) While the MPM was originally studied for point-mass spike-and-slab mixtures, it is optimal also under the continuous mixture priors. Indeed, to give an alternative argument, note that the posterior mean under a given model γ satisfies . Then the overall posterior mean vector appears to beβ The criterion R(γ) in (1.2) can be then written as which easily seen to be minimized by the MPM model. Barbieri and Berger (2004) show that the MPM is optimal also for correlated regressors, when considering a nested sequence of linear models: M γ(j) (j = 0, . . . , q), where the first j elements of the index set γ(j) are equal to one and the last q − j to zero. Again, one of the sufficient conditions is that the posterior mean β γ is obtained by taking the relevant coordinates of the posterior mean under the full model, which holds under, e.g., the g-prior. Here, we generalize the class of priors under which MPM is optimal for nested correlation designs. Assume q < n and denote with A the upper Cholesky triangular matrix such that X X = A A. Then transform the linear model to where ε ∼ N (0, σ 2 I n ). Note first that, since A −1 is upper triangular, the nested sequence of models is unchanged; the parameterizations within each model have changed, but only by transforming the variables inside the model. We thus have the same nested model selection problem.
Next note that (X * ) X * = I n , so the likelihood factors into independent likelihoods for the β * i ; and this independence holds within each of the nested models, since the columns of X * are orthonormal. Thus, if the prior is chosen to be π(β * ) = q i=1 π i (β * i ) , then it follows from the earlier considerations in this section that the median probability model is optimal. For example, assuming β * ∼ N (0, D) for some diagonal covariance matrix D, we obtain generalizations of the g-prior for which we already know that MPM is optimal.

Discussion
The paper consists of two quite different parts. One part (mostly Section 4) focuses on generalizing previous theorems concerning the optimality of the median probability model. In addition to the generalizations therein a number of other generalizations are suggested in the paper, when groups of variables are orthogonal to others. Here are three such results, whose proofs are essentially obvious.
Result 1. If one group of variables is orthogonal to another, then finding the MPM and the optimal procedure can be done separately for each group of variables.

Result 2.
If a variable is orthogonal to all others, it can be separated from the problem and handled on its own, and will belong in the optimal model if its inclusion probability is bigger than 1/2.

Result 3.
If two groups of orthogonal variables each have a nested structure, then the median probability model is optimal and can be found separately in each group.
In spite of the considerable generalizations of optimality afforded by Section 4 and these related results, the extent to which the median probability model is guaranteed to be optimal is still rather limited. Hence the second goal of the paper was to study the extent to which the MPM failed to be optimal. This was done in two ways. First, by looking at "worst cases," where the number of highly correlated variables grows. The theoretical conclusions (besides Section 2.5) were obtained for the case of fixed and random variance. Note that the extreme case of perfect correlation is where we would expect the performance of the median probability model to be most challenged and several sections of the paper focused on this choice.