A Loss-Based Prior for Variable Selection in Linear Regression Methods

. In this work we propose a novel model prior for variable selection in linear regression. The idea is to determine the prior mass by considering the worth of each of the regression models, given the number of possible covariates under consideration. The worth of a model consists of the information loss and the loss due to model complexity. While the information loss is determined objectively, the loss expression due to model complexity is ﬂexible and, the penalty on model size can be even customized to include some prior knowledge. Some versions of the loss-based prior are proposed and compared empirically. Through simulation studies and real data analyses, we compare the proposed prior to the Scott and Berger prior, for noninformative scenarios, and with the Beta-Binomial prior, for informative scenarios.


Introduction
In this paper, we propose a method to derive model prior probabilities for variable selection problems in linear regression. The obtained prior, in its general form, is designed to penalise complex models and, therefore, to favour sparsity. We focus on the general case in which the total number of covariates, d, is smaller than the number of observations n. The prior we propose is based on losses and it is compared with existing options, including the Beta-Binomial prior (George and McCulloch, 1993), where a beta prior is defined over the (unknown) covariate inclusion probability, and a particular case of it known as the Scott and Berger prior (Scott and Berger, 2010). The priors are compared by using simulated data as well as real data sets: the Hald data (Woods et al., 1932) and the human micro-array gene expression data in colon cancer patients (Calon et al., 2012).
Variable selection problems, in the Bayesian framework, are in line with any other inferential procedure. That is, a posterior distribution for the space of models is obtained in order to represent the posterior uncertainty about the true regression model (see Gelman et al. (2004)). There may be instances where the above is not appropriate, for example if there are models with a negligible posterior probability, in which case a subset of all the possible regression models can be considered. With a prior distribution on the space of models, representing the model uncertainty related to variable selection, one way to proceed is by using Bayesian model averaging (Hoeting et al., 1999). When the The paper is organised as follows. In Section 2 we define the notation used throughout the paper and formalise the problem of variable selection for linear regression models in a Bayesian framework. In Section 3 we discuss the current model priors for variable selection (the Beta-Binomial prior and its particular case the Scott and Berger prior) and we present the proposed prior based on losses. The results of simulation studies are provided in Section 4. In the section we examine the performance of the considered priors on the basis of the frequentist results of the corresponding model size posterior distributions. Section 5 reports the analysis results using real data sets widely discussed in literature. The final Section 6 concludes and provides some discussion points on the proposed prior and its comparison with the other priors.

Notation and problem specification
Given the vector y of n responses, the design matrix X of size n × d, an intercept α and a vector of coefficients β β β of dimension d, the response outcome y i is expressed as where ε i ∼ N (0, 1/φ) are the i.i.d. normally distributed errors with unknown variance 1/φ. We assume that the number of observations n is larger than the number of covariates (i.e. n > d), and the design matrix is of full rank. The variable selection problem can be seen as identifying which of the possible d covariates has impact on y. In other words, we aim to identify which of the regression parameters β j s are different from zero. Let us consider the binary vector γ, where the j-th element is zero if β j = 0 and one if β j = 0. Then, the generic Bayesian regression model is indicated by where f (y|α, β β β γ γ γ , φ) = N (y|α + X γ γ γ β β β γ γ γ , I/φ), and π γ γ γ (α, β β β γ γ γ , φ) represents the prior distribution for the parameters of the model, the socalled model-specific parameter prior. Note that |γ| (the number of ones in γ γ γ) indicates the number of covariates included in the model M γ . There are 2 d possible regression models and each one of them identified by γ γ γ.
The number of possible regression models grows exponentially with d. When d is large the model posterior probabilities are often small for most models, and posterior inclusion probabilities could give a better idea of the posterior uncertainty in comparison to model posterior probabilities. The posterior inclusion probability of the j-th covariate is defined as Prior and posterior inclusion probabilities originate from the common idea in Bayesian variable selection to consider variable inclusions as exchangeable Bernoulli trials, with ω ∈ [0, 1], implying where |γ| represents the number of covariates included in the model M γ .

536A Loss-Based Prior for Variable Selection in Linear Regression Methods
Model-specific parameter prior Prior choice on model-specific parameters has received much attention, and well-received priors include the Zellner-Siow prior (Zellner and Siow, 1980), the Zellner's g-prior (Zellner, 1986), the mixtures of g-priors (Liang et al., 2008), and the more recent robust prior by Bayarri et al. (2012), among others. The robust prior has the advantage of yielding closed-form marginal likelihoods and to not suffer from the information paradox (Liang et al., 2008). The robust prior is defined as where Σ γ γ γ is the covariance matrix of the maximum likelihood estimator of β β β γ γ γ . The distribution of g is given by where a, b > 0 and ρ γ γ γ ≥ b/(b + n). The prior (5) is called robust as its tails behave like the tails of a multivariate t density, therefore less sensible to outliers.
In this paper, as the focus is solely on model priors, most of the analysis are performed by using the same model-specific parameter prior, the robust prior with hyperparameter values a = 1/2, b = 1 and ρ γ γ γ = 1/(d + 1), as recommended in Bayarri et al. (2012), so differences in the results can be ascribed to differences in the model prior. For large d simulations and the large real data analysis we have used the g-Zellner prior, for computational convenience.

Model priors in objective variable selection
We begin this section with a short summary about the existing priors, then describe the loss-based prior in Section 3.1 followed by the choices for the penalty factor in Section 3.2 and the generalized version in Section 3.3.
In principle, the choice of the prior on the model space should incorporate any prior knowledge about the subsets of covariates which should be included in the model. A common way of achieving the above result is to subjectively determine ω in equation (4) (George and McCulloch, 1993). Furthermore, by subjectively fixing ω, so to represent the proportion of covariates that one believes should be included in the model, will induce multiplicity correction (Scott and Berger, 2010). To address this issue, Cui and George (2008) suggested a beta prior distribution on ω where B(·) is the beta function and a, b are the parameters of the beta prior. Thus, by setting a and b, one could represent prior knowledge about the true proportion of the covariates that should be included in the model.
Alternatively, one may wish to adopt an objective approach and, to the best of our knowledge, the choice of model prior probabilities which convey minimal information is limited to two options: the uniform prior and the Scott and Berger prior.
The model uniform prior is obtained by assigning equal prior mass to each regression model, that is p(M γ γ γ ) = 1/(2 d ) for any γ γ γ, and it yields a prior inclusion probability of ω = 1/2. Scott and Berger (2010) discuss the following model prior for variable selection Their model prior is obtained by assigning a beta prior to ω, with both the hyperparameters equal to one, and then marginalising over ω. This prior was previously discussed in the literature, see for example Ley and Steel (2009), with the aim of representing prior minimal information. The motivation behind the choice of the model prior by Scott and Berger (2010) lies in its property to correct for multiplicity, which can be seen as an issue when model choice is performed by multiple statistical testing with respect to a reference model (typically the null model or the full model). The Scott and Berger prior induces a marginal prior inclusion probability of ω = 1/2 for each covariate, same as the uniform model prior. However, as thoroughly discussed in Scott and Berger (2010), given that their prior is a function of d, it allows the multiplicity correction.
Both the uniform prior and the Scott and Berger prior assume that the expected number of covariates is d/2.

Model prior based on losses
The model prior based on losses has been introduced by Villa and Walker (2015), where model selection problems involving nested and non-nested models have been discussed. However, in Villa and Walker (2015) model complexity was not considered. The basic idea is that we can assign a worth to each model by objectively measuring what is lost if the model is removed from the space of models, and it is the true one. While in Villa and Walker (2015) the worth of a model was associated to a measure of loss in information only, in cases of variable selection it is sensible to include a component of loss due to the complexity of the model.
We introduce the idea to assign prior mass to models by means of the following illustration (from Villa and Walker (2015)). Let us consider three trivial models M j = {f j (x|θ j ), π j (θ j )}, for j = 1, 2, 3. Here each model represents a single density. We also assume that models M 1 and M 2 (and so densities f 1 and f 2 ) are very similar, and that the third model M 3 (density f 3 ) is significantly different from the other two. We do not question the rational behind this scenario set up, we just assume that there is one.
By analysing this scenario in the light of the utility of each model we note that the worth of models M 1 or M 2 is less than the one of model M 3 . In fact, should we lose either M 1 or M 2 , we would still have the remaining one to "represent" that position in the set of all possible models. On the other hand, M 3 would be more valuable, as its removal from the set of choices would lead to bad inference if it turned out to be the true model.
Having identified this approach to assign the mass to each model on the basis of worth, we see it takes into consideration the "position" of each model with respect to the others. The quantification of the worth comes from a result in Berk (1966) which says that, if the model is misspecified, the posterior distribution asymptotically tends to accumulate at the nearest model in terms of the Kullback-Leibler divergence. Therefore, if we were to remove model M j from the set of possible models, and it is the true one, the loss we would incur is given by the Kullback-Leibler divergence from it to the nearest of {f i }, i = j. Thus, by defining the Kullback-Leibler divergence between M j and M i by That is, the larger the value of min j =i D KL (f j f i ) the greater the utility (or, equivalently, the smaller the loss) of keeping the model.
If we consider the mass to be put on each model P (M j ), this can be linked to the worth of the model via the self-information loss function. The self-information loss function (also known as the log-loss function in machine learning) measures the performance of a probability statement with respect to an outcome. Thus, for every probability assignment P = {P (A), A ∈ Ω}, the self-information loss function is defined as l(P, A) = − log P (A).
More details and properties of this particular loss function can be found, for example, in Merhav and Feder (1998). Therefore, for each model M j we have a measure of the information loss related to its worth, given by (8), and related to the self-information, given by − log P (M j ). We then equate the two losses, yielding In other words, the mass that we assign to each model is proportional to the exponential of the Kullback-Leibler divergence between the model and the nearest one in the set of options.
We can now take the basis of the idea to regression models, that is in a variable selection scenario.
If we consider the regression model M γ γ γ , with the simplified notation θ θ θ γ γ γ = (α, β β β γ γ γ , φ), the loss in information associated to the model, equivalent to equation (8), can be written as where f (y|θ θ θ γ γ γ ) represents the regression distribution of M γ γ γ , that is, the regression model which is the most similar to f (y|θ θ θ γ γ γ ). We then see that the loss in information associated to model M γ γ γ is the expected minimum Kullback-Leibler divergence between M γ γ γ and the nearest one, where the expectation is taken with respect to the prior π γ γ γ (θ θ θ γ γ γ ), representing the prior uncertainty about the true values of α, β β β γ γ γ and φ.
As anticipated, to fully describe the worth of a regression model it is also necessary to take into considerations the complexity of the model. For the regression model M γ γ γ , we denote the loss due to complexity by L C (M γ γ γ ), which it is determined as follows. If we keep model M γ γ γ in the space of models, the loss would be proportional to the number of covariates that have to be considered and measured. Therefore, the loss of keeping a linear regression model increases with the number of covariates it contains, and we have where L(·) represents a loss and U(·) a utility.
The loss component due to complexity is easily fit in our framework and the model prior for M γ γ γ is In other words, the prior is constructed based upon a cumulative loss with a component representing the loss in information and a component representing the loss due to complexity.
The following Theorem 3.1 (which proof is in Appendix A in the Supplementary Material (Villa and Lee, 2019)) shows the expression of the minimum Kullback-Leibler divergence between regression models. (2), with design matrices, respectively, X γ γ γ and X γ γ γ .

linear normal regression models as in
Theorem 3.1 shows that the minimum Kullback-Leibler divergence between any two linear regression models is zero, regardless to the number of covariates in the models. This means that, in variable selection problems for linear regression models, there is 540A

Loss-Based Prior for Variable Selection in Linear Regression Methods
no loss in information in selecting the "wrong" model, as such the model prior in (10) becomes It is therefore equation (12), for a suitable c, that will be used in the paper to represent the prior uncertainty on the space of regression models.

Setting the constant c
The proposed prior in (12) depends on the constant c, which can be interpreted as the penalty factor on the number of covariates included. So, the question is how to calibrate c, noting that it controls the rate at which the prior decreases as the model size increases, as well as the way the prior behaves. Here we discuss some ideas, intended to help the analyst, but we do not claim to be exhaustive. Our intention is to provide some insights on how the choice of c can be carried out on the basis, for example, whether prior information about the covariates to be included in the model is available, or if multiplicity correction is a matter of concern. There are fundamentally three options; • c is fixed to a specific value. This can be either on the basis of any available initial knowledge about the number of covariates to be included in the model or in agreement to some criterion.
• c is a function of d.
• Adopting a hierarchical approach, a prior is assigned on c. Here as well it is possible to reflect any available prior information in the hierarchical structure or opt for a noninformative choice.
Fixed c If the constant c is fixed to a specific value which does not depend on the total number of covariates in the problem (see the next paragraph below), there is no multiplicity correction unless this reflects some prior knowledge (Scott and Berger, 2010). Although one may agree that multiplicity correction is one way to define model prior probabilities in an objective sense, it is argued that this is not a necessary condition for a model prior to satisfy, in particular if the problem of interest is related to prediction. Furthermore, this is not an issue should one believe that for variable-selection problems the approach should be in line with the Bayesian framework of having prior and posterior probability representing, respectively, prior and posterior uncertainty.
To understand how the constant c acts like a penalty factor, we note that as c → 0, p(M γ γ γ ) → 1/{2 d } for all models, yielding the uniform prior. As c gets larger, the prior assigns more and more mass to sparse models, and the rate at which the prior drops to zero increases.
The noninformative criterion we propose to set c is based on maximising the expected loss with respect to c. From equation (12), let K = |γ|; then the prior on model size K = k given c is which, normalised, gives Then, the expectation of K given c > 0 is and the expected loss, in terms of c, is given by We note that (14) goes to 0 for c → 0 and for c → ∞. As such, we can maximise the expected loss in terms of c. Differentiating d dc In our simulation study, the choice is c ≈ 1.2785 and the sensitivity raised by c is numerically examined using real data sets in Section 5. We noted the conservative aspect that the above choice induces a prior inclusion probability of ω = 0.22, as it can be seen from equation (15) below.
Function of d The idea is to identify some specific functions of the total number of covariates under examination that have desirable properties. Having then c dependent on d, the prior corrects for multiplicity (Scott and Berger, 2010). First, we note that it is advisable to have a prior capable to produce sensible results even for large values of d. This is dictated to the obvious fact that modern regression models can easily include thousands (and more) covariates. One choice could simply be to set c = d; however, even for moderate values of d, the prior will exhibit an extremely fast decrease to zero with the consequence of assigning most of the mass to the very sparse models. Another possible choice would be to set c = d −1 . While this function would allow to avoid the previous undesirable behaviour, the consequence is that the prior will rapidly converge to a uniform prior, with all the negative caveats discussed in Scott and Berger (2010). We believe that a sensible choice is to have c = log(d). In a scenario of sparsity, most of the considered covariates will bring little (if not zero) information; as such, a desirable property of the prior would be to not "implode" as d grows, which is the case of, for example, the above choice of c = d. At the same time, it would be desirable to have a prior that it is sensible to minimal changes for small d. A process that exhibit this behaviour is the logarithmic process, which shows rapid growth when it is small and slow growth when it is large. Hence, having c = log(d) appears to be a sensible choice, and results for simulated data in Section 4 appear to provide empirical support for this choice.
Hierarchical approach The third approach to calibrate c consists in assigning a prior distribution to c; thus, obtaining for a suitable density π(c). Although setting, for example, c ∼ Ga(a, b), for some a, b > 0, represents a sensible choice, the resulting prior would be analytically intractable and, moreover, the calibration of the hyperparameters a and b is not straightforward. A more interpretable approach is as follows. First, we note that c>0, k = 0, 1, . . . , d.
The hyperparameters p and q can be chosen to reflect prior information about ω (and hence about c), if available. A noninformative option would be to assign a uniform prior to ω, corresponding to a Generalised Beta distribution with p = q = 1, giving π(ω) = 2, for ω ∈ (0, 1/2). The prior of the regression model, from (18) above, M γ γ γ becomes Should one be interested in expressing the prior distribution with respect to the number of covariates included in the regression model, K = k, one has to consider with ω = (1 + e c ) −1 . We have, by assigning a Generalised Beta distribution to ω, and, for p = q = 1, that is the noninformative choice, Figure 1 compares the proposed prior with c = 1.2785, c = log(d) and the hierarchical loss-based to the Scott and Berger prior for d = 30. In other words, it compares noninformative choices of P (k). Whilst Scott and Berger prior has a symmetrical behaviour, the proposed prior assigns more mass to the more simple models than to the more complex ones, as expected from expression (12). We note how the different choice of c impacts the rate at which the prior mass decreases as the model becomes more complex. In particular, for c = log(d) the prior assigns the highest mass to the null model with a quick drop in the prior probability for already moderate values of the number of covariates. While the choice of setting c = 1.2785 allows for a more distributed prior mass among the sparse model. The hierarchical approach, as it can be seen from the plot, yields a prior distribution that is relatively high for values of the number of covariates smaller than d/2, to decrease towards zero afterwards. It is the above characteristics of the proposed prior of assigning more pass to the lower region of the parameter space (although with different behaviours) that makes the loss-based prior more suitable in scenarios where preference is put on the more sparse models, when compared to the Scott and Berger prior. 7 (with d = 30). The parameters, numerically obtained, of the Generalised Beta are, respectively, p = 1, 6, 8, 8 and p = 14, 23, 16, 9.

Simulation study
In this section we present the results of simulation studies designed to assess the performance of the proposed prior and to compare it to the Scott and Berger's prior. We consider different scenarios, in terms of number of covariates, sample size as well as whether prior information is available or not.
It is well known that a variable selection problem is driven by both the choice of the model prior and of the model-specific parameter prior. However, here the interest is in the effects on variable selection determined by the prior probability on the space of models. As such, the simulation exercise described has the purpose to analyse the frequentist properties of the posterior distribution on the model size and, to minimise any possible effects of the model-specific parameters prior, we choose to use the robust prior for the parameter of the regression.
In the first simulation study, the performances of the priors for various scenarios are compared considering the posterior means squared error from the mean (MSE), the coverage of the posterior 95% credible interval and the frequency rate for identifying the true model by the HPM. The detailed results are reported in Tables in Appendix B (included in the Supplementary Material), while graphical analysis of the MSE and the coverage will be presented in the current Section. The second simulation study is limited to relatively sparse models and it consider the case where prior information, in terms of the mean number of covariates the model should include, is available and it is reflected in the choice of prior hyperparameter(s). The simulation considers correct prior information as well as inaccurate prior knowledge.

Non-informative simulation
The study involves 250 experiments and each experiment uses four data sets, one for each d = 5, 10, 15, 30, and repeated for n = 50 and n = 100. The procedure of each experiment follows; • Generate a design matrix X of size n × d where each element is an independent realisation of a standard normal distribution; • Generate a binary vector γ γ γ from a sequence of d independent Bernoulli experiments with probability of success equal to ω; • From the robust prior in (5), generate the vector of coefficients β β β γ γ γ ; • Generate the response vector from the regression model in (1), considering φ = 1; • Using the above values of the design matrix and the vector of responses, compute the necessary quantities, including the marginal likelihoods, the model posteriors and the model size posterior distribution. The last step of the above procedure has been performed under the Scott and Berger prior and the loss-based prior with the three proposed methods to calibrate c, that is setting c = 1.2785 (so to maximise the expected loss), c = log(d) and the hierarchical approach as described in the previous section, where the parameters of the Generalised Beta have been both set to one. These choices for c are made for the case in which no prior information about the true model size is available, but sparsity is expected.

546A Loss-Based Prior for Variable Selection in Linear Regression Methods
The simulation result using the loss-based prior and Scott and Berger prior are summarized in Appendix B (included in the Supplementary Material). Five values of ω are considered to examine small to large model sizes. For each model prior, the coverage of the 95% credible interval of the posterior (Coverage), the mean squared error of posterior mean (MSE Mean) and frequency rate of the HPM equals to the true model (Freq. True) from 250 experiments are estimated. Finally, we note that for d = 30 we have used a Gibbs search to explore the space of models resulting in relatively low frequencies for the true model, as well as a larger variability in the different performances of the priors.
The coverage (Figures 3 and 4) under the Scott and Berger prior appears to be the most stable across sample size and model size, although it is mostly above the nominal value of 95%. The prior we propose appears to have a relatively low coverage when ω = 0.75; this is particularly obvious for the prior with c = log(d) set up, and it is due to the fact that the distribution goes to zero (for large k) much faster than the other two loss-based options. However, the loss-based prior shows an overall performance in the coverage closer to the nominal value than the Scott and Berger's. For n = 100, as one would expect, the differences between the priors becomes smaller in comparison to the case n = 50; again, we notice that the loss-based prior with c = log(d) has the worst performance when ω = 0.75.
Considering the HPM when it refers to the true model, we note that all the priors exhibit a similar pattern (Appendix B in the Supplementary Material). In fact, for any value of d, the frequency the true model is identified decreases as the average model size (i.e. ω) increases. For n = 50 the frequencies between the priors have more variability when compared to the case n = 100. Furthermore, more variability in the performance is observed for d = 30; as in this case the inference is performed through a random exploration of the model space (i.e. Gibbs), a higher degree of uncertainty is included in the process by the fact that the model space itself is very large.
Comparing the MSE in Figure 5 and Figure 6, we note the following. First, as expected, the MSE is larger for n = 50 then for n = 100 as the information in the data increases. For d = 5, that is for a small model space, the priors tend to be quite similar, overall. We note that the Scott and Berger prior tends to perform better for relatively large models, while the loss-based priors with c = 1.2785 and c = log(d) have smaller MSE for ω = 0.05. The loss-based prior with a hierarchical approach has the largest MSE for relatively smaller models, and this behaviour is consistent for any n and any d. In fact, its behaviour of spreading most of its mass evenly in the lower part of the parameter space of k, renders it weak in dealing with models with a relatively small number of covariates. When d is either 10 or 15, we note a similar pattern in the remaining priors (the Scott and Berger's and the loss-based with c = 1.2785 and c = log(d)) as to when d = 5. Differences are a bit more accentuated, in particular for small sample sizes, but the overall performances are quite stable. As we have observed for the frequency of "guessing" the true model, when d = 30 the random search contributes in increasing the variability of the differences. For example, the Scott and Berger prior does not have the smallest MSE for ω = 0.75.

548A Loss-Based Prior for Variable Selection in Linear Regression Methods
Undoubtedly, the Scott and Berger prior will have a higher degree of efficiency when the true model is the full model (or a true model that contains a very large proportion of possible covariates). In fact, if one suspects that the true model is quite large, possibly this is the prior to employ. On the other hand, when the model tends to be sparse (or when it is thought that this is the case), the loss-based approach appears to give better results; at least, when we calibrate c = 1.2785 or c = log(d). The latter option might be preferable should one be concerned with multiplicity correction. The hierarchical approach here explored, when non information is included in the prior, is the less preferable, as it has been shown by the summary statistics considered.

Informative simulation
In this simulation study we analyse the performance of the loss-based prior when prior information about the true size of the model is available and we compare its performance with the Beta-Binomial prior with same mean and variance. For the coefficients, we have employed the 'robust' prior in (5). We have simulated 200 data sets of different sizes, n = 50, 100, with d = 5, 10, 15, covariates. We have started by considering the vector of coefficients β β β = (0.5, −0.5, 0, 0, −1) for d = 5 and added 5 and 10 extra null coefficients for the simulations with, respectively d = 10 and d = 15. As all scenarios yielded the same conclusion that the priors perform in the same manner, we show the details of the more complex ones only, that is for n = 50, 100 and d = 15. We proceeded as follows.
We assume to have prior information about the mean number of covariates that the true model includes, say m, and we look at scenarios where this information is correct, i.e. m = 3, and where it is inaccurate, i.e. m = 1, 5, 7. With this piece of information, we obtain the value of c for the loss-based prior in (12) by setting With the obtained value of c, we then derive the variance of the loss-based prior by applying For example, for d = 5 and m = 1, we have c = 2.64 and Var(K) = 0.31. We then obtain the values of a and b of the Beta-Binomial prior by equating the expectation and the variance of the distribution to m and Var(K), respectively, and solve with respect to the two parameters. Table 1 shows the values associated with the simulation study for d = 15 and m = 1, 3, 5, 7. The results for n = 30 and n = 100 are reported in Appendix C (included in the Supplementary Material) in graphical form. We note that the loss-based prior and Beta-Binomial prior, when they have same means and same variances, is almost identical (within the same scenario and for the same coefficient). As one would expect, the posterior inclusion probability reflects more accurately the true nature of a coefficient when n is relatively large; or when the true value is relatively different from zero (i.e. for β 5 ). If we consider the priors performance on the basis of the accuracy of the prior information m, we see that the posterior inclusion probabilities of the non-null coefficients is better for a large (although inaccurate) m in the case of n = 50. This is a consequence of having a prior that puts more mass on relatively large models; however, there is also a larger inclusion of null coefficients. For n = 100 the above effect is drastically reduced, in the sense that for m = 3 (the true value), the inclusion posterior probabilities, overall, reflect the true status of the regression model.
The Figures in Appendix C (refer to the Supplementary Material) contain also the results when the hierarchical loss-based prior is used (right violin plots). The parameters p and q of the Generalised Beta density have been chosen so to have mean m and variance similar to the one in Table 1 (as mentioned at the end of Section 3 above). We note that, for n = 50, the hierarchical loss-based prior has similar performances to the other priors when the coefficient is either 0 or -1 (i.e. β 3 to β 15 ). When β = ±0.5, i.e. β 1 and β 2 , its performance appear to be better when the prior information about the mean is either accurate or larger than the true one. In fact, we note that the means of the posterior inclusion probabilities are above 0.5. For m = 1, although the posterior inclusion probability yielded (in mean) is below the 0.5 threshold, it has a larger value when any of the other two priors is used. For n = 100, the above differences are still noticeable, but with the magnitude that is diminished by the increase in information in the data.
A remark is that the loss-based prior has only one parameter that has to be calibrated. This allows a single piece of information (e.g. the mean) to be easily reflected in the prior, while the Beta-Binomial prior and the hierarchical loss-based would need to fix one of the to parameters "freely". It is true that two parameters would allow to include an extra piece of information, such as variability, but in practice this information is not known or, at least to a practitioner, it is not easy to define.

Illustrative examples with real data sets
In this section we investigate the properties of objective model priors for variable selection in real data sets. The first considered data set is the Hald data (Woods et al., 1932), which have been extensively used in the literature (see Kubinyi (1996) and Liang et al. (2008), for example). The second data set considered concerns with the study of gene expression data in colon cancer patients (Calon et al., 2012). For both examples we compare the Scott and Berger prior with the three approaches for the loss-based prior discussed in Section 3.2.

Hald data
The Hald data set contains n = 13 observations with d = 4 covariates, and it concerns an engineering application to study the cement composition (Woods et al., 1932). In particular, the study considers the effect on the heat evolved per gram of cement (in calories) by the amount of tricalcium aluminate, the amount of tricalcium silicate, the amount of tricalcium aluminio ferrite and the amount of dicalcium silicate.
The summary statistics of the model size posterior distributions are presented in Table 2. The corresponding histograms of the posterior distributions are represented in Figure 7. The loss-based priors appear to yield similar posteriors for the number of covariates. In fact, the posterior distributions have very similar statistics and histograms. The Scott and Berger prior is more spread with a slightly higher mean. Although the above differences, all priors basically point in the direction of the same regression model for the Hald data set.
The above conclusion is supported by the information in Table 3, where we note that the MPM is the same under each prior with posterior inclusion probabilities that clearly suggest the inclusion of both the Tricalciums and the exclusion of the remaining two covariates. In the table we have also reported the posterior probabilities associated to the 552A   highest posterior density (HPD) interval. Again, the conclusions are in direction on the above model, but we note that the loss-based prior (under each method) yields a posterior probability definitely larger than the once yielded by the Scott and Berger prior.  Table 3: Posterior inclusion probabilities for the Hald data set. The covariates with a posterior inclusion probability greater than 1/2 are highlighted in bold, and a dot notation represents the covariate included in the highest posterior probability model. The prior compared are: S&B (Scott and Berger), LB-ml (loss-based with c = 1.2785, to maximise the expected loss), LB-lg (loss-based with c = log(d)) and LB-gb (hierarchical loss-based using the Generalised Beta GB(1,1)). The table includes the posterior probability of the HPM under each prior.

Large data set analysis
Data sets with a large number of covariates are, nowadays, widespread. It is therefore important to illustrate how the proposed method deals with a practical problem with a large size, in terms of covariates. We illustrate the prior performance with the human micro-array gene expression data in colon cancer patients. This data set was originally discussed in Calon et al. (2012) and it consists of d = 172 predictors for a total of n = 262 patients. The aim is to identify which genes have an effect on the expression of transforming growth factor-beta (TGFB). Although the whole data set consists of 10, 172 genes, we limit the dimension as we are working under the assumption that n > d, and the first 172 genes are the key ones for a preliminary analysis (Rossell and Telesca, 2017). The analysis has been performed by running 10000 simulations, with a burn-in period of 500, using the Gibbs-search mechanism implemented into the 'BayeVarSel' R-package under the g-Zellner prior with g = d 2 . Posterior statistics are summarised in Table 4, while Table 5 shows the probeset identifiers (which can be associated to genes ESM1, GAS1, HIC1, CILP and IGFBP3) contained in the HPMs under each prior distribution.
When we consider a large data set we note important differences in the priors. Both Scott and Berger and the loss-based prior with c = log(d) give similar results. In particular, in Table 4 we see that the posterior statistics are virtually the same and, from Table 5, that they both identify as the "best" model the one with four probesets. When we consider the loss-based prior with a value of c chosen so that the expected loss is maximised, the posterior statistics appear to suggest a slightly larger model than the previous one, which is supported by the fact that the extra probeset '212143 s at' is included in the model; although the inclusion posterior probability is only 0.51. Differently, when the loss-based hierarchical prior is adopted, the inferential process results in what is a different outcome; in fact, the posterior statistics show a wider posterior distribution for the number of covariates and probeset '212143 s at' is included but with a larger posterior probability (0.75) compared to the above one.   Table 5: Posterior inclusion probabilities for the gene expression data set. The covariates with a posterior inclusion probability greater than 1/2 are highlighted in bold, and a dot notation represents the covariate included in the highest posterior probability model. The prior compared are: S&B (Scott and Berger), LB-ml (loss-based with c = 1.2785, to maximise the expected loss), LB-lg (loss-based with c = log(d)) and LB-gb (hierarchical loss-based using the Generalised Beta GB(1,1)). The table includes the posterior probability of the HPM under each prior.

Discussion
This paper introduces a novel prior distribution for the model space in variable selection for linear regression. The prior is based on the idea that, if the "wrong" model is chosen, we incur in a cumulative loss with two components: one represents the loss in information and one related to the complexity of the model expressed by its size. The proposed prior, in its general form, exhibits an exponential decay which depend on the number of covariates included in the model and that can be controlled by the calibration of a constant c. It is therefore possible to reflect any prior information into the prior by setting the constant accordingly. For example, in Section 4.2 we discuss how the constant c can be set up so to reflect a prior expected number of covariates that should be included in the regression model. We also discuss how prior information can be included through a hierarchical approach, in particular by means of a Generalised Beta hyperprior.
It is also possible to represent minimal information in the prior by choosing particular values of the parameters of the prior. In the paper, besides showing through simulation how choices of c perform, we have discussed some general guidelines on how it can be fixed, considering or not considering correction for multiplicity, and how c can be either directly fixed or modelled through an appropriate prior density in a hierarchical set up.
The simulation studies are carried out to show frequentist performance of the proposed model prior relative with selective choices on c and they are compared to the Scott and Berger prior (Scott and Berger, 2010), when the true model is relatively sparse. In the case where prior information about the model size is available, we show how the constant c can be easily calibrated so to reflect, for example, prior information about the mean number of covariates one believes should be included in the model.
When it comes to real data analysis, we note a fair closeness of the results obtained by using the proposed prior with the one of Scott and Berger's when the size of the problem is small (Hald data). Both MPM and HPM represent the same regression model under each prior, although the loss-based priors result in a higher posterior probability for the HPM than Scott and Berger's.