Projective Inference in High-dimensional Problems: Prediction and Feature Selection

This paper discusses predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We argue that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the \emph{reference model} and the operation during the latter step as predictive \emph{projection}. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The benefits are illustrated via several simulated and real world examples.


Introduction
Predictive inference and feature selection for generalized linear models (GLMs) in problems with scarce data but high-dimensional feature space-regime known as "small n, large p" 1 -remains a topic of active research. Often, albeit not always, the goals are twofold: the desire is to find a model that predicts unseen data well but utilizes only a small subset of features thereby facilitating the interpretation and making the model more convenient to use at prediction time.
A vast variety of different approaches have been proposed. Frequentist approaches typically formulate an estimator with a penalty that enforces sparsity in the solution (e.g., Breiman, 1995;Tibshirani, 1996;Fan and Li, 2001;Zou and Hastie, 2005;Candes and Tao, 2007). A useful overview has been written by Hastie et al. (2015). Among Bayesians, the most common approach is to use a sparsifying prior that favors solutions with a small number of active predictors (e.g., George and McCulloch, 1993;Raftery et al., 1997;Ishwaran and Rao, 2005;Johnson and Rossell, 2012;Carvalho et al., 2010). These approaches do not automatically produce truly sparse solutions since there is always nonzero probability for each feature being included in the model, but sparse models can be obtained for instance by removing features with estimated posterior effect below certain threshold (Barbieri and Berger, 2004;Ishwaran and Rao, 2005;Narisetty and He, 2014).
All these approaches attempt to solve the two problems-prediction and feature selectionsimultaneously. In this paper we argue that in many situations one can gain if these problems are solved in two stages, by first finding a model that predicts well (not caring about true sparsity) and then finding a minimal subset of features that provide similar predictions as this model which we shall call a reference model. This strategy not only solves many issues that one might encounter in traditionally used Bayesian approaches (as we will discuss in Sec. 2) but has also shown empirically very good performance in comparison to many other methods with good tradeoff between sparsity and predictive accuracy (Piironen and Vehtari, 2017a). Our discussion will be mainly from the Bayesian viewpoint but is aimed to provide several insights also for a non-Bayesian oriented reader since the reference model approach is not intrinsically limited only to the Bayesian paradigm.
A piece of pioneering work in this line was carried out by Lindley (1968), who considered prediction in linear regression model when some of the features are unavailable at prediction time. A related but slightly different approach was proposed by Goutis and Robert (1998) and Dupuis and Robert (2003) who introduced the concept of projecting the posterior information in the reference model to smaller submodels, although they were mainly interested in feature selection and less so about predicting with the submodels. A few related papers are due to Nott and Leng (2010) and Tran et al. (2012), who introduced variants of the original projection of Goutis, Dupuis and Robert. These ideas are also very closely related to the frequentist technique known as "preconditioning" for feature selection (Paul et al., 2008). Also the approach of Hahn and Carvalho (2015) utilizes essentially the same idea but formulates the projection in a slightly different way. It is worth pointing out that the same conceptual idea of replacing a large complex model with a smaller one has also successfully been used in the neural networks literature-where it is known as "model compression" or "knowledge distillation"-although there the interests are reduced memory costs and faster out-of-sample predictions rather than feature selection (Bucilǎ et al., 2006;Hinton et al., 2015).
In general we assume that the phenomena we are modeling are so complex that the true model is not included in the list of models under consideration. Hence we adopt the M-open view although the projective approach is partially using also the M-completed view. 2 More specifically, in the first stage where we attempt to construct a model that predicts well, we assume M-open case. If we are able to construct a sensible model that passes model assessment and checking (see, e.g., Gelman et al., 2013;, we can use that model as the reference model in M-completed setting. We assume the reference model is our best description of the future data, and in M-completed setting the predictive performances of other models are evaluated with respect to that reference model. The use of a reference model reduces the variance in the model selection in a same way as use of model assumptions reduce the uncertainty in the usual data modeling. Finally, as we do select a single model, we estimate the effect of selection process with M-open style using cross-validation (see Section 5).

Our contributions
This paper makes several contributions which are summarized as follows: • We review the aforementioned projection techniques under unified notation, illustrate their differences in detail and give recommendations about the preferred approaches depending on the problem at hand.
• We develop a new type of projection-called clustered projection-that can be considered as a unification of the approach of Goutis, Dupuis and Robert and that of Tran et al., and show that it gives a good balance between speed and accuracy.
• We propose a new efficient method for validating the selection process using approximate leave-one-out (LOO) cross-validation. This technique can be used to assess the predictive accuracy of the submodels which allows for intuitive model size selection.
• We discuss the typical difficulties encountered with the traditional Bayesian approaches via small examples and show how the projective approach yields much more satisfactory results.
Since an extensive comparison showing the superiority of the projection (in terms of sparsityaccuracy tradeoff) to many other Bayesian model selection strategies over a variety of data sets has already been carried out earlier (Piironen and Vehtari, 2017a), here we focus only on some of the most commonly used techniques and illustrate via small examples why they are problematic.
• We discuss the connection of the projection to the popular Lasso estimator (Tibshirani, 1996) in detail together with several empirical results that demonstrate the benefit of the proposed approach in the "small n, large p" -setting.
• We prove a theorem that-at least in our knowledge-for the first time gives a theoretical argument of why and under which conditions the use of reference model could be beneficial for parameter learning in linear models.
• We provide an R software package projpred that implements all the discussed methods. The package is freely available and makes the method easily accessible to a wide audience. 3

Why does a reference model improve feature selection?
We begin with a simple example that motivates why use of a reference model can be useful for feature selection. Although the details are different, this example is greatly inspired by the one presented by Paul et al. (2008). Assume we have collected n measurements of p features x j , j = 1, . . . , p along with measurements of some target variable y. Assume also that the data are generated according to the following mechanism: f ∼ N(0, 1), The target variable values y are noisy observations from the latent function values f which are drawn randomly from a standard Gaussian distribution. The first p rel features x j are also noisy observations from the latent function f , which makes them correlated and on average equally predictive about y. The multiplier √ ρ and the noise variance 1 − ρ are chosen so that the marginal variance of each x j is 1 and the pairwise correlations between the first p rel features are all equal to ρ. The rest of the features are drawn randomly from a standard normal distribution and are thus uncorrelated and irrelevant for predicting y. Suppose our goal is to assess how predictive each of the features is about the target variable. A simple strategy would be to compute the sample correlation R(x j , y) between each feature and the target variable and then rank the features based on the absolute values |R(x j , y)|. Since the features are related to the target variable via the latent f , clearly our task would be easier if we had access to the noiseless values f instead of the noisy ones y, since the additional noise weakens the correlations, that is, |Cor(x j , y)| < |Cor(x j , f )| for j = 1, . . . , p rel . In practice we do not observe f directly, but intuitively if we could build up a model whose output f * is fairly close to the true f , we might expect to benefit by making the assessment based on the sample correlations R(x j , f * ) instead of R(x j , y). Figure 1 illustrates this idea. The left graph shows the absolute sample correlations |R(x j , y)| versus |R(x j , f )| for one data realization from (1) with p = 500, p rel = 150, n = 30 and ρ = 0.5. The relevant features (red dots) are much better separated from the irrelevant ones (gray dots) when we consider their correlation with f instead of y. The right graph demonstrates that this holds also when we replace the unknown f with predictions f * of a reference model we can actually compute. Here the reference fit is obtained by Bayesian linear regression of y on the first three supervised principal components of all the features (the procedure is discussed in detail in Sec. 6). Figure 2 shows that this pattern holds for a wide range of values for ρ and p rel . Parameter ρ describes how strongly the relevant features are predictive about y, so when ρ is close to 1, they all are almost perfect copies of f and therefore easy to distinguish from the noise features. On the other hand when ρ gets smaller, the predictive power of the relevant features decreases and hence they are more difficult to identify. It is quite remarkable that above ρ = 0.4 the reference model approach gives nearly oracle results.

Note on the terminology
To avoid confusion, it is useful to distinguish between two different problems both of which could be considered as "feature selection": 1. Find a minimal subset of features that yield a good predictive model for y, so that adding more features does not considerably improve predictive accuracy.
2. Identify all features that are predictive about (that is, statistically related to) the target variable y.
In the remainder of this paper, we shall focus on the first problem. The latter problem-which is often a considerably more difficult one-is usually referred to as multiple hypothesis testing, and different means are more suitable for solving that. Still, as the previous example illustrates (Sec. 1.2), we expect the reference model approach to be beneficial also there. We shall touch upon this issue in the final discussion (Sec. 9.2).

Traditional Bayesian approaches
This section briefly reviews some of the most common Bayesian approaches for inference with large number of features and highlights their main difficulties.

Sparsifying priors
Consider the standard Gaussian linear regression model where x is the p-dimensional vector of features, β contains the corresponding regression coefficients and σ 2 is the noise variance. A very popular Bayesian approach for assessing the relevances of the different features is to assign a sparsifying prior on each β j , and then perform the relevance assessment based on the marginal distributions for each β j (see Sec. 2.2). A popular prior choice is the spike-and-slab, which is often written as a mixture of two Gaussians where ε c and the indicator variable λ j ∈ {0, 1} denotes whether the coefficient β j is close to zero (comes from the "spike", λ j = 0) or nonzero (comes from the "slab", λ j = 1). The width of the spike ε is either taken to be exactly zero or set to a small positive value (George and McCulloch, 1993;Ishwaran and Rao, 2005). The prior inclusion probability π is either fixed (typically to π = 0.5) or given a hyperprior such as π ∼ U(0, 1) (Ishwaran and Rao, 2005). In some variants, the Gaussians are replaced by more heavy-tailed distributions, such as Laplacians (Johnstone and Silverman, 2004).
A popular alternative to the spike-and-slab is to formulate the prior for β j s as a continuous mixture of Gaussians. This approach can be computationally more appealing and can avoid some issues that are due to sensitivity to the choices for ε, c and π. Several such priors have been proposed (e.g. Carvalho et al., 2010;Armagan et al., 2011;Bhattacharya et al., 2015;Bhadra et al., 2017), but the most popular one is probably the horseshoe which has been shown to possess several attractive properties and has enjoyed a great empirical success (Carvalho et al., 2009;Polson and Scott, 2011;van der Pas et al., 2014). The intuition is that the global scale τ drives all the coefficients toward zero, while the thick Cauchy-tails for the local scales λ j allow some of the coefficients to escape the shrinkage. Piironen and Vehtari (2017c) proposed an extension to the formulation (4), called the regularized horseshoe which introduces an additional regularization parameter c that brings the characteristics of the horseshoe even closer to those of the spike-and-slab (3). The idea is that unlike in the original horseshoe where the largest coefficients are only very weakly penalized (horseshoe has Cauchy-tails), here they face a regularization equivalent to a Student-t slab with scale s and ν degrees of freedom. For a fixed but finite slab width c = s (obtained by letting ν → ∞), the prior is operationally similar to the spike-and-slab (3) with the same c, whereas the original horseshoe (4) (obtained by letting also s → ∞) resembles the spike-and-slab with infinite slab width c → ∞ (see Piironen and Vehtari, 2017c, for the derivations, more detailed discussion and illustrations). This additional regularization is useful if the parameters are weakly identified (e.g. coefficients in separable logistic regression) and often robustifies and speeds up the Markov chain Monte Carlo (MCMC) posterior inference. It is possible to place a prior for the global parameter τ based on the sparsity assumptions analogous to the prior for π in spike-and-slab (3). Under certain assumptions, Piironen and Vehtari (2017b,c) showed that to concentrate prior mass onto solutions where p 0 coefficients are far from zero, most of the prior mass for τ should be concentrated near the reference value A recommended weakly informative prior is then τ | σ ∼ C + 0, τ 2 0 , which we shall also use throughout this paper unless otherwise stated.

Bayes factors and marginal posterior relevance assessment
It should be made explicit that neither the spike-and-slab (3) nor the (regularized) horseshoe (5) performs actual feature selection in the sense that some of the variables would have exactly zero coefficient with probability one, which is true for many of the non-Bayesian penalized estimators (see Sec. 4). Although often overlooked, the actual selection problem can remain highly non-trivial even after successfully fitting the model with a sparsifying prior.
In the spike-and-slab literature, the actual selection is most often carried out either by selecting the most probable feature combination (that is, using Bayes factors) or by selecting those features with posterior inclusion probability above some threshold, typically 0.5, although several thresholding rules have been proposed (Ishwaran and Rao, 2005;Narisetty and He, 2014). The selection based on posterior inclusion probabilities is known to yield a submodel which minimizes the expected squared predictive error under some fairly strict and unrealistic assumptions, most notably that the model and prior are correct, and the features are orthogonal (Barbieri and Berger, 2004). Analogous decision rule based on the posterior estimates for the so called shrinkage factors could also be devised for the horseshoe (Carvalho et al., 2010). This is essentially equivalent to simply investigating the marginal posteriors of the regression coefficients, and then choosing those features with the coefficient posterior mass significantly away from zero with some pre-defined credible level.
Unfortunately both the Bayes factors and the marginal relevance assessment have difficulties that make them unsatisfactory in our opinion. First of all, the posterior inference via MCMC for multimodal posterior resulting from one of the sparsifying priors can be a great challenge for highdimensional feature spaces. Even when the posterior inference would not be a problem, the prior sensitivity of the Bayes factors has been long known (see, e.g., Jeffreys, 1961;Kass and Raftery, 1995) and the approach does not lend itself to the continuous shrinkage priors. In addition, for large number of features p the Bayes factors typically have high Monte Carlo errors due to the fact that only a vanishingly small proportion of the 2 p models is visited during MCMC, and almost all models are not visited at all. The relevance assessment based on the marginal posteriors on the other hand can produce unintuitive results in the case of correlating features, since it can be that the marginals of two or more coefficients overlap with zero but the joint distribution is clearly distinguished from zero (see Sec. 2.3). Another major issue is that neither of these approaches provides a satisfactory answer to how to perform post-selection inference for the selected model, in particular, how to make inference and predictions after the selection, conditional on all the information available. This makes it also problematic to perform tradeoff analysis between the number of included variables and the model accuracy (that is, how much predictive accuracy would be gained or lost if one or more features were included or excluded). For an example of how the projective approach can improve predictions using the selected model even when marginal posterior probabilities are used for selecting the features, see Figure 6 in Piironen and Vehtari (2017a).

An illustrative example
We illustrate the difficulties with the marginal relevance assessment discussed in Section 2.2 with similar data as in the introductory example, see Equation (1). We generated one data realization with n = 50 observations for three different number of features, p = 4, p = 10 and p = 50, each using ρ = 0.8 and p rel = p 2 , so in each case the first half of the features were truly relevant. For illustration purposes, we did this by first generating the data for p = 4 and then adding the right number of relevant and irrelevant features for cases p = 10 and p = 50. This way, the realized values for the first two relevant features x 1 and x 2 and the target variable y did not vary between the three data sets, which lets us illustrate how the total number of features p affects the relevance assessment of the two features.
A Bayesian linear regression model was fitted to these data with three different priors on the regression coefficients: • Gaussian β j | τ ∼ N 0, τ 2 with τ ∼ C + (0, 1) • Regularized horseshoe (RHS) with p 0 = 1, ν = 4, s 2 = 1 (See Eq. (5) and (6)) • Spike-and-slab (SS) 4 with π ∼ U(0, 1) Figure 3 visualizes the posterior median and credible intervals for the regression coefficients under Gaussian and RHS priors, along with the marginal posterior inclusion probabilities for the different features obtained from the SS-posterior. With only p = 4 features and Gaussian prior, both x 1 and x 2 are detected to be relevant as the marginal posteriors of β 1 and β 2 are distinguished from zero. As the number of features grows, the marginals become more overlapping with zero and with p = 50 the marginals of all the relevant features are substantially overlapping with zero. The same applies also for the RHS prior, in fact it appears that the marginals start to concentrate around zero faster than for Gaussian prior. Also for the SS prior, the marginal inclusion probabilities generally decrease for all the relevant features as the dimensionality grows, and for p = 50 only one of them just barely has probability over 0.5. Notice how the marginals of the coefficients for the relevant variables are not substantially different from those of the irrelevant ones when p = 50 regardless of the prior.
The reason for this behaviour is quite simple: as the number of features carrying similar information grows, the coefficients of most of the relevant features could be set to zero as long as one (or a few) of them obtain nonzero coefficient. In other words, none of the features is so precious that it could not be removed, and therefore the marginals of all the features become more overlapping with zero.  For the Gaussian and RHS priors the graphs show the posterior median (dots) with 50% and 90% credible intervals (thick and slim lines, respectively) for the regression coefficients β j . For SS prior, the graphs show the posterior inclusion probabilities for each variable. As the dimensionality p increases, all the marginals start to overlap with zero, and the SS posterior inclusion probabilities get smaller. Figure 4 further illustrates what happens to the posterior of β 1 and β 2 when the dimensionality changes. For p = 4 where x 1 and x 2 are the only relevant features, the posterior dependency between their coefficients is very strong; if one of the coefficients is set to zero, then the other one must be large. As the number of features p grows, the posterior dependency between β 1 and β 2 becomes weaker; when there are many features that carry similar information as x 1 and x 2 , both coefficients could be set to zero because there are many substitutes. The results for p = 50 really summarize why the marginals and the pairwise posterior plots can be very challenging to interpret and even misleading: x 1 and x 2 have correlation of ρ = 0.8 and their correlation with y both exceed 0.6, 5 yet there is no apparent posterior dependency and both marginals clearly overlap zero! Figure 5 simply confirms that these observations are not due to cherry-picking a specific data set. For each of the three priors the relative propotion of data sets where at least one feature is found to be significant goes down when p increases. With Gaussian and RHS priors this probability is already fairly close to zero with p = 50, and even with SS we fail to find any relevant features in about half of the data sets. The exact proportions are naturally dependent on the selected thresholding rules In each graph, the observed data for x 1 , x 2 and y are exactly the same, only the prior and the total number of features p varies. Notice how the marginal posteriors are always more overlapping with zero than the joint posterior. As the dimensionality increases (in particular, when the number of features correlating with x 1 and x 2 increases), the joint posterior becomes more closer to the product of the two marginals and more overlapping with zero. Total number of features p Relative proportion of data sets Figure 5: Simulated example: Relative proportion of data sets where at least one feature is found to be significant. With SS prior feature is considered significant if its posterior probability exceeds 0.5 and with Gaussian and RHS priors if its coefficient is either positive or negative with posterior probability 0.95 or more. The results are computed from 50 randomly generated data sets generated according to (1) with n = 50, ρ = 0.8 and p rel = p 2 . Vertical bars denote one standard error intervals (posterior probability of 0.5 in SS and credible level 0.95 for Gaussian and RHS) but these do not affect the main conclusions.

RHS
2.4 Why not to use cross-validation for selecting the feature combination?
Cross-validation (CV) and information criteria (IC) are widely used generic methods for estimating predictive performance of essentially any learning algorithm. One might be wondering why not to use them also for feature selection? While it is certainly true that for example cross-validation can be a robust and convenient method for comparing a few competing models, in feature selection the number of model comparisons becomes quickly impractically large even for a relatively small number of candidate features. The computational burden of fitting a large number of models becomes an obvious problem especially if Bayesian approach with MCMC is used for inference. Another complication is that the prior needs to be specified separately for each model. On the other hand, in projective approach (Sec. 3) the prior must be specified only for the reference model and projecting the reference model posterior onto the submodels is usually hugely faster than performing MCMC for the submodels.
Another problem that is not always so well understood is that when many models are compared using cross-validation, the selection process is liable to overfitting which can lead to selection of nonoptimal model due to relatively high variance in the cross-validation estimates. We have discussed this in detail in our earlier work (Piironen and Vehtari, 2017a) where we also show that the projective approach is considerably more resilient to this phenomenon. The selection induced bias has also been discussed by other authors, see for example Ambroise and McLachlan (2002), Reunanen (2003) and Cawley and Talbot (2010).

Predictive projection
This section discusses the projective approach in detail. We start by describing the projective idea in general, and then discuss the exponential family models and GLMs as special cases.

Remarks on notation
We shall denote the training data by D. The 'tilde' notation is used to denote future measurements, for example symbolỹ denotes unseen measurement for y. To simplify notation, we useỹ i to denote a new observation at the ith observed feature values x i , which allows us to drop the conditioning on x i from the conditional distributions. Notice though thatỹ i is in general different from the observed y i .

General idea
In generic terms, posterior projection refers to a procedure of replacing the posterior distribution p(θ * | D) of the reference model with a simpler distribution q ⊥ (θ) that is restricted in some way. For example, in feature selection context for GLMs, this would mean constraining some of the regression coefficients to be exactly zero. In general the domain of the projected parameters θ ∈ Θ can and typically will be different from the domain of the reference model parameters θ * ∈ Θ * . For this reason, it is not meaningful to define the projection directly via the discrepancy between p(θ * | D) and q ⊥ (θ). Instead, a natural approach would be to define it via the discrepancy between the induced predictive distributions Here E θ * ( · ), Eỹ | θ * ( · ) and E θ ( · ) denote expectations over p(θ * | D), p(ỹ | θ * ) and q ⊥ (θ), respectively. Optimal projection of posterior p(θ * | D) from parameter space Θ * to Θ in terms of minimal predictive loss would then be the distribution q ⊥ (θ) that minimizes functional (7). In practice minimizing this is difficult even for relatively simple models and projected posterior q ⊥ (θ) due to the many expectations, but expression (7) serves as the ideal when re-formulating the projection in a more tractable way. Below we define three different projections.

Practical projection techniques
Draw-by-draw Instead of trying to minimize the functional (7) assuming some parametric form for q ⊥ (θ), we can obtain an easier optimization problem by formulating the projection as a pointwise mapping from a given θ * ∈ Θ * to θ ⊥ ∈ Θ as For models where the predictions are conditioned on some set of observed predictorsx, one takes the average of (8) over the distribution of the predictors. As the distribution of the future predictors p(x) is typically not available, the expectations over this are most conveniently approximated by a sample mean over the observed {x i } n i=1 . This results in a projection equation which is the original formulation of (Goutis and Robert, 1998;Dupuis and Robert, 2003) (they used minimization of KL-divergence in their formulation, but this is equivalent to maximizing the expected likelihood in Eq. (9)). Given draws {θ s * } S s=1 from the posterior p(θ * | D) we can project each of these separately via (9) to obtain the corresponding draws {θ s ⊥ } S s=1 in the projection space Θ. These can be thought of as draws from a projected posterior distribution q ⊥ (θ) (although this may not be available analytically), and hence they are used exactly as we would use posterior draws for that particular submodel. The appealing property of the draw-by-draw projection is that it is computationally feasible for many commonly used models such as the GLMs because the optimization problem will have the same form as the problem of finding the maximum likelihood parameter values (see Sec. 3.5). The introduced projection error or loss is then defined as the average loss over the draws Single point (one cluster) Draw-by-draw projection (above) maps each parameter value θ * into a corresponding value θ ⊥ in the projection space. The single point projection (which is a special case of the clustered projection that we will introduce in a moment) instead maps the whole posterior p(θ * | D) into a single value θ ⊥ . This can be obtained from (7) by assuming q ⊥ (θ) is a point mass at θ ∈ Θ, taking expectation over the predictorsx and then optimizing the expression with respect to θ This is the formulation of Tran et al. (2012). Notice that (11) is otherwise same as (9) except that here the expectation is computed over the posterior predictive distribution of the reference model, that is, . In practice the expectation Eỹ i ( · ) is approximated using the posterior draws. Equation (11) can be used to compute optimal point estimates in the projection space. Also, when Θ = Θ * this computes the optimal predictive point estimates in the original parameter space (for a related approach, see Bernardo and Juárez, 2003). It is worth noticing that in general the result is often different from the usual point estimates, such as the posterior mean or median. The benefit of the single point projection over the draw-by-draw is that it is much lighter computationally. For instance, for GLMs (Sec. 3.5), solving (11) has the same computational complexity as solving (9), and since the latter must be solved separately for each of the S posterior draws, single point projection essentially reduces the computations by a factor of S. Another benefit of formulation (11) is that it allows convenient search techniques, such as the Lasso type L 1 -penalty, to be used for finding good submodels (Tran et al., 2012). We will discuss this more closely in Section 4. The drawback is that it can be somewhat less accurate than the one-to-one projection, meaning that the predictive accuracy of the submodel can be compromised. To address this point, we shall introduce the clustered projection below.

Clustered
The clustered projection is our novel approach that can be thought of as a unification of the draw-by-draw and single point projections. In this approach one clusters the posterior draws {θ s * } S s=1 of the reference model into C clusters {θ s * : s ∈ I c }, c = 1, . . . , C, and then performs a single point projection within each cluster. Here I 1 , . . . , I C denote the index sets that indicate which draw belongs to which cluster (we discuss in a moment how to come up with such a division). The projection for the cth cluster then becomes where Eỹ i |Ic ( · ) denotes the predictive distribution of the reference model computed over the posterior draws in that cluster I c . In other words, for any function h(ỹ i ). 6 Solving (12) for each or the C clusters yields a set of projected parameters {θ c ⊥ } C c=1 .
6. Here we are slightly abusing the notation by using the symbol E( · ) to denote sample mean computed over a finite number of posterior draws, but we do this to simplify the notation.
Each of these is given a weight ω c proportional to the number of draws in that cluster, ω c = |Ic| S , and these weights are taken into account when computing expectations over the projected posterior. For example, the projected predictive density at futureỹ is then given by More generally, the expectation of an arbitrary function h(θ ⊥ ) over the projected posterior is calculated as C c=1 ω c h(θ c ⊥ ). A simple but generic and effective approach is to cluster the draws {θ s * } S s=1 based on the expected values they impose for y in the unconstrained (latent) space. That is, if f s = g(E(ỹ | θ s * )), whereỹ = (ỹ 1 , . . . ,ỹ n ) and g(·) denotes the link function, we would cluster the vectors {f s } S s=1 . This approach is convenient since it makes the clustering independent of the dimensionality of the parameter space of the reference model, and since in practice for projection we need only the vectors f s (see Sec. 3.4 and 3.5), we can perform the clustering with access only to the predictions of the reference model (without access to the actual parameter values). As a clustering algorithm, we use k-means. An alternative approach would be to minimize the locations of the projected parameters {θ c ⊥ } C c=1 jointly using for example the method of Snelson and Ghahramani (2005), but this is computationally much more expensive.
Both the draw-by-draw (9) and the single point projection (11) are obtained as special cases of the clustered projection (12). The draw-by-draw approach is obtained by setting the number of clusters C equal to the number of posterior draws C = S and assigning each posterior draw into its own cluster. The single point projection is obtained by setting C = 1 and assigning all draws into the same cluster. The benefit of the clustered projection is that it improves the accuracy compared to the single point (one cluster) projection already with a small number of clusters, and thereby gives a good tradeoff between speed and accuracy. We will illustrate this with an example in Sec. 7.1.

Exponential family models
Assuming the observation model for y i belongs to the exponential family with canonical parameter ξ i and dispersion φ, the log-likelihood has the form (McCullagh and Nelder, 1989, ch. 2) for some specific functions A(·), B(·) and H(·). Here the natural parameter is a function of the model parameters, ξ i = ξ i (θ). The maximum likelihood solution for the parameters θ reduces to which does not depend on the value for the dispersion φ (function A(φ) is assumed to be strictly positive). Letỹ i denote a new measurement at the ith observed feature values x i . Now, if we denote the expected value ofỹ i over some reference distribution as µ * i = E(ỹ i ), we can write the draw-by-draw, single point and clustered projections (Eq. (9), (11) and (12)) all as Thus when the observation model of the submodel is in the exponential family, the projection of the model parameters θ is equivalent to finding the maximum likelihood solution with the observed targets y = (y 1 , . . . , y n ) replaced by their expected values µ * = (µ * 1 , . . . , µ * n ) as predicted by the reference model. Thus the projection can be considered as "fitting to the fit" of the reference model. As discussed in Section 3.3, in draw-by-draw projection these fitted values µ * i are computed separately for each posterior draw in the reference model, in clustered projection separately for each cluster, and ultimately in the one cluster (single point) projection over the whole posterior with the parameters θ * integrated out. Notice also that the projection of the parameters θ does not depend on the value for the dispersion parameter φ.
It is worth emphasizing that this result assumes only that the observation model of the reduced model belongs to exponential family. In particular, we are not making any assumptions about the observation model of the reference model (which need not belong to the exponential family) or about the functional form of ξ(θ) or about how the reference fit µ * is formed. In principle this means that the projection could be applied to a wide class of learning algorithms simply by plugging in the fit of the reference model in place of the observed targets y i in maximum likelihood estimation. In practice, though, this does not work for nonparametric models such as Gaussian processes where the parameters are the values ξ i themselves without further assumptions.
After computing the projected values for the model parameters θ (Eq. (16)), the dispersion φ is computed from where Again, in draw-by-draw and clustered projection, the expectation in Equation (17) is computed separately for each draw or cluster, and in single point projection by integrating over the whole posterior.

Generalized linear models
GLMs have their observation model in the exponential family and thus the discussion of Section 3.4 applies. Let us first consider the projection onto a linear Gaussian model with feature matrix X, where the parameters are the regression coefficients β and dispersion is the noise variance σ 2 . For simplicity, let us now assume also that the reference model is a linear Gaussian model with feature matrix Z and parameters (β * , σ 2 * ) and that we have drawn a posterior sample {β s * , σ 2 * ,s } S s=1 . In terms of Equation (14), we have Consider now the clustered projection with C clusters. As discussed in Section 3.4, the projection solution for β within each cluster is obtained by plugging in the fit of the reference model in place of y into the familiar maximum likelihood solution where µ c * = 1 |Ic| s∈Ic Zβ s * denotes the prediction within the cth cluster. In the single point projection (C = 1) this reduces to µ * = 1 S S s=1 Zβ s * , whereas in the draw-by-draw (C = S) we have µ s * = Zβ s * . After plugging (18) into (17), it is straightforward to show that the projection of the noise variance becomes where V c i denotes the predictive variance ofỹ i in the reference model within the cth cluster. This is given by where V s∈Ic [·] denotes sample variance over indices s ∈ I c . Result (19) has a natural interpretation; the projected noise variance is the average predictive variance of the reference model plus the mismatch between the projected and the reference model. Therefore any systematic variation in the data captured by the reference model but not by the reduced model will be added to the unstructured noise term in the reduced model. Notice also that the predictive uncertainty of the projected model can never be smaller than in the reference model which shows why the projection provides guard against overfitting in the submodels. Above we assumed that also the reference model is linear with Gaussian noise. As already pointed out in Section 3.4, we emphasize that Equations (18) and (19) hold even without these assumptions. For instance, µ c * could come from an arbitrary model, such as Gaussian process (GP), neural network or some complex simulation model, and in the projection we investigate how much accuracy is sacrificed by replacing it with a linear model. Even when the reference model does not account for uncertainty in µ * , that is, when no clustering can be made, the single point projection is always available for the reference fit µ * . Also, the reference model noise could be non-Gaussian-Student-t, for instance-but we could still project this model onto a Gaussian noise.
When the observation model of the projected model is non-Gaussian or when the link is nonidentity, the maximum likelihood solution is not available analytically, and therefore no closed form solutions for the projected regression coefficients or dispersion parameters exist. For solving the regression coefficients, the standard approach then is to use iteratively reweighted least squares algorithm (IRLS), where each of the log-likelihood terms L i is replaced by a pseudo Gaussian observation whose mean and variance are determined either by second order Taylor series expansion to L i (e.g. Gelman et al., 2013, ch. 16.2) or by linear approximation to the link function (McCullagh and Nelder, 1989, ch. 2.5) at the current iterate (with canonical link functions the two approaches are equivalent). The process is then iterated until convergence. Given the solution to the regression coefficients, one can then plug that into Equation (17) and solve the corresponding value for the dispersion (which might also require an iterative procedure).

Search strategies
Due to the combinatorial explosion, even for relatively small number of features it is impossible to go through all the combinations when finding the optimal reduced model for a given number of features. Therefore one has to rely on approximate search heuristics for exploring promising submodels. Probably the simplest alternative is to use a forward stepwise excursion. This procedure starts from the model with only the intercept term and sequentially adds the feature that decreases the projection error the most. Forward search can be used together with any of the three projection techniques presented in Section 3.3 and often works well, but it can be computationally expensive for large number of features.
In the case of single point projection (11), a viable alternative is to use either a Lasso-type L 1 -penalization (Tibshirani, 1996) or the more general elastic net penalty (Zou and Hastie, 2005) which contains L 1 -penalty as a special case. The single point projection for GLMs with elastic net penalty can be written as Here the first term is the expectation of the negative of the expected log-likelihood of the submodel with coefficient vector β over the predictive distribution of the reference model, and α is the elastic net mixing parameter that bridges the gap between Lasso (α = 1) and ridge (α = 0). Solving this for α > 0 over a grid of values for λ yields a sequence of models with varying number of regression coefficients different from zero, which can then be used to order the features, for instance by recording the order in which their coefficients break nonzero as λ is decreased 7 . It is known that in the case of correlating predictors, Lasso tends to select only one or a few of them discarding the others, while elastic net with 0 < α < 1 tends to select correlating predictors in groups (Hastie et al., 2015). Often α is treated as a higher level parameter and is selected on more subjective grounds. One of the key advantages of elastic net over the forward stepwise search is that it is computationally very efficient. In particular, the coordinate descent algorithm of Friedman et al. (2010) that exploits warm starts can often compute the solution path over the entire λ grid in comparable time to a single IRLS fit for a fixed variable combination. We shall not discuss the algorithm but instead refer to the original paper for more information. However, we do emphasize that unlike in the penalized GLM literature, we use the penalization only to find promising submodels, not to regularize their fit after selection. In other words, after we have solved problem (21) for a grid of values λ, we order the features from the most relevant to the least relevant, and find the projected parameter values (or projected posteriors) of the submodels without any penalization, or using only a small L 2 -regularization to improve numerical stability. This is because the projection conditions on the information in the reference model and is therefore much more resilient to overfitting than maximum likelihood estimation for the parameters after selection. See Section 7.1 for an illustration of this point, and Section 7.3 for a demonstration of how the predictive accuracy can greatly benefit from not using the penalization for the submodels after selection.
In addition to Lasso and elastic net, there is a wide literature on different penalties for the (generalized) linear models, that are used to induce sparsity in the solution, and therefore could be 7. Notice that this is not necessarily the same order in which the coefficients go to zero as the penalty term λ is increased. This is because a coefficient that is nonzero can go back to zero as λ is reduced, but most of the time the two orderings are the same.
used as search heuristics to find promising submodels for the projection also. One such method is the adaptive Lasso (Zou, 2006) which is obtained from (21) by introducing penalty factors γ j that result in different penalization for different variables λ j = γ j λ, j = 1, . . . , p. Plugging the local penalties into the regularization term in (21), the regularizer becomes Using pilot estimates β for the coefficients (that can be the univariate regression coefficients, for example) and setting γ j = 1/|β j | ν for some ν > 0, adaptive Lasso reduces the excessive shrinkage of the relevant coefficients and recovers the true model under more general conditions than does the Lasso. Adaptive Lasso can also be used to encode preferences for different variables, for instance, due to varying measurement costs. In the projection context, Tran et al. (2012) proposed to set β to the posterior mode of the reference model (assuming it is also a GLM) whereas Hahn and Carvalho (2015) proposed to use the posterior mean (the two choices are in general different for GLMs with non-Gaussian priors for the reference model). Our approach differs from these in that we set γ j = 1 for each feature in the selection phase but then relax completely γ j = 0 after the feature selection is done. We also utilize clustered or draw-by-draw projection after selection when appropriate (see Sec. 7.1). Another difference to the approach of Hahn and Carvalho is that they used squared error instead of the KL-divergence to measure the discrepancy to the reference model. Nott and Leng (2010) also used L 1 -penalization but for the draw-by-draw projection. In this method the different draws can generally project onto different feature combinations even for fixed λ, and thus this approach does not perform feature selection in the sense we are interested. Another useful search heuristic is the group Lasso penalty (see, e.g., Hastie et al., 2015) which allows selecting features in groups meaning that all features in the same group are either selected or discarded simultaneously. Other sparsity enforcing penalties that could be used as search heuristics include the smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) and the Dantzig selector (Candes and Tao, 2007), but we shall not discuss these further.

Validation and decision rules for model size selection
Although we can find the optimal reduced model for a given model complexity by selecting the model with minimal projection loss, making the decision about the appropriate model complexity using the KL-divergences is often difficult. Firstly, it is not easy to assess how much predictive accuracy is lost for a given amount of projective loss introduced. Secondly, since the reference model is never perfect in practice, it is possible to find a submodel with nonzero projection loss but which gives as good predictions as the reference model (Piironen and Vehtari, 2017a). Therefore a natural way of deciding the model complexity is to validate the predictive utility of both the reference model and the candidate reduced models on a validation set using a metric that is easy to interpret, and then make the decision based on these validation results. A generic and useful utility function is the mean log predictive density (MLPD) over the validation points (see, e.g. Vehtari and Ojanen, 2012), which has the advantage that it measures not only the point predictions but also how well the predictive uncertainties are calibrated. Various other utility and loss functions could also be used, such as mean squared error (MSE) or classification accuracy in classification problems, which are often easier to interpret.
If plenty of data are available and computation time is an issue, this assessment can be done on hold-out data. However, when data are scarce, more accurate assessment can be obtained using either leave-one-out (LOO) or K-fold cross-validation, which we shall discuss next.

K-fold cross-validation
In K-fold cross-validation both the reference model fitting and the selection is performed K times each time computing the utilities on the corresponding validation set (Peltola et al., 2014). This gives us the cross-validated pointwise utilities u (i) k for a given model complexity k (number of features) at each datapoint i. For instance, with log predictive density as the utility function, u (i) k is the log predictive density of the submodel with k features evaluated at the left out y i . These can then be used to make the final decision about the appropriate level of complexity. Our approach is to estimate the utility of each model size k relative to the reference model, that is, ∆U k = U k − U * , where U k and U * denote the true (unknown) utilities for the reduced and the reference model, respectively. The point estimate and the standard error for the relative utility ∆U k in such pairwise comparison are given by where V n i=1 [·] denotes the sample variance. Given the point estimate and its standard error it is easy to construct desired confidence intervals for ∆U k . A natural choice is then to choose the simplest model that has acceptable difference relative to the reference model with some confidence (Piironen and Vehtari, 2017a).
A simple choice is to select the smallest model for which the utility estimate is no more than one standard error away from that of the reference model, that is, the smallest k that satisfies ∆Ū k + s k ≥ 0, which means that the submodel is no worse than the reference model with probability approximately α = 0.16. This approach has the drawback that such a model is not guaranteed to be found if the submodels all introduce a considerable loss in utility. Instead one could compare the utilities relative to the best submodel found, that is, in Equation (22) replace u (i) * by u (i) k best where k best = arg max k ∆Ū k . Based on the experiments in Section 7.4 the two choices perform quite similarly, the latter tending to select less parsimonious models but also with slightly better predictive accuracy. Depending on the application, one might be willing to sacrifice more utility in order to simplify the model ever further, and the decision about the appropriate model size could naturally be made on more subjective grounds also.

PARETO SMOOTHED IMPORTANCE SAMPLING
The drawback in the K-fold cross-validation is that it requires fitting the reference model K times.
Here we propose an alternative approach using approximate leave-one-out (LOO) validation using the Pareto smoothed importance sampling (PSIS) , which avoids the repeated fitting of the reference model. In (PS)IS-LOO, the posterior draws can be treated as draws from the LOO posteriors given the importance weights. The weight for draw θ s * after leaving ith observation out, w (i) s , is given by w (i) s ∝ 1 p(y i | θ s * ) . These raw weights are then regularized using Pareto smoothing to stabilize the LOO estimates in case the importance weight distribution has a thick tail (see , for the procedure). It is then easy to approximate the desired quantities for the LOO folds using these weights. For instance, in the clustered projection for the Gaussian linear model we need the predictive means µ c * and variances (V c 1 , . . . , V c n ) from the reference model for each cluster c. If the reference model is also linear with Gaussian noise, using the notation from Section 3.5, the predictive means at the observed inputs for the ith LOO are given by where the weights are assumed to be normalized s∈Ic w (i) s = 1. Correspondingly, the predictive variance at point j for the ith LOO is given by where V s∈Ic [ ·, v s ] denotes the weighted sample variance over indices s ∈ I c with weights v s . Equation (25) is merely the weighted version of formula (20). The feature selection and the projection onto the submodels at the search path are then carried out for each LOO exactly as in the K-fold case.
Exactly the same decision rules as with the K-fold validation can be used to decide the appropriate model size, the LOO method simply gives an alternative procedure for computing the pointwise utilities, u (i) k and u (i) * , for the reduced and reference models, respectively. PSIS has the benefit that it gives us the Paretok-diagnostics for each LOO describing the accuracy of the importance sampling approximation.  discuss the interpretation of thê k-values in detail. Based on theoretical and empirical considerations, they conclude that valueŝ k ≤ 0.7 indicate reliable approximation. Larger values indicate that the calculated utilities u (i) k and u (i) * for such observation i have high variance and can be biased (optimistic). In Section 7.2 we demonstrate empirically that even when a fewk-values exceed this threshold the relative utility estimate (22) can be nearly unbiased since the bias in both u (i) k and u (i) * tends to cancel out in the subtraction.

SUBSAMPLING
Although the (PS)IS-LOO validation avoids the repeated fitting of the reference model, the computation can still get quite involved for large data sets if the selection and projection onto the reduced models is repeated n times. In such situations it is usually advisable to resort to K-fold validation (or hold-out), but an alternative approach would be to compute only a subset of the LOO folds. Selecting a random subsample of m < n datapoints gives us an unbiased estimate of the submodel utilities but with higher variance than would be obtained by computing all the n LOOs. This method is analogous to the hold-out method, but with the difference that the full model is learned using all the data. Since we expect the whole projective idea to be most advantageous when n is small (and hence uncertainties high), we do not focus on large data sets but provide some further ideas of reducing the variance of the subsampling LOO estimate in appendix A.

Importance of validating the search
In order to reduce computations, it might be tempting to perform the reference model fitting and feature selection only once using all the available data, and then simply use LOO or K-fold CV to estimate the performance of the found submodels. We strongly advice not to employ this strategy, as this is known to produce biased performance estimates, and the bias can be substantial especially for small n and large number of features (see Piironen and Vehtari, 2017a, for illustrations). To avoid the selection induced bias, it is important that the same data are never used simultaneously for selection and assessment, meaning that the selection must be performed separately for each of the cross-validation folds regardless of the feature selection method. Section 7.2 shows an example of the resulting bias when the selection process is not taken into account in the model assessment.

On the construction of the reference model
How to construct a good reference model is naturally a central issue in the whole projective approach. It should be clear that this is essentially an open-ended question with no definite answer; for each problem there are endless choices. For simple linear and logistic regressions with moderate number (say less than a hundred) features we recommend using all the features with a sparsifying prior, which can work better than a non-sparse prior like Gaussian. If one is uncertain about the prior, the recommended strategy is to try different choices and compare the resulting fits with cross-validation.
In high-dimensional problems, say with hundreds of features or more, fully Bayesian approach can still provide a good fit but can also prove computationally expensive (Piironen and Vehtari, 2017c). Using either feature screening, dimension reduction or the combination of the two can be very successful for alleviating the computational burden without sacrificing the predictive accuracy (Neal and Zhang, 2006;Fan and Lv, 2008;Piironen and Vehtari, 2018). In our experience this is true especially for data sets that have plenty of features many of which are correlated with each other and predictive about the target variable. Microarray data sets (Sec. 7.4) are typical examples that fall into this category.
For these problems a simple but useful recipe combining feature screening and dimension reduction is known as supervised principal components (SPCs) (Bair et al., 2006), which works as follows. First, univariate correlations R(x j , y) between each feature x j and the class label y are computed, and only features with |R(x j , y)| above some threshold γ are retained. This yields a reduced feature matrix X γ , from which one then computes the first n c principal components (z 1 , . . . , z nc ) and uses these as the predictors for the reference model. The advantage over the unsupervised principal components is that the screening step anticipates variation in the original features unrelated to the variance in y, and therefore the predictive power is typically more heavily loaded on the first few components. In the experiments of this paper, the screening threshold γ is selected using fivefold cross-validation from a coarse grid of n γ = 7 values evenly spaced between γ min and γ max , where γ min is the largest γ such that none of the features are discarded and γ max the smallest γ such that only one feature survives the screening. Furthermore, we use n c = 3 SPCs with a Gaussian prior N 0, τ 2 for the regression coefficients and hyperprior τ ∼ t + 4 0, s −2 max where s max denotes the standard deviation of the largest principal component (this is done only to make the prior roughly the same regardless of the scale of the SPCs).
We emphasize that we do not argue that this gives a foolproof method for constructing a good reference model. Rather the purpose is to demonstrate that even with such a simple, easy-toimplement and computationally light method it is possible to come up with a reference model that gives good results and improves feature selection in many cases. Indeed, in our earlier work we found that the optimal method is in general data set dependent, and in some cases better results can be obtained by other choices such as the iterative version of the above algorithm (Piironen and Vehtari, 2018). Again, cross-validation and posterior predictive checks should be used to guide the selection of the reference model (Gelman et al., 2013;. A generic strategy for improving the prediction accuracy is also to average over several models, using either stacking or (pseudo) Bayesian model averaging (Yao et al., 2018), or boosting or bagging in non-Bayesian context (see, e.g., Hastie et al., 2009).

Experiments
This section presents several examples of the projective method. We shall first demonstrate the basic usage of the different projection techniques and the new LOO validation for the model size selection, and then compare the projective approach to the elastic net family estimators. For fitting the Bayesian reference models we use Stan (Stan Development Team, 2018), with the convenient interfaced to GLMs provided by R-packages rstanarm (Goodrich et al., 2018) and brms (Bürkner, 2017). All the projections are computed using our R-package projpred. The results for the elastic net family methods are computed using R-package glmnet (Friedman et al., 2010).

Illustration of different projections
This section illustrates the differences between the three projection techniques introduced in Section 3.3. Consider the following synthetic binary classification data. For instances belonging to the first class, the first three features are drawn from independent Gaussians with mean 1 and scale 0.5, whereas for the observations from the second class the mean and scale of these features are −1 and 0.5, respectively. In addition, the data has 27 additional noise features that are drawn from independent standard Gaussians, so the data has 30 features altogether (out of which the only the first three are predictive about the class label). We generated one data realization with n = 50 observations and fitted Bayesian logistic regression model to those data using the RHS prior (5) with hyperparameter choices p 0 = 1, s 2 = 1 and ν = 4. This serves as our reference model. Figure 6 illustrates the posterior projection onto the first two features for the three different projections: draw-by-draw (left column), single point (middle column) and 10 clusters (right column). We observe that even with the single point projection, the predictive probabilities are very close to those of the draw-by-draw projection (see top and middle row), and projecting 10 clusters gives predictions indistinguishable from the draw-by-draw projection for all practical purposes. This result is insightful, as one might think that the single point projection would be substantially inferior because it computes only point estimates for the projected model. The key insight is that these point estimates are computed so as to take into account the uncertainty in the parameters of the full model. Therefore the resulting predictive distribution is much closer to that of the full model than what would be obtained by projecting only the posterior mean of the full model or by computing the maximum likelihood estimates for the submodel (which in this case do not even exist because the classes are separable).
Another important point is that even for the draw-by-draw method the projected posterior is in general different from the marginal posterior for those parameters in the full model (see the bottom left plot in Fig. 6). In particular, the projected posterior has vanishingly little mass near the origin β 1 = β 2 = 0, although the full posterior has substantial mass there. This makes sense: after Top row shows the observed data and the contours (from 0.1 to 0.9) of the predictive probability forỹ = 1, whereas middle row shows the predictive probabilities at the observed input locations (vertical axis denoting the result for the full model with all features, and horizontal axis for the projection of the corresponding column). Bottom row shows the projected regression coefficients (black dots) as well as the draws from the full posterior (gray crosses). In bottom row plots, the dot sizes denote the relative weights (the dot sizes between different columns are not comparable).
removing feature x 3 which is predictive and highly correlated with x 1 and x 2 the coefficients of x 1 and x 2 can not both be set to zero, otherwise the predictions would seriously be affected. As discussed earlier, the benefit of the clustered projection compared to the draw-by-draw projection is its speed; projecting only C clusters cuts down the computations by a factor of C/S, where S is the number of draws that would be projected in the draw-by-draw projection. The computational savings can be huge when projections need to be computed onto many models, such as with the LOO validation. For instance, for this data set computing the projections of each of the n = 50 LOO posteriors for all model sizes up to 30 features in a naive fashion would require a total of 1500 projections, each of which takes around a second or two depending on the hardware. Thereby with the clustered projection we can reduce the computation time from the order of 25-50 minutes to about 4-8 seconds 8 . The additional benefit of the single point projection is that it can be combined with the sparsity enforcing penalty functions (Sec. 4) which allows for fast searching for promising submodels.
For these reasons, our preferred choice is to use one cluster projection in the selection phase, and a small number of clusters (such as 5-10) when making predictions with the submodels, especially if many submodels need to be considered. Still, we find the draw-by-draw projection most convenient for visualizing the projected posterior distributions for instance when credible intervals or regions are of interest. It also serves as a useful yardstick for checking and confirming the accuracy of the clustered projection.

Simulated example revisited with projection and LOO
We shall now revisit the simulated example discussed in Section 2.3 and illustrate the steps of projective selection as well as our new LOO validation technique. The first step is to decide the reference model, which we would in practice do by assessing the fits of each of the candidate models using cross-validation and posterior predictive checks. The sums of LOO log predictive densities for the Gaussian and RHS priors are −76.8 and −77.6 with standard errors 6.8 and 6.2, respectively, so there is no significant difference between the predictive fits between these models (this holds also if we make the comparison in pairwise fashion, like in Eq. (22) and (23)). The R package spikeslab does not provide the posterior draws for the regression coefficients and thus we cannot compute LOO for the SS prior, so we ignore it for now.
Suppose we select the model with RHS prior as our reference model (the results for Gaussian prior are shown in Appendix B). We then run the projective feature selection with the L 1 -search and assess the accuracy of the submodels using the LOO approach (Sec. 5.2). The MLPD for the submodels relative to the reference model are shown in the bottom left subplot of Figure 7 (blue curve). The one standard error -rule (Sec. 5.1) would suggest selecting one feature, which results in a small loss in accuracy on test data (black curve) compared to the reference model. The top left subplot shows the MLPD on the actual scale, which demonstrates how much larger the uncertainty is about the actual MLPD than about the relative MLPD.
The right column of Figure 7 shows the average LOO curves for both MLPD and relative MLPD over 200 data replications. These graphs demonstrate that the actual LOO values for the submodels are slightly biased (optimistic). This is due to a small bias in the PSIS-LOO for the reference model, which is also diagnosed by a fewk-values that exceed 0.7 in most data realizations. Notice though, that the results for the relative MLPD are still essentially unbiased for submodels with performance close to the reference model, because the bias cancels out in subtraction (22) (see Sec. 5.2). In other words, even if we have only a biased estimate of the reference model utility, we can still get a good indication of whether our submodel performance is close to that of the reference model.
Right column subplots of Figure 7 also show the expected LOO results if we do not take into account the selection induced bias but perform the selection only once (not separately for the n folds) 8. In a careful implementation the difference would not be quite as dramatic since some of the submodels would be visited in many of the n = 50 folds, so their projections would not need to be computed again every time, but this example still gives a good idea of the computational gain. one standard error intervals on independent test data of 1000 points (black) and using LOO (blue) for the selected and projected submodels. The data has p = 50 features and the reference fit is the linear model with RHS prior (the same as in the right middle subplot of Figure 3). Right column: The same but results are averaged over 200 data realizations. The orange curves show the LOO for the submodels if the feature selection is done only once, and not separately for each of the n folds. The difference to the blue curve comes from the selection induced bias. and then compute LOO for the submodels (see Sec. 5.3). The selection induced bias is clear although only moderate in this particular example. For assessing the submodel accuracies, LOO validation is very useful in this particular example because of a few reasons. Firstly, PSIS-LOO works pretty well for the full model (only a few k-values above 0.7 in most data realizations). Secondly, the number of features is only moderate and hence the feature selection is very fast. Thirdly, the number of observations is small, so the number of selection paths we need to compute is also small. Consequently, the whole validation process takes only a few seconds, which is much less than a single model fit with the horseshoe prior (around half a minute with a standard laptop), so the computational savings compared to K-fold cross-validation are clear.

The benefit of using a reference model
This section demonstrates the benefits of a reference model for feature selection and parameter estimation in the submodels. We again utilize simulated data generated by mechanism (1), and consider both regression with the original y regressed on (x 1 , . . . , x p ), and binary classification with target variable defined as an indicator y class = 1 (y > 0).
We used a setup with n = 50 training observations with p = 500 features, out of which first p rel = 50 were relevant, and report average results over 50 data realizations for ρ-values of 0.3, 0.5 and 0.8. 9 The reference model is fitted using SPCs as discussed in Section 6. We tested four different strategies for selecting features and making predictions with the selected subsets of features: 1. Lasso: Sort the variables from the most relevant to least relevant according to the order in which they enter the model as the regularization coefficient λ is decreased. For a given number of features, the submodel coefficients are computed using the smallest λ for which other variables do not enter the model. In the regression problems, the noise variance σ 2 is estimated as proposed by Reid et al. (2016), that is, by dividing the sum of the squared residuals by n − p act where p act denotes the number of active features in the submodel.
2. Lasso, relaxed: Same as Lasso, but after sorting the variables, the submodel coefficients and predictions are computed without any regularization (which affects also the estimated noise variance in regression). 10 3. L 1 -projection: L 1 -penalized projection (21) varying λ similarly as in Lasso. In regression, the projected noise variance is computed according to Equation (19) (where C = 1).
4. L 1 -projection, relaxed: Same as L 1 -projection, but after sorting the variables, the submodel coefficients are projected without any regularization (which affects also the projected noise variance in regression).
Notice that all these methods utilize only point estimates for the model parameters in the submodels, the difference is only how they are computed. Figure 8 shows the regression MLPD and MSE on independent test data as well as the projected noise standard deviation for different submodel sizes. The blue curves demonstrate the benefit of relaxation for L 1 -projection: both eventually achieve the performance of the reference model but without relaxation this requires many more features. The reason is the inherent tradeoff between shrinkage and selection: in order to force most of the regression coefficients to zero, the regularization coefficient λ must be made large, but this will also overshrink the nonzero coefficients. Therefore projecting without any penalization after selection achieves greatly improved tradeoff between accuracy and model complexity. Notice in particular that here no regularization is needed to avoid overfitting in projection; when more features are added the projected submodels simply get closer to the reference model.
However, the picture is quite different when the parameter estimates are computed based on the observed data without utilizing the reference model (Fig. 8, orange curves). The relaxation improves the fit in terms of MSE for submodels with only a few features but results in overfitting for larger models. In terms of MLPD the relaxed Lasso performs worst overall indicating badly calibrated uncertainties in the predictive distributions, which is mostly due to underestimation of the noise variance (bottom row) for most model sizes. Projection methods on the other hand show very good calibration of predictive uncertainties which is evident from superior MLPD and noise variance estimation for most model sizes. Overall the projection approach shows a bigger edge for ρ = 0.3 and ρ = 0.5 where the individual features are less predictive. 9. We also considered varying values for prel but the conclusions are not sensitive to the selected value prel = 50. 10. We are aware that the term 'relaxed Lasso' has been used to denote a more general method where after feature selection the coefficients are computed with a small but nonzero L1-penalty (Meinshausen, 2007). The complete relaxation (i.e., zero penalty after selection) was referred to as 'Lasso-OLS hybrid' by Efron et al. (2004)  (1)). Errorbars indicate one standard error intervals and black dashed lines the reference model result. In the bottom row plots the gray line denotes the true noise standard deviation. Figure 9 shows the analogous results for the classification data. The conclusions are very similar to those drawn from the regression example. Here the relaxed Lasso overfits even more severely; although the classification accuracy is similar to the Lasso, the MLPD is very low indicating bad calibration in the predicted class probabilities. Again, the edge for projection is more pronounced for ρ = 0.3 and ρ = 0.5, but we observe that for these cases also the relaxed projection struggles to achieve the same MLPD as the reference model, and for larger number of features (15-25) the penalized projection achieves slightly better results. This is due to a small instability of the projection in data sets where some of the reference class probabilities are close to 0 and 1 and shows that even the projection, although very resilient, is not always entirely immune to overfitting.

Real world benchmarks
This section shows how the projection compares in high-dimensional real world problems. We use microarray data sets 11 some of which have been used as benchmarks by several authors (Li et al., 2002;Lee et al., 2003;Hernández-Lobato et al., 2010). All data sets deal with binary classification, and the number of features and data set sizes can be found in Table 1. Again, as a reference model we use the one described in Section 6 and call it here 'Bayes SPC'. For the projection method, we used L 1 -search and made the submodel predictions using five clusters projection. Notice that although the reference model employed only a reduced set of features (those that survived the screening), the projective selection considered all the features (no pre-selection). The number of features was decided based on fivefold cross-validation. To investigate the effect of the model size selection heuristic discussed in Section 5.1, we report results for the smallest number of features that had its cross-validated MLPD within one standard error away from the reference model  or from the best submodel ('Proj-best-1se'). We also report results  that are otherwise exactly the same as the two above but utilize a little bit of ridge regularization (with λ = 0.1) in the submodel projections which was observed to improve the numerical stability in cases where some of the reference model class probabilities are close to 0 and 1.
For comparison, we computed results for Lasso, elastic net (with α = 0.7 and α = 0.3) and ridge. To investigate the sensitivity of these to the selection of the regularization parameter λ, we report results for two choices: λ opt denotes the value that minimizes the tenfold CV-error whereas λ 1se (default in glmnet) denotes the largest λ which has its CV-error within one standard error of that of λ opt . To avoid any possible biases in the comparisons, the out-of-sample predictive accuracies 11. All except the Ovarian data are available at http://featureselection.asu.edu/datasets.php. for all the methods were assessed using an outer tenfold cross-validation. That is, the reference model, projected submodels as well as the baseline methods were computed ten times, each time leaving one tenth of the data out and then validating the found models on this left-out data.
The MLPD and classification accuracies from the outer cross-validation are shown in the first two rows of Figure 10. Overall the differences between the methods are fairly small compared to the standard errors in the estimates. In terms of MLPD, the reference model Bayes SPC gives somewhat better results than Lasso, elastic net and ridge with λ 1se , but all these give similar results when λ opt is used, and in fact ridge gives a bit better results for Leukemia data. All projections have statistically indistinguishable MLPD compared to Bayes SPC, but the model size selection with 'best-1se' performs slightly better in terms of classification accuracy. Adding a little bit of regularization does not hurt predictive accuracy but we noticed that it makes the projection numerically more stable in cases where the reference class probabilities are close to 0 and 1.
The bottom row of Figure 10 shows the number of selected features for each method. The projection methods produce by far the most parsimonious models (notice the log scale). The only data set where Lasso (with λ 1se ) selects fewer variables is Leukemia, but there it also yields inferior results in terms of MLPD. This is perfectly in accordance with the results shown in Figures 8  and 9: the projection finds very good tradeoff between sparsity and accuracy. To fully respect the differences in the number of features used, we have also reported them using hard numbers in Table 2 (appendix B) since an accurate comparison on the log scale is somewhat cumbersome.
The computation times are shown in Table 1. After forming the reference model, the projection is computationally only slightly more expensive than Lasso and the increase comes from the relaxed projections (the predictions are computed without the L 1 -penalty). Although not as highly optimized as glmnet, our software is reasonably fast even for the largest problems. Forming the reference model (Bayes SPC) is computationally the most expensive part, though still very affordable considering that all the computations (reference model construction and projection) for the largest number of features can be done in about two minutes. Indeed, this demonstrates that the projection can be very feasible computationally and it can yield improved results to the standard approaches, as were shown in Figure 10.

Theoretical results
In this section we present a theorem that helps us to understand when the reference model could be helpful for parameter learning in linear submodels. Here we only state the results, the proofs can be found in appendix C.
Let X = (x T 1 , . . . , x T n ) ∈ R n×p be the design matrix and y = (y 1 , . . . , y n ) ∈ R n the target measurements. Assume the target measurements decompose as y i = µ(x i ) + ε i , where µ(x) is the true expected value of y given x, µ(x) = E( y | x ), and ε i are i.i.d. random numbers independent of x with zero mean and finite variance σ 2 denoting the variation in y that cannot be explained by x. It should be emphasized that although we assume ε denotes random error independent of x, it may contain systematic variation related to some other (unobserved) features not included in x, and hence the magnitude of ε should be interpreted as the irremovable error for this particular set of features x. In vector notation, y decomposes as y = µ + ε, where µ = (µ(x 1 ), . . . , µ(x n )) and ε = (ε 1 , . . . , ε n ). Furthermore, in what follows we shall use the shorthand notation ||v|| 2 M = v T Mv, where v is a vector and M a positive definite matrix.  In all cases the time contains the cross-validation of the tuning parameters and/or the model size. The first result for Lasso is computed using our software (projpred) whereas the second result (and that of ridge) is computed using the R-package glmnet which is more highly optimized.
Consider two methods of estimating the regression coefficients when regressing y on X, namelŷ β = (X T X) −1 X T y and β ⊥ = (X T X) −1 X T µ * .
Hereβ is the familiar least squares estimate, and β ⊥ a projection of an arbitrary reference fit µ * ∈ R n . Let us then define the expected prediction error for any vector of coefficients β as ∆(β) = Eỹ 1 n ||Xβ −ỹ|| 2 .
Notice that although we consider here the predictions at the observed input locations X, the expectation is with respect to a set of new measurementsỹ = µ +ẽ, whereẽ is a vector of new noise termsε 1 , . . . ,ε n . The gain from using β ⊥ instead ofβ is defined as the reduction in the expected prediction error We have the following lemma.
Since both ||y − µ|| 2 P and ||µ * − µ|| 2 P are non-negative, the interpretation of Lemma 1 is that for linear submodels, one can expect to gain (that is, G ≥ 0) from using a reference model when the reference fit µ * is closer to the best possible prediction µ (with features x) than the observed noisy target values y (with the norms taken with respect to the projection matrix P). This makes perfect sense: if we fit our model to pseudo-data µ * instead of the actual data y, we expect to do better if the pseudo-data are closer to the true underlying conditional mean µ(x) = E( y | x ), that is, less noisy than the actual data. Notice that the lemma makes no assumptions about the form of the true underlying mean µ(x) that captures the relationship between y and x. In particular, µ(x) need not be linear in x, not even smooth or continuous. Neither does the lemma assume anything about how the reference fit µ * is constructed.
Let us now assume the differences e * = µ * − µ are random numbers with mean b and covariance K. These describe the bias and variance in the reference fit. 12 We have the following theorem Theorem 2 Assume the terms e * = µ * − µ have mean b ∈ R n and covariance K ∈ R n×n . Then the expected gain can be written as 12. Here we mean bias and variance both due to fitting the reference model to a finite data set and having unobserved features. For example, even if our reference model was a completely deterministic function of x and some other features z, then its value in a particular location xi is still random as it depends on the realized value for z.
Theorem 2 further decomposes the reference model error into bias and variance. The term Tr(PK) is difficult to grasp without further assumptions, but the theorem can be understood more easily by the following immediate corollary.
Corollary 3 Assume the reference model errors are uncorrelated with a common variance, that is, K = σ 2 µ * I. Then the expected gain E( G ) simplifies to This corollary states that with an unbiased reference model (b = 0) we can expect to gain when the variance of the residuals µ * − µ is smaller than the variance of y − µ. Furthermore, the gain increases with the dimensionality of the projection space p, but on the other hand goes to zero when n → ∞. This is also in perfect accordance with the empirical results, for instance those shown in the middle row of Figure 8. There the difference in predictive MSE between relaxed Lasso and projection is small up to about p = 2, but then starts to increase gradually. On the other hand we know the least squares fit gives us the optimal coefficients at the limit n → ∞ and hence we do not expect to gain anything then with a reference model. The above analysis assumes the future predictions are made at the observed input locations. Usually this is not quite a realistic assumption, but it still gives us some idea when the reference model could be useful. Extending the result to differentx would require assumptions about the functional form of µ(x). Furthermore, here we used squared error as the loss function as it allows for tractable analysis, but we expect one of the major advantages of the projection to be that it conserves the predictive uncertainties well which are not measured by the squared error. Finally, this analysis considers only parameter learning in the submodels but it says little about when the reference model can improve the selection of a better feature combination. In principle it is possible to improve selection even when the reference model is not unbiased, as long as the bias is "in the right direction", for instance so that it favors certain features over the others. We have discussed in a bit more detail in our earlier work, see Section 3 in Piironen and Vehtari (2017a). The empirical evidence about the improved selection is convincing (Sec. 1.2) but currently we are not aware of any theoretical analysis on this topic.

Discussion
Below we provide some final remarks, together with recommendations for practitioners and possible directions for future research.

Projection versus Lasso and related methods
Although our results indicate that with a reasonable reference model the relaxed L 1 -projection outperforms Lasso in terms of tradeoff between sparsity and accuracy, there is no question that Lasso and the whole elastic net family are useful methods. These methods are fast to fit and often provide good results without any hand tuning, so they are very useful for getting a good predictive baseline for almost any problem.
In addition to better tradeoff between sparsity and accuracy, the projection has the upside that it provides a principled approach to handling also parameters other than the regression coefficients.
For instance, projecting the noise variance in linear regression is trivial and analytical solution exists (see Sec. 3.5) whereas for Lasso this is considered to be a difficult problem (Reid et al., 2016). This property is very beneficial in all settings where additional dispersion parameters need to be estimated. These situations come up frequently, such as in survival analysis with parametric observation models (Peltola et al., 2014). Another benefit is that the projection gives a principled answer to how to make predictions when some of the features we would like to use are unavailable at prediction time (missing data) (Lindley, 1968).

Multiple hypothesis testing
In this paper we have focused on selecting a minimal subset of features that are sufficient for achieving accurate predictions. As pointed out in Section 1.3, this is a different problem from what is known as multiple hypothesis testing, where one is less interested in predictive accuracy but the desire is to identify all the features that are statistically related to the target variable. For instance, in gene expression studies it is conventional to try to identify genes whose expressions are different between two groups (say between control and cancer patients). This is often done by computing some statistic for each feature (say two sample t-statistic) and then based on these trying to distinguish between significant and non-significant features using either frequentist of Bayesian approaches. For a nice overview, see Efron (2010), and for more recent Bayesian accounts, see Bhattacharya et al. (2015); Bhadra et al. (2017) and van der Pas et al. (2017). The empirical evidence indicates that the reference model approach could be highly useful also in this problem setting since it tends to help rank the truly relevant features before the irrelevant ones (Sec. 1.2). Still, it is an open question which approach to use for detecting the truly relevant features, but this is a research area on its own.

Future directions
There are several natural directions to continue this work. An obvious topic for generalized linear models would be implementing the projection to multiclass classification which has not been done yet but would be straightforward. Another useful noise model would Student-t which can find applications in data sets with outliers. This likelihood is not log-concave, but the minimization of the KL-divergence could be implemented using the EM-approach that can be used to find the maximum likelihood solution for the regression coefficients. Other relatively straightforward extensions would be hierarchical GLMs and generalized additive models (GAM) which provide more flexibility but for which the projection should be implementable using the methodology presented in Section 3.
Ultimately we would like to extend the projection framework to nonlinear models such as Gaussian processes (GP). This topic was tentatively pursued in Piironen and Vehtari (2016) with promising results, but that approach is computationally too expensive and prohibitive for large data sets. It could be possible to formulate the projection in a more clever way by borrowing ideas from the stochastic variational inference algorithms that have proven useful for GP learning with large data sets. Another difficulty with GPs is that due to their flexibility, the minimization of KL-divergence locally at training inputs does not necessarily guarantee small divergence elsewhere.
We also believe the projective approach could find more applications in improving interpretability and transparency of complex black box models such as deep neural networks, an idea that was explored recently by Ribeiro et al. (2016) and Peltola (2018). For example, in image classification one could approximate the prediction surface of the complex classifier with a linear model in the vicinity of a misclassified image to figure out which of the pixels had high weight in making the decision.

Acknowledgments
We thank Michael Riis Andersen for helpful discussions.

A. Subsampling LOO
As discussed in Section 5.2.2, one approach to perform validation for large n is to use only a subset of m < n points in LOO validation. This gives an unbiased utility estimate for each submodel but with higher variance than if we used all n LOOs.
The variance can be reduced by a semi-random subsampling. The PSIS procedure does not only smooth the importance weights but gives also an indication of the tail thickness of the importance weight distribution, measured by thek-value for each of the n datapoints. The larger thek-value, the fatter the tail indicating more influential data point. In particular, if k ≤ 0.5, the raw importance weights have a finite variance, central limit theorem kicks in and the Monte Carlo error decreases quickly. For 0.5 < k ≤ 1 the raw weights have infinite variance, but generalized central limit theorem holds and the importance sampling estimate is asymptotically consistent. However, pre-asymptotic behavior is such that for k > 0.7 infeasible high sample sizes would be needed for reasonable error rates. For k < 0.7 the Pareto smoothing regularizes the estimate so that it has a finite variance with a cost of small bias, and empirical results indicate practically useful convergence rate (see ,a, for more thorough discussion). We divide the data points into 3 strata: those witĥ k < 0.5 (good), those with 0.5 <k < 0.7 (OK) and those withk > 0.7 (bad). Denote the sizes of these strata as n 1 , n 2 and n 3 , respectively. We then randomly draw min(m/3, n j ) points without replacement from each stratum j = 1, 2, 3. If the number of drawn points is less than m, the rest of the points are drawn randomly from the remaining points so that m points are selected in total. Let m j , j = 1, 2, 3 denote the number of points drawn from each stratum. To account for the fact that sizes of the strata are different, the points are weighted according to the stratum sizes when computing the utility estimates. Jumping straight to the result, the point estimate and its standard error for the quantity ∆U k (analogous to Equations (22) and (23)) are given by where V n i=1 [ ·, v i ] denotes the weighted sample variance with weights v i . Each of the m 1 points in the 'good' category is given weight n 1 /(nm 1 ), and the weights for the points in the 'ok' and 'bad' categories are n 2 /(nm 2 ) and n 3 /(nm 3 ), respectively. The points that did not get selected in the subsample are assigned zero weight v i = 0.
The standard error (29) will be higher than (23), but often a fairly small number of points, such as m = 200 is enough for obtaining reasonably accurate estimates of the difference between the reference and submodels. This is because the submodels become closer to the reference model as more features are added, and therefore also the differences between their pointwise utilities become  Figure 7 but using the full model with Gaussian prior as the reference model. ever smaller and hence the standard errors for the models with good performance can get small already for small number of LOOs. It is worth noting though, that although the standard error for the utility difference would be small, the standard error for the actual utility U k can be substantial for small m. If all LOOs are computed for the reference model, it is possible to combine the standard errors for U * and U k to get a smaller standard error for U k but we shall not discuss this further here.
B. Extra experimental results B.1 Toy example Figure 11 shows the analogous results to Figure 7 but using the full model with Gaussian prior as the reference model. The results are otherwise similar to those in Figure 7, but here the submodels with 3 to 14 features achieve a slightly better generalization performance than the reference model. This is simply due to the fact that the Gaussian prior is not the optimal choice in this particular case since it does not help to shrink the coefficients of the truly irrelevant features, and hence we can gain by removing those features. This does not mean that the Gaussian prior would always be inappropriate even for very high-dimensional problems, see for instance the results for the microarray data sets in Section 7.4 where the ridge regression performs very well (corresponds to maximum a posteriori solution with the Gaussian prior).  Table 2: Average number of features selected for the different methods in the microarray examples over the ten outer cross-validation folds. The last column denotes average over all data sets. The projection methods select by the most parsimonious models. The sparsest other method with similar predictive accuracy (Lasso with λ best , see Fig. 10) selects on average over twice as many features as the most dense projection (Proj best-1se-reg).

B.2 Real world benchmarks
C. Proofs of the theoretical results