An Ensemble EM Algorithm for Bayesian Variable Selection

We study the Bayesian approach to variable selection in the context of linear regression. Motivated by a recent work by Rockova and George (2014), we propose an EM algorithm that returns the MAP estimate of the set of relevant variables. Due to its particular updating scheme, our algorithm can be implemented efficiently without inverting a large matrix in each iteration and therefore can scale up with big data. We also show that the MAP estimate returned by our EM algorithm achieves variable selection consistency even when $p$ diverges with $n$. In practice, our algorithm could get stuck with local modes, a common problem with EM algorithms. To address this issue, we propose an ensemble EM algorithm, in which we repeatedly apply the EM algorithm on a subset of the samples with a subset of the covariates, and then aggregate the variable selection results across those bootstrap replicates. Empirical studies have demonstrated the superior performance of the ensemble EM algorithm.


Introduction
Consider a simple linear regression model with Gaussian noise: where y = (y 1 , . . ., y n ) T is the n × 1 response, e = (e 1 , . . ., e n ) T is a vector of iid Gaussian random variables with mean 0 and variance σ 2 , and X is the n × p design matrix.The unknown parameters are the regression parameter β = (β 1 , . . ., β p ) T and the error variance σ 2 .In many real applications such as bioinformatics and image analysis, where linear regression models have been routinely used, the number of potential predictors (i.e., p) is large but only a small fraction of them is believed to be relevant.Therefore the linear model ( 1) is often assumed to be "sparse" in the sense that most of the coefficients β j 's are zero.Estimating the set of relevant variables, S = {j : β j = 0}, is an important problem in modern statistical analysis.
The Bayesian approach to variable selection is conceptually simple and straightforward.
First introduce a p-dimensional binary vector γ = (γ 1 , . . ., γ p ) T to index all the 2 p submodels, where γ j = 1 if the jth variable is included in this model and 0 if excluded.Usually γ j 's are modeled by independent Bernoulli distributions.Given γ, a popular prior choice for β is the "spike and slab" prior (Mitchell and Beauchamp, 1988): where δ 0 (•) is the Kronecker delta function corresponding to the density function of a point mass at 0 and g is a continuous density function.After specifying priors on all the unknowns, one needs to calculate the posterior distribution.Most algorithms for Bayesian variable selection rely on MCMC such as Gibbs or Metropolis Hasting to obtain the posterior distribution; for a review on recent developments in this area, see O'Hara and Sillanpää (2009).MCMC algorithms, however, are insufficient to meet the growing demand on scalability from real applications.Since the primary goal is variable selection, we focus on efficient algorithms that return the MAP estimate of γ, as an alternative to these MCMC-based sampling methods that return the whole posterior distribution on all the unknown parameters.
Recently, Ročková and George (2014) proposed a simple, elegant EM algorithm for Bayesian variable selection.They adopted a continuous version of the "spike and slab" prior-the spike component in ( 2) is replaced by a normal distribution with a small variance (George and McCulloch, 1993), and proposed an EM algorithm to obtain the MAP estimate of the regression coefficient β.The MAP estimate βMAP , however, is not sparse, and an additional thresholding step is needed to estimate γ.
In this paper, we develop an EM algorithm that directly returns the MAP estimate of γ, so no further thresholding is needed.We adopt the same continuous "spike and slab" prior.Different from the algorithm by Ročková and George (2014) that returns βMAP by treating γ as latent, our algorithm returns the MAP estimate of the model index, γMAP , by treating β as latent.The special structure of our EM algorithm allows us to use a computational trick to avoid inverting a big matrix at each iteration, which seems unavoidable in the algorithm by Ročková and George (2014).Further we can show that the γMAP achieves asymptotic consistency even when p diverges to infinity with the sample size n.
Although shown to achieve selection consistency, in practice, our EM algorithm could get stuck at a local mode due to the large discrete space in which γ lies.Borrowing the idea of bagging, we propose an ensemble version of our EM algorithm (which we call BBEM): apply the algorithm on multiple Bayesian bootstrap (BB) copies of the data, and then aggregate the variable selection results.Bayesian bootstrap for variable selection was explored before by Clyde and Lee (2001) for the purpose of prediction, where models built on different bootstrap copies are combined to predict the response.But the focus of our approach is to summarize the evidence for variable relevance from multiple BB copies, which is similar in nature to several frequentist ensemble methods for variable selection, such as the AIC ensemble (Zhu and Chipman, 2006), stability selection (Meinshausen and Bühlmann, 2010), and random Lasso (Wang et al., 2011).
The remaining of the paper is organized as follows.Section 2 describes the EM algorithm in detail, Section 3 presents the asymptotic results, and Section 4 describes the BBEM algorithm.Empirical studies are presented in Section 5 and conclusions and remarks in Section 6.
2 The EM Algorithm

Prior Specification
We adopt the continuous version of "spike and slab" prior for β, i.e. a mixture of two normal components with mean zero and different variances: where v 1 > v 0 > 0. Alternatively, we can write the prior on β as where For the remaining parameters, we specify independent Bernoulli priors on elements of γ, and conjugate priors like Beta and Inverse Gamma on θ and σ 2 , respectively: For hyper-parameters (a 0 , b 0 , ν, λ), we suggest the following non-informative choices unless prior knowledge is available: The choice for v 0 and v 1 will be discussed later.

The Algorithm
With the Gaussian model and prior distributions specified above, we can write down the full posterior distribution: Treating β as the latent variable, we derive an EM algorithm that returns the MAP estimation of parameters Θ = (γ, σ 2 , θ), whereas the roles of β and γ are switched in Ročková and George (2014).

E Step
The objective function Q at the (t + 1)-th iteration in an EM algorithm is defined as the integrated logarithm of the full posterior with respect to β given y and the parameter values from the previous iteration Θ (t) = (γ (t) , σ 2 (t) , θ (t) ), i.e., where It is easy to show that β follows a Normal distribution with mean m and covariance matrix σ 2 (t) V, given Θ (t) and y, where Then the two expectation terms in (5) can be expressed as:

M Step
We sequentially update parameters (γ, θ, σ) to maximize the objective function Q.

Stopping Rule
The EM algorithm alternates between the E-step and M-step until convergence.A natural stopping criterion is to check whether the change of the objective function Q is small.To reduce the computation cost for evaluating the Q function, we adopt a different stopping rule as our main focus is γ: we stop our algorithm when the estimate γ (t) stays the same for k 0 iterations.In practice, we suggest to set k 0 = 3.The pseudo code of this EM algorithm is summarized in Algorithm 1.

Computation Cost
At each E-step, updating the posterior of β given other parameters in (6) requires inverting a p × p matrix which is the major computational burden of this algorithm.When p > n, we can use the Sherman-Morrison-Woodbury formula to compute the inverse of an n × n matrix.
So the computation cost at each iteration is of order O(min(n, p) 3 ).It is, however, still time-consuming when both n and p are large.
Note that the only thing that changes in (13) from iteration to iteration is D γ (t) , a diagonal matrix depending on the binary vector γ (t) .From our experience, only a small fraction of γ (t) j 's are changed at each iteration after the first a couple of iterations.So the idea is to use the following recursive formula to compute V (t) : where ) is a diagonal matrix with the j-th diagonal entry being non-zero only if the inclusion/exclusion status, i.e., the value of γ j , is changed from the last iteration.
Let l denote the number of variables whose γ j values are changed from iteration (t − 1) to is a rank l matrix.We can apply the Woodbury formula on ( 14) to reduce the computation complexity from O(min(n, p) 3 ) to O(l 3 ).
For example, without loss of generality, suppose only the first l covariates have their γ j values changed.Then, we can write j − 1) l j=1 and U consists of the first l columns from I p .Applying the Woodbury formula, we have

Asympototic Consistency
In this section, we study the asymptotic property of γn , the MAP estimate of model index returned by our EM algorithm.Assume the data y n are generated from a Gaussian regression model: Here we consider a triangular array set up: the dimension p = p n diverges with n and the true coefficients β * n also vary with n.Suppose the true model is indexed by γ * n , where We show that our EM algorithm has the following selection consistency property: First we list some regularity conditions needed in our proof.Let λ min (A) denote the smallest eigenvalue of matrix A. We assume where M is a positive constant, and (a 0 , b 0 , ν, λ) are the hyper-parameters from the Beta and InvGamma priors.
Assumption (A1) controls the collinearity among covariates; in the traditional asymptotic setting where p is fixed, we have η 1 = 1.Assumption (A2) controls the sparsity (in terms of L 2 norm) of the true regression coefficient vector.Assumption (A3) requires that the minimal non-zero coefficient cannot go to zero at a rate faster than 1/ √ n; in the traditional asymptotic setting where β * is fixed, we have η 3 = 0. Assumption (A4) is purely technical, which ensures that θn and σ2 n are bounded.In fact we could fix θn and σ2 n to be any constant, which does not affect the proof.In our simulation studies, we still recommend (4) as the choice for hyper-parameters unless p is large.

The BBEM Algorithm
A common issue with EM algorithms is that they could be trapped at a local maximum.
There are some standard remedies available for dealing with this issue, for instance, trying a set of different initial values or utilizing some more advanced optimization procedures at the M-step.Since our EM algorithm is searching for the optimal γ over a big discrete space, all p-dimensional binary vectors, these remedies are less useful when p is large.
When doing optimization with γ, a discrete vector, the resulting solution is often not stable, i.e., has a large variance.Bagging is an easy but powerful method (Breiman, 1996) for variance reduction, which applies the same algorithm on multiple bootstrap copies of the data, and then aggregates the results.We proposed the following ensemble EM algorithm, in which we repeatedly run the EM variable selection algorithm, Algorithm 1 from Section 2.2, on Bayesian bootstrap replicates.
The original bootstrap repeatedly draws samples from the original data set {(x i , y i )} n i=1 with replacement, i.e., each observation (x i , y i ) is sampled with probability 1/n.In Bayesian bootstrap (Rubin, 1981), instead of sampling a subset of the data, we assign a random weight w i to the i-th observation and then fit a weighted least squares regression model on the whole data set.In particular, following Rubin (1981), we generate the weights w = (w 1 , . . ., w n ) from a n-category Dirichlet distribution: When applying Algorithm 1 on a weighted linear regression model, all the updating equations stay the same, except equation ( 6) for the posterior of β, which should be changed to: Eq (7), the expectation of the weighted residual sum of squares, should also be changed accordingly: It is well-known that in order to make the aggregation work, we should control the correlation among estimates from bootstrap replicates.For example, in random forest (Breiman, 2001), the number of variables used for choosing the optimal split of a tree is restricted to a subset of the variables, instead of using all p variables.A similar idea was implemented in Random Lasso (Wang et al., 2011), an ensemble algorithm for variable selection.In the same spirit, we apply the EM algorithm only on a subset of the variables at each Bayesian bootstrap iteration.A naive way is to randomly pick a subset from the p variables.This, however, will be inefficient when p is large and the true model is sparse, since it is likely most random subsets will not contain any relevant variables.So we employ a biased sampling procedure: sample the p variables based on a weight vector π that is defined as that is, variables are sampled based on their marginal effect in a simple linear regression.
The ensemble EM algorithm operates as follows.First we sample a random set of L variables according to the probability vector π, and draw a n × 1 bootstrap weight vector w from ( 16).Let X be the new data matrix with the L columns.Then apply the EM algorithm on X with weight w.Let γ k denote the model returned by the k-th Bayesian bootstrap iteration, where the j-th element of γ k is 1 if the j-th variable is selected and zero otherwise; of course, the j-th element is zero if the j-th variable is not included in the initial L variables.Define the final variable selection frequency for the p variables as We can report the final variable selection result by thresholding φ j 's at some fixed number, for example, a half.Or we can produce a path-plot of φ as v 0 varies, which could be a useful tool to investigate the importance of each variable.We illustrate this in our simulation study in Section 5.
As for the computational cost, the inversion of the L × L matrix in ( 17) is a big improvement compared with that of a p × p matrix, while it can be further simplified through the fast computing trick in Section 2.3.We call this algorithm, BBEM, which is summarized in Algorithm 2.

Empirical Study
In this section, we first compare the proposed EM algorithm (Algorithm 1) with other popular methods on a widely used benchmark data set.Then we compare BBEM (Algorithm 2) with other methods on two more challenging data sets of larger dimensions.
Finally, we applied BBEM on a restaurant revenue data from a Kaggle competition, and showed that our algorithm outperforms the benchmark from random forest.
For the hyper-parameters v 0 and v 1 , we set v 1 = 100 as fixed and tune an appropriate value for v 0 either based on 5-fold cross-validation or BIC.For the initial value for θ, we suggest to use 1/2 for ordinary problems, but √ n/p for large-p problems.The initial value of σ 2 is set as 1.In addition, there are two bootstrap parameters: the total number of replicates K and the number of variables used in each bootstrap L. For efficiency, the number of variables in each bootstrap replicate should not exceed the sample size n.We use K = 100, and L = n/2 = 50 if p is large and L = p is p is small.

A widely used benchmark
First we apply our EM algorithm on a widely used benchmark data set (Tibshirani, 1996), which has p = 8 variables, each from a standard normal distribution with pairwise correlation ρ(x i , x j ) = 0.5 |i−j| .The response variable is generated from where ∼ N(0, σ 2 ).
Following Fan and Li (2001), we repeat the experiment 100 times under two scenarios: (1) n = 40, σ = 3 and (2) n = 60, σ = 1.The result is shown in Table 1, which reports the average number of zero-coefficients (i.e., no selection) among signal variables (x 1 , x 2 , x 5 ) and among noise variables, respectively.The results for SCAD1 (tuning parameter selected by cross-validation), SCAD2 (tuning parameter fixed) and LASSO are taken from Fan and Li (2001).In the first "small sample-size high noise" scenario, our EM algorithm has the highest number of zero-coefficients among noise variables, i.e., the lowest type I error.
The average number of signal variables missed by EM is slightly higher than SCAD1 (where the tuning parameter is chosen by cross-validation) but less than SCAD2 (where the tuning parameter is pre-fixed).But overall, our EM algorithm and the two SCAD methods perform the best.In the second "large sample-size low noise" scenario, no signal variables are missed by any method, but EM has the lowest type I error.
Following Wang et al. (2011) and Xin and Zhu (2012), we repeat the experiment 100 times with the same sample size n = 50 but two different noise levels: low noise level (σ = 3) and high noise level (σ = 6).Table 2 reports the minimum, median, maximum of being selected out of 100 simulations for the signal and the noise variables, respectively.Both Lasso and random Lasso have a higher chance of selecting the signal variables, but at the price of mistakenly including many noise variables.Overall, our EM algorithm performs the best, along with PGA and stability selection, two frequentist ensemble methods for variable selection.

A highly-correlated data
Next we demonstrate our two algorithms on a highly-correlated example from Wang et al. (2011).The data has p = 40 variables and the response y is generated from (j=3,4,6,7,8) x The results other than EM (Alg 1) are from Fan and Li (2001).
where ∼ N(0, σ 2 ) and σ = 6.Each x i is generated from a standard normal with the following correlation structure among the first six signal variables: the signal variables are divided into two groups, V 1 = {x 1 , x 2 , x 3 } and V 2 = {x 4 , x 5 , x 6 }; the within group correlation is 0.9 and the between-group correlation is 0.
We repeat the simulation 100 times with n = 50 and n = 100, and the results are summarized in Table 3.For this example, due to the high correlation among features we expect ensemble methods to perform better.Indeed, BBEM has the best performance in terms of selecting true signal variables while controlling the error of including noise variables.The performance of the EM algorithm, although not the best, is also comparable with other top ensemble methods like random Lasso from Wang et al. (2011), and T2E and PGA from Xin and Zhu (2012).
For illustration purpose, we apply BBEM on a data set with n = 50 and v 0 varying from 10 −4 to 1. Figure 1 shows the path-plot of the selection frequency from BBEM.
There is clearly a gap between the signal variables and the noise ones.For a range of v 0 , from 0.001 to 0.02, BBEM can successfully select the six true variables {x 1 , x 2 , . . ., x 6 } if we threshold the selection frequency φ j at 0.5.

A Large-p small-n example
Finally we apply BBEM on a large-p small-n example from Ročková and George (2014), where p = 1000 and n = 100.Each of the p features is generated from a standard normal with pairwise correlation to be 0.6 |i−j| and the response y is generated from the following linear model: where ∼ N(0, 3).For this large p example, we set the parameters in the BBEM algorithm as follows:  this example where the true model is known to be sparse, we choose to tune v 0 via BIC.
For illustration purpose, we also include BBEM with a fixed tuning parameter v 0 = 0.03 in the comparison group.We compare BBEM with the EMVS algorithm from Ročková and George (2014), which is implemented by us using the annealing technique for β's initialization, and fixed v 0 = 0.5, v 1 = 1000 as suggested in Ročková and George (2014).
Table 5.3 reports the average number of signal and noise variables being selected over 100 iterations for each method.BBEM with BIC tuning performs the best: it selects 2.99 signal variables out of 3 on average (i.e., only miss one variable, the weakest signal x 1 , once in all 100 iterations) and meanwhile has the smallest type I error.The BBEM algorithm with a fixed tuning parameter has a similar result as EMVS but is much faster.The computation advantage for BBEM comes from two aspects: the computation trick that reduces the computation cost on matrix inversion and the sub-sampling step in Bayesian bootstrap which allows us to deal with just a subset of variables of size smaller than p.

A real example
For TFI, a company that owns some of the world's most well-known brands like Burger If a wrong location is chosen, likely the restaurant will soon be closed and all the initial investment will be lost.TFI hosted a prediction competition on Kaggle We first transform the "Open Date" to a numeric feature called "Year Since 1900" and merge the "City" column into the "City Group" column which now contains four categories: Istanbul, Izmir, Ankara, and others (small cities).Then we crate dummy variables for the categorical features like "City Group" and "Restaurant Type" and keep all the obfuscated numeric columns P1-P37.The final training set has 43 features and 137 samples.
After standardizing the data, we fix v 1 at 100 and tune v 0 from 10 −4.5 to 10 −0.5 for the BBEM algorithm, where each bootstrap sample uses L = 15 variables, and the total number of replicates is K = 300.The path-plot of selection frequency for important features is shown in Figure 5.4.It is not surprising that "City Group", "Years Since 1900" and "Restaurant Type" are important predictors for the revenue.Quite a few obfuscated features are also selected as important predictors.Although we do not know their meanings, they should provide valuable information for TFI to choose their next restaurant's location.
Since the evaluation metric for this specific competition is based on the rooted mean square error (RMSE), we use the same metric in our 5-fold cross-validation.We tuned v 0 from the set {0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01}, and found v 0 = 0. (RMSE=1998014.94)provided by Kaggle2 .It is impressive for BBEM to outperform random forest considering that BBEM does not use any nonlinear features but random forest does.

Further Discussion
Variable selection is an important problem in modern statistics.In this paper, we study the Bayesian approach to variable selection in the context of multiple linear regression.We proposed an EM algorithm that returns the MAP estimate of the set of relevant variables.
The algorithm can be operated very efficiently and therefore can scale up with big data.In addition, we have shown that the MAP estimate from our algorithm provides a consistent estimator of the true variable set even when the model dimension diverges with the sample size.Further, we propose an ensemble version of our EM algorithm based on Bayesian bootstrap, which, as demonstrated via real and simulated examples, can substantially increase accuracy while maintaining the computation efficiency.
Although we restrict our discussion for the linear model, the two algorithm we proposed can be easily extended to other generalized linear models by using latent variables (Polson et al., 2013), an interesting topic for future research.
Appendix: Proof of theorem 3.1 Proof.Recall the EM algorithm returns where the threshold and the conditional second moment of β j is equal to m Here we represent the posterior mean of β as three separate terms: the true coefficient vector β * n , the bias term b n and the random error term W n .So the event {γ n = γ * n } is equivalent to First we prove the following results that quantify m 2 j and V jj .
(R1) V jj is upper bounded by the largest eigenvalue of V, where for two sequences {a n } and {b n }, we write a n ≺ b n if a n /b n → 0.
The matrix L 2 norm is defined as A 2 = sup v =1 Av 2 , which is equal to its largest eigenvalue (singular value) when A is symmetric (non-symmetric).
(R3) Note that W n is not a Gaussian random vector due to the dependence between D and e n , but it can be rewritten as where A = X T n X n + D −1 −1 X T n X n and Wn = (X T n X n ) −1 X T n e n .Since A is a matrix with norm bounded by 1, we have (R4) Wn = (X T n X n ) −1 X T n e n is a Gaussian random vector with covariance σ 2 (X T n X n ) −1 and mean 0. So the variance for W nj is upper bounded by σ 2 λ −1 n1 .
Recall the tail bound for Gaussian variables: for any Z ∼ N(0, τ 2 ), (i.e., no selection) out of 100 simulations for each type of variables (Signal and Noise) are shown.The results other than EM and BBEM are from Xin and Zhu (2012).L = n/2 = 50 and the total number of replicates K = 100.It is well known that crossvalidation based on prediction accuracy tends to include more noise variables.So, for

Figure 1 :
Figure 1: Highly-correlated data n = 50.A path-plot of the average selection frequency when v 0 varies in the logarithm scale of base 10.Top 6 lines represent the true variables x 1:6 and the bottom 3 lines represent the maximum, median and minimum among the noise variables x 7:40 .
King and Arby's, decisions on where to open new restaurants are crucial.It usually takes a big investment of both time and capital at the beginning to set up a new restaurant.

Figure 2 :
Figure 2: Restaurant data.The path plot of selection frequency when v 0 varies in the logarithm scale of base 10.Only a subset of variables with high selection frequencies are displayed.

Table 1 :
A widely used benchmark.The average number of zero-coefficients (i.e., no selection) out of 100 simulations for each types of variable (Signal or Noise) are shown.

Table 2 :
Xin and Zhu (2012)hmark.The min, median, max number of being selected out of 100 simulations for each types of variable (Signal or Noise) are shown.The results other than EM (Alg 1) are fromXin and Zhu (2012).

Table 3 :
the number of variables used in each bootstrap iteration A highly-correlated data.The min, median, max number of times being selected

Table 4 :
Ročková and George (2014)uild a mathematical model to predict the revenue of a restaurant based on a set of demographic, real estate, and commercial information.The data contains 137 restaurants in the training set and 1000 restaurants in the test set.Features include the Open Date, A large-p small-n example.The table shows the average number of signal and noise variables being selected out of 100 iterations.In BBEM, v 0 is either chosen by BIC or fixed at 0.03.EMVS is the algorithm proposed byRočková and George (2014).