Theory of Optimal Bayesian Feature Filtering

Optimal Bayesian feature filtering (OBF) is a supervised screening method designed for biomarker discovery. In this article, we prove two major theoretical properties of OBF. First, optimal Bayesian feature selection under a general family of Bayesian models reduces to filtering if and only if the underlying Bayesian model assumes all features are mutually independent. Therefore, OBF is optimal if and only if one assumes all features are mutually independent, and OBF is the only filter method that is optimal under at least one model in the general Bayesian framework. Second, OBF under independent Gaussian models is consistent under very mild conditions, including cases where the data is non-Gaussian with correlated features. This result provides conditions where OBF is guaranteed to identify the correct feature set given enough data, and it justifies the use of OBF in non-design settings where its assumptions are invalid.


Introduction
Biomarker discovery entails mining a small-sample high-dimensional dataset for a list of features that represent potentially interesting molecular biomarkers. The hope is that the reported features might direct future studies (Feng et al., 2004) that ultimately lead to new diagnostic or prognostic tests, better treatment recommendations, or a better understanding of the regulatory mechanisms underlying the biological phenomena or disease under study Rifai et al., 2006;Ramachandran et al., 2008).
Unfortunately, discovering reliable and reproducible biomarkers has proven to be difficult . One reason is that the algorithms employed (see , Saeys et al. (2007),  and Ang et al. (2016) for reviews on common methods) are typically not well suited for the biomarker discovery problem. Univariate filter methods often exhibit quirks depending on the scoring function employed (Lazar et al., 2012). For example, the popular t-test cannot detect markers based on large differences between variance alone (Foroughi pour and Dalton, 2018b), even though such markers may have an important role to play in the disease under study or help uncover previously unknown subclasses of the disease. Multivariate methods may seem to have an advantage over filters because they can account for correlations; however, rather than use this correlation information to identify all markers that may be of interest, they tend to avoid selecting redundant markers or reward selecting smaller feature sets to simplify model construction or avoid overfitting (Sima and Dougherty, 2008;Awada et al., 2012;Ang et al., 2016;Li et al., 2017). This effect is so catastrophic for biomarker discovery that univariate methods often far outperform multivariate methods (Sima andDougherty, 2006, 2008;Foroughi pour and Dalton, 2018b).
Here we examine optimal Bayesian feature filtering (OBF), a supervised univariate filter method designed from the ground up for exploratory biomarker discovery (Foroughi pour and Dalton, 2015). OBF assumes a finite number of classes (e.g., patients given drug A versus drug B). Under its assumed model, OBF optimally detects and ranks the set of all features with distributional differences between the classes. It has been shown through simulations that OBF has competitive and robust performance across Bayesian models with block-diagonal covariances, and that it enjoys particularly excellent performance when markers are individually strong with low correlations (Foroughi pour and Dalton, 2017dDalton, , 2018a. Foroughi pour and Dalton (2018b) also examined the performance of OBF when its modeling assumptions (e.g., independence, priors, and Gaussianity) are violated, provided guidance on choosing inputs and objective criteria for robust performance, and demonstrated that OBF enjoys low computation cost.
Under Gaussian models with certain non-informative priors, OBF reduces to testing each feature separately using the test statistic studied by  and Zhang et al. (2012) for the equality of two Gaussian populations. OBF does not use classification or regression in any part of its framework. While variable selection methods based on classification or regression (for instance LASSO) are useful for designing predictive models , like most multivariate methods they are typically not suitable for biomarker discovery because their objective is model construction. Small sample sizes worsen the overfitting problem, often resulting in small feature sets. If classification is involved, error estimation bias and variance result in poor selection performance (Sima and Dougherty, 2006).
Bayesian variable selection methods like Bayesian LASSO , the Bayesian extension to group LASSO by Xu and Ghosh (2015), and works by Lee et al. (2003) and Baragatti (2011) based on generalized linear models (GLMs), suffer from similar problems. Whereas OBF places priors directly on the underlying data generation model, most priors for Bayesian variable selection, for example spike and slab priors (Mitchell and Beauchamp, 1988;Madigan and Raftery, 1994;George and McCulloch, 1997;Ishwaran and Rao, 2005), place uncertainty on the classification or regression parameters, which are difficult to justify, interpret and validate in practice. Multicollinearity can be assuaged by grouping genes, but methods by Rockova and Lesaffre (2014) and Xu and Ghosh (2015) assume grouping information is known a priori, which is infeasible in exploratory analysis. Also, in contrast with OBF, Bayesian methods often rely on computationally intensive methods like Markov-Chain Monte-Carlo (MCMC) sampling or variational inference (Carbonetto and Stephens, 2012).
Shared kernel Bayesian screening (SKBS) by  is an interesting approach that assumes all feature distributions belong to a family of mixture models with K components, and the objective is to test whether the classes have different weights in the mixture distribution. Whereas OBF treats each individual feature separately, SKBS uses the same K dictionary mixture components for all features and allows only the kernel weights to vary. When sample size is small we observed better performance using small K, but in this case the data may not be properly modeled for all features and the detected mixture components lose interpretability. SKBS also uses MCMC, making it more computationally expensive than OBF. Bayesian non-parametric methods, for example those based on Dirichlet or Pitman-Yor processes, have also gained popularity in bioinformatics for classification, inferring gene networks, clustering, and detecting chromosomal aberrations (Shahbaba and Neal, 2009;Libbrecht and Noble, 2015;Mitra and Müller, 2015;Ni et al., 2017). Spike-and-slab Dirichlet processes avoid the need to specify the number of mixtures; however, it is still difficult to specify and justify the base distribution and priors in practice. While our focus here is on the supervised case, many works like that of Cui and Cui (2012) focus on the unsupervised case. Computation is also a key concern; Cui and Cui use Bayesian expectation-maximization, which is more demanding than OBF. Holmes et al. (2015) presents a supervised method based on Pólya trees, however, the model may require larger samples than available in a typical exploratory analysis and may be sensitive to imbalanced samples.
Our main contributions are two-fold: (1) we prove optimal Bayesian feature selection under a general family of Bayesian models reduces to filtering (e.g., OBF) if and only if the underlying Bayesian model assumes all features are independent, and (2) we prove OBF under independent Gaussian models is a consistent estimator of the feature set we wish to select under mild conditions, including cases where the data is non-Gaussian with correlated features. Contribution (1) has two practical implications: OBF is the only filter method for which there exists a model in the general Bayesian framework where it is optimal, and OBF is optimal if and only if one assumes all features are independent. Contribution (2) is of enormous importance, since it provides conditions where OBF is guaranteed to identify the correct feature set given enough data, and it justifies the use of OBF in non-design settings where its assumptions are invalid.
We review the Bayesian model in Section 2 and optimal set selection in Section 3. In Section 4 we discuss OBF and present contribution (1) in Theorem 1, and in Section 5 we examine consistency and present contribution (2) in Theorems 2 and 3. We provide a demonstration on synthetic microarray data in Section 6, and conclude in Section 7. We provide a demonstration on real colon cancer microarray data in Sections S2 and S3 of Supplementary Material A.

Bayesian Model
In Section 2.1, we describe the general three-level Bayesian model originally proposed in Dalton (2013). In Sections 2.2 and 2.3 we cover the independent case and independent Gaussian case, respectively, which are originally presented in Foroughi pour and Dalton (2015). Although not covered here, an independent categorical model and several models with correlations in the general Bayesian framework have been proposed (Dalton, 2013;Dalton, 2016a, 2017c,d).

General Bayesian Model
Consider a feature selection problem in which we are to identify all features that have distinct distributions between two classes, y = 0 and y = 1. Although we consider binary labels here, the multiclass case is similar and has been characterized in Foroughi pour and Dalton (2017b). Let F be a set of feature indices, let each feature f ∈ F be associated with a space, X f , and let X = f ∈F X f be the feature space. Typically, X f = R for all f . We call features that we wish to select, e.g. those with distributional differences between classes, "good features." When viewed as a random quantity, we denote this set byḠ, and we denote a realization of this random set by G. Likewise, we call features that we wish not to select "bad features," and denote them byB = F \Ḡ when random and B = F \G when fixed, where "\" is the set difference. Conditioning on events like {Ḡ = G} or {f ∈Ḡ} does not mean the set of good features is deterministic. Rather, this should be interpreted as merely a hypothesis that these events hold for the current G ⊆ F or f ∈ F under consideration. Furthermore, sinceḠ = G if and only ifB = B, conditioning on the event {Ḡ = G} is equivalent to conditioning on the event {B = B}. We denote conditioning on these events by "|G" or "|B", and use these notations interchangeably throughout.
We denote a prior onḠ across the power set of F by p(G) = P (Ḡ = G). Given G = G, let θ G y denote data generation parameters of class y ∈ {0, 1} features in G, let θ B denote data generation parameters of features in B, and let θ = {θ G 0 , θ G 1 , θ B } be the set of all data generation parameters. Define corresponding parameter spaces: We denote a prior on θ by p(θ|G), and assume θ G 0 , θ G 1 and θ B are conditionally mutually independent, i.e., We assume feature selection is aided by the observation of feature-label pairs, and we denote the complete dataset, including features and labels, by S. Though we assume the data is complete here, the missing data problem has been studied for special cases of this model in Foroughi pour and Dalton (2016b). Let x ∈ X be a feature vector, and let x G and x B denote elements of x that correspond to features in G and B respectively. GivenḠ = G, parameter θ and class y, we also assume x G and x B are independent: Assume the data is comprised of n points with n y points in class y, that the label of each point is determined by a process independent of θ and G, and that, conditioned on the labels, sample points are independent with points belonging to the same class identically distributed. These assumptions are true in many sampling strategies, for instance random and separate sampling. Let S G y and S B be the part of the data corresponding to features in G from class y and features in B from both classes, respectively. Due to independence between x G and x B and independence between sample points, where the proportionality constant depends on the distribution of n y for the given sampling strategy, . Thus, S G 0 , S G 1 and S B are mutually independent given θ and G. Further, from (2.1) and (2.3), they are also independent given only G, that is, where for y ∈ {0, 1}, Let p(G|S) = P (Ḡ = G|S) be the posterior probability that the set G is precisely the set of good features, given our observation of the data. Applying Bayes' rule and (2.4), The marginal prior and posterior probabilities that an individual feature, f ∈ F , is in G are denoted by π(f ) = P (f ∈Ḡ) = G:f ∈G p(G) and respectively. Note that P (f ∈B) = 1 − π(f ) and P (f ∈B|S) = 1 − π * (f ). Also, where | · | denotes cardinality for sets, and I(q) is the indicator function, equal to 1 if q holds and 0 otherwise. Similarly, E |Ḡ| S = f ∈F π * (f ). The expected number of good features, before and after observing data, may be found from π and π * , respectively.
In biomarker discovery, previously known biomarkers can be integrated into the prior to aid the discovery of new biomarkers (Foroughi pour and Dalton, 2017a). When prior knowledge is not available, improper priors for p(θ|G) may be needed and the above derivations become invalid. To circumvent this problem we: (1) require p(θ|G) to be such that the integrals in (2.5) are positive and finite, (2) require π(G) to be proper, and (3) take (2.5), (2.6) and (2.7) as definitions with the proportionality constant in (2.6) defined such that G:G⊆F p(G|S) = 1. Although improper priors are controversial, see for example marginalization paradoxes described by Dawid et al. (1973), counterexamples discussed by Jaynes (2003), and discussions on the Jeffreys-Lindley paradox by Robert (1993Robert ( , 2014, this guarantees the posterior p(G|S) and marginal posterior π * (f ) under improper priors are normalizable to valid densities and have definitions consistent with proper priors. See Sections S5 and S6 of Supplementary Material A for further discussions on improper priors.

Independent Bayesian Model
Assume a prior p(G) onḠ where the events {f ∈Ḡ} are mutually independent. Then, (2.9) To completely characterize this prior, note that one need only specify π(f ) for all f ∈ F . Further, if π(f ) = p is constant for all f ∈ F , then |Ḡ| is binomial (|F |, p).
For every f ∈ F we assign three parameters, θ f 0 , θ f 1 and θ f , with parameter spaces Θ f 0 , Θ f 1 and Θ f and densities p(θ f 0 ), p(θ f 1 ) and p(θ f ), respectively. Let θ G y = {θ f y : f ∈ G} and θ B = {θ f : f ∈ B} and assume parameters of individual features are mutually independent givenḠ = G. Then (2.1) becomes p(θ|G) = g∈G p(θ g 0 )p(θ g 1 ) b∈B p(θ b ). Finally, we assume features are mutually independent givenḠ = G, θ and y, thus the joint density in (2.2) is now of the form p(x|y, θ, G) = g∈G p(x g |θ g y ) b∈B p(x b |θ b ), where p(x g |θ g y ) and p(x b |θ b ) are the marginals of good and bad features, respectively. As in (2.6), one can show where, as in (2.5), Dividing the right-hand side of (2.10) by the where for all f ∈ F , we define .
When p(θ f y ) or p(θ f ) are improper, we require π(f ) to be proper, we require the integrals in (2.11) to be positive and finite and take these equations as definitions, and we define π * (f ) = h(f )/(1 + h(f )) as in (2.14), where h(f ) is defined in (2.13).

Independent Gaussian Model
Now suppose all features are Gaussian with conjugate priors. If f ∈Ḡ then θ f y = [µ f y , σ f y ], where µ f y and σ f y are the mean and variance of x f in class y, respectively. Similarly, if f ∈B, then θ f = [µ f , σ f ], where µ f and σ f are the mean and variance of x f . To simplify notation, we drop the conventional square in variances, σ f y and σ f .
, and s f y , κ f y , m f y and ν f y are realvalued hyper-parameters. For a proper prior we require s f y , κ f y , ν f y > 0, in which case are the sample mean and unbiased sample variance, respectively, of feature f points in class y (Murphy, 2007). Note that p(S f y |f ∈Ḡ) is the normalization constant in finding the posterior, p(θ f y |S f y ), from the prior times likelihood, p(θ f y )p(S f y |θ f y ): (2.16) Moving on to bad features, we assume, p(θ f ) = p(σ f )p(µ f |σ f ), where given realvalued hyper-parameters s f , κ f , m f , and ν f , whereμ f andσ f are the sample mean and variance, respectively, of feature f points in both classes (Murphy, 2007). As in (2.16), (2.17) Plugging (2.16) and (2.17) in (2.13), Under improper priors we require π(f ) to be proper, and to ensure (2.16) and (2.17) are positive and finite we require In addition, L f > 0 becomes a separate parameter specified by the user. All theorems in this work hold under these improper priors, and set selection under proper and improper priors for the independent Gaussian case have been studied extensively in Foroughi pour and Dalton (2018b). Following , DeGroot (1970) and Akaike (1980), in Section S5 of Supplementary Material A we also show that π * (f ) from these improper priors is equivalent to a limit of π * (f ) from a sequence of proper priors.

Optimal Bayesian Feature Selection
We define five criteria for optimal Bayesian feature selection under the general Bayesian model: (1) the maximum a posteriori (MAP) criterion selects the feature set having the highest posterior probability of being the good feature set, (2) constrained MAP (CMAP) uses the MAP objective but considers only feature sets of a given size, (3) the minimal risk (MR) criterion minimizes a notion of risk, with the maximum number correct (MNC) rule being a special case that minimizes the number of mislabeled features, (4) constrained MNC (CMNC) uses the MNC objective but considers only feature sets of a given size, and (5) the Neyman-Pearson (NP) criterion maximizes the expected number of good features selected given a limited expected number of bad features selected. MAP was originally presented in Dalton (2013), while MNC and an early form of CMNC constrained to selecting two features (2MNC) were originally presented in Foroughi pour and Dalton (2014); all of the other criteria are new.

Maximum a Posteriori
The MAP feature set is the set having maximum posterior probability: (3.1) We also define the CMAP feature set to be the MAP feature set under the constraint of selecting exactly D features for some user-specified constant D: Let (G,Ḡ) be a loss function in selecting G whenḠ is the true set of good features, and let E( (G,Ḡ)|S) be the risk in selecting G. It can be shown that the MAP feature set minimizes risk under a zero-one loss function that assigns (G,Ḡ) = 0 whenḠ = G and (G,Ḡ) = 1 whenḠ = G. Therefore, one drawback of the MAP objective is that it assigns the same loss to all incorrect feature sets, regardless of how many features are labeled incorrectly. This is remedied by the MR objective, described in the next section.

Minimal Risk
Consider the family of objective criteria with (G,Ḡ) of the form: where λ GG , λ GB , λ BG , and λ BB are constants such that λ GB ≥ λ BB and λ BG ≥ λ GG . The MR feature set is defined as: Observe that, (3.6) Similarly, E |B ∩Ḡ| S = b∈B π * (b) and E |B ∩B| S = b∈B (1 − π * (b)). Thus, In other words, the MR objective ranks features by their marginal posterior probability of being inḠ, and selects those with probabilities exceeding a given threshold.
When λ GG = λ BB = 0 and λ GB = λ BG = 1, the MR cost function minimizes the expectation of the number of mislabeled features, |G ∩B| + |B ∩Ḡ|, or equivalently, maximizes the expectation of the number of correctly labeled features, c(G,Ḡ) = |G ∩ G| + |B ∩B|. This results in the MNC objective: (3.9) MNC thus selects features with a posterior probability of being inḠ greater than 0.5.
CMR minimizes risk under the constraint of selecting exactly D features: Following a similar procedure used to derive (3.8), observe: Thus, G CM R ranks π * (f ) and selects the D features with highest rank. Since the λ's need not be specified, we also call this criterion CMNC.

Neyman-Pearson
Viewing the number of correctly identified good features, |G ∩Ḡ|, as the number of true positives, and the number of incorrectly identified bad features, |G ∩B|, as the number of false positives, the NP objective maximizes the expected number of true positives while bounding the expected number of false positives by 0 ≤ α ≤ E(|B| S): (3.12) From (3.5) and (3.6), we have that (3.13) This is solved by ranking π * (f ) and iteratively adding features with highest rank to G N P until adding a new feature results in violating the constraint. NP is closely related to MR and CMNC in that all of these methods rank features using the same scoring function, π * (f ). However, they use different score cutoffs; in MR the cutoff is a constant threshold, in CMNC the cutoff forces a certain set size, and in NP the cutoff depends on the values of the π * (f ). For selection rule G k with free parameter k, plotting the pair (E(|G k ∩B| S), E(|G k ∩Ḡ| S)) in the [0, E(|B| S)] × [0, E(|Ḡ| S)] space under various k results in a curve analogous to a receiver operating characteristic (ROC) curve. The ROC curve for MR (varying T ), CMNC (varying D) and NP (varying α) are all for k = 0, 1, . . . , |F |, where the π * (f ) are the π * (f ) ordered from largest to smallest.

Optimal Bayesian Feature Filtering
In the general Bayesian model, MAP and CMAP require finding p(G|S) for all G ⊆ F , which is computationally prohibitive when |F | is large. Although MR (and thus MNC), CMNC and NP always reduce to ranking features by π * (f ) with various methods of thresholding, finding π * (f ) also requires evaluating p(G|S) for all G ⊆ F . In this section, we discuss how this problem is circumvented under independent Bayesian models.
Under independent Bayesian models, any method that ranks features by π * (f ) (or equivalently h(f )) and selects top ranking features is considered an OBF rule. While MAP and CMAP generally do not reduce to ranking π * (f ), in independent Bayesian models MAP reduces to MNC and CMAP reduces to CMNC by (2.15) and (3.1), thus all selection criteria covered in Section 3 reduce to OBF rules. Furthermore, since π * (f ) can be found separately for each feature under independent Bayesian models via (2.14) (for instance using (2.18) or (2.19) in the Gaussian case), all OBF rules reduce to filtering. The fact that optimal Bayesian feature selection reduces to filtering under independent models is not surprising, in light of similar results for Bayesian multiple comparison rules (Müller et al., 2006). By assuming independence we lose the ability to take advantage of correlations, but we greatly simplify optimal selection. Define a univariate filter on F to be a feature selection rule that ranks features by a scoring function h(f, S f ), which is a function of only the feature index f and the portion of the labeled data corresponding to f , and selects top ranking features using some score thresholding rule, which is based on only the set of feature scores. t-tests with Benjamini and Hochberg (1995) multiple testing correction are univariate filters. Define a simple univariate filter on F to be a univariate filter that uses a constant threshold, i.e., a feature selection rule that reduces to the form: where T is a constant. t-tests without multiple testing correction are simple univariate filters. By the following theorem, not only does optimal selection reduce to OBF under independent models, but optimal selection reduces to simple univariate filtering only under independent models, and the resulting filter must be equivalent to an OBF rule.
where π * (f |M ), given in (2.14), depends only on f and S f (note that S f is comprised of S f 0 and S f 1 , along with the labels). Thus, MR reduces to a simple univariate filter on F under both M and M for all T .
This contradicts the premise that MR is a simple univariate filter. Thus, for all triplets S • , S • and f such that , which is in general a function of S, cannot be written as a function of only S f0 . Then there exists a pair of samples S • and S • such . By contradiction, P (f 0 ∈Ḡ|S, M) can be written as a function of only S f0 . Since f 0 is arbitrary, we must have that the marginal posterior for each feature can be expressed as π We now construct an independent Bayesian model, M . The idea is to create auxiliary random variables for each f ∈ F that are independent from other features and yet sufficient to describe π * (f |M). Define P (f ∈Ḡ|M ) = P (f ∈Ḡ|M), define the data generation parameters φ g y = {H, θH ∪{g} y } for each g ∈ F , and define priors on a realization of H ⊆ F \{g} and θ H∪{g} y ∈ Θ H∪{g} y from M by, Similarly, for all b ∈ F , define φ b = {H, θH ∪{b} }, and define priors on H ⊆ F \{b} and θ H∪{b} ∈ Θ H∪{b} from M by, In this way, for each feature f ∈ F we merge the identity of features excluding f with the data generation parameters. Finally, we define the distributions p(x g |φ g y , M ) = p(x g |θ , and comparing p(S g y |g ∈ G, M ) and p(S b |b ∈B, M ) with counterparts in M, we have π * (f |M ) = π * (f |M).

Consistency
A key property of any estimator is consistency: as data are collected, will the estimator converge to the quantity it is to estimate? We are now interested in frequentist asymptotics, that is, the behavior of an estimator under a fixed set of good features,Ḡ, a fixed set of parameters,θ, and the corresponding sampling distribution.
Let S ∞ denote a countably infinite labeled dataset, and let S n denote the first n observations. In general, a sequence of estimators,θ n (S n ) for n ≥ 1, of a parameter,θ, is said to be strongly consistent atθ if where convergence is understood with respect to a distance metric d, and this probability is taken with respect to the infinite sampling distribution on S ∞ under some true data generation parameter,θ. For feature selection, we will use d(Ḡ, G) = I(Ḡ = G). Under this metric, G n →Ḡ if and only if G n =Ḡ for all but finitely many n. The following theorem addresses the convergence of MR, CMNC and NP under any sequence of posteriors, p(G|S n ). The posteriors may be based on any general Bayesian model.
Proof. By (2.7), lim n→∞ p(Ḡ|S n ) = 1 implies π * (g) → 1 and π * (b) → 0 for all g ∈Ḡ and b ∈B. The consistency of MR and NP follow immediately for the range of T and α specified, and the consistency of CMNC follows if D = |Ḡ|.
By Theorem 2, if p(G|S n ) converges almost surely (a.s.), i.e., with probability 1, to a point mass atḠ, then MR (and thus MNC) and NP are strongly consistent and CMNC is strongly consistent when constrained to select the correct number of features. In Section 5.1 we prove that p(G|S n ) converges almost surely for independent Gaussian models under very mild conditions; the data need not be independent or Gaussian.

Convergence of p(G|S n ) Under Independent Gaussian Models
For fixedḠ, let FḠ ∞ be the infinite sampling distribution on S ∞ . For fixed S n , define ρ = n 0 /n, c f y = s f * y /(n y − 1) for all f ∈ F and y = 0, 1, and c f = s f * /(n − 1) for all f ∈ F . Throughout this section, we assume p(G|S n ) is calculated under an independent Gaussian model with proper or improper priors on p(θ f y ) and p(θ f ), and (in a slight generalization) allow p(G) to be arbitrary. Allowing p(G) to be arbitrary, equations analogous to (2.12) and (2.18) are straightforward to derive. We have: We write L f as a function of n 0 and n 1 to emphasize that it may be allowed to change depending on the sample size. We assume all other inputs and hyper-parameters of the independent Gaussian model are constant across all samples sizes.
Definition 1.Ḡ is an independent unambiguous set of good features if, for each g ∈Ḡ µ g y and σ g y exist and are finite such that either µ g 0 = µ g 1 or σ g 0 = σ g 0 , and for each b ∈B µ b y and σ b y exist and are finite such that S ∞ is called a balanced sample if the label of sample points are such that lim inf n→∞ ρ > 0 and lim sup n→∞ ρ < 1, and, conditioned on the labels, sample points are independent with points belonging to the same class identically distributed.
The following theorem proves our desired result. Three lemmas used in the proof are provided in Section S1 of Supplementary Material A. The conditions assumed by the theorem are very mild. Condition (i) is based on Definition 1 and essentially says that G is really the feature set we wish to select, i.e., good features must truly have different means or variances between the classes, and bad features must truly have the same means and the same variances between the classes. Conditions (i) and (ii) require certain moments to exist, but there is no requirement for the data to be Gaussian or for features to be independent from each other. Condition (iii) is based on Definition 2 and addresses the sampling strategy; the assumptions are similar to those made by most finite sample data generation models for classification, with an additional requirement on the infinite sample that the proportion of points observed in either class must not converge to zero. Conditions (iv) and (v) place constraints on the inputs to OBF. Condition (iv) requires that OBF assign a non-zero probability prior to the feature set we ultimately wish to select, which is easily achieved by setting 0 < π(f ) < 1 for all f ∈ F . Condition (v) is based on Definition 3 and addresses the possibility that one might input different L f for an improper prior to OBF depending on sample size. Condition (v) is always satisfied with p = 0 under proper priors, and under improper priors with L f set to a positive constant across all samples sizes. By Theorems 2 and 3, under these conditions and posteriors computed based on the independent Gaussian model, we have that MR (and thus MNC and MAP) is strongly consistent, and CMNC (and thus CMAP) is strongly consistent when constrained to select the correct number of features.
The proof of Theorem 3 also characterizes the rate of convergence of the posterior. Under the conditions stated in the theorem, there exist R > 1 and N > 0 such that h(g) > R n (a.s.) for all n > N and all good features g ∈Ḡ. Equivalently, there exist 0 < r < 1 and N > 0 such that π * (g) > 1 − r n (a.s.) for all n > N and all g ∈Ḡ; thus the marginal posterior of good features converges to 1 at least exponentially (a.s.). Further, there exist c, N > 0 such that h(b) < n −c (a.s.) for all n > N and all bad features b ∈B. Equivalently, there exist c, N > 0 such that π * (b) < n −c (a.s.) for all n > N and all b ∈B; thus the marginal posterior of bad features converges to 0 at least polynomially (a.s.). Extending these facts to the full posterior on feature sets, there exist 0 < r < 1 and N > 0 such that for all n > N and all G missing at least one feature inḠ, and there exist c, N > 0 such that for all n > N and all G =Ḡ. More discussions on rates of convergence are provided in Section S6 of Supplementary Material A.
Theorem 3. Suppose the following are true: (i)Ḡ is an independent unambiguous set of good features, (ii) Fourth order moments exist and are finite for all features b ∈B, Let G =Ḡ. If p(G) = 0, then (5.8) holds trivially. Thus, assume p(G) = 0. Note that Since p(θ|G) is semi-proper, by Lemma S1 in Supplementary Material A, there exists L 1 > 0 and q > 0 such that as n → ∞ (a.s.), where ∼ denotes asymptotic equivalence. Therefore, it suffices to show that for each g ∈ B ∩Ḡ and each b ∈ G ∩B we have First, we prove (5.11). Let g ∈ B ∩Ḡ. Consider a fixed sample in whichμ g y converges to µ g andσ g y converges to σ g y for y = 0, 1. Since sample points in a class are independent and identically distributed with finite first and second order moments, this event occurs almost surely by the strong law of large numbers. By Lemma S2 in Supplementary Material A, there exists > 0 and L 2 > 0 such that for n large enough Since the limit of the right-hand side is zero, so is that of left-hand side. Now we prove (5.12). Let b ∈ G ∩B. Observe that . (5.14) Consider a fixed sample in whichμ b y andμ b are bounded andσ b y andσ b converge to σ b , which occurs almost surely. There exists L 4 > 0 such that for n large enough, Similarly, there exists L 50 , L 51 > 0 such that for n large enough From (5.15) and (5.16) we conclude there exists L 6 > 0 such that for n large enough: Furthermore, as c b and c b y converge, there exists L 7 > 0 such that for n large enough, Therefore, for n large enough we may write The following property of sample variance holds, provided that sample moments exist: Let us consider the sample mean term in (5.20).
Recall that (5.15) through (5.19) and (5.21) hold when the sample means are bounded and the sample variances converge to σ b . We now consider the rate of convergence of the means and variances. Suppose x i , i = 1, . . . , n 0 , are the values of feature b for points in class 0. Observe that (x i − µ b )/ √ σ b are independent random variables with zero mean and unit variance. By the law of the iterated logarithm (Kolmogorov, 1929), Hence for n large enough, Similarly, for n large enough, By the triangle inequality, for n large enough, Note that for all 0 < ρ < 1, Combining (5.21), (5.26), and (5.27), we see that for n large enough, Now, consider variance terms in (5.20). We have another property of sample variance: Under balanced sampling, ρn increases with n. Note that lim n→∞ ρn/(ρn − 1) = 1, thus ρn/(ρn − 1) < 2 for n large enough (a.s). Also, 1/(ρn − 1) < (log log n)/(ρn) for n large enough (a.s). In addition, we can use (5.24) to bound ( Since we assume fourth order (and thus lower order) moments of features inB are finite, the variance of (x i − µ b ) 2 /σ b is finite, and we call this variance K 0 . Again applying the law of the iterated logarithm, Combining (5.29), (5.30), and (5.31) we conclude that for n large enough, Similarly, we can show there exists K 1 > 0 such that for n large enough, Now, observe that (5.34) Using (5.32) and (5.33), we can show that for n large enough, in Supplementary Material A, there exists r > 0 such that for all t ∈ (0, 1) and x ∈ (1 − r, 1 + r), (5.36) Using (5.34), (5.35), and (5.36), we see that for n large enough, where in the last inequality we have used (5.27). Combining (5.20), (5.28), and (5.37) we see that for n large enough, where in the last inequality we have used the fact that for all x, t > 0, (1 + t/x) x < e t .
Since the limit of the right-hand side is 0 whenever q > 0, so is that of the left-hand side. Combining (5.19) and (5.38) we see that (5.12) holds almost surely.

Performance and Consistency on Synthetic Data
Here we implement OBF and several other feature selection methods on synthetically generated microarray data. An application on real colon cancer microarray data is provided in Sections S2 and S3 of Supplementary Material A. Although OBF assumes all features are independent with Gaussian class-conditional distributions, the data generation model employed violates these assumptions by generating correlated and non-Gaussian features. Remarkably, OBF is still theoretically consistent by Theorems 2 and 3. Since the main contributions of this paper are theoretical, and numerous extensive simulation studies have already shown that OBF has competitive and robust performance (Foroughi pour and Dalton, 2017dDalton, , 2018a, our primary objective in this section is to simply observe whether OBF is indeed consistent, i.e. whether it eventually selects the correct feature set as sample size grows. Our secondary objective is to provide new examples showing that OBF enjoys competitive performance, running time, and memory consumption compared with popular Bayesian and non-Bayesian feature selection algorithms, including several methods that OBF has not been compared with before. The data is generated using a variant of the "synergetic" model originally proposed in Hua et al. (2009). For a fixed sample size, n, in each iteration we assign an equal number of points to class 0 and 1 (n is always even). We generate |F | = 20, 000 features, including a random assignment of 20 global markers, 80 heterogeneous markers, 11, 900 low-variance non-markers and 8, 000 high-variance non-markers. Markers have distinct class conditional distributions, non-markers have identical distributions in both classes, and heterogeneous markers and high-variance non-markers account for unknown subclasses in the data. Global markers, heterogeneous markers, and low-variance nonmarkers are randomly partitioned into blocks of size k = 5. All features within a block are correlated, while all blocks of markers, all blocks of low-variance non-markers, and all high-variance non-markers are independent from each other. All features are also randomly assigned to one of four groups, i = 0, 1, 2, 3, such that each group contains one block of global markers, four blocks of heterogeneous markers, 595 blocks of lowvariance non-markers and 2, 000 high-variance non-markers.
In addition to OBF, we implement: Welch's t-test (t-test), a moderated t-test from the limma package in R (Smyth, 2004) (Moderated t-test), the Bhattacharyya distance between Gaussian distributions with sample means and variances computed from each class (BD), the mutual information between features and class labels computed from a non-parametric entropy estimator based on sample spacings of order m = 1 (Beirlant et al., 1997) (MI), and a bolstered error estimate (Braga-Neto and Dougherty, 2004) under nearest mean classification (NMC). In each case, we output the D = 100 top ranked features. Note that these methods are all univariate filters.
Finally, we implement three types of Bayesian variable selection methods: the univariate filter method SKBS , a regression method using a slab-and-spike prior and probit link (Lee et al., 2003) (BPM), and a regression method by Makalic and Schmidt (2016) (BayesReg). Due to the high computation cost of these methods, we run each on the top 300 features as ranked by BD, rather than on the full set of 20, 000 features. We implement SKBS with K = 2 to K = 7 Gaussian mixture kernels. We observed best performance with K = 2 and report on only this case. As in , we find the marginal posterior probability of each feature having distributional differences using a Gibbs sampler with a burn-in period of 1, 000 steps and a sampling period of 5, 000 steps. We report the D = 100 features having largest marginal posteriors with ties broken by BD (CMNC-SKBS), and the set of all features with marginal posteriors greater than T = 0.9 (MR-SKBS). We also implemented T = 0.5 (the threshold of MNC) and T = 0.75, but observed best performance with T = 0.9. We implement BPM using default settings in the published code, except we initialize the MCMC chain with the top D = 100 features ranked by BD, forgo the burn-in period, and directly generate 5, 000 samples. Similar to CMNC-SKBS, we then report the D = 100 features having largest marginal posteriors with ties broken by BD. We implement four variations of BayesReg using default settings in the published MATLAB code. Each variant corresponds to one combination of prior (L 1 or horseshoe) and link function (linear or logit). BayesReg outputs a t-statistic, and for each variant of BayesReg we report the D = 100 features with largest absolute t-statistic.
This procedure is iterated 600 times for each n, where n increases from 50 to 1, 000 in steps of 50. For each algorithm, reported features are labeled markers and unreported features are labeled non-markers. Figure 1(a) shows the average number of correctly labeled features over iterations with respect to n. For each n, Regularized-best presents the best performance observed among all 84 regularized regression methods, and BayesReg-best presents the best performance among all four BayesReg methods.
In general, the best performing algorithm is MNC-OBF-PP, which is followed by MNC-OBF-JP, CMNC-OBF-PP, then CMNC-OBF-JP and BD. OBF and BD perform well because they can detect differences between both means and variances (Foroughi pour and Dalton, 2018b). Observe that PP outperforms JP. In general, an informed prior like PP can have better performance than a non-informative prior like JP when assumptions are accurate, but may be less robust when assumptions are inaccurate. Also observe that MNC outperforms CMNC. In general, MNC outperforms CMNC when the sample size is small, and CMNC slightly outperforms MNC when the sample size is large. It may seem counterintuitive for MNC to outperform CMNC, since CMNC is directly informed with the true number of markers to select (via D) and MNC  is not. However, MNC is given some information about the number of markers through π(f )-recall that the expected number of good features given π(f ) can be found in (2.8).
In addition, MNC outputs a variable number of features, and under small samples it can be beneficial to output a smaller feature set to avoid selecting features that one is uncertain about. Also note that CMNC-OBF-JP and BD make similar assumptions, and typically have very similar performance, as seen here.
Regularized-best and MR-SKBS appear to perform very well under small samples; however, these methods are the only methods besides MNC-OBF-PP and MNC-OBF-JP that output a variable number of features, and they perform very close to the trivial algorithm that outputs no features (which always labels 19, 900 features correctly). CMNC-SKBS also performs fairly well under small samples, but drops below BPM at around n = 300 and below Moderated t-test at around n = 650. This may be due to insufficient sampling iterations of the Gibbs sampler, or an issue with selecting the number of kernels. Since SKBS models mixtures of Gaussians, it can detect differences between means and variances like OBF and BD, and potentially differences between higher order moments, but performance may be sensitive to the number of kernels used.
Under large samples, BD is followed by BPM, t-test, CMNC-SKBS, then NMC. BPM appears to perform close to BD under large samples because its MCMC chain is initialized with BD. Unlike OBF and BD, t-test and NMC struggle to detect features with similar means but different variances, which usually results in some loss in performance relative to BD, with NMC performing worse than t-test (Foroughi pour and Dalton, 2018b). BayesReg-best has comparable performance to Regularized-best and MR-SKBS, while MI has the poorest performance across all sample sizes. Although MI does not perform well here, as a non-parametric method it can detect any distributional differences, and it has been observed that MI can shine under large differences in skewness (Foroughi pour and Dalton, 2018b). All univariate filters (OBF, t-test, Moderated t-test, BD, MI and NMC) do not account for correlations between features, while all regression based methods we implemented (Regularized-best, BPM and BayesReg) do account for correlations. Regressionbased methods do not perform particularly well, except Regularized-best under small samples (where it tends to output very few features) and BPM under large samples (where performance tracks BD because the MCMC chain is initialized with BD). As discussed in Section 1, classification and regression based methods tend to miss weak features in the presence of strong features, and miss strong features that are correlated to other stronger features, because these features are not very useful in improving the predictive capacity of the model. See Section S4 of Supplementary Material A for more discussion on this. Table 1 lists the average running time and maximum memory requirement of several methods for n = 200 over 10 iterations. MNC-OBF-PP, CMNC-OBF-PP, MNC-OBF-JP and CMNC-ONF-JP have similar computation cost and are reported in the table as "OBF." t-test and Moderated t-test have comparable computation cost and are averaged together in the table and reported as "t-test." "Regularized-best" reports the average computation cost for all 84 regularized regression models. We implement SKBS with K = 2 kernels, SKBS with K = 7 kernels, BPM, and the four earlier variants of BayesReg after filtering out all but the top 300 features with BD, and again after filtering out all but the top 5, 000 features. "BayesReg-best" reports the average computation cost of all four variants of BayesReg. OBF is not only the best performing, but also stands among the fastest methods with low memory requirements. SKBS, BPM and BayesReg all have running times that are orders of magnitude higher than that of OBF and require several times the amount of memory, particularly when run on a larger number of features. Our code is vectorized, which tends to reduce running time at the cost of higher memory consumption.
We conclude this section with a simulation similar to that of Fig. 1(a), except we do not implement computationally intensive methods and we let sample size increase from 100 to 5, 000 in steps of 100. Figure 1(b) plots the average number of correctly labeled features with respect to sample size. The curves for BD, t-test, Moderated t-test, and all methods based on OBF appear to converge to 20, 000, which suggests that these methods are consistent under the current data model. It is also interesting that t-test becomes more competitive for very large sample sizes.

Conclusion
OBF should not be used in applications where the objective is dimensionality reduction to design a simpler model or avoid overfitting. Rather, it is designed for applications where all features that exhibit distributional differences between the classes should be ranked and reported. That being said, as a filter method, OBF cannot identify a feature that is itself indistinguishable between the classes, while being highly correlated with other features that do have distributional differences. Such features are of interest in biomarker discovery because: (1) they might be paired with other biomarkers to develop better tests for the biological condition of interest, and (2) strong correlations between genes or gene products suggest possible links in the underlying biological mechanisms, and understanding these links is an important part of the discovery process. Therefore, a major thrust of our future work is in developing models and methods that can take advantage of correlations. A few suboptimal methods have been proposed in prior works (Foroughi pour and Dalton, 2014Dalton, , 2016aDalton, , 2017dDalton, , 2018a, however, more work is needed in identifying conditions under which these algorithms are consistent, and in understanding performance and robustness properties of these algorithms. Finally, note that the OBF framework makes it possible to conduct a Bayesian error analysis for feature selection, much like Bayesian error estimation in classification (Dalton and Dougherty, 2011a,b). For instance, one may find the probability p(G|S) = P (Ḡ = G|S) in (2.15) or the expectation E( (G,Ḡ)|S) in (3.7) for an arbitrary feature set G, or find the ROC curve defined in (3.14) for an arbitrary feature selection rule. We plan to study Bayesian error analysis under the OBF framework in future work, and to develop and study methods of error analysis that also take into account correlations.

S1 Proof of Lemmas
Here we provide the three lemmas used by Theorem 3 of the main manuscript.
Lemma S2. Let g ∈ F such that either µ g 0 = µ g 1 or σ g 0 = σ g 1 . In addition, let S ∞ be a fixed and balanced sample in whichμ g y converges to µ g andσ g y converges to σ g y . Then, there exists > 0 and c > 0 such that for n large enough, Proof. It suffices to show there exists , c > 0 such that and (c g 0 ) κ g 0 (c g 1 ) κ g 1 (c g ) κ g < c (S1.10) for all n large enough. Observe that c g y =σ g y + s g y n y − 1 + ν g y n y (ν g y + n y )(n y − 1) (μ g y − m g y ) 2 . (S1.11) The first term converges to σ g y , and the second and third terms converge to 0, thus, c g y → σ g y . Similarly, (S1.12) The second term converges to 0, and, sinceμ g = ρμ g 0 + (1 − ρ)μ g 1 ,μ g is bounded and the third term converges to 0. Also, the following property holds: (S1.13) Sinceσ g y converges, it is bounded. Hence, lim(1−ρ)σ g 0 /(n−1) = 0, and lim ρσ g 1 /(n−1) = 0. Combining all of this and applying properties of lim sup and lim inf, we have, (S1.14) We first show (S1.9) holds if σ g 0 = σ g 1 . Note that where ρ 1 = lim inf ρ and ρ 2 = lim sup ρ. Since we are maximizing a continuous function with respect to r, the maximum is obtained for some r * ∈ [ρ 1 , ρ 2 ]. Observe for all 0 < r < 1, in particular r * , and x, y > 0, rx + (1 − r)y ≥ x r y 1−r with equality if and only if x = y, which is shown by taking the logarithm of both sides and noting the concavity of logarithm. Thus, the right-hand side of (S1.15) is strictly less than 1. Now suppose σ g ≡ σ g 0 = σ g 1 and µ g 0 = µ g 1 . We have, .
The right-hand side is again strictly less than 1, therefore, (S1.9) also holds in this case for some > 0 and n large enough. Now for (S1.10), observe from (S1.13), Thus (S1.10) holds for some c > 0 and n large enough.

S2 Application Using Colon Cancer Microarray Data
Here we use t-test, BD, CMNC-OBF-JP, BPM, SKBS with K = 2, the Bayesian regression model of  with logit link and L 1 penalty, denoted by BayesReg(logit,L 1 ), and GLMs with logit link and L 1 and elastic net penalties to a colon cancer microarray dataset (Smith et al., 2010;Freeman et al., 2012) deposited on Gene Expression Omnibus (GEO) (Edgar et al., 2002) with accession number GSE17538 comprised of subseries GSE17536 and GSE17537. This dataset contains gene expression levels of 238 patients in different stages of colon cancer. We assign 28 stage 1 and adenoma patients to class 0 and the remaining 210 patients in stages 2-4 to class 1. The data has been normalized with Bioconductor's affy package using default settings. Our objective is to identify features that discriminate between early and late stages of cancer. This dataset uses the GPL570 platform containing 54, 765 probes. Probes not mapping to any genes are removed leaving 42, 450 probes. These probes map to 21, 049 distinct genes or gene families. Here we perform feature selection at the probe level. Afterwards, for each gene we identify the probe ranking highest that maps to the gene. We rank genes based on their associated probe. This downstream analysis will be explained in more detail later. Figure S1 plots the histogram of all patients and all probes used for feature selection.
We first study the application of penalized GLMs with logit link on this dataset. We use MATLAB's built in lassoglm function to implement a logit model with LASSO and elastic net penalties. Elastic net assumes α = 0.5. We use two methods for model selection under the penalized GLMs: (1) 10 fold cross validation (with 10 Monte Carlo repetitions) to select the penalty resulting in the highest prediction accuracy on the test data, and (2) the stability criterion of Meinshausen and Bühlmann (2010) to obtain a "marginal inclusion posterior" for each feature. The running time of penalized regression under these model selection schemes is discussed in Section S3.
For cross validation, we consider the penalty family λ = 0.005, 0.01, · · · , 0.2. We observed larger values of λ output very small feature sets and suffer large prediction error. Although penalty terms smaller than 0.005 outputted reasonably large feature sets, we again observed large prediction error. Hence, the interval of [0.005, 0.2] was chosen. 10 fold cross validation selects λ = 0.04 and λ = 0.055 for GLMs with LASSO and elastic net penalties, assigning non-zero regression coefficients to 36 and 83 probes, respectively. All selected probes for LASSO map to distinct genes, and the selected probes for elastic net map to 76 distinct genes. We also use these GLMs with fixed λ = 0.01 and use all of the data for training. This time, 84 and 177 probes mapping to 81 and 165 distinct genes are selected by LASSO and elastic net penalties, respectively.
We now use the stability selection metric of Meinshausen and Bühlmann (2010). We iterate 100 times, as suggested in Meinshausen and Bühlmann (2010), and in each iteration we randomly subsample 90% of the points in each class. Although Meinshausen and Bühlmann (2010) suggests using half of the training data in each iteration to save on computation cost, we used 90% of the training data as suggested in He and Yu (2010) for small-sample high-dimensional biomarker discovery applications. We then computê Π λ k , the maximum probability of a feature (probe) being selected by a regression model over all penalty terms, as described in Meinshausen and Bühlmann (2010). We then compare this probability with thresholds π thr = 0.1, 0.25, 0.5, 0.75, 0.9. For the LASSO penalty these thresholds select 195, 98, 47, 19, and 7 probes mapping to 186, 93, 46, 19 and 7 different genes, respectively. For the elastic net penalty these thresholds select 471,258,133,75 and 36 probes mapping to 440,244,122,67,and 31 genes,respectively. Equation (9) in Meinshausen and Bühlmann (2010) provides an upper-bound on the expected number of false discoveries for each selected set. Using this equation to bound FDR by 5%, LASSO and elastic net both use π thr = 0.51 to select 47 and 133 probes mapping to 46 and 122 genes, respectively. However, with this threshold the upperbounds on FDR are 7.1 × 10 −4 and 8.9 × 10 −4 , respectively. The bound of Meinshausen and Bühlmann (2010) is not applicable when π thr < 0.5.
Both cross validation and the stability criterion output a relatively small number of features as markers. While many of these genes are high-profile biomarkers, we observe many important biomarkers are missed by all GLMs. For example, all GLMs (including all λ's and π thr 's considered above) miss TTN, TP53, FBXW7, and CCNE1, which are verified colon cancer biomarkers (Fearon, 2011;Network, 2012), except for elastic net with π thr = 0.1 which selects TP53. We will later see these verified biomarkers rank high by CMNC-OBF-JP, BD, and BayesReg(logit,L 1 ). TP53 also ranks high by t-test. This is typical of many algorithms based on regression performance; different methods might select different feature sets (Saeys et al., 2007), and few features might be declared as biomarkers (Sima andDougherty, 2006, 2008). Now we study the application of other selection algorithms on this dataset. We use t-test, BD, CMNC-OBF-JP, BPM, SKBS, and BayesReg(logit,L 1 ) to rank genes. BPM, SKBS, and BayesReg(logit,L 1 ) are computationally intensive. Therefore, we implement a first phase filtration by BD similar to Section 6 of the main manuscript. However, to avoid missing important biomarkers we use top the 5000 BD probes, mapping to 4017 different genes. CMNC-OBF-JP is implemented as in Section 6 of the main manuscript, except we set π(f ) = 0.01 for all features f to obtain π * (f ). We use π S * (f ) to denote the SKBS marginal posterior of a feature f ∈ F . After probe ranks are obtained, among probes that have exactly the same associated gene list, we only keep the probe with the highest rank. Afterwards, we rank genes. Thereby, a gene whose associated probe ranks higher in the probe ranking also ranks higher in the gene list. Table S1 contains a list of genes in this colon cancer dataset with reported or suggested involvement in colon cancer (Fearon, 2011;Network, 2012). While Fearon (2011) and Network (2012) also contain other verified biomarkers, none of the probes in the GPL570 platform (the platform used in this dataset) map to these biomarkers. Hence, they were removed from Table S1. For each gene we also list the rank assigned by each method. SKBS, BPM and BayesReg(logit,L 1 ) only rank the top 15 genes in the table; the remaining genes do not pass the first phase of filtration by BD. Table S2 lists the number of these known markers that rank below N = 1000, 2000, and 5000, and the associated p-values, for each method. These p-values are based on an over-representation test using a hyper-geometric distribution. As these tables suggest, CMNC-OBF-JP and BD seem to provide a better feature ranking. Recall that SKBS, BPM, and BayesReg(logit,L 1 ) use a first phase of filtration using top 5000 probes selected by BD, which may affect their performance. SKBS assumes K = 2, which may be insufficient to model data; indeed, a feature with strong differences might get low π S * (f ) when the kernels do not properly describe its distribution. While  use cross validation to select the number of kernels, (a) it is computationally intensive, (b) given the very small sample size in class 0, partitioning the data might affect performance, and (c) cross validation may suggest a large K. In the synthetic simulations of Section 6 of the main manuscript, we observed that large values of K may overestimate π S * (f ), which is not desirable. In Fig. S2, the x-axis is the number of selected features and the y-axis is the ratio of the number of selected markers over the total number of markers listed in Table S1. These curves help visualize Tables S1 and S2, and illustrate how OBF and BD provide better feature rankings.
Tables S3 and S4 list the top 20 genes selected by CMNC-OBF-JP and t-test, respectively, as well as their associated π * (f ), π S * (f ), and t-test p-values. No FDR correction is done for t-test. As these two tables suggest, genes that rank high by t-test typically have large π * (f ), but the converse is not true, i.e., top genes from OBF might have large p-values even when no FDR correction is performed. SBKS performs similar to OBF and assigns large posteriors to top genes of OBF and t-test. Many of the top OBF genes are also verified colon cancer biomarkers. For example, GAGE genes are over-expressed in a subpopulation of colon cancer patients, and may promote cancer progression (Gjerstorff et al., 2015;Scanlan et al., 2004). Furthermore, CPNE4 (Shin et al., 2009), EPHA7 (Wang et al., 2005;Kim et al., 2010), LOC286297 (Hu et al., 2017), SLC2A2, also known as GLUT2 (Lambert et al., 2002), and MAGEA4 (Yamada et al., 2013) have also been shown or suggested to be involved in colon cancer. Our methodology favors selecting genes that discriminate between early and late stages of disease, thus some genes known to have strong links to colon cancer across all stages, for instance KRAS, may not necessarily rank high.
Histograms of several top OBF genes are provided in Fig. S3. They indeed have distributional differences between the two classes. GAGE genes are over-expressed in a small subpopulation of late-stage colon cancer patients (Figs. S3(a) and S3(b)), which is in line with the literature (Gjerstorff et al., 2015;Scanlan et al., 2004). Furthermore, we observe GPM6A is also over-expressed in some late-stage patients by comparing Figs. S3(c) and S3(d). While the expression of GPM6A is never above 5 among early-stage cancer patients, among 28.1% of late-stage patients its expression is above 5. GPM6A has been suggested to be a biomarker in colon and lung cancers (Camps et al., 2009;Hasan et al., 2015). SLC14A1 is an interesting biomarker. It has been suggested to be involved in colon cancer (van Erk et al., 2005;Popovici et al., 2012), bladder cancer (Garcia-Closas et al., 2011), and prostate cancer (Stamey et al., 2001). However, all of these studies suggest SLC14A1 is under-expressed in cancer. Comparing Figs. S3(e) and S3(f) we observe SLC14A1 is over-expressed in a sub-population of late-stage colon cancer patients and under-expressed in another subpopulation. The probability that the expression of SLC14A1 is below 4 is 17.9% in early-stage patients and 22.4% among late-stage patients. In addition, the probability that the expression of SLC14A1 is above 5 is 0% and 8.6% among early-and late-stage patients, respectively. While we observe SLC14A1 is under-expressed in a subpopulation in accordance with the literature, we found no references justifying over-expression of SLC14A1 in another subpopulation. This is an interesting pattern observed in this dataset motivating further investigation. The net effect of all observed data is a slight under-expression of SLC14A1. The sample mean of early-stage and late-stage patients is 4.15 and 4.36, respectively. Hence, methods that only look at sample means, such as t-test, might miss this interesting gene.
Finally, comparing Figs. S3(g) and S3(h) we observe that MMP8 is over-expressed in a subpopulation of late-stage cancer patients, as suggested in De Sousa et al. (2013).
As the histograms suggest, these biomarkers are not Gaussian; however, OBF has been successful at identifying probes with strong distributional differences. These results suggest that OBF under JP might enjoy robustness with respect to its modeling assumptions. A detailed discussion of robustness and performance properties of OBF is provided in Foroughi pour and Dalton (2018b). Also,  provides a discussion on potential robustness properties of hierarchical Bayesian models with non-informative priors.
We perform an enrichment analysis of the top 2, 000 CMNC-OBF-JP genes using an over-representation test in PANTHER (Mi and Thomas, 2009;Mi et al., 2017). PAN-THER recognized 321 genes in the gene list, and tests over 163 pathways. The top 10 pathways are listed in Table S5. Bounding FDR by 0.05, the top 4 pathways are significant. Meanwhile, many of the top PANTHER pathways, for instance, the cadherin signaling pathway (Avizienyte et al., 2002;Peña et al., 2005), the WNT signaling pathway (Bienz and Clevers, 2000), the plasminogen activating cascade (Ganesh et al., 1994;Baker and Leaper, 2003), and blood coagulation (Wojtukiewicz et al., 1989), have been shown to be involved or affected in colon cancer. ---BAX  4720  5048  2166  ---TGFBR2  6130  5796  6804  ---BRAF  6159  5915  2680  ---CTNNB1  6945  7162 14361  ---APC  6951  6683  8829  ---MYO1B  7919  8008 17026  ---MSH6  7999  8007  3834  ---NRAS  8054  7870  8141  ---CDK8  8454  8273  4790  ---CASP8  8877  8916  3823  ---MAP7  8939  8795  4847  ---PTPN12  9056  8933  4631  ---ACVR2A  9130  9057 14557  ---FAM123B  10163  10183 10360  ---MIER3  11637  11911 10526  ---MLH3  14408  14406 8321  ---KIAA1804  15606  16195 17573  ---SMAD2  16641  16776 13170  ---MSH3  17890  17353 13725  ---TCERG1  18664  19121 16361  ---Table S1: Genes with known involvement in colon cancer and their rankings    Table S4: Top 20 genes selected by t-test and their statistics folds, uses 9 folds for training and the remaining fold for testing, and considers all 10 combinations of leaving 1 out fold for testing. This process is performed several times; we implement 10 Monte Carlo repetitions in the colon cancer example. For a penalized regression model, this is done for all candidate penalty terms, and the penalty that results in best overall prediction is selected. Let T be the average time to train the penalized regression model for one penalty term. Note that 10 fold cross validation uses 90% of data for training. The computational complexity of regularized regression models depends on the training algorithm as well as the data (sample size and number of predictors). In general, it can be O(n) or O(n 2 ) depending on the settings used. A more detailed discussion on the computational complexity of penalized GLMs can be found in Minka (2003) and Hastie et al. (2015). Assuming linear and quadratic complexity with respect to sample size, the running time of cross validation for one penalty value is approximately 10 × 0.9 × M × T = 9M T and 10 × 0.9 2 × M × T = 8.1M T , respectively, where M is the number of Monte Carlo repetitions. Either way, the running time of cross validation is > 8M T . Assuming L candidate penalties, we expect the total running time of cross validation to be more than 8LM T . While T might be reasonable, 8LM T might be large. Now we discuss the computational complexity of the stability selection method of Meinshausen and Bühlmann (2010). Recall that we generate 100 subsamples of the data, where each subsample uses 90% of the points in each class. Following a similar argument to cross validation, the running time of stability selection should be more than 80LT .
For the colon cancer data, using the penalty term λ = 0.01, LASSO and elastic net took about 25.8 and 29.2 seconds to run, respectively, and for λ = 0.1 LASSO and Figure S2: Proportion of selected known colon cancer markers listed in Table S1

S4 Discussion on Regression and Classification Objectives for Feature Selection
Variable selection has been extensively studied for the purpose of regression Xu and Ghosh, 2015;Baragatti, 2011), and has been discussed in detail in the main manuscript. Here we provide several examples on the types of features such methods can detect, what can be expected of them in small- sample high-dimensional biomarker discovery settings, and the observations made in the bioinformatics community regarding the applicability of such models for biomarker discovery. In particular, we focus on why regression and classification based objectives may not be suitable for biomarker discovery applications, where one desires to find all features with distributional differences.
A classical method such as t-test can detect differences in means, but cannot detect differences in variances. For example, if a feature has densities similar to Fig. S4(a) in two classes, t-test cannot detect it. The SLC14A1 gene studied in the real data example is an example of such genes. In contrast, methods based on regression and classification objectives can detect this mode of distributional difference. Furthermore, they can detect other second order distributional differences, such as the one in Fig. S4(b), where each color denotes the joint distribution between two features in a given class. Neither ttest nor OBF can detect such biomarkers, but feature selection methods proposed in the authors' previous work, such as 2MNC-Robust and POFAC (Foroughi pour and Dalton, 2018aDalton, , 2017, are specifically designed to address this issue. 2MNC-Robust and POFAC are more computationally intensive than OBF, but typically not as intensive as methods based on regression and classification. In the main manuscript we have focused on consistency properties of OBF due to its simplicity and having a closed form solution for π * (f ). This also lays the foundation for consistency analysis of POFAC and 2MNC-Robust, which will be discussed in future work.
Suppose two features have class-conditioned densities similar to Fig. S4(c). Both t-test and OBF can easily detect these two biomarkers, thanks to their differences in their means, but objectives based on regression and classification performance struggle. Both features are extremely correlated, and dependencies are similar in both classes, which is a typical co-expression setting in "omics" studies. In this case, given the value of one feature, say feature 1, one can easily and accurately estimate the value of the other feature, (feature 2 in this example). Therefore, a classifier considering both features does not improve performance very much relative to a classifier using only one of these two features. Now, given some fixed observed data, feature selection schemes based on different classification rules, or even simply different permutations of the same data, may end up selecting different features, e.g., one pipeline may report only feature 1 while another may report only feature 2. Furthermore, suppose another study is performed and another sample is observed. Due to many factors, for instance noise and experimental conditions, a classifier might select feature 1 given sample 1, and feature 2 given sample 2. Thereby, reported feature sets are not reproducible. These are some undesirable properties of classification based feature selection algorithms in biomarker discovery applications, which have been discussed in many reviews Saeys et al., 2007). Furthermore, (a) the peaking phenomenon may result in reporting very few biomarkers (Sima and Dougherty, 2008), and (b) classification error estimates may suggest sets far from the set with minimal Bayes error (Sima and Dougherty, 2006). In the real data example, we observed that GLMs with LASSO and elastic net penalties miss many important biomarkers, although they may achieve good prediction performance.

S5 Justification for Using Improper Priors
Here we show that π * (f ) under an improper prior is the limit of a sequence of proper priors, which are constructed by truncating the improper prior. Recall that for each feature f ∈ F , p(θ f 0 ), p(θ f 1 ), and p(θ f ) are the priors on θ f 0 assuming f ∈Ḡ, on θ f 1 assuming f ∈Ḡ, and on θ f assuming f ∈B, respectively. Note that p(θ f y ) is described by hyperparameters s f y , κ f y , ν f y , and m f y , and relative weights A f y and B f y . Similarly, p(θ f ) is described by hyperparameters s f , κ f , ν f , and m f , and relative weights A f and B f . We assume these hyperparameters result in improper normal-inverse-Wishart priors such that their area, i.e., their integrals, are infinite. In particular, we assume the integrals of p(σ f y ), p(µ f y |σ f y ), p(σ f ), and p(µ f |σ f ) are all infinite. A similar demonstration can be made for improper priors where any combination of p(σ f y ), p(µ f y |σ f y ), p(σ f ) and p(µ f |σ f ) are improper.
We first describe the truncation process of p(θ f y ). The truncation process of p(θ f ) is similar. Recall that θ f y = [µ f y , σ f y ]. Consider the proper prior p Ky,My (θ f y ) such that where K y is a positive constant and M y (σ f y ) is a positive function of σ f y . In general, K y and M y (σ f y ) may depend on the feature f as well, i.e., we should write K f y and M f y (σ f y ). However, we have dropped the superscript f here to avoid cluttered notation. Let U f y be the normalization constant of p Ky,My (σ f y ), and V f y (σ f y ) be the normalization constant of p Ky,My (µ f y |σ f y ). In general, V f y may depend on σ f y and we have explicitly included this dependence for the sake of being complete; however, we will later choose M y (σ f y ) such that V f y does not depend on σ f y . Now consider the proper prior p K,M (θ f ) such that and Again, M may depend on σ f , and we are using K and M (σ f ) instead of K f and M f (σ f ) to avoid cluttered notation. Let U f be the normalization constant of p K,M (σ f ), and V f (σ f ) be the normalization constant of p K,M (µ f |σ f ).
Let A(r) : [0, ∞) → [0, ∞) be a strictly increasing function such that A(0) = 0 and lim r→∞ A(r) = ∞. Now, for each r > 0, choose Note that M y (σ f y ) and M (σ f ) are chosen such that V f y and V f do not depend on σ f y and σ f , respectively. In order to satisfy (S5.5) through (S5.8), we need to choose K y , M y (σ f y ), K, and M (σ f ) such that This is doable since p(σ f y ), p(µ f y |σ f y ), p(σ f ), and p(µ f |σ f ) are all improper. For each r > 0 let In order to compute the integrals in (S5.9), we first need to integrate with respect to the means, and then the variances. Observe that where and we have used (S5.1), (S5.2), (S5.3), and (S5.4) to explicitly write out the priors p Ky,My (θ f y ) and p K,M (θ f ). Observe that A(r) → ∞ as r → ∞, and hence K y , M y , K, M → ∞ as r → ∞. If exist, then using the Monotone Convergence Theorem (MCT) we have that where h(f ) is computed using (2.18) of the main manuscript. Here, we have chosen the radii of the truncated priors so that h r (f ) converges to the desired value as r → ∞. This is not the only solution; indeed, any functional form for the normalization constants U f y , V f y , U f and V f such that would guarantee that (S5.11) holds. In addition, under improper priors L f is an arbitrary constant set by the user, and for any L f there exists a sequence of proper priors for which (S5.11) holds. It is even possible to design the radii of the truncated priors such that h r (f ) → 0 (π * (f ) → 0) or h r (f ) → ∞ (π * (f ) → 1) as r → ∞ for a fixed sample, S. While the current analysis does not help in justifying the use of any particular value for L f , it does give an interpretation for L f . In particular, from (S5.5) through (S5.8) we see that the constants A f y , B f y , A f and B f that make up L f control the relative normalization constants needed in the truncated improper priors, which in turn control the relative rate that the radii of the truncated priors increase. These radii must increase at rates such that the ratio of the area under the un-normalized truncated p K,M (θ f ) and the product of the areas under the un-normalized truncated p K0,M0 (θ f 0 ) and p K1,M1 (θ f 1 ) is equivalent to (or converges to) the constant L f .

S6 A Note on the Jeffreys-Lindley Paradox
The Jeffreys-Lindley paradox is encountered when Bayesian methods using improper priors and frequentist approaches yield statistics that motivate different actions (accepting or rejecting the null hypothesis) for some observed data. This is a major concern for using improper priors in practice, and has been an active topic of debate for the past three decades (Robert, 1993(Robert, , 2014Berger and Sellke, 1987;Spanos, 2013) Following a classical example from Robert (1993) and Berger and Sellke (1987) for illustration of this paradox, we consider testing the mean of a Gaussian population with unit variance. Assuming the population follows the density N (θ, 1), the null (H 0 ) is θ = θ 0 and the alternative (H 1 ) is θ = θ 0 . Suppose the prior probability of H 0 is p 0 , and consider a Gaussian prior on θ under H 1 , i.e., p(θ) = N (0, σ), where σ is the variance of the Gaussian prior. For θ 0 = 0, and given an observation x from the population, we have that the posterior probability of H 0 is (Robert, 1993) 1 which goes to one as σ goes to infinity for any given x and p 0 . As σ goes to infinity, the prior on θ under H 1 assigns less weight to each fixed neighborhood of θ 0 . Thereby, the alternative hypothesis would have smaller posterior probability for each fixed θ 0 and x as σ goes to infinity. Larger values of σ correspond to less-informative priors; thus a noninformative setup seems to produce meaningless results here. Further discussion on this paradox for point null hypothesis tests, and possible remedies, are provided in Robert (1993) and Robert (2014). One approach to this problem is to avoid non-informative priors. This may be feasible if, for example, θ 0 has been chosen by the experimenter because it has some special meaning for the problem at hand. Alternatively, one may allow p 0 to depend on σ so that the posterior on H 0 does not converge to extreme values as σ goes to infinity. In general, one must take care when using non-informative priors, especially when setting normalization constants associated with improper prior densities. However, there is currently no universal agreement on precisely how this should be done.
Here we consider sequences of proper priors with increasing variances for our feature selection problem, and study how they affect π * (f ). In other words, we study how the choice of p(θ f 0 ), p(θ f 1 ), and p(θ f ) affect π * (f ) as we make them less informative, i.e., increase their variance and make them more flat. Suppose p(θ f y ) follows a proper normal-inverse-Wishart prior with hyperparameters s f y , κ f y , m f y , and ν f y , and p(θ f ) follows a proper normal-inverse-Wishart prior with hyperparameters s f , κ f , m f , and ν f . Furthermore, fix sample S with n 0 , n 1 > 1. From (2.18) of the main manuscript observe that we have Assume s f y , κ f y , m f y , s f , κ f , and m f are all fixed, and vary ν f y and ν f . Observe that 1. If ν f 0 = ν f 1 = ν f = ν, then lim ν→0 h(f ) = 0 and hence lim ν→0 π * (f ) = 0.
Depending on how ν f 0 , ν f 1 and ν f go to zero relative to each other, different behaviors may occur. Such behaviors of likelihood ratios are discussed in Villa and Walker (2017). This is in contrast to the classical example provided above, where no mater how we increase the variance of the prior on θ 0 , the posterior probability of H 0 goes to 1. Here, in order to obtain h(f ) → 0, ∞, we should select the ν's such that ν f 0 ν f 1 /ν f (which is a component of L f ) approaches a positive constant in the limit. Assuming f is a good feature we have two degrees of freedom for choosing the hyperparameters, i.e., ν f 0 and ν f 1 , and assuming f is bad we only have one hyperparameter to tune, i.e., ν f . Heuristically speaking, we can visualize this as the following: since under the assumption that f is a good feature we have two degrees of freedom and under the assumption that f is a bad feature we only have one degree of freedom, we should properly select ν f compared with ν f 0 and ν f 1 to promote the same amount of "uncertainty" in the priors and avoid π * (f ) → 0 or 1. Note that for ν f 0 = ν f 1 = ν f = 0 we do not need to specify m f 0 , m f 1 , and m f . Now consider the case where κ f y and κ f go to zero, and s f y , s f , m f y , m f , ν f y , and ν f are fixed. In this case, to avoid h(f ) converging to zero or infinity, we require a sequence of κ's such that Γ(0.5κ f )/(Γ(0.5κ f 0 )Γ(0.5κ f 1 )) converges to a positive constant in the limit. It is well known (e.g., via Taylor series at zero) that Γ(x) is asymptotically equal to 1/x as x goes to zero. Thus, we equivalently require κ f 0 κ f 1 /κ f to converge to a positive constant. This is similar to the situation above where we let the ν's go to zero. For instance, we may set κ f 0 = c 0 κ, κ f 1 = c 1 κ, and κ f = c b κ for some c 0 , c 1 , c b ∈ (0, ∞) and let κ go to zero to get h(f ) → 0, ∞ for the observed sample S.
In Section S5 we considered sequences of proper priors built by truncating improper priors. Similarly, here we have considered sequences of proper priors where we let the ν's and κ's go to zero. In all cases, π * (f ) converges to a positive constant only when the sequence of parameters being tweaked (the radii of truncated priors, the ν's, and the κ's) are chosen carefully in combination. The critical issue always boils down to how L f should be selected. While these analyses help with setting L f in practice, currently it is being subjectively chosen by the user under improper priors and the choice of L f remains a topic for future work. The consistency proof in Section 5 of the main manuscript offers some reassurance that the data will eventually win out if L f is selected poorly. Perhaps a natural choice for a non-informative prior is c 0 = c 1 = c b = c 0 = c 1 = c b = 1, which results in L f = (2π) −0.5 (to cancel out the (2π) 0.5 in (2.18) of the main manuscript). This is close to the value L f = 0.1 used in the simulations for Section 6 of the main manuscript and in the real data analysis performed in Section S2.
Now assume L f is such that the non-informative prior is semi-proper by Definition 3 of the main manuscript. Using Lemma S1 in Section S1 and (2.18) of the main manuscript we see that under JP for n large enough for some c > 0. Assuming f is an independent unambiguous bad feature, using (5.16) and (5.35) of the main manuscript, we have that for n large enough with probability 1 for some L b , c > 0. Hence, for a bad feature we expect h(f ) (which is always positive) to decay at least as fast as (log n) c /n for some c > 0 as n goes to infinity. For independent unambiguous good features, by Lemma S2 in Section S1 we see that where L g > 0 and R > 1. Thus, the right-hand side of (S6.3) grows at least exponentially fast. Therefore, reasonable values of L f should give satisfactory performance. Note that extremely large values of L f result in large π * (f )'s and hence more false alarms under the MR objective, and extremely small values of L f result in missing more features under MR.
Another reasonable choice for L f can be √ 2πn −1 , which is again semi-proper with p = −1. For a bad feature f , using (S6.3) and (S6.4), we have that for n large enough with probability 1, π * (f ) < (log n) c /n 2 for some c > 0. Thereby, not only do we have π * (f ) → 0 as sample size increases with probability 1 when f is bad, but also ∞ n=1 π * (f ) < ∞. Note that L f = √ 2πn −1 has the slowest polynomial decay with respect to n (assuming the power of n is an integer) while satisfying ∞ n=1 π * (f ) < ∞ for bad features. Furthermore, h(f ) would still grow exponentially fast for good features.
Finally, when CMNC is used and L f is independent of the feature index f , the exact value of L f does not matter. However, the choice of D, i.e., the number of features to select, is now determined by the user. It is desirable to use improper priors when little or no reliable information is available, or a proper normal-inverse-Wishart prior might not adequately describe the prior information. The robust performance of OBF under improper priors is studied in detail in Foroughi pour and Dalton (2018b).
We close by showing that OBF-JP provides a feature ranking equivalent to that of a frequentist statistic used to test equality in two Gaussian populations. Under JP, h(f ) = π(f ) 1 − π(f ) L f 2π(n 0 + n 1 ) n 0 n 1 0.5 Γ(0.5n 0 )Γ(0.5n 1 ) Γ(0.5(n 0 + n 1 )) × (n − 1) 0.5n (n 0 − 1) 0.5n0 (n 1 − 1) 0.5n1 Posing the feature selection problem as a hypothesis test for feature f ∈ F , under the null (H 0 ) we have θ f 0 = θ f 1 , i.e., f is a bad feature, and under the alternative (H 1 ) we have θ f 0 = θ f 1 , i.e., f is a good feature. Under this setup, the test is not a point null hypothesis test as θ f 0 and θ f 1 are not fixed to have a specific value; however, whatever they are, they must be equal. The hypothesis test of  for comparing two Gaussian populations suggests using the test statistic whereσ f y andσ f are biased variance estimates of f in class y and both classes considered together, which divide the sum of squares by n y and n, respectively. Therefore, Given λ n0,n1 (f ), p-values can be found using the method of Zhang et al. (2012). Therefore, if π(f ) and L f do not depend on the feature index f , using (S6.6), (S6.8), and (S6.9) we see h(f ) ∝ 1/λ n0,n1 (f ), and hence OBF-JP and the p-values computed using λ n0,n1 (f ) provide the same feature ranking for a given sample S.