On suﬃcient variable screening using log odds ratio ﬁlter

: For ultrahigh-dimensional data, variable screening is an impor- tant step to reduce the scale of the problem, hence, to improve the estimation accuracy and eﬃciency. In this paper, we propose a new dependence measure which is called the log odds ratio statistic to be used under the suﬃcient variable screening framework. The suﬃcient variable screening approach ensures the suﬃciency of the selected input features in model-ing the regression function and is an enhancement of existing marginal screening methods. In addition, we propose an ensemble variable screening approach to combine the proposed fused log odds ratio ﬁlter with the fused Kolmogorov ﬁlter to achieve supreme performance by taking advantages of both ﬁlters. We establish the sure screening properties of the fused log odds ratio ﬁlter for both marginal variable screening and suﬃcient variable screening. Extensive simulations and a real data analysis are provided to demonstrate the usefulness of the proposed log odds ratio ﬁlter and the suﬃcient variable screening procedure.


Introduction
Ultrahigh-dimensional data have emerged recently in many areas of modern scientific research, including microarray, genomic, proteomic, brain images and genetic data. There are known challenges such as scalability, noise accumulation, high collinearity, and spurious correlation for analyzing the ultrahighdimensional data [10,7,12]. To improve the scalability and reduce the noise accumulation, one possible approach is to first reduce the dimensionality of the feature space from a very large scale to a moderate one using a screening procedure and then implement learning algorithms or make inferences based on the much reduced feature space. By doing so, one not only can drastically speed up the computation process, but also can significantly improve the estimation accuracy when the dimensionality of the data is ultrahigh.
Under the assumption that only a small number of variables, which are usually referred as active features, among all observed input features contribute to the response variable, [10] propose the sure independent screening (SIS) method to identify a subset of features that contains the active features. The SIS method is based on the marginal Pearson correlation between an individual feature and the response variable and is designed for the linear regression model under which both response variable and input features follow the Gaussian distribution. Along this direction, many model-based screening procedures have been proposed in recent years under different parametric, semi-parametric, or nonparametric assumptions [see e.g., 12,8,18,2,11,20,24]. Nevertheless, specifying a correct model for ultrahigh-dimensional data remains to be a challenging task. To tackle this problem, several model-free sure screening procedures have been developed [see e.g., 30,19,23,1,15,21,22,16,6,17] so that the sure screening property can be achieved under much weaker assumptions on the regression function.
While sure screening methods are useful in analyzing ultrahigh-dimensional data, there are some known limitations. First, most screening methods rely on the marginal dependence between input features and the response variable. The marginal screening methods work well only if the noise features are weekly associated with the active features. To deal with strong correlations among the features for model-based screening method, [10] suggest an iterative screening and model fitting procedure which has been demonstrated its usefulness empirically [10,12,8], but theoretical justifications for this approach are missing. Another limitation of many screening methods is that they are either proposed based on a specific model or under certain parametric assumptions on the features. Lastly, many screening methods are not invariant to the monotone transformations of the features. That is, the screening results differ with or without making monotone transformations on the features [e.g., 19]. Recent studies have found the Kolmogorov-Smirnov test statistic useful for the variable screening purpose. As proposed by [21], variable screening using the Kolmogorov filter is fully nonparametric and is invariant under monotone transformations. In addition, the fused Kolmogorov filter [22,16] is shown to be an effective variable screening method when the input features and the response variable are either discrete or continuous.
If we denote the response variable by Y and p-dimensional input features by X = (X 1 , X 2 , . . . , X p ) T ∈ R p , in the ultrahigh-dimensional problems where p is very large relative to the sample size, the sure screening methods are to identify a subset of features X D such that it contains the true active set of features X A (i.e., A ⊆ D). Hence, sure screening procedures aim to identify majority of the features in A c which is the complement of the index set A. In contrast, variable selection procedures more ambitiously try to recover A exactly [26,9,31]. From a different perspective, [28] introduce the concept of sufficient variable selection to deal with the "large p, small n" problem where n is the sample size of the observed data. Let B be a p × q matrix with q ≤ p, where the columns of B consist of p-dimensional unit vectors e k of which the k-th element is 1. The subspace spanned by the columns of B is called a variable selection space if Y X|B T X. The intersection of all such variable selection spaces, if exists, is called the central variable selection space and is denoted by S V Y |X . It can be shown that the central variable selection space exists under mild conditions [3,29]. We assume the existence of S V Y |X in this paper. It can be seen that the set of features involved in S V Y |X are equivalent to X A . Hence, a sufficient variable selection procedure is equivalent to identify A such that Y X A c |X A . [27] note that the marginal screening methods identify features in A c by evaluating the marginal independence Y X A c instead of the conditional independence Y X A c |X A . Compared with existing marginal screening methods, sufficient screening is particularly useful to improve marginal screening methods in at least two situations: (i) when the correlations among features are relatively strong; or (ii) when some active features demonstrate weak dependence to the response marginally but strongly associated with the response conditioning on some other correlated features. To achieve sufficient feature screening, [27] propose a variable screening framework based on the conditional independence using distance correlation [25] and Hilbert-Schmidt Independence Criterion [14]. However, [27]'s algorithm requires using a dependence measure to evaluate the association between two random vectors. As a consequence, some classic dependence measures that are defined only to measure the association between two univariate random variables, such as Pearson's correlation or Kolmogorov-Smirnov statistic [21], cannot be used under their framework. In addition, the distance correlation and the Hilbert-Schmidt Independence Criterion are not invariant under monotone transformations.
In this paper, we establish a new sufficient feature screening framework that is suitable for dependence measures defined only for univariate random variables. Therefore, our approach is a generalization of the sufficient screening framework to incorporate any dependence measure without a constraint on the dimension of the random variables (i.e., multivariate vs univariate). In addition, we propose an ensemble algorithm to further improve the sufficient screening performance using different dependence measures. The improvement in screening performance brought by our proposed ensemble approach is illustrated by abundant simulation studies. In addition, we propose a new dependence measure, which we call the log odds ratio statistic, to assess the statistical association between two random variables. We show that the proposed log odds ratio statistic can be used for variable screening and the log odds ratio filter is fully nonparametric and model-free. It is also invariant under monotone transformation on features. More importantly, it outperforms the fused Kolmogorov filter [22] for the situation when the conditional cumulative distribution function (c.d.f.) F (y|X j = x 1 ) and F (y|X j = x 2 ) are close to each other for all pairs of (x 1 , x 2 ) and especially when both are close to 0 or 1. By definition, the log odds ratio filter can be applied to the data where the response variable and the input features are either discrete or continuous. Owning their advantages over different situations, the proposed fused log odds ratio filter can be combined with the fused Kolmogorov filter as a complement to each other to achieve better performance under an ensemble approach. We show that the fused log odds ratio filter enjoys sure screening properties for both marginal screening and sufficient variable screening.
The rest of this paper is organized as follows. In Section 2, we introduce a sufficient variable screening framework which can be adopted by any dependence measure defined between two univariate random variables. Based on a new dependence measure, we propose to use the fused log odds ratio filter for sufficient variable screening in Section 3. Sure screening properties of the fused log odds ratio filter are established in Section 4. Section 5 contains simulation studies and a real data application. We conclude with discussions in Section 6. Additional remarks and technical proofs are included in the appendix.

Framework
For ultrahigh-dimensional data with p n, the sparsity assumption assumes that only a small subset of X are associated with Y . Denote this active set by A, the sparsity assumption is equivalent to assume that [28] provide an equivalent formulation of the problem as Y X A c |X A and define the central variable selection subspace S V Y |X which involves features in X A . Through this definition, [28] discussed the existence and uniqueness of the central variable selection space. The conditional independence Y X A c |X A indicates that if we can identify X A , we can eliminate X A c and achieve the goal of variable screening without losing any regression information. Motivated by the conditional independence, a sufficient variable screening method was proposed by [27] using the following lemma.

Lemma 1.
[Proposition 1 of [27]] Let X 1 , X 2 be any arbitrary random vectors and Y is a random variable. Then, either one of the two conditions: This lemma sheds light on the sufficient variable screening. Statement (iii) implies that F (Y |X 1 , X 2 ) = F (Y |X 2 ). Hence, let X = (X T 1 , X T 2 ) T , if we can iteratively eliminate X 1 at each step and treat X 2 as a new X, and repeat the process until no additional variable can be eliminated, we can obtain a set of variables that contains X A . Although statement (iii) is the ultimate goal of sufficient variable screening, it is difficult to measure the conditional independence directly because we do not know X 2 in advance. Lemma 1 enables us to validate statement (iii) through validating statement (i) or statement (ii). To this end, [27] propose two sufficient variable screening approaches based on statements (i) and (ii), which respectively, they call one-stage and two-stage sufficient variable selection. To test the conditional independence X 1 X 2 |Y in statement (ii), [27] adopt a slicing approach by discretizing the values of Y .
In general, the sufficient variable screening framework developed under statements (i) or (ii) of Lemma 1 is only applicable to dependence measures that are defined for measuring the associations between two random vectors. In particular, [27] use distance correlation [DC,25] and Hilbert-Schmidt Independence Criterion (HSIC) [14] in their study. When a measurement is only defined to measure the dependence between two univariate random variables, e.g. the Kolmogorov statistic [21,22], we have to modify the framework so that such a measurement can be used. To achieve this goal, we make separate observations from statement (i) and statement (ii) of Lemma 1. Statement (i) implies that Y X 1 and X 2 X 1 . For any arbitrary feature X α ∈ X 1 and X β ∈ X 2 , if Y X α or X β X α , then (Y, X β ) X α and, hence, Y X α |X β . On the other hand, statement (ii) suggests that if X α Y and X α X β |Y , then Y X α |X β . These relationships involve only univariate random variables X α , X β and Y , hence, inspire us to propose the following sufficient variable screening algorithms using the dependence measures defined only for measuring the associations between univariate random variables. It is noteworthy that Lemma 1 reveals the fundamental differences between the sufficient variable screening methods and the marginal screening methods. While the traditional marginal screening methods [e.g., 10,8,22] focus on the marginal independence Y X α which is the second part of statement (ii) in Lemma 1, the sufficient variable screening methods directly target on the conditional independence in statement (iii). To improve the performance of the marginal screening method, [10] propose to use an iterative procedure by computing the residuals from regressing the response Y over the selected variables. And then the residual is treated as a new response variable to iteratively screen over unselected variables to captured important variables that are missed from the previous step. Since the residuals are obtained based on the previously selected variables, to some extent, it uses the conditional information to avoid missing important variables.

Algorithms
To make the proposed framework general, let I(X, Y ) denote an arbitrary index that measures the statistical dependence between two univariate random variables X and Y . We first note that using either statement (i) or statement (ii) of Lemma 1 requires to evaluate the marginal independence Y X α by computing I(Y, X α ), which conforms to the usual marginal screening methods. To complete the route of using statement (i) of Lemma 1, it additionally requires assessing the marginal independence X β X α by computing I(X β , X α ). On the other hand, to follow the path of statement (ii) of Lemma 1, we need to evaluate the conditional independence X α X β |Y in addition to the marginal independence Y X α . The assessment of this conditional independence is not trivial and we propose to use a slicing approach to overcome the challenge of computing E[I(X α , X β )|Y ]. Define a general partition of the real line, where s 0 = −∞ and s H = ∞. We slightly abuse the notation and express all intervals as [s h−1 , s h ) by noticing the fact that (s 0 , s 1 ) is open. Note that the definition of the partition H in (2.1) is arbitrary on the real line and can be used for discretizing any continuous random variable. Let H y be a partition of Y with H y slices s hy−1 , s hy for h y = 1, . . . , H y . We defineỸ = h y if and only if Y ∈ s hy−1 , s hy . If Y is already discrete, we can simply setỸ = Y . With the discreteỸ , we can approximate E[I(X α , X β )|Y ] as is computed within the h y -th slice of Y ). We denote sample estimates of the pivotal quantities by I(Y, X α ) and I(X β , X α ) respectively. The algorithm of the sufficient variable screening is as follows.
Sufficient Variable Screening (SVS): 2b. Or alternatively, use partition (2.1) to slice Y into H y non-overlapping slices and obtain its discrete surrogateỸ . Then compute v β = max α∈ A1 1 Hy In above algorithm, Step 2a and Step 2b follow separate paths led by statement (i) and statement (ii) of Lemma 1 to achieve sufficient variable screening. We call the SVS methods using each approach SVS-I and SVS-II procedures respectively. While the traditional marginal screening methods focus on only Step 1 by assessing the marginal dependence and estimate the active feature set directly by A 1 , the additional step (2a or 2b) ensure the sufficiency of the selected features.
From the practical point of view, for an observed sample data, we need to determine how to partition the response variable Y and choose appropriate values of d 1 and d 2 . In our implementation, we set h y in (2.1) to be the h y /H yth sample quantiles of Y and set H y = 2. [19] suggest to use d n = n/logn to be the size of A where n is the sample size and we follow the suggestion of [27] to set d 1 = 0.95d n and d 2 = 0.05d n . Our simulations indicate that these settings generally perform well.

Ensemble
The proposed SVS framework can be implemented using any index I(X, Y ), such as Pearson correlation, distance correlation [25], Hilbert-Schmidt Independence Criterion [14], and the Kolmogorov statistic [21,22] among others.
However, different dependence measures enjoy their own advantages for different situations and it is difficult to know in advance which measure is preferred for a specific data. Therefore, we further propose an ensemble SVS approach to combine the use of different dependence measures under the SVS framework. Let I m (X, Y ) for m = 1, . . . , M be M different dependence measures under consideration. The algorithm of ensemble sufficient variable screening is as follows.

Ensemble Sufficient Variable Screening (ESVS):
be the rank ofr m α among all p features such thatr m α = r m (ϕ(r m α )) (i.e., a relatively largerr m α is corresponding to a higher ϕ(r m α )). Then define a combined rank for each X α as ϕ(r . We obtain an estimated index set A 2 as the set of . We obtain an estimated index set A 2 as the set of X β 's with the d 2 largest values of ϕ(v * β ).

The final estimate of the active feature set is
We call methods using Step 2a and Step 2b in ESVS algorithm ESVS-I and ESVS-II procedures respectively.

Log odds ratio filter
While many existing dependence measures can be used for the proposed SVS framework, we propose a new measure called the log odds ratio statistic, which is fully nonparametric and invariant under monotone transformation, to be used under the SVS framework.

Motivation
Before presenting the proposed log odds ratio filter for sufficient variable screening, we provide a brief review of the fused Kolmogorov filter [22]. Based on the fact that a random variable X is independent of Y if and only if the conditional distributions of X|Y = y are identical for all y's, [22] propose using to measure the dependence between X j and Y , where F Xj |Y is the conditional c.d.f. of X j given Y . To facilitate the empirical estimation, [22] consider a sliced version of (3.1) using the partition defined in (2.1). Given a partition H y , define a discrete random variableỸ ∈ {1, . . . , H y } such thatỸ = h y if and only if Y is in the h y -th slice. Then [22] let and show that X j is independent of Y if and only if K j (H y ) = 0 for all possible choices of H y . Since K j (H y ) depends on a particular choice of partition H y , motivated by a fusion idea [5], [22] define the fused Kolmogorov filter as contains H l y intervals. [22] showed that the sample estimate K * j of (3.3) can be effectively used for marginal variable screening as the fused Kolmogorov filter.
While it is useful in variable screening, the fused Kolmogorov filter can hardly identify informative features when F Xj |Y (x|Y = y 1 ) and F Xj |Y (x|Y = y 2 ) are essentially different but very similar, especially when both of them are close to 0 or 1. Consider two scenarios where in the first scenario F Xj |Y (x|Y = y 1 ) = 0.01 and F Xj |Y (x|Y = y 2 ) = 0.001, and in the second scenario F Xj |Y (x|Y = y 1 ) = 0.41 and F Xj |Y (x|Y = y 2 ) = 0.401. Although both differences are 0.009, the difference in the first scenario is much more noteworthy and significant because F Xj |Y (x|Y = y 1 ) is 10 times larger than F Xj |Y (x|Y = y 2 ). Therefore, in order to capture the important variables in such a case, we propose to use the difference between log and log instead of the difference between F Xj |Y (x|Y = y 1 ) and F Xj |Y (x|Y = y 2 ) in measuring the statistical dependence of X j and Y .

Proposed methodology
We define the log odds ratio statistic as .
In order to avoid singularity points at In general, the magnitude of τ should be small (e.g., 10 −5 ) and it is related to our regularity condition (C2a) which is needed to ensure the sure screening property. The conditional c.d.f. in (3.4) is based on the conditional distribution of Y |X whereas in the Kolmogorov statistic (3.1) is in the form of X|Y . If Y and X are independent, then for any value .4) is the difference between two log odds which is equivalent to the log odds ratio. Hence, we call the variable screening procedure based on (3.4) the log odds ratio filter (LogOR filter ; hereafter).
For binary response Y , [12] propose a maximum marginal likelihood screening method under the logistic regression model. Specifically, for each X j , they consider a model log Pr and rank variables based on the maximum likelihood estimateβ M j of β j . [12] establish the sure screening property of this approach under certain conditions. Under the logistic regression model, it is well known that the interpretation of β j is related to the log odds ratio between different values of X j . However, the LogOR filter (3.4) is a more general approach than the maximum marginal likelihood screening method because the log odds ratio statistic works not only for binary response but also for continuous response and it is completely modelfree.
For continuous X j , we can follow a slicing approach similar to [22] to make the estimation easier. Given a specific partition H xj on X j as defined in (2.1), for each X j , define a discrete random variableX j ∈ {1, . . . , H xj } such that X j = h if and only if X j is in the h-th slice. Accordingly, we define (3.5) The following proposition demonstrates some characteristics of R Y |Xj (H xj ).

Proposition 1.
For R Y |Xj defined in (3.4) and R Y |Xj (H xj ) defined in (3.5), the following statement are true.
The proof of Proposition 1 is included in the appendix. Proposition 1 indicates that, for continuous X j , R Y |Xj (H xj ) serves as a good surrogate of R Y |Xj to capture the dependence between X j and Y .
With an observed random sample (X i , Y i ), i = 1, . . . , n, where X i = (X i1 , . . . , X ip ) T , an estimate of (3.5) can be obtained as , and n h is the number of observations within the h-th slice, andX ij = h if X ij is in the h-th slice. When X j takes finite discrete values, we can letX j = X j without using a partition. When X j takes infinite discrete values, such as following a Poisson distribution, or X j is continuous, we can use h/H xj -th sample quantiles in (2.1) to define the end-points for a partition. To search for the supremum over y, for any given h 1 and h 2 , we can search on a set of grid points defined over the support of Y , Ξ y = {y i : −∞ < y 1 < y 2 < · · · < y k−1 < y k < ∞}, to find the value y * ∈ Ξ y such that (3.6) is maximized. Note that we can also use partition H y to define the grid points for Ξ y and this approach works well in our simulation studies.
To further improve the stability and accuracy for variable screening, we can define a fused LogOR filter as which is an ensemble over N different partitions H l xj each containing H l xj intervals. Note that the population version of (3 To obtain different partitions for the fusion step, we consider multiple uniform slicing schemes under which H l xj has H l xj many slices based on the sample quantiles of X j for 1 ≤ l ≤ N .

The fused log odds ratio filter for sufficient variable screening
In the marginal variable screening step, the true active marginal feature set is defined as Similar to many existing measures, the fused LogOR filter can readily be used for marginal variable screening by selecting input features X j 's that have relatively large R * Y |Xj values. In the SVS-I procedure, after obtaining the marginal screening set A 1 , it requires to computeû β = max α∈ A1 I(X β , X α ) for every β ∈ A c 1 to obtain A 2 . Hence, to use the fused LogOR filter in this step, we can compute On the other hand, in the SVS-II procedure, after obtaining the marginal screening set A 1 , it requires to compute where H y is the number of slices used in the partition (2.1) for Y to define its discrete surrogateỸ . Therefore, in the sufficient screening step, we use to incorporate the fused LogOR filter. While there are many choices for the partition H y of Y , we use h y /H y -th sample quantiles in (2.1) to define the end-points for H y . We should note that the fused LogOR filter and the fused Kolmogorov filter have their advantages in different situations and our empirical experiences indicate that it is best to combine them to achieve the supreme screening results. Therefore, we can apply the ensemble algorithm as proposed in Section 2 to combine them in practice. We consider an ensemble of the fused LogOR filter and the fused Kolmogorov filter for they share many common characteristics, such as being fully nonparametric, model-free, and invariant under monotone transformations of response variable and input features. A detailed algorithm on how to combine the fused LogOR filter and the fused Kolmogorov filter is included in the appendix.

Theory
In this section, we will show that the fused LogOR filter (3.8) enjoys sure screening properties for both marginal screening and sufficient screening procedures.

Regularity conditions for marginal screening
Note that the fused LogOR filter is an ensemble over several R Y |Xj (H l xj )'s which depend on the empirical quantiles of X j . If we know the distribution of X j , we can use an oracle uniform partition scheme such that the partition H l o(xj ) contains the intervals defined by the h/H l xj -th theoretical quantitles of X j , for h = 1, . . . , H l xj . In this situation, since the true distribution of X j is assumed to be known, we denote the corresponding estimated log odds ratio statistic as R We call (4.1) as the oracle fused logOR filter using the terminology of [22] and . We define the screening sets obtained using the oracle fused LogOR filter (4.1) and the fused LogOR filter (3.8) as and respectively, for some predetermined d n .
We should note that the definitions of the screening feature sets based on (4.2) and (4.3) are equivalent to the commonly used definitions in the literature [10,8,19], for some predetermined thresholding positive constants c and ν. [19] discussed the connection and equivalence of these definitions. In order to establish the sure screening properties for A 1 R (o) Y |Xj and A 1 R * Y |Xj , we need to assume the following regularity conditions.
(C1) There exists a set D such that A 1 ⊂ D and for some constants 0 < τ 1 ≤ τ 2 < 1; and for all y, j and x 1 , Condition (C1) is the key condition which is commonly used in the literature for establishing the sure screening property of marginal screening methods. It assures that the important variables in the active set A 1 are also marginally more important than the noise variables. In the appendix, we provide a discussion on how Condition (C1) can be satisfied in general.
Condition (C2a) requires the conditional c.d.f. to be bounded away from 0 and 1 such that the log odds ratio statistic (3.4) is well defined. This condition can be easily met by setting a small threshold value as described in Section 3. Condition (C2b) requires the sample quantiles of X j 's to be close enough to the population quantiles. Note that no other distributional and moment assumptions are needed to establish the marginal sure screening property of the fused LogOR filter.  The proof of Theorem 1 is provided in the appendix. According to (4.4), the fused LogOR filter has the same order as the oracle fused LogOR filter indicating that the proposed slicing scheme using sample quantitles is as good as using the oracle information about the theoretical quantiles. In addition, Theorem 1 provides guidance on choosing the partition H l xj because it requires H l xj ≤ logn . Therefore, combining this requirement and suggestions in the literature [5,22], we set H l xj = 3, . . . , logn to ensure a sufficient sample size within each slice.

2) and (4.3), we have
From the results, we can see that both the oracle fused LogOR filter and the fused LogOR filter possess the sure screening property with a probability tending to 1 if For N ≤ logn, if there exists a constant o < κ < 1 such that Δ D n −κ , above condition on Δ D is equivalent to logp n ξ for any ξ ∈ (0, 1 − 2κ). This requirement on the order of p is same as many existing methods require [e.g., 10]. However, the fused LogOR filter is fully nonparametric and, hence, is more flexible.
It is worth noting that η is independent of d n . Therefore, as long as we choose a sufficiently large d n such that d n ≥ |D|, the sure screening property is guaranteed. In our simulation, we use the conventional value d n = n/logn suggested by [19].

Regularity conditions for sufficient screening
In the sufficient screening step of the SVS-I procedure, the major difference is to replace R * Y |Xj by R * X k |Xj . To establish the sure screening property of using U * X k |Xj , we need to make assumptions similar to conditions (C1) and (C2) based on the conditional c.d.f. F (X k |X j ) instead of F (Y |X j ). Then we could obtain similar results as Theorem 1. Given its similarity, we omit the discussions for this case, but focus on the SVS-II procedure which is a more complicated case because V * X k |Xj ;Y in (3.10) involves partitions of both X j and Y . The second step in SVS-II procedure is based on selecting additional features that are marginally independent of the response. We can define the oracle filter for (3.10) as of which the corresponding population parameter is Note that the population index set for the sufficient variable screening step is We define the screening sets using V

6) and
In order to establish the sure screening properties of for A 2 , we need to assume the following regularity conditions.
(C1 * ) There exists a set D 2 such that A 2 ⊂ D 2 and for some constants 0 < τ 1 ≤ τ 2 < 1 and denote min(τ 1 , 1 − τ 2 ) by τ * ; and for all x 0 , j and x 1 , (C3 * a) Furthermore, for any > 0, if 1/H y − ≤ Pr{Y ∈ [k 1 , k 2 )} ≤ 1/H y + , then for any y 1 , y 2 ∈ [k 1 , k 2 ), Assumption (C1 * ) ensures that the proposed screening statistic V X k |Xj ,hy and R * X k |Xj ,hy in each slice by assuming the same conditions of Theorem 1 within each slice. Hence, it is generally true if conditions of Theorem 1 hold. Finally, condition (C3 * ) assumes that when we slice Y for conditioning using partition H y , for any given values y 1 and y 2 within a particular slice, the conditional dependencies R X k |Xj |Y based on the original continuous Y . This condition is commonly used in sufficient dimension reduction literature [13,4] when the slicing is used.

Sure screening property for sufficient screening
Theorem 2. Assume conditions (C1 * )-(C3 * ). If H l xj ≤ logn for all l and d n ≥ |A 2 |, then, for the screening sets (4.6) and (4.7), we have for some generic positive constant C.

515
The results of Theorem 2 indicate that the SVS step using the fused LogOR filter also enjoys the sure screening property if Δ D * lognlog(p 2 N logn) n .

Numerical studies
In this section, we evaluate the performance of the proposed fused LogOR filter under sufficient variable screening framework through simulations and a real data example.

Simulations
In [22], the fused Kolmogorov filter has been compared with several other existing screening methods in the literature, including marginal correlation screening [10], nonparametric independence screening [8], distance correlation screening [19], rank correlation screening [18], empirical likelihood screening [2], and the quantile adaptive screening [15]. The fused Kolmogorov filter demonstrated superior performance over these methods due to its unique characteristics such as being fully nonparametric, model-free and invariant to monotone transformation. Our proposed fused LogOR filter shares the same advantages as the fused Kolmogorov filter and is expected to be more sensitive to the tail-distribution, i.e. F (Y |X j ) is closer to 0 or 1. Hence, we focus on comparing the fused Lo-gOR filter to the fused Kolmogorov filter [22] in our simulations. Our empirical results indicate that the fused Kolmogorov filter and the fused LogOR filter is advantageous in different situations and one is not consistently better than the other. Hence, we consider the ensemble filter which combines the two filters together following the ensemble procedure described in Section 2 to take the advantages of the both filters. We denote these methods as "K", "LogOR", "Ens" respectively. Not only we demonstrate the effectiveness of the fused LogOR filter for marginal variable screening, we also demonstrate that the sufficient variable screening procedures described are useful to improve the marginal variable screening results. In addition, we will show that the ensemble of the fused Kolmogorov filter and the fused LogOR filter significantly improve the performance of variable screening for the cases where neither approach is unable to identify all active features. For each of our simulated models, we repeat each experiment 200 times and report the following criteria to evaluate the variable screening results.
• P i : the proportion that an individual active predictor is selected out of the total number of replicates. • P a : the proportion that all active predictors are selected out of the total number of replicates.
Note that the results are better when P i and P a are closer to 1. For each simulated model, we consider various settings of n = 200, 400 and p = 500, 2000, 5000.    N (0, Σ) where Σ follows an autoregressive structure with elements σ ij = 0.7 |i−j| , and ε be a standard normal random variable. We consider the following setting of T y (Y ) and T(X): The models in Example 1 are considered by [22] and they are designed under strict monotone univariate transformations. From the results presented in Table 1, we observe that both the fused Kolmogorov filter and the fused LogOR filter perform well for all models and there is no obvious difference between the two. Therefore, the fused LogOR filter shares the same advantages of the fused Kolmogorov filter in the marginal screening.
Since the proposed fused logOR filter is more robust to detect the active features that are associated with the response variable at the tail of the conditional distribution, In the next example, we consider models with different conditional cumulative F Y |Xj (y|X j ). Example 2. Let X 1 be a Bernoulli(0.5) random variable and X 2 , . . . , X p be i.i.d. standard normal random variables. Then Y = ε 1 if X 1 = 1 and Y = ε 2 if X 1 = 0. We consider the following settings for ε 1 and ε 2 : N (0, 1), and ε 2 ∼ t(1); (M2a) ε 1 ∼ N (0, 1), and ε 2 ∼ 0.5N (0, 1) + 0.5N (−1, 3). (M2b) In this example, since X 1 is the only active predictor, we report only the P i in Table 2 from which we observe that the proposed fused LogOR filter significantly outperforms the fused Kolmogorov filter. In the following example, we consider a slightly more complicated case than Example 2 by introducing additional active predictors. We will also show that in Table 3 Simulation results for Example 3.
From Table 3, we can see that neither of the fused Kolmogorov filter nor the fused LogOR filter is able to identify both active predictors as marginal screening procedures. Hence, we consider the ensemble of the two as well as the ensemble SVS procedures. As a result, the ensemble filter brings the substantial improvement, especially when the sample size is large enough. In this case, the SVS procedures perform similarly to marginal screening procedure using the ensemble filter. Example 4. Let X 1 be a Bernoulli(0.5) random variable, and (X 2 , . . . , X p ) T follow a multivariate normal distribution with mean 0 and covariance matrix Σ = (σ ij ) for i, j = 2, . . . , p with σ ii = 1; σ i5 = σ 5i = ρ κ for i = 5; and σ ij = ρ for i = j, i = 5, j = 5. Consider a general model: where ε 1 ∼ N (0, 1) and ε 2 ∼ t(1). We consider the following settings: In all models of Example 4, the predictors X 2 , . . . , X p , except for X 5 are equally correlated with coefficient ρ, while X 5 has correlation ρ κ with all other p − 2 predictors. In these models, X 1 -X 4 are active predictors and X 5 is also an active variable that is marginally independent of the response. The results of Example 4 are gathered in Tables 4-6 which demonstrate that the ESVS-I and ESVS-II procedures significantly improve the marginal screening procedures. as noise predictors. We report the variable screening results in Table 7. From the results, it seems that all marginal screening methods have difficulty identifying X 7 as an active predictor but the proposed SVS procedures significantly improve the chance of selecting it. We also observe that the fused LogOR filter is more competent to select X 2 and X 5 than the fused Kolmogorov filter. We further examine how variable screening step helps predicting the response variable by fitting a generalized additive model (GAM) and a random forest (RF) model using the selected top 7 predictors. We report the mean and standard deviation of prediction mean squared error (PMSE) on the validation data based on 100 replications. As a reference, which can be treated as an oracle approach, we compute the average and standard deviation of the PMSE using original 7 variables over the validation data with 100 observations randomly selected from the 500 data points. It turned out to be that the average PMSE using original 7 variables is 0.2467 for GAM and 0.0418 for RF with 0.035 and 0.017 as standard deviations respectively. From Table 8, we can observe that as a marginal screening method, the fused LogOR filter and the ensemble fitler outperform the fused Kolmogorov filter by itself. Both SVS procedures further improve the prediction accuracy.

Discussions
In this paper, we propose a general sufficient variable screening framework that works for any dependence measure that is defined to measure the statistical association between two univariate random variables. Two separate sufficient variable screening procedures are proposed to overcome the limitations of the marginal screening methods when the active variables are marginally independent of the response variable. In addition, an ensemble approach is proposed to combine advantages of different dependence measures to further boost the screening performance. In addition, a new dependence measure, the log odds ratio statistic, is proposed for variable screening which enjoys the sure screening properties for both marginal and sufficient variable screening. The fused logOR filter overcomes the challenge for the fused Kolmogorov filter when the conditional c.d.f is close to 0 or 1. It has been demonstrated empirically that the ensemble of the fused logOR filter and the fused Kolmogorov filter delivers superior screening results in most cases. While under the current ensemble framework, all candidate screening methods are treated equally in the ensemble step, obtaining an optimal weighting over all candidate screening methods is an interesting direction for future research.