Minimal σ-field for flexible sufficient dimension reduction∗

Sufficient Dimension Reduction (SDR) becomes an important tool for mitigating the curse of dimensionality in high dimensional regression analysis. Recently, Flexible SDR (FSDR) has been proposed to extend SDR by finding lower dimensional projections of transformed explanatory variables. The dimensions of the projections however cannot fully represent the extent of data reduction FSDR can achieve. As a consequence, optimality and other theoretical properties of FSDR are currently not well understood. In this article, we propose to use the σ-field associated with the projections, together with their dimensions to fully characterize FSDR, and refer to the σ-field as the FSDR σ-field. We further introduce the concept of minimal FSDR σ-field and consider FSDR projections with the minimal σfield optimal. Under some mild conditions, we show that the minimal FSDR σ-field exists, attaining the lowest dimensionality at the same time. To estimate the minimal FSDR σ-field, we propose a two-stage procedure called the Generalized Kernel Dimension Reduction (GKDR) method and partially establish its consistency property under weak conditions. Extensive simulation experiments demonstrate that the GKDRmethod can effectively find the minimal FSDR σ-field and outperform other existing methods. The application of GKDR to a real life air pollution data set sheds new light on the connections between atmospheric conditions and air quality. MSC2020 subject classifications: Primary 62B05; secondary 62J02.


Introduction
Statistical analysis of high dimensional data is known to suffer from the curse of dimensionality, and dimension reduction methods are usually used to reduce the dimensionality so as to mitigate the curse. Li [17] proposed Sufficient Dimension Reduction (SDR) as an effective approach for reducing dimensionality in high dimensional regression analysis. Let Y be the response, X = (X 1 , . . . , X p ) T the p-dimensional vector of explanatory variables, and B ∈ R p×d a matrix of p rows and d columns. If given B T X, Y and X are independent with each other, that is, where ⊥ ⊥ denotes "independent", then the column space of B, denoted by span(B), is called an SDR subspace. SDR subspaces are not unique. Under some mild conditions, the intersection of all SDR subspaces remains an SDR subspace, which is called the central subspace and denoted by S Y |X [4]. After S Y |X is obtained, subsequent analyses can be applied to P S Y |X X, which is the projection of X onto the central subspace S Y |X . Another approach to coping with the curse of dimensionality in high dimensional regression analysis is to impose the assumption that the dependence of Y on X admits a lower dimensional configuration. The ACE algorithm [2] and the generalized additive model [12] are two methods following this approach. Both of the methods postulate the following additive regression model where h, φ 1 , . . . , φ p are unknown univariate functions, α is the intercept, and is the error term. The additive regression model can be considered an extension of covariate transformations commonly used in linear regression analysis. It does not suffer from the curse of dimensionality and is more flexible in practice.
The idea of using transformed explanatory variables for linear regression can also be employed for sufficient dimension reduction. Wang and Zhu [23] proposed the flexible SDR model given below.
where B is a p×d matrix as before and φ(X) = φ 1 (X 1 ), · · · , φ p (X p ) T . Here we refer to B T φ(X) satisfying (2) as a sufficient predictor vector. Compared with the original SDR, the flexible SDR model can provide more flexibility in modeling the relationship between Y and X, and lead to further dimensionality reduction. For example, consider the model Y = X 2 1 / sin(X 2 )+X 2 3 +2 + , where X = (X 1 , . . . , X 10 ) T , and ⊥ ⊥ X. Under the original SDR, it can be shown that the central subspace is the three-dimensional subspace spanned by e 1 , e 2 and e 3 , where e i is the column vector with the i-th entry being 1 and the others being 0 for i = 1, 2, 3. The projection of X onto the central subspace yields a three-dimensional space, which is the sample space for (X 1 , X 2 , X 3 ) T ; Under the flexible SDR, however, we can set φ(X) = X 1 , sin(X 2 ), X 2 3 , X 4 , . . . , X 10 T and B = (e 1 , e 2 + e 3 ), and then Y ⊥ ⊥ X|B T φ(X). B T φ(X) can be considered a two-dimensional sufficient predictor vector, and we can rely on B T φ(X) to explore the relationship between Y and X instead of the original X. Therefore, the dimensionality can be reduced to two after flexible SDR is performed. For the example discussed above, we can find another set of transformations φ(X) = X 2 1 , sin(X 2 ), X 2 3 , X 4 , . . . , X 10 T , which also satisfies Y ⊥ ⊥ X|B Tφ (X).
Therefore, both B T φ(X) and B Tφ (X) are sufficient predictor vectors. Wang and Zhu [23] used the dimensions of B to characterize flexible SDR. In this example, the dimensions of B cannot be used to discriminate between B T φ(X) and B Tφ (X) because they share exactly the same B. Notice the only difference between B T φ(X) and B Tφ (X) is that φ 1 (X 1 ) = X 1 whereasφ 1 (X 1 ) = X 2 1 . From the model, Y depends on X 1 only through X 2 1 , and the sign of X 1 is not important. Therefore, B Tφ (X) should be considered a further reduction from B T φ(X). The reduction from B T φ(X) to B Tφ (X) can in fact be best characterized by their respective σ-fields. Let σ B,φ and σ B,φ denote the σ-fields associated with B T φ(X) and B Tφ (X), respectively. It can be shown that σ B,φ is a proper sub-field of σ B,φ , that is, σ B,φ ⊂ σ B,φ . The size of a σ-field can reflect the amount of information the σ-field contains, and the smaller the σ-field is, the more concentrated the information it contains [1]. From the perspective of dimension reduction, therefore, B Tφ (X) should be preferred over B T φ(X).
The use of σ-fields for facilitating SDR was first introduced by Lee et al. [16]. Let σ(X) denote the σ-field for X, and G any sub σ-field of σ(X). If then G is called an SDR σ-field for the dependence of Y on X. Under some mild conditions, the intersection of all SDR σ-fields still satisfies the conditional independence assertion (3), and is called the central σ-field [16]. SDR σ-fields retain sufficient information stored in X for predicting Y , and the central σ-field is the smallest SDR σ-field. The central σ-field has the advantage of achieving the largest possible data reduction. This advantage however comes with a trade-off, which is that it no longer admits the concept of dimensionality. In the example discussed above, it can be shown that the central σ-field is the σ-field associated with X 2 1 / sin(X 2 ) + X 2 3 + 2 , and it is not clear how to properly define the dimensionality of this σ-field. Furthermore, finding the central σ-field is equivalent to directly performing nonparametric regression, and it does not produce a set of sufficient predictor variables as in the original SDR for subsequent data exploration and analysis such as visualization and model construction.
In this article, instead of using the dimensions of B, we propose to use the σ-field associated with B T φ(X) to characterize flexible SDR, which can be equivalently redefined as follows.
where σ B,φ is the σ-field associated with B T φ(X). We refer to σ B,φ as the Flexible SDR σ-field or FSDR σ-field, and call (B, φ) a generating pair of σ B,φ . An FSDR σ-field can be considered a compromise between the original SDR [17] and the general SDR σ-field [16]. Unlike the general SDR σ-field, an FSDR σ-field is equipped with a generating pair B and φ, and thus preserves the concept of dimensionality and a set of sufficient predictor variables (i.e., B T φ). For convenience, we refer to the column rank of B as the dimension of the FSDR σ-field σ B,φ . FSDR σ-fields are not unique, neither are the generating pairs of a given FSDR σ-field. Different FSDR σ-fields represent different degrees of data reduction. Intuitively, the smallest FSDR σ-field should be considered optimal, because it achieves the largest data reduction under the framework of flexible SDR. We refer to the smallest FSDR σ-field as the minimal FSDR σ-field. The existence of the minimal FSDR σ-field is not obvious. In a properly defined class of FSDR σ-fields, we will show that the minimal FSDR σ-field exists, and it also has the smallest dimension; See Section 2. Therefore, the minimal FSDR σ-field should be the inference target under the flexible SDR.
In the literature, a variety of methods have been proposed to perform estimation for SDR, including sliced inverse regression (SIR; [17]), sliced average variance estimation (SAVE; [5]), and many others. These methods mainly take the inverse regression approach and only utilize the first couple of moments of the conditional distribution of X given Y . They can consistently estimate the central space only under various conditions such as the linearity condition. There are some other estimation methods for SDR, which do not follow the inverse regression. The Fourier method [29] and the Minimum Average Variance Estimation (MAVE) method [27] are two such methods. Also, there are some recent works that introduce kernel methods into sufficient dimension reduction, such as the Kernel Sliced Inverse Regression (KSIR) method [26] and its extension called the Kernel Additive Sliced Inverse Regression (KASIR) method [18], the Kernel Dimension Reduction (KDR) method [10], and the Gradient Based Kernel Dimension Reduction method [11]. Specifically, the KDR method characterizes the conditional independence assertion (1) through the conditional covariance operators between the Reproducing Kernel Hilbert Spaces (RKHS) of Y and X, respectively. The key advantage of the KDR method is that it no longer requires the linearity condition, and theoretically it can fully recover the central subspace when the dimension of the true central subspace is known. Because of this advantage, we extend the KDR method to perform inference for the minimal FSDR σ-field. However, it should be noted that the inference for the minimal FSDR σ-field is not restricted to the KDR method and can be also based on other methods.
In this article, we prove that under some mild conditions, the minimal FSDR σ-field uniquely exists and has the smallest dimension. This result is placed in Section 2. We propose the Generalized Kernel Dimension Regression (GKDR) method, an extension of the KDR method, to estimate the minimal FSDR σ-field in Section 3. The GKDR method is a two-stage approach. In the first stage, it estimates multiple FSDR σ-fields with the same dimension as the minimal FSDR σ-field through the minimization of conditional covariance operator between RKHSs. To perform nonparametric estimation, B-spline functions are utilized. We prove that under some conditions, the GKDR Stage I estimator has some good consistency properties. In the second stage, the GKDR method estimates the minimal FSDR σ-field through the maximization of conditional entropy among all the FSDR σ-fields obtained in the first-stage. In Section 4, we discuss some implementation details and provide numerical results for a number of simulation studies and a real data application of the proposed method. We conclude the article with some future research directions in Section 5. All the technical proofs of the theorems and propositions are presented in the Appendix A.

Existence of the minimal FSDR σ-field
Let X and Y be the supports of X and Y , and P X and P Y the probability distributions of X and Y , respectively. Let P XY be the joint probability distribution of X and Y . Suppose F X and F Y are the Borel σ-fields over X and Y, respectively. For x ∈ X and y ∈ Y, P Y |X (·|x) and P X|Y (·|y) denote the conditional probability distributions of Y given X = x and X given Y = y, respectively. We first state two required assumptions for the existence of the minimal FSDR σ-field. Assumption 1. The family of probability distributions P X|Y (·|y) : y ∈ Y is dominated by a σ-finite measure.
Assumption 1 is adopted from Lee et al. [16]. Under this assumption, for any two SDR σ-fields G 1 , G 2 ⊆ σ(X), their intersection remains an SDR σ-field. Let X (−i) denote the sub-vector of X excluding the i-th entry X i , and φ (−i) (X (−i) ) the sub-vector of φ(X) excluding the i-th entry φ i (X i ). Let X (−i|xi) denote the support of the conditional distribution of X (−i) If the support of a random vector is the whole Euclidean space R p , sphere, hemisphere, or cube, then this random vector satisfies Assumption 2. In the following proposition, we state that if a random vector X satisfies Assumption 2, then the components of X are not nonlinearly confounded with each other, that is, none of them is a function of some other components.
then both ζ i and ψ i are zero almost surely.
The proof of Proposition 1 can be found in Appendix A.1. Note that the proofs of the remaining theorems throughout this article are also placed in the Appendix A. Recall that σ B,φ is the σ-field associated with B T φ(X). In the rest of this article, we sometimes write it as σ B T φ(X) for convenience. The matrix B plays an important role in the way that X i 's affect σ B,φ . We next In general, the partition of I B , J B , and K B determines how the σ-fields of φ(X i )'s are related to σ B,φ . Consider the example discussed in Section 1: 2 3 , and φ 4 (X 4 ) = 0. For this FSDR σ-field, the partition is σ B,φ , and φ 4 (X 4 ) is irrelevant to σ B,φ . This property holds in general cases. The σfields of φ i (X i )'s with i ∈ I B are contained in σ B,φ ; The σ-fields of the linear combinations of φ j (X j ) with j ∈ J B , but not the individual σ-fields of φ j (X j )'s, are contained in σ B,φ ; And the σ-fields of φ k (X k )'s with k ∈ K B are irrelevant to σ B,φ .
Suppose the cardinalities of I B , J B , and K B are p 1 , p 2 , and p 3 , respectively. Without loss of generality, we assume that the indices in I B are smaller than those in J B , and the indices in J B are smaller than those in K B . This is equivalent to the assumption that B has the following standardized form ⎛ trix, and for each column of B J there exist at least two nonzero entries. Let X I B denote the subvector of X indexed by I B , and φ I B (X I B ) the subvector of φ(X) indexed by I B . The same notational rules apply to . We next define a proper class of FSDR σ-fields. Let The class A is rich enough to contain most FSDR σ-fields. For example, if the transformations φ 1 (X 1 ), . . . , φ p (X p ) are continuous and the conditional independence assertion (2) is satisfied, then σ B,φ belongs to A. The purpose of requiring that the supports of φ j (X j ) s are intervals is to exclude degenerate cases involving both continuous and discrete transformations. Let's consider a counterexample. Suppose where φ 1 (X 1 ) = e X1 /(e X1 + 1), φ 2 (X 2 ) = I(X 2 > 0), and I(·) is the indicator function. Here φ 2 (X 2 ) only takes value in {0, 1}, so the support of φ 2 (X 2 ) is not an interval. Let B 1 be the 2 × 2 identity matrix I 2 , and B T 2 = (1, 1). It can be verified that both σ B1,φ and σ B2,φ are FSDR σ-fields and they are identical. Note that the column rank of B 1 is two and the column rank of B 2 is one. Although the rank of B 1 is higher than that of B 2 , the two FSDR σ-fields are the same. In general, when discrete transformations are involved, the dimension can be further reduced while the FSDR σ-field remains the same. In this article, we require that the support of φ j (X j ) should be an interval for j ∈ J B to exclude the degenerate cases such as the example discussed above. We now state the existence theorem of the minimal FSDR σ-field as follows.

Theorem 2. Under Assumptions 1 and 2, there exists a unique minimal FSDR
Furthermore, suppose (B,φ) is another generating pair of the minimal FSDR σ- Note that FSDR σ-fields in class A have a partial order instead of a linear order, and Zorn's Lemma is required in the proof of Theorem 2. Theorem 2 implies that although the generating pair (B * , φ * ) s are not unique, the column space span(B * ) is unique. Furthermore, the univariate transformation φ * i s can be determined up to a one-to-one mapping for i ∈ I B * , and the univariate transformation φ * j s can be determined up to a linear transformation for j ∈ J B * . σ B * ,φ * in Theorem 2 is extraordinary since it achieves the smallest σ B,φ . In addition to σ B * ,φ * , the FSDR σ-fields with the same dimension as σ B * ,φ * are also of interest and will play a critical role in estimating σ B * ,φ * , as will be shown in Section 3.4. The generating pairs of these FSDR σ-fields possess similar properties as that of σ B * ,φ * , which we state in the following theorem.
Theorem 3 implies that for σ B * ,φ * , not only the dimension is the smallest, but also the univariate transformations are the coarsest, that is, the univariate transformations of σ B * ,φ * can be represented as functions of those of other FSDR σ-fields. We will utilize this property for estimating σ B * ,φ * , as will be discussed in Sections 3.1 and 3.4.

Estimation scheme for σ B * ,φ *
Suppose the flexible dimension reduction model (2) holds, and Assumptions 1 and 2 are satisfied. In Section 2, we have proved that the minimal FSDR σfield σ B * ,φ * exists and is unique, and hereafter we always assume that d 0 , the dimension of σ B * ,φ * , is already known. In this section, we present the GKDR method for estimating σ B * ,φ * , as discussed in Section 1.
The GKDR method uses a two-stage approach. In the first stage, it estimates the generating pair (B, φ) s of the FSDR σ-fields with the same dimension as σ B * ,φ * . We present this procedure at the population level in Section 3.1, and then we present its sample version and consistency result in Sections 3.2 and 3.3, respectively. In the second stage, the GKDR method incorporates conditional entropy to estimate one representing generating pair (B * , φ * ) of the minimal FSDR σ-field, based on the generating pair (B, φ) s obtained in the first stage. This procedure is presented in Section 3.4.

Stage I of GKDR: Population level
For an integer d ≤ p, let S p d (R) be the Stiefel manifold defined as follows.
LetL 2 (P Xi ) andL 2 (P X ) be the families of normalized L 2 functions defined as follows.
is a normalized generating pair of (B * , φ * ). We define B p d0 , which is a subset of S p d0 (R), and Φ 0 and Φ 1 , which are two subsets ofL 2 (P X ), as follows: From Theorem 2, we know that Θ 0 is the family of the normalized generating pairs of σ B * ,φ * . And from Theorem 3, we know that Θ 1 is the family of normalized generating pairs of the FSDR σ-fields which have the same dimension as σ B * ,φ * .
In the rest of this section, we will focus on estimating Θ 1 , and the estimation of Θ 0 will be placed in Section 3.4. Let H X be the RKHS on X generated by a positive definite kernel function k X , and H Y the RKHS on Y generated by another positive definite kernel function k Y . The positive definite kernel functions k X and k Y are assumed to satisfy the following conditions.
Following Fukumizu et al. [10], we define the cross-covariance operator Σ Y X : for any f ∈ H X and g ∈ H Y . The variance operators Σ XX : Note that Σ −1 XX is an abuse of notation because in general Σ −1 XX may not exist. The formal definition can be found in [10], and we omit it here for conciseness.
Let k d (·, ·) be a positive definite kernel function on R d , such that for any B ∈ S p d (R) and φ ∈L 2 (P X ), the following condition holds.
Let H k d be the RKHS on R d generated by the kernel function k d . For a given normalized generating pair be the RKHS on X generated by the kernel function k B,φ X . It can be shown that for any f ∈ H B,φ X , one can always find a function can be similarly defined as Σ Y X and Σ XX , respectively. We further define the conditional covariance operator The conditional covariance operator Recall that Θ 1 is the family of normalized generating pair (B, φ) s of the FSDR σ-fields with dimension d 0 . In fact, Θ 1 can be characterized by the solution set of a minimization problem of Σ B,φ Y Y |X , which will be presented in Theorem 4 below. Before that, we require an assumption that is equivalent to Assumption (A-2) proposed in [10]. This assumption guarantees that the RKHS is rich enough to approximate any L 2 function. Let H + R denote the direct sum of the RKHS H and the real number space R, and P B,φ the probability distribution on X induced from the projection BB T φ : X → X .
. We need to introduce one more concept. Let (Ω, F) be a measurable space, and H a RKHS on Ω generated by a bounded kernel function k. H is said to be characteristic with respect to F, if it holds that fdP = fdQ for any f ∈ H implies P = Q. Here P and Q are probability measures on (Ω, F).

Theorem 4. Under Assumption 3, there exists an order between two selfadjoint conditional covariance operators
Furthermore, if H Y is assumed to be characteristic with respect to F Y , then Theorem 4 provides the theoretical underpinning for the GKDR method which will be presented in the following section. Notably, the inequality (11) in Theorem 4 indicates that the conditional covariance operator In general, Theorem 4 holds for any FSDR σ-field with dimension d no smaller than d 0 . If d is greater leads to Θ 1 , the generating pairs of FSDR σ-fields with the same dimension as σ B * ,φ * . Since the conditional covariance operator is positive self-adjoint, minimization of the conditional covariance operator can be done by minimizing its trace. Therefore, Θ 1 can be characterized as the solution set of the following minimization problem.
Note that Θ 1 contains infinite generating pairs, so the minimization problem (13) also have infinitely many solutions. This property will be considered in the next section when we construct the estimator for Θ 1 .

Stage I of GKDR: Sample level
In this subsection, we obtain the estimator of Θ 1 using Theorem 4 and minimization problem (13), which is further referred to as the GKDR Stage I estimator.
be the empirical distribution, where δ a (·) is the Dirac delta function with point mass at a. We define the empirical cross-covariance operatorΣ Note that the right hand side of (14) is exactly the right hand side term of (7) evaluated at the empirical distributionP XY . The empirical variance operator Σ , where 1 is the n-dimensional vector with all of its entries equal to 1. The Gram matrix K Y and the centered Gram matrix G Y with respect to the kernel function k Y can be similarly defined. It can be verified that the matrix representation ofΣ [10], we define the empirical conditional covariance operatorΣ where n I n is the regularization term required for inverting the operator. Recall that Θ 1 is characterized by the solution set of the minimization problem (13). We perform optimization at sample level by replacing the conditional covariance operator in (13) with its empirical version as follows.
It can be verified that the matrix representation ofΣ Therefore, by replacingΣ Notice that the first n is omitted since the multiplier constant does not affect in the minimization problem. We denote the objective function in (21) as GKDR(B, φ).
In the objective function GKDR(B, φ), φ is a vector of univariate transformations φ 1 , . . . , φ p , which need to be further parametrized. Here we adopt B-splines to approximate the univariate transformations. Without loss of generality, we assume that X i ∈ [0, 1] for i = 1, . . . , p. We consider natural cubic splines of order 4. For 1 ≤ i ≤ p and the corresponding knots sequence Knots selection is an important issue whenever spline functions are used. In this article, we do not investigate it for our proposed method, instead we assume the knots are pre-specified using some conventional methods. For practice, we use knots placed at the sample quantiles, while for theoretical development we use equally spaced knots. S i,q is a q-dimensional spline space, and we use it to approximate the univariate transformation φ i . Suppose . . , p. Then for any φ i ∈ S i,q , there exists a unique column vector α ∈ R q such that φ i = B i α. By restricting φ i ∈ S i,q and normalizing φ i to have mean 0 and variance 1 for i = 1, . . . , p, the minimization problem (21) is parametrized as follows.
Let (B n ,α n,1 , . . . ,α n,p ) denote the solution to (22), and letφ n,i = B iαn,i andφ n = (φ n,1 , . . . ,φ n,p ) T . For better visualization of the sufficient predic-torB T nφn (X), we impose an orthogonal transformation P ontoB n , such that the new sufficient predictor (B n P ) Tφ n (X) has a diagonal sample covariance matrix. This does not change the associated FSDR σ-field, since span(B n ) = span(B n P ). LetB n =B n P , and we refer to (B n ,φ n ) as the GKDR Stage I estimator.
Recall that Θ 1 contains infinitely many generating pairs which can be characterized as the solutions to the minimization problem (13). It is not possible to estimate all the generating pairs in Θ 1 based on a finite sample, and instead only a finite number of them can be estimated. Further recall that Θ 0 , which is a subset of Θ 1 , is our ultimate inference target. Therefore, we need to estimate a sufficient number of generating pairs in Θ 1 at this stage, with the hope that those estimated generating pairs contain at least one pair, which can become the estimator of a generating pair in Θ 0 . There is another issue we need to address when solving (22). Note that the objective function in (22) is generally nonconvex. Therefore, there may exist many local minima that are spurious in that they do not lead to good estimators of some generating pairs.
In order to estimate a sufficient number of generating pairs in Θ 1 and avoid spurious local minima, we consider the idea of multiple initializations and propose the following estimation scheme. First we randomly choose N different initializations for the minimization problem (22) to get N solutions and their associated generating pairs. Then we select the N 1 generating pairs that attain the N 1 smallest values of the GKDR objective function. The N 1 selected generating pairs are retained as the GKDR Stage I estimators, whereas the others are considered spurious solutions and discarded. In our simulation experiments and real life data applications in Section 4, we set N and N 1 to be 20 and 10, respectively. LetΘ 1 denote the collection of the N 1 GKDR Stage I estimators. Based onΘ 1 , we will use conditional entropy to select a generating pair as the final estimated generating pair of the minimal σ-field σ B * ,φ * . This is the task that the second stage (i.e. Stage II) of the GKDR method will take on, which will be presented in Section 3.4.

Consistency for (B n ,φ n )
In this subsection we prove that (B n ,φ n ) is consistent under some conditions. The space S p d0 (R) in which B lies is naturally equipped with the geodesics distance, and we denote it as D. The formal definition of geodesics distance can be found in [15], Chapter IV. The spaces L 2 (P Xi ) and L 2 (P X ) in which φ i and φ lie respectively are equipped with the L 2 -type distances defined as follows.
Before stating the main theorem, we need some technical assumptions. 2 is continuous with respect to B and φ equipped with the distances D B and L 2 X , respectively. Assumption 5. There exists a measurable function ξ :

Assumption 4. For any bounded continuous function g on
, and x ∈ X .

Assumption 6. The GKDR objective function at the population level T r[Σ
There exists a generating pair (B, φ) ∈ Θ 1 such that φ i is twice continuously differentiable for i = 1, . . . , p.

Assumptions 4 and 5 are the generalizations of Assumptions (A-1) and (A-3)
in [10] for proving the consistency of kernel dimension reduction. Assumption 6 is similar to Assumption A1 in [22]. Assumption 7 is to ensure that the univariate transformations φ 1 , . . . , φ p can be approximated uniformly by the spline functions with certain precision.
Note that we cannot ensure that as n goes to infinity, (B n ,φ n ) converge to a fixed generating pair (B 0 , φ 0 ) ∈ B p d0 × Φ 1 in probability, whereas, as n goes to infinity, we can always find a generating pair (B n , φ n ) ∈ B p d0 × Φ 1 such that (B n ,φ n ) is close enough to (B n , φ n ) in probability.

Stage II of GKDR and conditional entropy
In this subsection, we will focus on estimating Θ 0 and obtaining the GKDR Stage II estimator.
Here we will show that the conditional entropy of φ * i s in Φ 0 are no smaller than that of φ i s in Φ 1 , so we can obtain Θ 0 by maximizing the conditional entropy over Θ 1 . Following Cover and Thomas [6], for a univariate random variable X and a univariate transformation φ(X), we define the conditional entropy of X given φ(X), denoted as H X|φ(X) , as follows.
where p(X|φ(X)) denotes the conditional distribution of X given φ(X). It can be shown that the conditional entropy H X|φ(X) ≥ 0, and the equality holds if and only if σ(X) = σ φ(X) . For a random vector X and a generating pair (B, φ), the B-weighted conditional entropy of X given φ(X), denoted as H B X|φ(X) , is defined as follows.
where w i = B i,· 2 is the L 2 norm of the i-th row of matrix B. In fact, Θ 0 can be characterized as the solution set of a maximization problem of H B [X|φ(X)] over Θ 1 , which will be presented in Theorem 6. Before that, we require an assumption to ensure that the conditional entropy H B * X|φ * (X) is invariant in Θ 0 , that is, for any two generating pairs (B * , φ * ) and (B,φ) in Θ 0 , we have H B * X|φ * (X) = HB X|φ(X) .

Assumption 8.
There exists a generating pair (B * , φ * ) ∈ Θ 0 such that the conditional distribution of X given φ * (X), denoted as p X|φ * (X) , is discrete. Assumption 8 is satisfied if there exists a generating pair (B * , φ * ) ∈ Θ 0 such that for any real number c and 1 ≤ i ≤ p, the level set of φ * i , defined as L c (φ * i ) = {x : φ * i (x) = c}, is countable. Theorem 6. Suppose X is a univariate random variable and ψ and φ are two univariate transformations. If σ(X) ⊇ σ ψ(X) ⊇ σ φ(X) and the conditional distribution p X|φ(X) is discrete, then the conditional entropies satisfy the following decomposition rule.
A direct conclusion of Theorem 6 is that, among all the generating pairs in Θ 1 , the conditional entropy H B X|φ(X) attains maximum at (B * , φ * ) in Θ 0 , that is, Therefore, we can obtain the generating pairs in Θ 0 by selecting those in Θ 1 such that H B X|φ(X) is maximized. This is referred to as Stage II of GKDR at population level.
To obtain the estimator of Θ 0 , we choose the generating pair inΘ 1 such that the empirical conditional entropy is maximized. The empirical conditional entropy is evaluated by slicing the data. LetĤ Then the solution that maximizesĤB X|φ(X) overΘ 1 is defined as the GKDR Stage II estimator, and is considered as an approximation of one representing generating pair in Θ 0 .

Low rank approximation
The GKDR method also suffers from the issue of computational inefficiency as other kernel methods do. Various methods have been proposed to address this issue in the literature, among which low rank matrix approximation achieves much success. In general, the Gram matrices in the kernel methods have fast decreasing eigen values, henceforth they can be approximated by matrices with much smaller rank of m << n. The Nystrom method [25] is one of the widely used low rank matrix approximation approach, and we incorporate it into the GKDR method to improve the computational efficiency. The Nystrom method works as follows. First, it chooses m columns as the pivots of the centered Gram matrix G B,φ X . Denote G n,m ∈ R n×m as the submatrix of G B,φ X whose columns are in the pivot set, and G m,m ∈ R m×m as the submatrix of G B,φ X whose rows and columns are both in the pivot set. ThenG = G n,m G −1 m,m G T n,m is the approximation of the original Gram matrix G B,φ X , reducing the computation complexity to O(m 2 n).
Furthermore, we use the incomplete Cholesky decomposition method [9] to choose the pivot set in this article. Basically, the method operates by iteratively choosing the columns until the difference betweenG and G B,φ X is small enough, i.e. G B,φ X −G 1 ≤ tol, where tol is a pre-specified small constant. After replacing G B,φ X withG, the inverse term (G B,φ X +n n I n ) −1 in the objective function in (21) can be approximated by (G n,m G −1 m,m G T n,m + I n ) −1 . By the Sherman-Morrison-Woodbury (SMW) formula, which is Henceforth, the objective function in (21) is reduced to Because the term involving inversion in (29) is an m×m matrix, its computation complexity is reduced from O(n 3 ) to O(m 3 + n 2 ) for one evaluation of the objective function.

Determination of dimension d 0
As mentioned previously, we assume d 0 , the dimension of the minimal FSDR σ-field is known a priori, while in practice we always need to determine one operable d 0 . Note that if we misspecify d 0 as a larger integer d, Theorem 4 guarantees that the generating pair (B, φ) estimated by minimizing Σ B,φ Y Y |X , is still associated with an FSDR σ-field, although this σ-field is larger than the minimal FSDR σ-field. Leveraging this property, we propose a heuristic method to determine the dimension d 0 . We first use the χ 2 statistic method [17]

Numerical implementation scheme
A scheme for deriving a representative generating pair of the minimal FSDR σ-field is shown below.
1 For i = 1, . . . , p, take the sample quantiles as knots and obtain the normalized natural cubic spline basis Solve the following minimization problem via gradient descent method on matrix manifolds [24]: where G is the centered kernel matrix G B,φ X with φ i (x i ) = B i,q (x i )α i , and G n,m is the submatrix of G whose columns are in the pivot set, G m,m is the submatrix of G whose rows and columns are in the pivot set. The pivot set is determined by iteratively choosing the columns until G − G n,m G −1 m,m G T n,m 1 ≤ tol. n is fixed as 0.001 and tol is fixed as a relatively small number, i.e. 10 −8 . 3 Repeat the last step for N times with random initialization and obtain N generating pairs (B (l) ,φ (l) ) N l=1 . Choose top N 1 with the smallest GKDR objective function value in these N generating pairs, and letΘ 1 denote their collection. We determine N = 20 and N 1 = 10 in practice. 4 Let (B 0 ,φ) be the one that maximizesĤ X|φ(X) inΘ 1 . 5 Perform eigenvalue decomposition:B T 0φ (X)φ(X) TB 0 = P DP T , where P is orthogonal matrix and D is diagonal matrix. LetB =B 0 P , and the output (B,φ) is called the GKDR Stage II estimator.

Simulation results
In this subsection we evaluate the performance of our proposed GKDR method via simulation studies. We use the Gaussian radial basis function kernel k(x, y) = exp(−   We consider five different models in simulation studies. Model 1 is the example discussed in Section 1, Model 2 has lower dimensionality under the FSDR framework than the original SDR framework, Model 3 has multiple FSDR σ-fields achieving the smallest dimensionality, and Model 4 considers a non-regression setting. In Models 1-4, the dimensions of X are 10, while in Model 5, X is 30-dimensional. We perform simulations to evaluate the performance of dimension reduction. First, we use the Pearson correlation coefficient between the estimated transformations and the true transformations to measure the partial performance of dimension reduction. We compare the Pearson correlation coefficients of our proposed GKDR method with FDR [23]. The means and standard deviations of the Pearson correlation coefficients are shown in Tables 1-2. Further, we use the RV coefficient [20] between the estimated sufficient predictors and the true sufficient predictors under flexible SDR to measure the overall performance of dimension reduction. RV coefficient is a multivariate generalization of squared Pearson correlation coefficient and is able to measure the similarity between two random vectors. We compare the RV coefficients of the GKDR method with FDR and three other popular dimension reduction methods which are KDR [10], MAVE [27] and KSIR [26]. The means and standard deviations of the RV coefficients are shown in Table 3. For each model, we repeat the simulation study for 100 times with 500 data points in each replicate.
• Model 1 where p = 10, X ∼ 1 2 N (−1 10 , 1 4 I 10 ) + 1 2 N (1 10 , 1 4 I 10 ) is the ten-dimensional Gaussian mixture explanatory vector, and follows N (0, 0.25) and is independent with X. Under the FSDR framework, we can construct a representative generating pair of the minimal FSDR σ-field as follows. φ * . . , 10, and B * = (e 1 , e 2 + e 3 ). The sufficient predictor is (X 2 1 , sin(X 2 ) + X 2 3 ) T , and the dimension of the minimal FSDR σ-field is 2. We also set the dimensionality to be 2 for the other methods. Table 1 demonstrates that GKDR accurately recovers the univariate transformations and achieves much better performance in the estimation for the first and the third components. This echoes the statement mentioned in Section 1 that the dimension of B is insufficient to characterize flexible SDR. In this example, we can construct another generating pair as follows. φ i (X 1 ) = X 1 ,φ 2 (X 2 ) = sin(X 2 ),φ 3 (X 3 ) = X 2 3 ,φ i (X i ) = 0 for i = 4, . . . , 10, andB = (e 1 , e 2 + e 3 ). The only difference between (B * , φ * ) and (B,φ) is the transformation of the first component. The dimension ofB is still equal to 2, but σB ,φ is larger than σ B * ,φ * and FDR cannot discriminate between B * T φ * (X) andB Tφ (X). Indeed, from Figure 1 we see thatφ 1 estimated by GKDR demonstrates higher correlation with X 2 1 , whereasφ 1 estimated by FDR demonstrates higher correlation with X 1 . Furthermore, we compare the RV coefficients of the five methods. Table 3 shows that both GKDR and FDR work well, whereas KDR, MAVE and KSIR fail to identify the true sufficient predictor. This is expected, since KDR, MAVE and KSIR are methods under linear SDR model and do not estimate the univariate transformations.
where p = 10, X ∼ N (0, I 10 ) is the ten-dimensional independent Gaussian explanatory vector, and follows N (0, 0.25) and is independent with X. Under the FSDR framework, we can construct a representative generating pair of the minimal FSDR σ-field as follows. φ * 1 (X 1 ) = sin(3X 1 ), φ * 2 (X 2 ) = 0.5 * (X 2 −0.5) 2 , φ * i (X i ) = 0 for i = 3, . . . , 10, and B * = e 1 + e 2 . In this example, the sufficient predictor is sin(3X 1 )+0.5 * (X 2 −0.5) 2 , and the dimension of the minimal FSDR σ-field is 1. We set the dimensionality to be 1 for the other methods. From Tables 1 and 3 we see that GKDR and FDR accurately recover the univariate transformations and outperform the other three methods. Note that in this example there does not exist other FSDR σ-fields with the same dimensions as the minimal FSDR σ-field, so FDR achieves slightly larger RV coefficient than GKDR.
where p = 10, X ∼ N (0, 0.6 * I 10 + 0.4 * 1 10 1 T 10 ) is the ten-dimensional Gaussian explanatory vector, and follows N (0, 0.25) and is independent with X. A representative generating pair of the minimal FSDR σ-field can be constructed as follows. φ * 1 (X 1 ) = sin(3X 1 ), φ * 2 (X 2 ) = (X 2 + 0.5) 2 , φ * i (X i ) = 0 for i = 3, . . . , 10, and B * = (e 1 , e 2 ). In this example, the sufficient predictor is sin(3X 1 ), (X 2 + 0.5) 2 T , and the dimension of the minimal FSDR σ-field is 2. We set the dimensionality to be 2 for the other methods. Again from Tables 1 and 3, we see that GKDR and FDR accurately recover the univariate transformations and outperform the other three methods. In particular, GKDR achieves higher correlation and smaller standard deviation when estimating the first univariate transformation compared to FDR in this example. The reason is the same as shown in Model 1 that the dimension of B is insufficient to characterize flexible SDR. In this specific example, another generating pair of an FSDR σ-field with the same dimension as σ B * ,φ * can be constructed as follows. φ i (X i ) = X i for i = 1, 2,φ i (X i ) = 0 for i = 3, . . . , 10, andB = (e 1 , e 2 ). The dimensions of B * andB are both equal to 2, but σB ,φ is larger than σ B * ,φ * and FDR cannot discriminate between B * T φ * (X) andB Tφ (X).
where p = 10, X ∼ N (0, I 10 ) is the ten-dimensional Gaussian explanatory vector, and follows N (0, 0.25) and is independent with X. A representative generating pair of the minimal FSDR σ-field can be constructed as follows.
. . , 10, and B * = e 1 . In this example, the sufficient predictor is (X 1 − 0.5) 2 , and the dimension of the minimal FSDR σ-field is 1. We set the dimensionality to be 1 for the other methods. From Tables 1 and 3, we see that GKDR accurately recovers the univariate transformation and outperforms the other methods. In particular, Table 1 demonstrates that GKDR achieves better performance in the estimation of univariate transformations than FDR. In this specific example, another generating pair of an FSDR σ-field with the same dimension as σ B * ,φ * can be constructed as follows. φ 1 (X 1 ) = X 1 ,φ i (X i ) = 0 for i = 2, . . . , 10, andB = e 1 . The dimensions of B * andB are both equal to 1, but σB ,φ is larger than σ B * ,φ * and FDR cannot discriminate between B * T φ * (X) andB Tφ (X).
In the simulation study above, we have assumed that the true dimension d 0 of each model's minimal FSDR σ-fields is known. Recall that in Section 4.2, we proposed a heuristic method for determining d 0 in practice. We further conducted a simulation study of the effectiveness of this heuristic method under Models 1 -5. For each model, we generated 100 samples, and then applied the heuristic method to each sample with d 1 set to be four. The proportions of the times that the heuristic method selected the candidate dimensions are presented in Table 4. From the table, the heuristic method selected the true dimensions of Models 2 and 3 with proportions 98% and 70%, respectively. For Models 4 and 5, the heuristic method tends to over-estimate the true dimension. The total proportions of correctly estimating d 0 plus over estimating d 0 by only one are 77% and 65% for Models 4 and 5, respectively. We argue that slight over-estimation in those two models is still acceptable, because the dimensions have been substantially reduced, and the sufficient directions have been retained. The heuristic method under-estimated the true dimension of Model 1 with proportion 27%. This may not be acceptable in practice, because underestimation leads to loss of information. This model clearly presented challenges for the heuristic method. The determination of the true dimension of the minimal FSDR σ-field is an important and challenging problem. More research on this problem is needed in the future.

PM2.5 Data
We apply the GKDR method to the PM2.5 dataset originally analyzed in Liang et al. [19]. This dataset is used to study the relationship between the PM2.5 concentration level and the meteorological conditions in five cities in China from January 1, 2010 to December 31, 2015, and it contains 52584 hourly measured instances during that period of time. In our study, we choose the dataset of PM2.5 in Beijing, where the air pollution is most severe, to perform statistical analysis. Here IP REC is redundant in that it can be derived from P REC, CBW D is categorical with only five levels, and P REC is highly imbalanced with over 95% zero values. Therefore, we remove these three attributes in the subsequent analysis. The PM2.5 reading is considered to be the response, and the remaining five meteorological attributes are considered to be the explanatory variables.  We apply our method to this dataset in a season-by-season fashion. We only report the results for the winter seasons, when the average PM2.5 concentration was the highest across each year. We smooth the original time series data using eight-hour moving windows. We randomly select two-thirds of samples for training and the rest for testing. We standardize both the response and the explanatory variables in the training set so that they have mean 0 and standard deviation 1. The dimension of the minimal FSDR σ-field is determined as two applying the heuristic method in Section 4.2. Four knots are used to estimate the transformations. After applying the GKDR method, we obtain two sufficient predictors which are denoted as GKDR 1 and GKDR 2 respectively and given as follows. In the above,φ i s are the estimated univariate transformations of the explanatory variables for i = 1, . . . , 5. The estimated transformations are depicted in Figure  2. All five transformations are highly nonlinear and have absolute coefficients larger than 0.1 in at least one of the sufficient predictors. Figure 3(a) shows the scatter plot for GKDR 2 versus GKDR 1 . The boundary constraint observed in this plot can be largely explained by the imbalanced distribution of the explanatory variables. Figures 3(b) and (c) show the marginal scatter plots for Y versus GKDR 1 and GKDR 2 , respectively. From Figure 3(b), a decreasing trend in the mean response can be observed as GKDR 1 increases. From Figures 3(b) and (c), we also see a larger variation for Y when GKDR 1 takes small values or GKDR 2 takes medium values. To further investigate how the sufficient predictors affect the response, we explore the 3-d plot for Y versus GKDR 1 and GKDR 2 ; See Figure 4(a). We not only observe a complicated nonlinear relationship between the mean response and the sufficient predictors, but also a dependency of the variance of the response on the sufficient predictors, the latter of which is called the heteroscedasticity phenomenon. We first use the loess method to fit a nonparametric regression model between the response and the two sufficient predictors. To model the heteroscedasticity phenomenon, we use the residual-based estimator proposed by Fan and Yao [8] to estimate the residual standard deviation. In particular, we use the loess method to fit a nonparametric regression model between the squared residuals and the two sufficient predictors, which gives a local smoothing estimator of the residual variance. The estimator of residual standard deviation is derived by taking the squared root of the estimator of residual variance. Figures 4(a) and (b) show the 3-d plot for the estimated regression surface and the estimated residual standard deviation. These plots reveal highly nonlinear relationship for both the mean response and the residual standard deviation versus the sufficient predictors. Figure 4(c) shows the plot for the scaled residual which is defined as the original residual divided by the estimated residual standard deviation, and indicates that heteroscedasticity phenomenon has been removed. Figure 4(d) shows the scatter plot for the observed response versus the fitted response, and their correlation is calculated to be 0.758. We use this fitted model to predict PM2.5 concentration level in the testing set, and the correlation between the observed response and the predicted response is calculated to be 0.766. The above results reveal that the sufficient predictors estimated by the GKDR method are informative and shed new lights on the impacts of the meteorological attributes on the PM2.5 concentration levels.
PM2.5 refers to the atmospheric particulate matter (PM) with aerodynamic diameter of less than 2.5 micrometers, and is reported to cause respiratory and cardiovascular diseases [13]. It was reported that humidity is highly relevant to PM2.5 pollution during winter in Beijing. In particular, high level of humidity would lead to abundance in water-soluble components such as inorganic ions, and further concentration of PM2.5 pollution episodes [3]. Our analysis also supports the impact of humidity on PM2.5 concentration. From Figure 2(b) we can see that when humidity is high, the value of first sufficient predictor GKDR 1 tends to be lower. This results in a higher expected value of PM2.5 concentration, as shown in Figure 3(b). We also observe some interesting patterns of other meteorological attributes. When temperature is at relatively high or low value, the value of second sufficient predictor GKDR 2 tends to be higher or lower, respectively. These result in a lower expected value of PM2.5 concentration, as shown in Figure 3(c). Low PM2.5 concentration in extremely cold weather can be attributed to the strong cold air from the north or northwest with high wind speed [14]. When temperature is at relatively median value and demand for heating supply is high, production intensity of coal-fired power plants is high, which leads to high emission of atmospheric chemical such as SO 2 and NO 2 and further high PM2.5 concentration [28]. Whereas when temperature is relatively high in winter and demand for heating supply is reduced, production intensity of coal-fired power plants decreases, which leads to low emission of atmospheric chemical and further low PM2.5 concentration. This interesting pattern of how the temperature affects the PM2.5 concentration has not been reported by previous studies.

Discussion
In this article, we propose to use FSDR σ-fields to characterize flexible SDR and show that the minimal FSDR σ-field σ B * ,φ * exists under some mild conditions. To estimate σ B * ,φ * , we develop the GKDR method and demonstrate its effectiveness through extensive simulation experiments and two real life applications. We believe that the proposed GKDR method can become a useful tool for sufficient dimension reduction in high dimension regression analysis. There are three directions to further improve the current work. First, we have established the consistency result for the GKDR Stage I estimator. However, the large sample properties of the GKDR Stage II estimator is left to be an open problem. Second, in this article, we have proposed a heuristic method to determine the dimension of the minimal FSDR σ-field. More systematic approaches with theoretical justifications will greatly enhance the GKDR method, and thus need to be developed. Third, when the dimension of X increases, the computational intensity of the current GKDR method can also quickly increase. Therefore, more efficient algorithms need to be developed. One approach to mitigating the computational complexity is to incorporate variable selection into the GKDR method. We are currently working on these three directions and hope to report results in the future. +∞ k=1 σ B (k) ,φ (k) = σ i∈I φ * i (X i ), B * T J φ * J (X J ) = σ B * ,φ * . Since the intersection of any two SDR σ-fields is still an SDR σ-field, as stated in the proof of Theorem 1 in [16], we have Y ⊥ ⊥ X|σ B * ,φ * . Therefore, any decreasing σ-field chain in the set A is close. Finally, applying Zorn's Lemma, we can prove that there exists a unique minimal σ-field in A.
Furthermore, suppose (B,φ) is another generating pair of σ B * ,φ * . Then we can construct two decreasing σ-field chains, which begin at σB ,φ and σ B * ,φ * and converge to σ B * ,φ * and σB ,φ , respectively. Using the properties of decreasing σ-field chains shown previously, we have span(B) = span(B * ), σ φ i (X i ) = σ φ * i (X i ) for all i ∈ I B * , and there exist constant numbers u j and v j such that φ j = u j * φ * j + v j for all j ∈ J B * . This completes the proof of Theorem 2.