Learning sparse conditional distribution: An eﬃcient kernel-based approach

: This paper proposes a novel method to recover the sparse structure of the conditional distribution, which plays a crucial role in subsequent statistical analysis such as prediction, forecasting, conditional distribution estimation and others. Unlike most existing methods that often require explicit model assumption or suﬀer from computational burden, the proposed method shows great advantage by making use of some desirable properties of reproducing kernel Hilbert space (RKHS). It can be eﬃciently implemented by optimizing its dual form and is particularly attractive in dealing with large-scale dataset. The asymptotic consistencies of the proposed method are established under mild conditions. Its eﬀectiveness is also supported by a variety of simulated examples and a real-life supermarket dataset from Northern China.


Introduction
In modern statistical analysis, the study on the conditional distribution and the relevant issues have attracted more and more attention [12,13,17,6,42] with many applications, including in economics and finance [5,46], in prediction and forecasting for time series analysis [10] and the inference on conditional distribution estimation [42]. Yet, as pointed out by [13], a conventional nonparametric estimator of the conditional distribution will suffer poor accuracy, even the number of covariates collected is relatively small. Hence, in high dimensional data analysis, the sparse modelling is demanded in the sense that it is generally believed that only a few covariates truly affect the conditional distribution. And thus it is crucial to identify truly informative covariates acting on the conditional distribution as the first step for subsequent statistical analysis.
This paper proposes a novel and efficient method to exactly identify all the covariates acting on the conditional distribution. The proposed method is motivated by the key observations that joint quantile regressions can be used as an efficient tool to exploit the conditional distribution and the gradient functions provide an appropriate definition of the informative covariates without explicit model specification. More importantly, unlike most existing learning gradients methods, which estimate the gradients under the regularization framework [28,48,14] and thus suffer computational burden, we notice that the derivative reproducing property in RKHS [52] provides an efficient alternative for fast computation of the gradient functions. Specifically, the proposed method first fits joint kernel-based quantile regressions, followed by the fast computation of corresponding gradients by using the derivative reproducing property in RKHS and a hard-thresholding procedure. The proposed method can be efficiently implemented by dual optimizations and is particularly attractive in dealing with large-scale cases. The asymptotic consistency of the proposed method is established without requiring any explicit model specification under mild conditions and the numerical experiments illustrate the superior performance of the proposed method against some state-of-the-art competitors.
The major contributions of the proposed method are four-fold: (i) It avoids directly estimating the gradient functions but efficiently computes the estimated gradient functions by using the derivative property in RKHS, which significantly reduces the computational cost; (ii) It only requires standard quadratic optimization with linear constraints, which can be efficiently solved by some existing packages in many statistical softwares. Besides, it can be applied to deal with big data with slight modification by using some distributed platform such as Apache Spark with Map-reduce steps. (iii) It is as efficient as the screening procedure but the strong marginal correlation condition is not required. Specifically, it can be regarded as a nonparametric joint screening method in the sense that each gradient function is computed given all the other covariates and computation procedures can be done in a parallel fashion. (iv) With the help of functional operators in learning theory, the asymptotic selection consistency is established under mild condition, which ensures that all the covariates acting on the conditional distribution can be exactly identified with probability tending to one.

Related works
In literature, sparse learning methods have been proposed to recovery the dependence relationship between the covariates and the conditional distribution under certain assumptions. One popularly used assumption assumes that the covariates act on the conditional distribution only through the conditional mean function. Various strategies have been developed to propose sparse learning methods under the linear modelling, including sparse-induced regularization [40,7,53,31,32], sure independence screening [8,45], and knockoff filter [2]. Extended methods have also been developed under the nonparametric additive model [23,16,9], or in the reproducing kernel Hilbert space (RKHS) [28,48]. Yet these methods require explicit model assumptions that are difficult to check in practice, or expensive computational cost that undermines its scalability. It is also interesting to point out that the aforementioned methods only focus on modelling the mean of the conditional distribution.
Beyond conditional mean regression, many methods have also been developed to detect a more general dependence between the covariates and the conditional distribution [47,25,15,14] through individual or composite quantile regression [20]. Specifically, [15] proposes a nonparametric screening method to retain all the covariates acting on the conditional quantile function at some given quantile level. It can be extended to retain all the covariates acting on the conditional distribution by estimating the marginal quantile utility at multiple quantile levels. Yet, the screening-based method aims to retain a large number of covariates to guarantee the sure screening property, which is often much larger than the number of truly informative covariates in sparse learning. [14] proposes a learning-gradient-based method to identify all the covariates acting on the conditional distribution, and establishes the asymptotic selection consistency. It learns the gradient of conditional quantile functions at multiple quantile levels in a RKHS, and employs a functional group lasso penalty in the formulated regularization framework. It is also interesting to notice that [28] proposes a novel learning gradients method, which adds an empirical functional penalty on the gradients to a standard kernel ridge regression in a RKHS, and it can be further extended to recover the sparse structure of the conditional distribution.

Paper organization
The rest of this paper is organized as follows. Section 2 provides the background and some key motivations and Section 3 introduces the proposed method. The computational algorithm and detail of tuning procedure are provided in Section 4. The asymptotic consistencies of the proposed method are established under mild conditions in Section 5. Section 6 reports the numerical experiments on the simulated and real examples. A brief summary is provided in Section 7, and all the technical proofs are given in Appendix.

Motivation
Given a random pair Z = (x, y) drawn from some unknown distribution ρ x,y with covariates x = (x 1 , ..., x p ) T ∈ X , where X ⊂ R p is assumed to be a compact set, and response y ∈ R. It is of great interest to estimate the conditional distribution function which has been widely studied in literature [12,13,17,6,42]. Yet, many existing methods are developed for the low-dimensional case, and yield suboptimal performance when the number of covariates becomes relatively large [12]. In the high-dimensional case, it is generally believed that only a small number of covariates have effects on F y (x), while the others are noise. Thus, it is crucial to first identify all the truly informative covariates acting on F y (x) for subsequent statistical analysis. To this end, we regard a covariate x l as noise if where x −l denotes all the covariates except x l . This criterion implies that a noise covariate x l does not have any functional relationship with the conditional distribution function F y (x). As pointed out by [21], F y (x) and the conditional quantile function Q * τ (x) are equivalent characterizations of the conditional distribution of y given x. Precisely, F y (x) can be uniquely quantified by Q * τ (x) as Q * τ (x) = inf y∈R {y : F (y| x) ≥ τ }, and vice versa. In the rest of this paper, we require Q * τ ∈ H K for any τ ∈ (0, 1), where H K is a reproducing kernel Hilbert space (RKHS) induced by some pre-specified two-times differentiable kernel K(·, ·). Consequently, the criterion in (2.1) is equivalent to Q * τ (x) = Q * τ (x −l ) for any τ ∈ (0, 1), and thus we only need to check whether the covariate x l has any functional relationship with Q * τ (x) for any τ ∈ (0, 1). Particularly, it suffices to check whether the corresponding gradient functions almost surely for any x ∈ X and τ ∈ (0, 1). Then, the importance of each covariate can be measured via its L 2 -norm that where ρ x denotes the marginal distribution of x. Then, the true active set A * , which contains all the covariates acting on the conditional distribution, can be defined as More importantly, we notice that the derivative reproducing property [52] in RKHS that where K x (·) := K(x, ·), ensures that once a consistent estimator of Q * τ is obtained, its gradient function g * l,τ can be efficiently obtained by simply matrix multiplication.
These key facts motivate us to recover the sparse structure of the conditional distribution by employing a two-step learning algorithm that we first obtain some consistent estimator of Q * τ , and then apply the derivative reproducing property in (2.2) to compute the empirical norm of the estimated gradient function g * τ and check whether it is substantially different from 0.

Reproducing kernel
We briefly introduce some basic knowledge of RKHS and interested readers may refer to [30] for more details about RKHS. Precisely, let K(·, ·) : X × X → R be a bounded, symmetric and positive semidefinite function in that sup x,x ∈X K(x, x ) < ∞. Then, the RKHS H K associated with the kernel K(·, ·) is the completion of the linear span of functions {K x (·) := K(x, ·) : x ∈X } with the inner product given by K x , K u K = K(x, u) for any x, u ∈ X . Note that the RKHS H K is uniquely determined by the kernel K(·, ·). By Mercer's theorem [26], under some regularity conditions, the kernel function has an eigen- .. ≥ 0 are a non-negative sequence of eigenvalues and {φ k } ∞ k=1 are the associated eigenfunctions, taken to be orthonormal in L 2 (X , ρ x ) = Q : . Note that the above results require that H K ⊂ L 2 (X , ρ x ) and it is directly satisfied if sup x∈X K(x, x) is assumed to be finite. It is interesting to notice that the finiteness requirement is equivalent to the boundness of the kernel.
More importantly, the reproducing property of RKHS is critical in both theoretical analysis and computation, which states that Q, K x K = Q(x) for any Q ∈ H K . It is worth pointing out that the RKHS induced by some universal kernel, such as the Gaussian kernel, is a fairly large functional space in the sense that any continuous function can be arbitrarily well approximated by an intermediate function in its induced RKHS under the infinity norm [36].

Sparsity learning
, which are copies of Z = (x, y), we first solve the following optimization task that where the first term is the sample version of 1 denoting the check loss function, τ 1 , . . . , τ m denote m quantile levels, and λ denotes the parameter controlling the model complexity. In practice, τ 1 , . . . , τ m can be chosen equidistantly from the interval (0, 1) to exploit the skeleton of the conditional distribution. Moreover, the number of quantile levels m and the covariate dimension p are both allowed to diverge with n. For simplicity, we suppress their dependence on n and still use m and p, and denote the cardinality of A * as |A * | = p 0 p. Note that the optimization task (3.1) resembles the joint quantile regression (JQR) [29] without imposing the (shape) non-crossing constraints to facilitate the computation.
By the representer theorem [43], the solution of (3.1) must have a finite form that where α τ k = ( α k 1 , ..., α k n ) T ∈ R n denotes the representer coefficients and Kn(·) = (K(x 1 , ·), ..., K(x n , ·)) T denotes the kernel vector with transposition (·) T . It is worth pointing out that the representer theorem [43] converts the original optimization problem (3.1) over an infinite functional space H K into an optimization problem over a finite n-dimensional vector space of α τ k . Thus, solving the optimization task (3.1) is equivalent to solving ∈ R n×n denotes the kernel matrix. Once α τ1 , ..., α τm are obtained, the estimated gradient function g l,τ k , l = 1, ..., p, can be directly computed as ) T is given once the kernel K(·, ·) is pre-specified. It is interesting to point out that we only need to solve (3.3) for α τ k , k = 1, ..., m one time, and the estimated gradient functions can be directly computed by using the derivative reproducing property in (3.4) without incurring any additional computational cost.
In practice, since the marginal distribution ρ x is usually unknown, we adopt the empirical norm as a reasonable substitute for the L 2 -norm. Thus, the estimated informative set is defined as denotes the empirical norm and v n represents some pre-specified thresholding value.
Note that the proposed method is computationally efficient in the sense that it only requires to solve the convex optimization task (3.3) for Q τ k (x), and its gradients can be analytically computed as in (3.4). Once the gradients are obtained, sparse learning can be done by comparing its empirical norm with some given thresholding values. It is clear that the selection performance of the proposed method relies on the choice of the thresholding value v n , which can be appropriately determined through a stability-based selection criterion [38]. More details of the employed criterion are provided in Section 4.2.

Parallel and distributed computing
In this section, we illustrate that the proposed method is particularly attractive and useful in dealing with the large-scale cases, where the dimension p or the sample size n is extremely large. When the dimension p is extremely large that p > n, the proposed method only needs to estimate the n-dimensional representer coefficients α τ k , k = 1, ..., m, and the estimated gradient function can be computed in a parallel fashion. Specifically, the proposed method only needs to estimate α τ k , k = 1, ..., m by solving the optimization task (3.3), where the computational complexity is only related with n and m. Once the estimated coefficients α τ k , k = 1, ..., m are obtained, the empirical norm of the estimated gradient functions can be directly computed as for each l = 1, ..., p simultaneously. Finally, given a pre-specified v n , the empirical norm of the estimated gradient functions, 1 m m k=1 g l,τ k 2 n , can be truncated by a hard-thresholding step, and thus the estimated informative set A vn is obtained.
When the sample size n is extremely large, the proposed method can be applied to deal with the large n case by using the idea of the divide-and-conquer method [51] with slight modification. Specifically, the divide-and-conquer method usually assumes that the sample can be partitioned into several disjoint subsets and each subset is stored in a local machine. Therefore, a local estimator can be obtained in each local machine, and then these local estimators are communicated to a central processor and a global estimator is synthesized by taking an average. Figure 1 illustrates the idea of the divide-and-conquer method. It is worth pointing out that the proposed method only requires to communicate the p-dimensional vectors of the empirical norms to the central processor, which is quite computationally efficient with the assistance of some distributed platform, such as Apache Spark with Map-reduce steps. The readers may refer to [19] for detailed introduction of Apache Spark. Without loss of generality, we assume that the sample Z n = {(x i , y i )} n i=1 can be exactly divided into the J disjoint subsets Z n 1 , . . . , Z n J at random that J | n, and each subset Z n j contains n/J observations stored in the distributed local machines. For the local machine j, the joint kernel-based quantile regressions (3.3) are fitted with the subset Z n j to estimate the local representer coefficients where |Z n j | denotes the cardinality of the subset Z n j . Then, the empirical norm of the estimated gradient functions based on Z n j can be computed as for each l = 1, ..., p. Note that the above procedure can be done simultaneously for each distributed local machine. Once the estimated gradients in each local machine are obtained, we only need to communicate g (j) 1 2 n , ..., g (j) p 2 n T ∈ R p to the central processor, and then the empirical norm with respect to the whole sample Z n can be computed as for each l = 1, ..., p. Finally, given a pre-specified thresholding value v n , the estimated informative set is A vn = l : g l 2 n > v n . As illustrated above, we employ the standard divide-and-conquer scheme to facilitate the estimation of the quantile functions in (3.1) and their gradient functions when the sample size is very large. It is interesting to notice that the numerical efficiency and performance can be further improved by employing the distributed scheme in [18], where the nondifferentiable objective function is replaced by a smoothing approximation, leading to an analytic solution, and then the corresponding statistics can be computed locally and finally aggregated together to complete the calculation of the derived estimator. Note that the emphasis of this paper is on developing an efficient method to learn the sparsity structure of the conditional distribution, and thus we leave the detailed investigation on more sophisticated schemes for the distributed computation as the future work.

Computational issues
The computational detail for updating α τ k , k = 1, ..., m, in (3.3) and the stabilitybased tuning procedure for the optimal choice of the thresholding value v n are provided in this section.

The dual optimization
To solve (3.3), it is equivalent to solving its dual problem at each quantile level. Precisely, the dual optimization task can be computed straightforwardly by using Lagrange multipliers, and thus solving (3.3) is equivalent to solving a quadratic optimization with linear constraints that for each k = 1, ..., m, where C denotes some tuning parameter with C = 1 nλ and 1 n = (1, ..., 1) T ∈ R n . Moreover, the optimization task (4.1) can be rewritten as where A = (1 n , In, −I n ) ∈ R n×(2n+1) , In = diag(1, ..., 1) ∈ R n×n , ≥ 1 means that the first constraint is equality and the others are inequalities, and Bk = (0, C(τ k − 1)1 T n , −Cτ k 1 T n ) T ∈ R 2n+1 . Note that the optimization task (4.2) can be efficiently solved by some commonly used package in many statistical softwares, such as the "quadprog" package in R or "qpsolvers" package in Python.
Note that the optimization task (4.2) can be done in a parallel fashion at each quantile level, and thus the estimation procedure is computationally efficient and scalable. This computational efficiency is largely due to the fact that the (shape) non-crossing constraints [29,1] are not enforced in (3.1). With the noncrossing constraints enforced, one may except the numerical performance can be further improved, at the cost the increased computational burden. In the rest of this paper, we illustrate the proposed method by fitting JQR as introduced in Section 3.1 without enforcing the (shape) non-crossing constraints, and it yields satisfactory numerical performance in all the numerical examples in Section 6.

Tuning procedure
To determine the thresholding parameter v n , we employ the stability-based criterion [38] to search the optimal value of v n . Its key idea is to measure the stability of the proposed method by randomly splitting the training sample into two parts, and comparing the disagreement between the two estimated active sets.
Specifically, given a value v n , we randomly split the training sample Z n into two parts Z n 1 and Z n 2 and apply the proposed method to Z n 1 and Z n 2 , to obtain two estimated active sets A 1,vn and A 2,vn , respectively. Then, the disagreement between A 1,vn and A 2,vn is measured by Cohen's kappa coefficient where P r(a) = n11+n22 p and P r(e) = (n11+n12)(n11+n21) with n 11 = | A 1,vn ∩ A 2,vn |, n 12 = | A 1,vn ∩ A C 2,vn |, n 21 = | A C 1,vn ∩ A 2,vn |, n 22 = | A C 1,vn ∩ A C 2,vn | and | · | denotes the set cardinality. The procedure is repeated B times and the estimated sparse learning stability is measured aŝ Finally, the parameter is set as v n = max v n ∈ R ≥0 :ŝ (Ψv n ) maxv nŝ (Ψv n ) ≥ q , where q ∈ (0, 1) is some given percentage and R ≥0 denotes the set of non-negative real numbers.

Asymptotic results
In this section, we establish the asymptotic results for the proposed method and for simplicity, we define Q * τ = argmin Qτ ∈Bτ Q τ 2 K with B τ = {Q τ : Q τ = argmin hτ ∈H K E Z n (h τ )} to ensure the uniqueness of the minimizer Q * τ and denote Q τ = argmin Qτ ∈H K E(Q τ ) + λ Q τ 2 K . The following technical conditions are needed to establish the estimation consistency. Condition 1. There exist some positive constants κ 1 and κ 2 such that for any l = 1, ..., p, sup x∈X K x K ≤ κ 1 and sup x∈X ∂ l K x K ≤ κ 2 . Condition 2. There exist some positive constants c 1 and θ 1 such that there holds max k=1,...,m Q τ k − Q * τ k K = c 1 λ θ1 . Condition 1 imposes a boundedness condition on the kernel function as well as its gradient functions, which is commonly used in statistical learning literature [28,50,14] and satisfied by many popular kernels, including the Gaussian kernel, Sobolev kernel and the scaled linear kernel with the compact support condition on X . Note that the compact support condition is usually assumed in machine learning literature [28,14,24] and also is regularly imposed for universality and the Mercer's theorem. Recently, many efforts have been made to extend them to the non-compact setting and interested readers may refer to [35,37,33]. Condition 2 controls the approximation error in the sense that lim λ→0 Q τ k − Q * τ k K = 0. Note that similar condition quantifying the approximation error is commonly used in statistical learning literature [28,49].
Now we turn to establish the asymptotic estimation consistency of the proposed method.

Theorem 1. Suppose Conditions 1-2 are satisfied. For any δ > 4(log n)
Theorem 1 provides the strong convergence result of the difference between Q τ k and Q τ k , which plays a crucial role in establishing the estimation consistency of the estimated gradient functions. Note that in machine learning literature [34,28,22], it is usually assumed that |y| < M, where M denotes some positive constant, for mathematical simplicity. When this bounded response condition is imposed, the upper bound in Theorem 1 reduces to c 3 M log 8 Theorem 2 establishes the convergence rate of the difference between the estimated gradient functions and the true gradient functions and this desired result is crucial to establish the asymptotic selection consistency of the proposed method. Note that the convergence result allows p diverging but as pointed out by [11] that the dependency is generally difficult to quantify explicitly. Further conditions are assumed to establish the selection consistency for the proposed method. Condition 3. For some positive constants c 5 and ζ 1 , the true gradient functions satisfy that for any τ, τ ∈ (0, 1) and l = 1, ..., p,  [4] for parametric case and [14] for nonparametric case. The first part of Condition 4 requires the true gradient function contains sufficient information about the truly informative covariates at some quantile level, which is commonly used in nonparametric modelling and is much tighter than many existing nonparametric methods [16,48]. The second part of Condition 4 imposes the condition on the choice of quantile levels and is naturally satisfied for the equidistant quantile levels, as well as some other more sophisticated designed quantile levels. Now we establish the asymptotic consistency of the proposed method. Theorem 3 shows that with the diverging of sample size n and the number of quantile levels m, the selected informative set can exactly recover the true active set with probability tending to 1. This result is particularly interesting given the fact that it is established without any explicit model assumption and exactly identifies all the truly informative covariates which act on the conditional distribution in any pattern.

Numerical experiments
In this section, we study the numerical performance of the proposed method, denoted as MF, and compare it against some state-of-the-art methods, including the distance correlation learning (DC, [39]) and the quantile-adaptive screening (QaSIS, [15]). Note that the original DC and QaSIS methods are designed to keep the first [n/ log n] covariates to achieve the sure screening property, and thus we further truncate them by using some thresholding value to conduct sparse learning and report the truncated results. It is worth pointing out that the methods considered in [14] are computationally demanding and can only work when the dimension is relatively small. Thus, we omit the numerical comparison of the proposed method with all the methods considered in [14].
In all the simulated scenarios, we apply the Gaussian kernel, , for MF and σ n is set as the median of all the pairwise distances among the training sample. As the choice of the thresholding value highly affects the performance of all the compared methods, we apply the stability-based selection criterion introduced in Section 4.2 to determine it. The maximization of the stability criterion is conducted via a grid search, where the grid is set as {10 −3+0.1s : s = 0, ..., 60}.

Simulated examples
The numerical performance of MF and its competitors is evaluated under two simulated examples, which are also considered in [14]. In the both simulated examples, the quantile levels are set as τ = (0.1, 0.25, 0.75, 0.9) for MF and τ = 0.75 for QaSIS. Example 1: (Nonlinear model) We first generate , where W ij and U i are independently drawn from U (−0.5, 0.5). The data are generated as f * ( , f 4 (u) = 0.1 sin(2πu) + 0.2 cos(2πu) + 0.3(2 sin(πu)) 2 + 0.4(cos(2πu)) 3 + 0.5(2 sin(πu)) 3 and f 5 (u) = sin(πu) (2−sin(πu)) , and i 's are independently drawn from N (0, 1). Example 2: (Heterogeneous model) The generating scheme is the same as in Example 1 except that W ij and U i are independently drawn from U (0, 1) and the response y i is generated as Clearly, only the mean of the conditional distribution relies on the covariates in Example 1, whereas in Example 2, the covariates act on the conditional distribution through the mean function and the error term. For the both examples, we consider the scenarios that (n, p) = (200, 100), (400, 100), (400, 1000), (400, 4000) and the correlation structure among each covariates is considered by setting η = 0, 0.3, and 0.8 for each scenario. Specifically, when η = 0, the covariates are completely independent, whereas when η = 0.3 and 0.8, correlation structure among the covariates are added. Each scenario is replicated 200 times and the averaged performance measures are summarized in Tables 1-2, where Size is the averaged number of selected informative covariates, TP is the number of truly informative covariates selected, FP is the number of truly non-informative covariates selected, and C, U, O are the times of correct-fitting, under-fitting, and over-fitting, respectively.
As shown in Tables 1 and 2, MF outperforms the both competitors in most scenarios. In Example 1, MF is able to identify all the five truly informative covariates in most replications. However, DC and QaSIS tend to miss some truly informative covariates. In Example 2, MF shows a much larger advantage against the both competitors. The two competitors tend to miss x 3 which affects the response through the variance, while MF is still able to identify x 3 in most replications and tends to overfit by including some noise covariates, which is much less severe than missing the important ones. In the both simulated scenarios with η = 0.3 and 0.8, the added correlation structure increases the difficulty of identifying the informative covariates, and here MF also outperforms its competitors in most scenarios. We also report the computational cost of MF under different scenarios in Table 3 to illustrate the remarkable computational efficiency of MF, which supports our claim that MF is particularly useful to deal with large-scale data. Note that all the simulations are done by a computing machine with CPU Intel Xeon 5117.

Real-data analysis
In this section, MF and its competitors are applied to a supermarket dataset [44], which is collected from a major supermarket located in northern China, consisting of daily sale records of p = 6, 398 products on n = 464 days. This data include almost all kinds of daily necessities and the response is the number of customers on each day, and the covariates are the daily sale volumes of each product. The supermarket wants to know which product's sale volumes are highly related with the number of customers, and then some specific sale strategies based on those products can be designed to attract more customers. Both the response and covariates are pre-processed and thus have zero mean and unit variance. We apply MF with τ = (0.1, 0.25, 0.5, 0.75, 0.9), QaSIS with τ = 0.75 and DC to the supermarket data. Note that the truly informative covariates are unknown in the real-data analysis, and thus we report the prediction performance of each method by randomly splitting the dataset into two parts, with 300 observations for training and the remaining for testing. We refit kernel-based quantile regressions in (3.3) with the selected covariates for each method on the training set, and measure the prediction performance on the testing set. The splitting procedure is repeated 1000 times and the numerical performance is summarized in Table 4.
From Table 4, MF selects 22 products, whereas DC selects 20 products and QaSIS selects 15 products. The averaged prediction error of MF is smaller than those of DC and QaSIS, implying that these two methods may miss some important products. We also report the products selected by MF in Table 5.
Clearly, MF selects the products that "Totole", "Coco Cola", "Eggs" , "Rice", "Sweet noodles", "Trotters with sauce" and others. This result suggests that many customers are more willing to buy these necessities, and thus some special sale strategies based on the selected products can attract more customers. It is interesting to point out that the product "Korean side dishes" is also selected by MF, which is not surprising since the supermarket is located in northern China, and its food culture is more or less similar as South Korea.

Summary
This paper proposes an efficient kernel-based method to recovery the sparse structure of the conditional distribution, which can be regarded as a crucial step for the subsequent statistical analysis. The proposed method focuses on identifying all the covariates acting on the conditional distribution by taking advantages of the nice properties of RKHS. The implementation of the proposed method is computationally efficient by using dual optimization, and is particularly useful to deal with large-scale cases. The asymptotic consistency of the proposed method is established without requiring any explicit model conditions and the numerical experiments illustrate the superior performance of the proposed method against some state-of-the-art competitors.

Appendix: Technical proofs
Given the condition |y| ≤ M n , we consider the functional space Note that F Mn is fairly large in the sense that the minimizer of (3.1), denoted as Q = ( Q τ1 , ..., Q τm ), is contained in F Mn by the fact that 1 ..,n |y i | ≤ M n . We also denote Directly by Lemma 1, we can establish the following lemma.

Proof of Lemma 2:
Denote Z n i as a modified sample which is exactly the same as Z n except that the i-th entry (x i , y i ) is replaced by its copy (x i , y i ). By the triangle inequality, we have where the inequalities are trivial. Note that by the condition that |y| ≤ M n , the Cauchy-Schwartz inequality in RKHS and Condition 1, we have max i=1,...,n Then the desired result follows immediately.
where the first inequality follows from the Rademacher random variable is symmetric, the second inequality follows from that the check loss is Lipschitz continuous, and the last inequality follows from the property of Rademacher complexity and the condition that |y| ≤ M n . Note that by the reproducing property that Q τ k (x) = Q τ k , K x K , we have (1 + κ 1 λ −1/2 ) log n λn 1/2 1/2 . Clearly, P (C 1 ) can be decomposed as P (C 1 ) = P (C 1 ∩ { |y| > log n}) + P (C 1 ∩ { |y| ≤ log n}) ≤ P (|y| > log n) + P (C 1 | |y| ≤ log n) =: P 1 + P 2 .