High-dimensional suﬃcient dimension reduction through principal projections

: We develop in this work a new dimension reduction method for high-dimensional settings. The proposed procedure is based on a principal support vector machine framework where principal projections are used in order to overcome the non-invertibility of the covariance matrix. Using a series of equivalences we show that one can accurately recover the central subspace using a projection on a lower dimensional subspace and then applying an (cid:2) 1 penalization strategy to obtain sparse estimators of the suﬃcient directions. Based next on a desparsiﬁed estimator, we provide an inferential procedure for high-dimensional models that allows testing for the importance of variables in determining the suﬃcient direction. Theoretical properties of the methodology are illustrated and computational advantages are demonstrated with simulated and real data experiments.


Introduction
In an era where large and massive datasets are becoming the norm, one of the biggest challenges for researchers is to propose efficient procedures that address the 'small n, large p' problem, where n denotes the available sample size and p denotes the number of unknown parameters. The main difficulty with a large number of the existing methods is the need to estimate the inverse of a variancecovariance matrix Σ which might not per se be invertible.
The Sufficient Dimension Reduction (SDR) framework contains a class of feature extraction procedures used mainly in regression and classification settings. These procedures are most frequently used as a first step for feature extraction, in order to reduce the dimensionality of a problem before any other statistical procedure is applied. They have been developed in a number of different directions in recent years and became more popular the last decades due to the reduced cost of data collection which lead to an increase in the complexity of the data.
In a regression setting one has a response variable Y (which we assume univariate without loss of generality) and a p-dimensional predictor vector X. Our efforts in SDR are concentrated in finding D functions of the predictors which contain all the information for the conditional distribution of Y |X. In other words, under the linear conditional independence model where the matrix B is a p × D matrix of unknown parameters, with D p. Under (1.1) we extract linear functions of the predictors X. The space spanned by the column vectors of B is known as the Dimension Reduction Subspace (DRS). Since there exists a large class of matrices B that satisfy (1.1), it implies as well that there exists a large number of DRSs that can be estimated. Almost always, one is interested in the one with minimum dimension D for which (1.1) is satisfied. This subspace is known as the Central Dimension Reduction Subspace (CDRS) and is denoted by S Y |X . Although the CDRS does not always exist, the conditions for its existence are mild (Yin et al., 2008) and therefore, throughout this work we assume it exists. If the CDRS exists, then it is also unique. There is an abundance of methods used to estimate the CDRS under model (1.1) and a very short list of references includes Li (1991Li ( , 1992, Cook and Weisberg (1991), Li et al. (2005), Li and Wang (2007), Zhu et al. (2010) and many more.
Recently, there was an interest in extracting nonlinear features of the predictors and therefore, researchers introduced SDR under the nonlinear conditional independence model: where φ : R p → R D . Model (1.2) is more general as it allows for the extraction of both linear and nonlinear functions of the predictors. Again, there is an abundance of literature for SDR under model (1.2) including Cook (2007), Wu (2008), Fukumizu et al. (2009) and Li et al. (2011) among them. The latter introduced Support Vector Machines (SVM) in the SDR framework. This method provides a unified framework for linear and nonlinear feature extraction for SDR, among other advantages. The SVM-based SDR methodology has been extended recently by the works of Artemiou and Shu (2014), Artemiou and Dong (2016), , Shin and Artemiou (2017), Randall et al. (2021) and Artemiou et al. (2021).
All the procedures mentioned above work only in low-dimensional settings where n p. The literature on how to achieve SDR in 'small n, large p' settings, or as it is more generally known as the high-dimensional low sample size (HDLSS) setting is relatively thin. One of the first efforts to achieve dimension reduction without matrix inversion in the SDR framework, was the work of Cook and Li (2002) where an iterative approach is proposed to avoid the use of the inverse matrix in the iterative Hessian transformations (IHT), a method which finds directions in the Central Mean Subspace, i.e. the space estimated under the model Y ⊥ E(Y |X)|B T X where E(Y |X) denotes the conditional expectation of Y given X. Recently, Lin et al. (2019) proposed the use of Lasso with the SIR algorithm and Lin et al. (2018) proposed the use of Diagonal Thresholding with SIR. In this paper we propose the use of Lasso in an SVM-based algorithm for SDR. As will be shown, our method utilizes principal projections of the covariance matrix Σ = var(X) to establish an equivalence relationship between high and lower dimensional SVM problems, after which an 1 penalized procedure is applied to obtain sparse estimators of the underlying directions.
The paper is structured as follows: in Section 2 we introduce the background literature by revisiting the Principal Support Vector Machine (PSVM) framework and its estimation procedure. In Section 3 we motivate an 1 regularized approach to address high-dimensional problems through a novel procedure coined 'LassoPSVM' and show that by using principal component projections, one can avoid the use of the inverse of the covariance matrix. In Section 4 we present the computational algorithm used to obtain the LassoPSVM solution. In Section 5 we discuss theoretical properties of the proposed method, while in Section 6 we present a high-dimensional inferential procedure based on a desparsified estimator for LassoPSVM. In Section 7 we present numerical studies using both a controlled simulation example, as well as real data to show the performance of the method. Finally, we close with a discussion on the method and future possible extensions in Section 8.

Principal support vector machine (PSVM)
The SDR approach we propose in this manuscript uses the multiple index model with additive error of the form . . , p with ∼ N (0, σ 2 ) and g : R D → R. The link function g is an unknown many-to-one function, the vectors β 1 , . . . , β D are unknown vectors of coefficients and for simplicity we assume throughout that the dimension D is fixed and known. We concatenate the vectors β 1 , . . . , β D to obtain the matrix of unknown coefficients as In this section we give a brief overview of the low-dimensional PSVM and in the next section we present our high-dimensional proposal. We assume without loss of generality that E(X) = 0. In the population version, PSVM minimizes the following objective function: whereỸ ∈ {−1, 1} is a discretized version of the response Y , Σ is the covariance matrix of the vector X, c is a positive regularization constant, t is a slack variable, ψ is a vector of unknown coefficients and a + = max{0, a}. It has been shown in Li et al. (2011) that under mild conditions, if the pair (ψ * , t * ) minimizes the objective function (2.2) then ψ * ∈ S Y |X which motivates the SVM approach for low-dimensional problems.
Considering H − 1 different discretized versions of the response Y , dimension reduction is achieved by performing an eigenvalue decomposition of V defined as V = H−1 h=1 ψ * h (ψ * h ) T where ψ * h are the minimizers of equation (2.2) for each h = 1, . . . , H −1, and selecting the eigenvectors corresponding to its largest D eigenvalues, due to the fact that the column space of V is the same as the column space of B, denoted as col(V ) = col(B) in light of Proposition 1.

Proposition 1.
Under the specification of model (1.1) and assuming E(X|B T X) is a linear function of B T X then col(V ) = col(B).
The proof follows directly by applying Theorem 1 from Li et al. (2011) which showed that the minimizer ψ * ∈ S Y |X . Thus, constructing the candidate matrix V using minimizers ψ h 's (h = 1, . . . , H − 1) ensures that the eigenvectors of V will span S Y |X . Proposition 1 justifies why performing an eigenvalue decomposition of V is beneficial for SDR.
Estimation under (2.2) proceeds as follows. Let (Y i , X T i ) T , i = 1, . . . , n be an independent sample of n observations. Divide first the data into H equal-sized slices, where H is a fixed number of slices. Let q h , h = 1, . . . , H − 1 denote the cut-off points between the H slices in the range of Y . Let Y = (Y 1 , . . . , Y n ) T and, for each h, we create the discretized version of the vector Y denoted as Y h which has entries:Ỹ where I(·) denotes the indicator function attributing the value 1 if the condition is satisfied and 0, otherwise. Then, for each h, in the low-dimensional setting for PSVMs one minimizes the objective function: where X = [X 1 , X 2 , . . . , X n ] T is an n × p predictor matrix and denotes the elementwise multiplication of two vectors. The vector ξ h = (ξ h 1 , . . . , ξ h n ) T where ξ h i is the misclassification distance for each data point (set to 0 if the point is correctly classified) when q h is used as a cut-off point, while t n = (t, . . . , t) T ∈ R n , 0 n = (0, . . . , 0) T ∈ R n and 1 n = (1, . . . , 1) T ∈ R n . Optimizing (2.3) gives rise to H − 1 minimizersψ h , h = 1, . . . , H − 1 used next to create the candidate matrixV = To solve the quadratic programming problem in (2.3) one can use the Lagrangian approach. This is done by creating first the Lagrangian function L(ψ, t, ξ h ) defined as where α = (α 1 , . . . , α n ) T and γ = (γ 1 , . . . , γ n ) T are vectors of Lagrangian multipliers. The Karush-Kuhn-Tucker (KKT) conditions imply that an optimal solution exists when the partial derivatives of (2.4) are set to 0. Therefore: which implies two very important properties needed later in Section 3. The first is that the minimizer of the objective function specified by (2.3) is given by: whereΣ is an estimator of Σ and the second is that Note that the same quadratic programme is solved for each cut-off point q h , the only difference being that only the values ofỸ h change (as they are a function of q h ). Note also thatψ h in (2.5) depends directly on the inverse of the estimator for the covariance matrixΣ. For low-dimensional problems this generally does not pose problems, but for cases where p n, this is problematic as the inverse might not always exist since the covariance matrix might not be full rank. As will be illustrated in the following section, our proposed procedure will eliminate completely the dependence on an estimator whose inverse does not exist. See Section 3 for details.
For ease of exposition, we avoid here the details on how one can use the dual problem to estimate α (details on the estimation are offered in Cortes andVapnik, 1995 andLi et al., 2011 among others) and focus on the fact that once one has the vectorsψ h for h = 1, . . . , H − 1 one can perform an eigenvalue decomposition of the estimated candidate matrixV = H−1 h=1ψ h (ψ h ) T in order to get the vectors that span the central subspace. This is done by selecting the first D eigenvectorsv 1 , . . . ,v D associated with the largest D eigenvalues ofV as the vectors that span the CDRS.

Using principal projections for LassoPSVM
In this section we propose a procedure coined as 'LassoPSVM', that bypasses the need of inverting the covariance matrix in order to obtain estimators in the high-dimensional setting.
Our procedure uses first the fact that one can construct an optimization problem with a solution contained within the space that is spanned by the eigenvectors associated with the non-zero eigenvalues of Σ. As such, solving this new optimization problem is equivalent to solving the original problem in R p . Due to the equivalence between the two problems, in a second step we use a principal component projection (Mardia et al., 1979;Zafeiriou et al., 2007) to solve the reduced problem in a lower dimensional space and project back to the original dimensional problem.
Theorem 1. Let Σ = var(X) be a p × p matrix of rank r < p and let A be the space spanned by the eigenvectors corresponding to the non-zero eigenvalues of Σ. The minimizer of the constrained objective function specified in (2.3), is equivalent to the minimizer of Proof. As Σ is a real, symmetric and positive semidefinite matrix in R p×p , its eigenvectors form an orthogonal basis of R p . Therefore every vector ψ ∈ R p can be written as a linear combination of the eigenvectors of Σ, as they form a basis in R p . Let A be the space spanned by the eigenvectors corresponding to non-zero eigenvalues of Σ and A ⊥ to be the space spanned by the eigenvectors corresponding to the zero eigenvalues of Σ. Taking vectors u ∈ A and s ∈ A ⊥ , one can write ψ ∈ R p as ψ = u + s. To show the equivalence, note that: Moreover, var(s T X) = s T Σs = 0 which implies that under the projection of s, all points x i ∈ R p project on the same point and therefore for any pair Therefore, the constraints of (2.3) are equivalent to: Using the above equivalence, one also has that the Lagrangian in (2.4) is equivalent to the following Lagrangian: Therefore, we have shown that the Lagrangian in (2.4) is equivalent to: which implies that the optimal hyperplane ψ depends only on u and not on s. It also implies that s can be any arbitrary vector from A ⊥ . We have thus shown that the original problem formulated in (2.3) can be solved in the space A, rather than in the entire space R p .
Assume now that r, the rank of Σ, is such that r ≥ D and r < n. We construct next the matrix P of dimension p × r. The columns of P are formed by the r non-null eigenvectors of Σ, that is, the eigenvectors corresponding to the nonzero eigenvalues. The columns of P are an orthogonal basis in R r therefore, one can create the mapping m : R r → A which is a one-to-one mapping from R r to A and where for any w ∈ R r we have that m(w) = P w = u. Using next the mapping m(w) and replacing u in Theorem 1 by P w, one can show the following corollary holds.

Corollary 1. The minimizer of the objective function specified in (2.3), is equivalent to the minimizer of
Rather than optimizing (3.1), we propose to optimize which coupled with the constraints in Corolloary 1 leads to the Lagrangian: where α and γ are the Lagrangian multipliers. Using KKT conditions as in Section 2, one can show that for each h, the estimator for w is: whereΣ † is an estimator for the covariance matrix of the projected vectors. In light of Theorem 1, LassoPSVM uses the candidate matrixV u defined aŝ where the second equality comes from using Corollary 1. Plugging in the expression forV u the estimatorŵ h defined in (3.2), one obtainŝ The advantage of the estimator expressed in (3.3) is the fact thatΣ −1 is not needed, as it is replaced by the inverse ofΣ † which always exists. We have thus removed the major hurdle one has encountered in the development presented in Section 2.
Let nowλ 1 ≥ . . . ≥λ D > 0 be the ordered eigenvalues ofV u and let η 1 , . . . ,η D be the associated eigenvectors. For any couple (λ d ,η d ) with d = 1, . . . , D, one now has that: and as such, by substitutingỹ † d in the above equation, one now has that As shown in Section 2 in Proposition 1, with PSVM the basic identity is that the column space of V is the same as the column space of B. As such, and with η d denoting the population version ofη d which corresponds to the d-th where the second result is obtained by approximating η d ≈η d . We can set now P (X † ) T = (X † ) T and therefore the above relation becomes To obtain now a sparse estimatorβ d we propose to minimize the 1 penalized quadratic loss function for a regularization parameter μ > 0: hence the name 'LassoPSVM' for the proposed procedure. Here · 1 and · 2 denote the usual 1 and 2 vector norms. We close this section by commenting on the importance and benefits of Theorem 1 and Corollary 1. In light of this theorem we highlight that the crucial information to recover the CDRS is contained in the eigenvectors corresponding to the non-zero eigenvalues of V u . However, these eigenvectors are still contained in R p . As such, this result is not useful enough as one still needs the inverse of a p × p covariance matrix, which might not exist. Corollary 1 on the other hand, allows us to improve on the dimensionality of the problem by projecting the crucial information to recover the CDRS in the space R r . As r can be much smaller than n, the estimatorΣ † we propose is always invertible. One other possibility would be to plug in an estimator of Σ −1 in (3.2) as is proposed in Pircalabelu and Artemiou (2021) for the graph-based procedures, however an extra Gaussianity assumption is needed to justify the approach. Alternatively, based on the KKT conditions an estimator as proposed in Cai et al. (2011) could be used, but this would introduce additional computational complexity. We argue that principal projections on a lower dimensional subspace offer an elegant and natural solution in the proposed framework.

A sufficient dimension reduction algorithm
In this section we present schematically a sufficient dimension reduction algorithm that allows for the estimation of B. Based on the development presented in Section 3 one can put forward the following estimation procedure: Calculate the sample mean and sample covariance of the predictors X i , i = 1, . . . n, denoted asX andΣ. Centre the X i 's to obtain X c i = X i −X. Perform an eigenvalue decomposition of the matrixΣ, to find the r nonzero eigenvaluesτ 1 , . . . ,τ r and the corresponding eigenvectorsζ 1 , . . . ,ζ r . Using the vectorsζ 1 , . . . ,ζ r as columns, create the p × r matrixP .
Find the D largest eigenvalues and corresponding eigenvectors ofV u and use them to calculate each vectorỹ † d , d = 1, . . . , D. Run a Lasso algorithm withỹ † d as the response vector andX † = P P T X T as predictor matrix to find each of the d (sparse) columns of the matrix B.
In the first step of the algorithm, one needs an estimator of the covariance matrix Σ. In principle, the simplest estimator that one can use is the sample covariance matrix S. However, when p n it is well known that the performance of this estimator relative to the true unknown Σ is poor. See for example, the works of Ledoit and Wolf (2004), Bickel and Levina (2008), Cai et al. (2010), Fan et al. (2011) or Bien and Tibshirani (2011) among many others. For these reasons, we propose to use the thresholded estimator of Bickel and Levina (2008) due to computational simplicity, generality and good theoretical properties.

Theoretical results
Let Σ 0 denote the true covariance matrix of the vector X and V 0 the true candidate matrix for which col(V 0 ) = col(B 0 ), where B 0 is the true matrix of unknown parameters. Let S d = {k : β d,0,k = 0} be the support of β d,0 , S c d the complement of S d and s d = |S d | the cardinality of S d .
By the basic Lasso inequality and Lemma 6.2 in Bühlmann and Van De Geer (2011) we know this event holds with high probability. We work further on this event. Now from the KKT conditions for Lasso, we have thatβ d is a solution of (3.4

. This happens if and only if
Working on the event B whenβ d is a LassoPSVM solution, also implies that || 1 n (X † ) Tỹ † d || ∞ ≤ μ. This bound is interesting in itself as it informs us that in an ∞ sense, the components of this vector are well controlled when p grows. To see why this inequality holds we work by contradiction. Suppose || 1 n (X † ) Tỹ † d || ∞ > μ. Asβ d is a Lasso solution we argued (5.1) must hold. On the other hand, due to the triangle inequality, we have that We have arrived at a contradiction, hence the supposition is false. Now since, for which we can guarantee a good control i.e. || 1 n (X † ) Tỹ † d || ∞ ≤ μ implies that the term in (5.1) can be rewritten as as stated in the lemma.
Proposition 2. Letβ d be the LassoPSVM solution of (3.4) when μ = Aσ log p n , D = 1 andβ d,0 be a vector proportional to the true vector that satisfies (2.1). AssumeX † satisfies a compatibility condition with respect to the true support S d for a compatibility constant φ 0 > 0. For sufficiently large constants A, A and A one has with high probability that

Proof. Under the compatibility assumption we analyse the vector
Using Lemma 1 we can bound

E. Pircalabelu and A. Artemiou
due to the constraint under the compatibility condition. As such, The first result directly follows by plugging in μ = Aσ log p n . The second result follows from by plugging in μ = Aσ log p n .
The conditions D = 1 andβ d,0 satisfies (2.1) in Lemma 1 and Proposition 2 put us in the single index model framework. If one considers a stronger restricted eigenvalue condition on the design one can obtain the bound where φ 1 > 0 is now the constant for which the restricted eigenvalue condition is satisfied and C is a constant arbitrary large. Note that now the restricted eigenvalue condition is much stronger since we require it to hold for any vector of the form v =β d −β d,0 whereβ d,0 satisfies (2.1).
Consider without loss of generality that D = 2. Then cos(θ) =β T 1β2 ||β 1 ||2||β 2 ||2 = a T b, where θ is the angle between the two component vectorsβ 1 andβ 2 and a and b are the unit vectors obtained after normalization. Suppose there exist vectors a and b with a = b such that ||a|| 2 = ||b|| 2 = 1 and a T b = 1 which implies the two vectors are orthogonal. This implies These relations also imply that which holds as long as a 2 = b 2 . Similarly one can arrive at a 1 = b 1 , however this is not allowed (due to our supposition), hence the contradiction.
Secondly, also by contradiction one can show that if ||Σ 0βd,0 || ∞ > μ then necessarily ||β d || ∞ > 0 with high probability for it to be a solution of (3.4). This implies further that the lengths of the vectorsβ d ; d = 1, . . . , D are bounded away from zero.
Using further Gram-Schmidt orthogonalisation, we can argue as in Lin et al. (2019) that the j-th element on the matrix diagonal) and Φ(·) the standard normal cumulative distribution function. As σ 2 ϕ is unknown, we replace it by a consistent estimator.

Numerical studies
In this section we present the performance of our proposed procedure on simulated and two gene expression real datasets as well as a phonetic dataset.

Simulation study: Frobenius loss performance
The performance of the LassoPSVM procedure has been investigated in a controlled simulation study. We have benchmarked our procedure against LassoSIR (Lin et al., 2019), as this procedure has been observed in practice to produce competitive results and most importantly, it was designed for high-dimensional data as it produces sparse estimators for the coefficient matrix B. Due to these similarities, we argue that the two methods are directly comparable.
Performance for both competitors has been measured by a Frobenius loss defined as: represents the estimated coefficient matrix, B 0 represents the true coefficient matrix and || · || F is the Frobenius norm. Smaller values for the loss, denote a better performance.
Three data generating models have been used throughout the section, defined as: Model 1: Y = X 1 + X 2 + . . . + X k + σ , where k = 2%p ; Model 2: Y = X 1 /(.5 + (X 2 + 1) 2 ) + σ ; Model 3: Y = X 1 (X 1 + X 2 + 1) + σ ; Model 1 investigates the case of a linear, low or high-dimensional model (depending on the value of p, the number of predictors), where the sparsity coefficient (i.e. the number of active components) is set at 2% of the total number of predictors. Models 2 and 3 investigate the very sparse, non-linear case. The difference between model 1 and models 2 and 3 lies in the fact that model 1 needs one effective direction to model the central subspace, whereas models 2 and 3 need two distinct directions. The difference between models 2 and 3 is that X 1 affects both directions in model 3, while in model 2 there are two different variables to define the two directions.
In each model X = (X 1 , . . . , X p ) T ∼ N (0, Σ) where Σ = (σ ab ) obeys a Toeplitz structure with σ ab = ρ |a−b| and ρ ∈ {0, .5, .8}. This gave rise to three different scenarios which range from uncorrelated to highly correlated predictors. The errors followed ∼ N (0, 1) and σ ∈ {.5, 1}. From each model we sampled n = 300 independent observations and p, the number of components, was set to p ∈ {100, 500, 1000, 2000}, while the number of slices was set to H ∈ {5, 20}. Using each distinct value of the different data generating parameters, 144 different simulation scenarios have been created and for each scenario 300 independent repetitions from the same generating process were performed.
For both competitors a 10-fold cross-validation scheme has been used for selecting the tuning parameter μ that dictates the sparsity level of the coefficientŝ β and both competitors have been given information about the true number of directions needed to model the data.
As pointed out at the end of Section 4, in the first step of the algorithm one needs an estimator of the covariance matrix Σ. We have investigated further in this section two estimators: (i) the sample covariance matrix S and (ii) the thresholded estimator of Bickel and Levina (2008). For the thresholded estimator of the Σ matrix, i.e.Σ tn , the thresholding level was set to t n = M log p/n where M = .1 was used. A cross-validation strategy can be used here as well to select M , but from our experiments we have observed that this value performed satisfactorily. Figure 1 presents the obtained results. In the plot each symbol represents the empirical median over the 300 repetitions for each of the 144 scenarios and it suggests that LassoPSVM provided similar or better results in a large majority of different scenarios when compared to the LassoSIR procedure. In only a small number of scenarios, the performance seemed to be slightly worse than that of LassoSIR. Panel (b) illustrates that using the thresholded estimator or the empirical estimator leads to similar conclusions, probably due to the choice of the thresholding level t n .
To illustrate the benefits of using the thresholded estimator rather than the empirical covariance matrix, we plot in Figure 2 the difference betweenΣ tn and the true Σ, and between S and Σ in Frobenius norm. The difference is plotted as a function of M for a particular configuration where n = 300, p = 100, H = 5, σ = 0.5, ρ = .8 and Model 3 is used as a data generating process, which in principle favors the empirical covariance estimator, as the example is low-dimensional and enjoys a high signal to noise ratio. As the figure illustrates, for higher values of the thresholding parameter, the accuracy of reconstructing the true underlying Σ increases forΣ tn relative to the empirical covariance estimator. Similar trends have been observed for other different configurations of parameters, but are not reported here.
We 'zoom in' further on the performance of the methods as a function of the data generating parameters in Figure 3. The figure suggests that under the non-linear models 2 and 3, the LassoPSVM is better equipped at identifying the effective directions in the data than the LassoSIR. The same holds for the settings where Σ = I and p = 100 for which LassoPSVM always provided comparable or better performance. For all other values of the data generating parameters, the performance of the LassoPSVM is generally better (by which we mean that the number of settings where it outperforms, is larger than the number of settings where it underperforms) than that of LassoSIR, but for all these cases we could identify a few settings where the LassoSIR is better performing than LassoPSVM.
To conclude this section, we remark that the simulation study suggests that none of the techniques is uniformly better than the other one, but in general Las-soPSVM provided better results for more settings than LassoSIR. LassoPSVM proved thus to be a worthy competitor to LassoSIR in a controlled, finite sample setting.

Simulation study: Type I error and power
In this section we evaluate the Type I error rate and power of the hypothesis test proposed in Section 6 with significance level α = 5%. We consider the models: where the total number of predictors p takes values in the set {500, 1000}, σ ∈ {.5, 1} and ∼ N (0, 1). The vector X ∼ N (0, Σ) where σ ab = ρ |a−b| with ρ ∈ {0, .5, .8}. We sampled n = 300 independent observations, while the number of slices was set to H = 5 and 20.
We evaluate the performance of the test by the average empirical power and average empirical Type I error defined as: Heatmap of the predictor gene expression levels for the Scheetz et al. (2006) dataset (left panel) and for the Bühlmann et al. (2014) dataset (right panel).
where S 0 and S c 0 are the sets of active/non-active components. Model 4 investigates the case of a linear, high-dimensional model where five components of the total number of predictors are active and where one sufficient direction is needed to model the underlying signal. Model 5 investigates the sparse, non-linear case where one needs two distinct directions to model the underlying signal. For model 5 only the first component is active for the first direction and the first 5 components are active for the second direction. The competitor procedures used in this section are the desparsified LassoPSVM with S andΣ tn and the desparsified Lasso (van de Geer et al., 2014). For Las-soPSVM withΣ tn a data splitting approach as suggested in Bickel and Levina (2008) was repeated 10 times on a grid of 10 different values in order to select an optimal thresholding level t n . Table 1 presents the obtained results and it suggests that both competitors control well the average type I error rate at the target level of 5%. For model 4, the average power with respect to the specified alternative seems to decrease as ρ and H increase when p = 500, but seems to be comparable to that of Lasso for p = 1000. For the non-linear model 5, LassoPSVM captures much better the important variable for the first direction, while being slightly less powerful than the Lasso with respect to the components active on the second direction. As the model is highly non-linear the obtained powers are lower than in the case of model 4 and increasing the correlation between predictors severely reduces the performance of the tests. Overall, using LassoPSVM with S orΣ tn provided very close results.

Applications to real data: the continuous case
In this section we discuss the application of the LassoPSVM procedure to two real datasets. The first dataset we analyse is a simplified version of the 'Eye' gene expression data from Scheetz et al. (2006). The data contain information about the expression level of p = 200 genes (predictors) for a total of n = 120   rats, while the response is represented by the expression level of the TRIM32 gene for all rats. The second dataset we analyse is the 'Riboflavin' dataset from Bühlmann et al. (2014) which contains information regarding the riboflavin production by Bacillus subtilis. Information for n = 71 observations on p = 4088 predictors (gene expressions) and a one-dimensional response (riboflavin production) is recorded. Both datasets are freely provided in the R-based packages flare and hdi. A visual inspection of the information in these two datasets is offered in Figure 4. We evaluate LassoPSVM and LassoSIR with respect to leave-one-out average mean square prediction error and average standard deviation defined as Err i,d ) where Err i,d = (y i −x T iβd,(−i) ) 2 and Err i,d = 1 n n i=1 Err i,d . The values y i and x i represent the observed values for the response and the vector of gene expression levels for the i-th case andβ d, (−i) is the d-th estimated direction, when the i-th case is excluded from the training sample. Table 2 presents the obtained results when D (the number of directions) takes values in the set {1, 2} and H (the number of slices) takes values in the set {3, 5, 10} for both the LassoPSVM and LassoSIR techniques. It suggests that both techniques provide very similar results for the two datasets, thus showing that the proposed procedure is a worthy competitor to the Las-soSIR procedure. Moreover, for LassoPSVM a BIC-type criterion as proposed in Artemiou and Dong (2016) suggests that D = 1 provides best results for both datasets and as such allowing for a larger number of directions does not sensibly improve its performance, which is confirmed by the results seen in the table.
To illustrate the usefulness of the testing procedure proposed in Section 6, we present in Table 3 the selected variables when a Bonferroni correction is applied to maintain a FWER of 10% and when LassoPSVM is compared to x x x 102 x 140 x x 153 x x x x 174 x 180 x 200 x x the desparsified Lasso and the desparsified kernel-based procedure for highdimensional linear single index models of Gueuning and Claeskens (2016). The Eye dataset is used as example. The

Applications to real data: the discrete case
In this section we discuss the application of the LassoPSVM procedure to the dataset of Tsanas et al. (2014) from the UC Irvine repository available at https://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation. The dataset consists of n = 126 samples and p = 309 features, and the aim is to assess whether voice rehabilitation treatment leads to phonations considered 'acceptable' or 'unacceptable' (a binary classification problem). We evaluate LassoPSVM and LassoSIR with respect to Sensitivity, Specificity and F 1 score defined as: Sensitivity = A/(A + C); Specificity = D/(B + D); and F 1 = 2P R/(P + R), where P = A/(A + B), R = Sensitivity and the counts A, B, C and D come from the below classification table:```````P redicted Observed Acceptable Unacceptable The predicted class is obtained for each case in a leave-one-out scheme when the training set uses all cases but the i-th case to make a prediction for the i-th case. To maintain comparability, we use the same strategy as proposed in the work of Lin et al. (2019): we first standardize each feature, apply LassoSIR and LassoPSVM to identify the directionsB and the corresponding components, followed by a logistic regression model on the training dataset, after which the probability for the left out case to belong to the acceptable/unacceptable class is calculated. Table 4 presents the obtained results when D = 1 and H = 2 for both the LassoPSVM and LassoSIR techniques. It illustrates that both techniques provide very similar results for the classification problem, thus showing again that the proposed procedure is a worthy competitor to LassoSIR.

Discussion
In this paper we have introduced a new method for sufficient dimension reduction which allows handling 'small n, large p' problems. The proposed method is based on an SVM framework for SDR, in conjunction with the use of a principal projections framework in order to avoid the singularity of the covariance matrix. The method outputs sparse directions and this is achieved by penalizing a transformed objective function with an 1 penalty. The method's performance is assessed on simulated and real datasets, where it provides in the majority of cases similar or better results than a state-of-the-art, direct competitor. The method is flexible enough that it admits extensions in multiple directions, but probably the most interesting extension would be towards non-linear settings as framed in model (1.2). At first sight, connections with the Kernel PCA methods are directly exploitable, but a thorough treatment of the subject is topic for future research.