Convergence guarantee for the sparse monotone single index model

We consider a high-dimensional monotone single index model (hdSIM), which is a semiparametric extension of a high-dimensional generalize linear model (hdGLM), where the link function is unknown, but constrained with monotone and non-decreasing shape. We develop a scalable projection-based iterative approach, the"Sparse Orthogonal Descent Single-Index Model"(SOD-SIM), which alternates between sparse-thresholded orthogonalized"gradient-like"steps and isotonic regression steps to recover the coefficient vector. Our main contribution is that we provide finite sample estimation bounds for both the coefficient vector and the link function in high-dimensional settings under very mild assumptions on the design matrix $\mathbf{X}$, the error term $\epsilon$, and their dependence. The convergence rate for the link function matched the low-dimensional isotonic regression minimax rate up to some poly-log terms ($n^{-1/3}$). The convergence rate for the coefficients is also $n^{-1/3}$ up to some poly-log terms. This method can be applied to many real data problems, including GLMs with misspecified link, classification with mislabeled data, and classification with positive-unlabeled (PU) data. We study the performance of this method via both numerical studies and also an application on a rocker protein sequence data.


Introduction
Single index models (SIMs) provide a semi-parametric extension of linear models, where a scalar response variable Y ∈ R is related to a predictor vector X ∈ R p via for some unknown parameter vector u and unknown link function g .
In this paper, we consider a shape-constrained single index model (1) where g belongs to a class of monotonic (but not necessarily smooth) functions and the parameter u is high-dimensional and sparse (with sparsity level s ).We develop a scalable algorithm and provide theoretical guarantees for high-dimensional single index models.Prior theoretical guarantees for the high-dimensional single index model rely on the distribution of X being known or symmetric (see e.g.[28,38]).In a number of settings, such assumptions on X are not satisfied (in section 1.1.3we show X follows a non-symmetric mixture distribution) and prior theoretical guarantees do not apply.In this paper, we provide a scalable algorithm with theoretical guarantees for the highdimensional single index model under more general design assumptions that include non-symmetric distributions.
We consider a general case where mild assumptions are made on the design matrix X, the error := Y − E [Y | X], and their interaction.In particular, X can be deterministic or random.If X is random, the distribution of X can be asymmetric.The distribution of may depend on X.In addition we assume that the unknown vector u is sparse and high-dimensional.As g is not assumed to be known a priori, the model (1) provides a more flexible model specification than a parametric model, while still circumventing the curse-of-dimensionality by avoiding a p-dimensional non-parametric function estimation.Two estimation problems exist in single index model framework: estimation of the unknown vector u and of the 1-dimensional regression function g .Both are addressed in this paper.
To begin, we provide three motivating examples for the hdSIMs in (1).We see that each example reduces to the estimation problem of u or g , or both under the model (1).Generalized linear models (GLMs) are parametric extensions of linear models, which includes many popular models such as logistic and Poisson regression.GLMs relate the conditional expectation of Y given X via a link function ψ, such that E [Y | X] = ψ −1 (X u ), where a link function ψ is assumed to be known.The parameter of interest u is usually estimated via an iterative procedure using the knowledge of ψ; therefore, when ψ is misspecified, an output is inevitably biased for u , even in an asymptotic sense.Using the single index model framework, we produce an estimate of u without particular link function specification.

Example 2: Classification with corrupted labels
Another interesting example involves classification with corrupted labels.In particular, instead of observing a sample (X, Ỹ ) ∈ R p × {0, 1} where Ỹ may follow, e.g., a logistic model, we observe the pair (X, Y ) ∈ R p × {0, 1} where Y is a systematically corrupted version of Ỹ .In particular Ỹ is related to X via E[ Ỹ |X] = g(X u ) for some unknown parameter vector u ∈ R p and a monotonic function g, which is potentially unknown.
We assume the corrupted sample Y is independent of X given Ỹ , which corresponds to a missing-completely-at-random assumption.The corruption structure is completely specified by corruption probabilities ρ 1 = P(Y = 0| Ỹ = 1) (the probability that a positive label is corrupted) and ρ 0 = P(Y = 1| Ỹ = 0) (the probability that a negative label is corrupted), assuming ρ 0 + ρ 1 < 1 for identifiability of the model.Under these assumptions, E[Y |X] = P(Y = 1|X) is given by where we let g (s) := g(s)(1 − ρ 1 ) + (1 − g(s))ρ 0 .(Note g is monotonic since g is monotonic and ρ 0 + ρ 1 < 1 by assumption.)In many practical situations the corruption probabilities ρ 0 and ρ 1 may be unknown.In such cases, even with the assumption of a specific g (such as a sigmoid function g(u) = 1/(1 + exp(−u)) in the case of a logistic model for Ỹ ), with the unknown parameters ρ 0 and ρ 1 , the new link function g is still unknown, which motivates moving beyond the parametric model setting.

Example 3: Classification with positive-unlabeled (PU) data and unknown prevalence π
In various applications such as ecology, text-mining, and biochemistry, negative samples are often hard or even impossible to be obtained.In positive-unlabeled (PU) learning, we use positive and unlabeled samples, where positive samples are taken from the sub-population where responses are known to be positive, and unlabeled samples are random samples from the population.
To make this concrete, we begin with a distribution D on data ( X, Ỹ ) ∈ R p × {0, 1}.Here X is a feature vector while Ỹ ∈ {0, 1} is a true label, "positive" (1) or "negative" (0).In some applications, it is not possible to observe (negatively) labeled data; instead, we observe unlabeled data (i.e., a feature vector X, drawn from the marginal distribution of X under the joint distribution D), and a subsample of positively labeled data (i.e., a feature vector X drawn from the conditional distribution of X | Ỹ = 1, derived from the joint distribution D).The new "label" Y ∈ {0, 1} then specifies whether X was drawn as an unlabeled sample (Y = 0) or as a positive-only draw (Y = 1).Writing γ as the ratio of positive-only samples to unlabeled samples, the conditional distribution Y | X can be calculated as where π = P( Ỹ = 1) is the proportion of positive labels under the original distribution D on (X, Y ).Therefore, we can see that if the distribution of Ỹ | X, in the underlying population, satisfies the monotone single-index model, then the same is true for the new PU data-specifically, if P( Ỹ = 1 | X = x) = g(x u ) for some monotone function g and some vector u , then which is also a monotone function, inheriting this property from g.In [33], this problem is addressed with a parametric approach, by assuming that g is known (commonly the sigmoid function, i.e., assuming a logistic model for Ỹ | X), and assuming that the proportion π = P( Ỹ = 1) of positive labels in the underlying population is known as well.In settings where π is unknown, however, this approach can no longer be applied, even if g is known; reliable estimation of such proportion is very challenging and heavily relies on model specification [16] and can lead to unreliable inference.In contrast, assuming only that g is a monotone function allows for both unknown π and unknown (monotone) g.
Another interesting aspect of this PU problem is that the distribution of X is, in general, not symmetric.Even in a setting where the distribution of features in the population (i.e., the marginal distribution of X) can plausibly be assumed to be symmetric, this will no longer be the case for X-this is because the marginal distribution of X is a mixture between the marginal distribution of X, and the conditional distribution of X | Ỹ = 1.Therefore, this setting cannot be addressed with existing methods that rely on a symmetric design assumption.

Prior work
The single index model has been an active area in statistics for a number of decades, since its introduction in econometrics and statistics [19,17,23].There is extensive literature about estimation of the index vector u , which can be broadly classified into two categories: methods which directly estimate the index vector while avoiding estimation of a "nuisance" infinite-dimensional function, and methods which involve estimation of both the infinite-dimensional function and the index vector.
Methods of the first type usually require strong design assumption such as Gaussian or an elliptically symmetric distribution.Duan and Li [13] proposed an estimator of u using sliced inverse regression in the context of sufficient dimension reduction.√ n consistency and asymptotically normality of the estimator are established in the low-dimensional regime, under the elliptically symmetrical design assumption.This result is extended in Lin et al. [25] in high-dimensional regime under similar design assumptions.Plan et al. [28] studied sparse recovery of the index vector u when X is Gaussian, and established a non-asymptotic O((s log(2p/s)/n) 1/2 ) mean-squared error bound, where s is a sparsity level, i.e. s := u 0 .Ai et al. [1] studied the error bound of the same estimator when the design is not Gaussian and showed that the estimator is biased and the bias depends on the size of u ∞ .The proposed estimator in Plan et al. [28] is generalized in several directions; Yang et al. [37] propose an estimator based on the score function of the covariate, which relies on prior knowledge of covariate distribution.Chen and Banerjee [8] proposed estimators based on U-statistics which also include Plan et al. [28] as a special case with standard Gaussian design assumption.
Methods of the second type usually do not require strong design assumption, and our method also falls into this category.One popular approach is via M-estimation or solving estimation equations.Under smoothness assumptions of g , methods for a link function estimation were proposed via Kernel estimation [19,15,23,12,10], local polynomials [6,9], or splines [39,24].An estimator for the index vector is then obtained by minimizing certain criterion function such as squared loss [19,15,39] or solving an estimating equation [9,10].
√ n consistency and asymptotic normality of those proposed estimators were established for those estimators in the low-dimensional regime.
Another approach is the average derivative method [29,18], which takes advantage of the fact ).An estimator is then obtained through a non-parametric estimation of d dx E [Y | X = x], also under smoothness assumption of g .There are also studies for single index model in high-dimensional regime, using PAC-Bayesian approach [2] or incorporating penalties [35,30].
There have been an increasing number of studies where the link function is restricted to have a particular shape, such as a monotone shape, but is not necessarily a smooth function.Kalai and Sastry [22] and Kakade et al. [21] investigated single index problem in low-dimensional regime under monotonic and Lipschitz continuous link assumption, and proposed iterative perceptron-type algorithms with prediction error guarantees.In particular, an isotonic regression is used for the estimation of g then update u based on estimated g .Ganti et al. [14] extended [22,21] to the high-dimensional setting via incorporating a projection step onto a sparse subset of the parameter space.They empirically demonstrated good predictive performance of their algorithms, but no theoretical guarantee were provided.Balabdaoui et al. [3] study the least square estimators under monotonicity constraint and establish n −1/3 consistency of the leastsquare estimator.Also in Balabdaoui et al. [4], assuming a smooth link function g , a √ n consistent estimator for the index vector u , based on solving a score function, is proposed.
Our work focuses on estimation of the index vector in high-dimensional regime where an unknown function is assumed to be Lipschitz-continuous and monotonic.We provide an efficient algorithm via iterative hard-thresholding and a non-asymptotic 2 and mean function error bound.Unlike Ganti et al. [14] and Kakade et al. [21], our algorithm does not require to input a Lipschitz constant of the unknown function.

Our contributions
Our major contributions can be summarized as follows: • Develop a scalable projection-based iterative approach, the "Sparse Orthogonal Descent Single-Index Model" (SOD-SIM) algorithm, which alternates between sparse-thresholded orthogonalized "gradient-like" steps and isotonic regression steps to recover the coefficient vector.
• Provide finite sample convergence guarantees for the SOD-SIM algorithm, both in estimating the parameter vector u and the mean function g (X u ) that the error scales as O s 1/2 log np , where s is the sparsity level of u .Note that the minimax rate for low-dimensional isotonic regression is n −1/3 and our algorithm achieves this rate up to log factors.This rate also matches the n −1/3 rate shown for least square estimator in [3].However, the algorithm to obtain the least square estimator is computationally intensive even in low dimensional cases.Whereas in our work, we simultaneously showed the statistical and algorithmic convergence of a computationally efficient algorithm, which can also handle high dimensional case.In estimating u , with X being elliptically symmetrically distributed, the link-free slicing regression estimates by Duan and Li [13] has been shown to be √ n consistent with asymptotic normal distribution.When g is bounded and continuously differentiable, the score estimates proposed by Balabdaoui et al. [4] has been shown to be √ n consistent with asymptotic normal distribution.In comparison, our convergence result is not asymptotic and has milder assumptions on X and g ; our algorithm is also much more computationally efficient than the score estimator.
• Finally we provide empirical study results based on simulated and real data, which support our theoretical findings and also shows that our algorithm performs well compared with existing approaches.

Main results
Let X 1 , . . ., X n ∈ R p be the feature vectors and Y 1 , . . ., Y n ∈ R the response values,1 which we model as where g is monotone non-decreasing; u is a s -sparse unit vector; and Z i ∈ R denotes the noise term.We write X ∈ R n×p to denote the matrix with rows X i , i ∈ {1, • • • , n}, and Y ∈ R n and Z ∈ R n to denote the vectors with entries Y i and Z i , respectively.For any function g : R → R and any v ∈ R p , we write g(Xv) to mean that g is applied elementwise to the vector Xv, i.e. g(Xv) = g(X 1 v), . . ., g(X n v) .
In the remaining of this section, we first present our SOD-SIM algorithm.Then we present the finite sample convergence results for the estimation of u and g .

Algorithm
Before defining our algorithm, we first define notations for the hard-thresholding operator Ψ s (•) and the isotonic regression operator iso(•).First let Ψ s : R p → R p be the "hard-thresholding operator" at sparsity level s, i.e., where S ⊆ {1, • • • , p} indexes the s largest-magnitude entries of v (ties may be broken with any mechanism).Next, for any vectors u ∈ R p and v ∈ R n , define iso Xu (v) = arg min i.e. the isotonic regression of v onto Xu.This convex problem can be solved efficiently using pool-adjacent-violators algorithm (PAVA) [11].With these definitions in place, we are ready to define our algorithm for solving the sparse single index model. 2Given a target sparsity level s ≥ 1 and a step size η > 0, the algorithm alternates between taking an orthogonalized gradient-like step (with step size η), and hard-thresholding to enforce s-sparsity.The steps of the algorithm are shown in Algorithm 1.
• Take an orthogonal gradient-like step, • Enforce sparsity and unit norm, until some convergence criterion is reached.
For our theoretical guarantee to hold, we will see that the target sparsity level s needs to be sufficiently large relative to the true sparsity s of u .Under mild assumptions, our theory will guarantee that the iterations of this algorithm converge to u up to an error level that is O(s 1/2 log(np)/n 1/3 ).
The iterative framework of this algorithm is similar to the CSI method of Ganti et al. [14].A key difference is that, in the g estimation step, the CSI method enforces a Lipschitz constraint in addition to the monotonicity constraint, which requires choosing the Lipschitz constant in advance; whereas our method only uses isotonic regression to enforce monotonicity.In addition, no theoretical guarantee is given for CSI.Our gradient update step also has some similarities to the least squares method of Balabdaoui et al. [3]; however, their algorithm works explicitly to minimize an objective function Y − iso Xu (Y) 2  2 , while our SOD-SIM algorithm recovers u through an iterative procedure.The descent direction is not the gradient of least square objective exactly, but a modified version for computational simplicity.The orthogonal projection step enforce a steady "gradient-like" decending rate.Empirically in many cases the algorithm performance is similar without the orthogonal projection step; but in some cases when the initialization u 0 , u is close to 0, the orthogonal projected algorithm has better performance.The orthogonal projection step is also important in solving the technical issue from the normalization step in the proof.

Convergence guarantees for fixed design X
We first provide upper bounds of convergence rates for the estimation of u and g when X is fixed.We begin with our assumptions.

Assumptions
We assume that and we place the following assumptions on the underlying function g , parameters u , design matrix X and noise Z.
Assumptions on the signal We assume g is a bounded, Lipschitz, and monotone non-decreasing function, i.e. for t while the true parameter vector u is a sparse unit vector, For the design matrix X, we assume upper-bounded sparse eigenvalues, For the corresponding lower bound, we require an identifiability condition, which is slightly stronger than the usual restricted eigenvalue type condition.This is because we are working with a substantially larger class of models than the usual parametric regression setting.In particular, there exists n > 0 such that, for any s-sparse unit vector u with u − u 2 ≥ n and any monotone non-decreasing g, This condition effectively ensures that we cannot reproduce the true regression function g (Xu ) with some other sparse vector u = u without any loss in accuracy, unless u − u 2 is very small.In particular, comparing assumptions ( 5), (7), and (8), we can see that we must have α ≤ β.To help interpret the parameters in this assumption, we should think of α and β as constants (expressing properties of the function g that are constant regardless of sample size), while n is vanishing as n → ∞ (i.e., the identifiability condition will hold for vectors u increasingly close to u , as sample size n increases).Later in Section 2.3, we will show that ( 7) and ( 8) are satisfied with high probability when X is a mixture of Gaussian distributions.
Assumptions on the noise For Z, we assume for all 1 ≤ i ≤ j ≤ n, Z i 's are independent, with

Convergence guarantee for the estimation of u
Theorem 1. Assume the conditions (4), ( 5), ( 6), (7), and (8) hold for the signal, and condition (9) holds for the noise.Assume we run the algorithm with step size η > 0 and working sparsity level s ≥ 1 satisfying For δ > 0, assume furthermore that Then, with probability at least 1 − δ, it holds for all t ≥ 0 that where and where the constant C (specified in the proof ) depends on α, β, L, B, σ, η, c s , but not on n, p, s, δ and n .
This theorem provides information about both computation and statistical convergence in estimating u .Define Then, from a computation perspective, the algorithm converges linearly up to the time when it reaches the statistical error ∆.In particular, for any tolerance τ > ∆, running the algorithm for t ≥ many iterations will lead to u t − u 2 2 ≤ τ (with probability at least 1 − δ).
Next, we consider the dependence on the sample size n.Suppose we assume n ≤ O(n −1/3 ) (as we will see in Proposition 2 below, this holds with high probability under a random model for X).Then ∆ scales with sample size n as n −1/3 (up to log factors).Balabdaoui et al. [3] also obtained n −1/3 consistency of a least square estimator of u under similar assumptions.With g being monotone increasing and twice continuously differentiable, and some further assumptions on the u and X, Balabdaoui et al. [4] proposed a score based estimator which is √ n-consistent, i.e., error scales with sample size as n −1/2 .
Comparing these results, an open question remains regarding the convergence rate of our algorithm in settings where the true g is smooth -the n −1/2 rate achieved by Balabdaoui et al. [4] with slightly stronger assumptions suggests that perhaps our n −1/3 rate can be improved with an additional smoothness assumption.In Section 3.1.2below, we examine this open question empirically, and will see that while the n −1/3 rate appears to be tight for a non-smooth g , simulation results with a smooth g suggest that an improved rate might be possible in that setting.

Convergence guarantee for the estimation of g
We show that with the SOD-SIM algorithm in 2.1, the estimation of g has convergence rate of O( n + s 1/2 log(np) ).In particular, assuming n ≤ O(n −1/3 ), the estimation of g has O(n −1/3 ) convergence rate omitting the log terms.Proposition 1. Assume the conditions (4), ( 5), ( 6), (7), and (8) hold for the signal, and condition (9) holds for the noise.For δ > 0, assume (10) and (11) hold for the step size η, sparsity levels s and s , and n, p. Then as n is sufficiently large, with probability at least 1 − δ, for all t ≥ 1, the prediction error is bounded by: where r is defined in Theorem 1, and where the constant C (specified in the proof ) depends on α, β, L, B, σ, η, c s , but not on n, p, s, δ and n .
The proof for Proposition 1 is deferred to section A.4.We show our estimator iso Xut Y of g converges at a n −1/3 rate in 2 norm.The upper bound of the convergence rate for estimating g matches its lower bound up to the log factors, as we know the minimax rate for isotonic regression is n −1/3 [7].This result matches the convergence results shown in [3] and [4].

Convergence guarantees with random design X
In this section we verify that the assumptions in section 2.2.1 are likely to hold for random design matrices X.In particular, we show the convergence rate for random design X with normal mixture distribution, which is a much milder condition than the symmetric elliptical X.

Assumptions
We assume the same model (4) as in section 2.2.1.We have the following assumptions on g (•), X, u and Z under random design matrix X.
Assumptions on the signal Assume g (•) is bounded Lipschitz and monotone nondecreasing (5), and u is a sparse unit vector (6).The rows of X are i.i.d.draws from a distribution on X ∈ R p , and g (•), X and u satisfy that Assumptions on the noise Conditional on X, the noise Z is subgaussian with scale σ, i.e.E e v,Z X ≤ e σ 2 v 2 2 /2 holds almost surely over X, for any fixed v ∈ R n .( 16)
Proposition 2 verifies that under the random X setting proposed in section 2.3.1, conditions ( 7) and ( 8) are satisfied with high probability.This implies that when X has a normal mixture distribution, with high probability, the results of Theorem 1 still apply, that our proposed SOD-SIM algorithm leads to n −1/3 convergence rate in 2 norm for the estimation of the index u .The proof is deferred in section A.5.

Proof of Theorem 1
We will now prove our convergence theorem.The detailed proofs for the lemmas are presented later in Section A.

and
Err iso (Z) = sup where First, we verify that the initialization u 0 defined in (3) is well-defined (i.e., X (Y − Ȳ 1 n ) = 0), and is not worse than a random guess, meaning that we have u 0 , u ≥ 0. Lemma 1. Assume the conditions (4), ( 5), ( 6), (7), and (8 then the initialization u 0 defined in (3) is well-defined, and satisfies Next, we prove a bound on the iterative update step of the algorithm.
Proof of Lemma 3. We will prove the lemma by induction.We will show that, for each t ≥ 0, it holds that At t = 0, we have since by Lemma 1 we know that u 0 , u are unit vectors satisfying u 0 , u ≥ 0. Therefore (18) holds at t = 0. Now suppose (18) holds at t = T ≥ 0. We then have where the first step holds by Lemma 2 and the second step applies (18) with t = T .This proves that (18) holds with t = T + 1, completing the proof.
With these deterministic results in place, we now need to bound Err ∞ (Z) and Err iso (Z), under our assumptions on the noise Z. Lemma 4. Assume the conditions (4), ( 5), ( 6), (7), and (8) on the signal, and (9) on the noise.For any δ > 0, with probability at least 1 − δ it holds that Combining Lemma 3 with Lemma 4, we have proved Theorem 1 with the constant

Experiments
In this section, we first use simulation experiments to demonstrate the performance of the SOD-SIM algorithm in estimating u , and explore the lower bound for the convergence rate; then we apply the SOD-SIM algorithm to a classification problem with PU rocker protein data.

Performance in estimating u for PU data
In this section, we study the performance of the SOD-SIM algorithm in estimating u for PU data without knowing the prevalence π.
Methods We simulate PU data with 400 positive data and 400 unlabeled data.
Samples in the population are independent and for individual i, X i ∈ R p ∼ N (0, Σ ρ ), where Σ ρ is the autoregressive matrix with its i, j−th entry being ρ |i−j| , where we let with sparsity level s = 2 and we vary p = 100, 400, 800, 1600.
We compare the performance of our proposed SOD-SIM in estimating u with the logistic model with 1 penalized maximum likelihood (sparseLR) method.For the sparseLR, we choose the tuning parameter using cross validation.For the SOD-SIM algorithm, we let the working sparsity s = 10 and set the learning rate η = 0.1.We simulate for M = 100 times.

Results
The performance is shown in Figure 1.Since u is only identifiable up to direction, both u and the estimations u are rescaled to have unit norms.Besides bias, standard deviation (SD) and rooted mean square error (RMSE), we also characterize the mean inner product of u and u as their correlation.We can see that for all settings with different p, the SOD-SIM method has smaller bias and SD than the sparseLR method and the correlation from SOD-SIM with u is closer to 1.With the dimension of the covariates p increases, the performance of the sparseLR method becomes worse, whereas the SOD-SIM method remains a good performance.The results shows that the mis-specification of the link function and the non-symmetric covariate distribution of the PU data affects the estimation of u using the parametric sparseLR method.The proposed SOD-SIM method performs well for the PU data without specifying the prevalence π.

Lower bound exploration
In this section we study whether our convergence rate for u t − u 2 is tight in terms of its dependence on the sample size n.We are interested in two questions: (1) Is the upper bound from Theorem 1 tight for the SOD-SIM algorithm?(2) Does the SOD-SIM algorithm attain the optimal convergence rate of the global minimizer?
We first construct the g n (x) function.In Example 1, we construct g n (x) of which the second derivative is unbounded as n → ∞.Specifically, we have where we set = 0.1.Notice that in this construction, g n (x) ∝ n 1/3 f (n 1/3 x) ∝ n 1/3 .In Example 2, we construct a smoother version of g n (x) by letting where we set = 0.1.For this example, g n (x) has bounded limiting second derivative as n → ∞.These two example functions are plotted in Figure 2.
To construct the covariates, for each sample, we first generate t i ∼ Unif[0, 1] and we define X i as , where ε i ∼ Unif[−0.5, 0.5] are i.i.d. and independent of X 1i .The constants sd 1 and sd 2 are used to make X 1 and X 2 to have roughly equal variances.

Results
We plot the logarithm of rooted mean square error (log(RMSE)) against log(n) to show the rate of convergence.Example 1 (non-smooth example) has a convergence rate close to n −1/3 , which indicates our proved convergence rate for the SOD-SIM algorithm is as tight as the global minimizer.The second example (with bounded limiting second derivative) has a faster convergence rate, which is close to n −1/2 .
Our exploratory simulation results match with the reported convergence rate of the score based estimator in [4].We do not have a definitive answer for the convergence rate of our algorithm when the function g is smooth.A better rate could potentially be achieved with alternative proof techniques, and is beyond the scope of this work.

Application to rocker protein data
In this section, we study an application of our SOD-SIM algorithm to the rocker protein sequence data.The rocker data is composed of sequences with confirmed function (Y = 1) and sequences with unknown functionality (Y = 0).
To give some biological context to this dataset, Rocker is a de novo designed protein which recreates the biological function of substrate transportation [20].A functional protein successfully transports ions across cell membranes.To study how a mutation at each site affects the membrane transport ability of proteins, mutations are introduced to the original Rocker sequence (wild-type) by double site saturation mutagenesis.Mutated sequences are screened by fluorescence-activated cell sorting (FACS), which sort out functional protein variants.Due to an experimental challenge of obtaining negative sequences, an additional set of sequences are obtained from the initial library whose associated functionality is unknown.Therefore, the resulting data set is a Positive-Unlabeled (PU) dataset, because the first set consists of functional sequence variants and the second set consists unlabeled examples.
Methods A sequence consists of 25 positions taking one of the 21 discrete values, which correspond to 20 amino acid letter codes plus an additional letter for the alignment gap.Each sequence contains at most two mutations.The data consists of n = 703030 functional (positive label) and n u = 1287155 unlabeled sequences.To compare different algorithms reliably, we split the data into 10 subsets, and use half of each split for training and remaining half for testing.Each training dataset contains 35K labeled and 64K unlabeled examples on average.We generate features with main (site-wise) and pairwise (interaction between two sites) effects using one-hot encoding of the sequences.Removing columns with zero counts, we obtain 27K features on average, where 490 of the generated features correspond to the main effects.We take the amino acid levels in the WT sequence as the baseline levels to generate a sparse design matrix X ∈ R n×p where (n, p) ≈ (100K, 27K).The number of unique sequences is around 26K (i.e., rank of row space of X ≈ 26K), which makes the problem highdimensional.The response vector Y ∈ R n represents whether each sequence i is labeled (Y i = 1) or unlabeled (Y i = 0).
We applied four methods to each train dataset which we denote as follows: • SOD-SIM: our proposed algorithm • sparseLR: the logistic regression with 1 -penalty • PV1: the proposed method in [26].We solve the following objective: y, Xu where K(s) is a s-approximate sparse set, defined as K(s) := {x ∈ R : • PV2: the proposed method in [27].
We used the glmnet package to solve the logistic regression objective with 1penalty.To solve objectives in PV1 and PV2, we implemented the projected gradient methods, where we iteratively projected the gradients onto the intersection of 1 and 2 balls.We used the Dykstra's projection algorithm to obtain the projection onto the intersection of the two balls [5].Note both objectives in PV1 and PV2 are convex, and therefore the projected gradient descent algorithm guarantees to find a global minimum.For PV1 and PV2, we input the standardized X where each column is centered and scaled.Since sparseLR, PV1, and PV2 are convex problems, we run until the algorithms the models converge.For SOD-SIM, we let the learning rate η = 1 and run the algorithm until changes of estimated parameters is small (< 0.0005) or the iteration number t reaches the pre-defined maximum number of iterations (≤ 1000).In addition, PV1 and PV2 methods do not provide estimates of the link function.We run an isotonic regression after obtaining ûP V 1 and ûP V 2 to estimate the link function, and used such estimates to perform downstream prediction tasks.
We used four metrics (Accuracy, F1 score, Brier Score, and AUC value) to evaluate predictive performance of the four methods.Accuracy is the proportion of correctly classified examples.F1 score is the harmonic mean of the precision and recall, whose value lies between 0 and 1.Higher number corresponds to a better performance.Brier Score is an average squared 2 loss, i.e., Brier Score := 1 ntest ntest i=1 (y i − ŷi ) 2 , and therefore, lower numbers correspond to better performances.AUC measures the area under the ROC curve.The perfect classification corresponds to the AUC value of 1, and a random classification corresponds to 0.5.For the choice of hyperparameters (s for SOD-SIM, PV1, and PV2, and λ for the 1 logistic regression), we use a grid of 100 working sparsity s values from 1 to p, interpolated in a square root scale, and a grid of 100 lambda values obtained from the glmnet package by default.We picked the hyperparameters that gave the best results for the most of the metrics on the test datasets.
Results Figure 3.2 demonstrates the empirical performances of the four methods.SOD-SIM performed the best in all metrics on average.The predictive performance of the sparse logistic regression was slightly worse than the SOD-SIM.It is likely due to the mis-specification error, since we are forcing the link function to be a sigmoid function.Both PV1 and PV2 seem to suffer from the deviation X from the Gaussian design; however, PV2 seems to be more robust to such deviation.

Discussion
In this paper, we propose a scalable iterative projection-based algorithm (SOD-SIM) for the estimation of hdSIMs, which we show the statistical convergence guarantees for the estimation of both the coefficient u and the unknown function g .From the minimax results of isotonic regression perspective, our convergence rate for g is tight with respect to n up to some poly log terms.Our simulation results suggest that under the mild model assumptions, the convergence rate for the estimation of u is also tight.However, our simulation results for the smooth g case suggests that the convergence rate might potentially decay faster with sample size n, under stronger model assumptions-while the theoretical upper bounds proved here scale as n −1/3 (ignoring log terms), the simulations suggest that the lower bound for the settings with additional smoothness conditions might potentially be improved.The gap between these two rates in the smooth case remains an open question.
Our statistical guarantee requires very mild model assumptions, especially for the covariates X.This allows our theoretical guarantee to cover many useful examples such as the PU data.From our simulation studies, we have shown that even if at population level the covariates X is normally distributed, the positive-only data has asymmetric distribution and parametric models such as sparseLR suffer from estimation bias caused by both misspecification of the link function and the asymmetric distribution of the covariates; whereas our SOD-SIM algorithm has good performance in estimating u in high-dimensional settings.In our real data analysis, we have shown that our method (SOD-SIM) also outperforms methods based on the slicing regression idea (PV1, PV2).

A Additional proofs
A.1 Proof of Lemma 2 First, writing ū = Ψ s (ũ), since u is a s-sparse unit vector we can calculate where the second equality holds since ũ is equal to u plus an orthogonal vector.Since u is a unit vector, we have since ǔ = ū/ ū 2 is the projection of ū onto the unit ball, which is a convex set containing u .A trivial calculation shows that Now we work with this inner product.By definition of ũ, we can write where the last step applies Lemma 5 (presented later on).Plugging this back into (21) and rearranging terms, we obtain Moreover, noting that 1 n (y − iso v (y)) = 0 for any y, v ∈ R n by properties of isotonic regression, we can equivalently write this as Next, working with this last term, denote µ = g (Xu ), we can write ) , and since and therefore plugging in our definitions of Err ∞ (Z) and Err iso (Z) from above, Since u, ū are s-sparse and u is s -sparse, we can calculate that P ⊥ u ū − u is (2s + s )-sparse and therefore By our condition (7) bounding the sparse eigenvalues of X, we also have Combining everything and returning to (22), then, Next we work with the remaining inner product.We have where to prove the last step, we apply (7) to the first term, apply Lemma 6 (presented later) to the second term, and for the third term we observe that it is ≤ 0, because u, u ≥ 0 by assumption while µ − iso Xu (µ ), Xu ≤ 0 by definition of isotonic regression (i.e., since isotonic regression is projection onto a convex cone).Finally, by Cauchy-Schwarz we have Therefore, we have calculated Plugging these calculations back into (23), and applying the assumption that η ≤ 1 Lβ , we obtain Next, we split into cases.If u − u ≤ n , then (24) implies that which can be relaxed to If instead u − u 2 ≥ n , then by the identifiability assumption (8), noting that by the definition of isotonic regression, we can write iso Xu (µ ) = g(Xu) for some monotone non-decreasing function g, we have In this case, (24) implies that which can be relaxed to Finally, since we proved above that ǔ − u 2 ≤ ū − u 2 , this completes the proof.

A.1.1 Supporting lemmas
Lemma 5 (Liu & Barber 2018, Lemma 1).For any v ∈ R p and s -sparse Lemma 6.For any vectors v, w ∈ R n and any L-Lipschitz monotone nondecreasing function Proof of Lemma 6.Let a 1 < • • • < a K be the unique values of the vector iso w (g(v)), and for each k = 1, . . ., K let By definition of isotonic regression, we have which must exist since g : R → R is continuous (due to being Lipschitz) and a k is a convex combination of values taken by g.We then have

A.2 Proof of Lemma 4
First we bound Err ∞ (Z).By the sparse eigenvalue bound (7) on X, we have X j 2 = Xe j 2 ≤ √ βn for all j = 1, . . ., p. Since Z = Y − µ , Z − Z1 n is zero-mean and σ-subgaussian, we therefore see that X j (Z − Z1 n ) is also zero-mean and is σ √ βnsubgaussian.Therefore, for each j, and so taking a union bound, we have shown that Next we turn to Err iso (Z).First, for any vector v ∈ R n , define π v to be a permutation of {1, . . ., n} such that v πv(1) ≤ • • • ≤ v πv(n) , i.e., we rearrange the indices into the ordering of v. (If v has some repeated entries then π v will not be unique but we can arbitrarily choose one such permutation for each v.) First, for any fixed v, we will compute a deterministic bound on iso v (Y) − iso v (µ ) 2 .We know that µ ∈ [−B, B] n and therefore iso v (µ ) ∈ [−B, B] n , and is monotone nondecreasing by definition.By Lemma 11.1 in [7], we can find some partition In other words, this condition ensures iso v (µ ) has little variation within each index subset S m .We define k 0 = 0 and k M = n below, so that we can write S m = {π v (k m−1 + 1), . . ., π v (k m )} for each m = 1, . . ., M .
Next for any permutation π of {1, . . ., n}, define a seminorm This is the "sliding window" norm of [36], defined according to the ordering of the vector v.By Theorem 1 and Lemma 2 of [36], it holds that iso v (x) − iso v (x ) SW,πv ≤ x − x SW,πv for all x, x ∈ R n .Therefore, for all 1 ≤ j ≤ k ≤ n, Now fix any ∈ {1, . . ., n} and let m be the index such that ∈ S m .We have by construction of the partition. Similarly, Therefore, Since M = n 1/3 , as long as n ≥ 2 we can relax this to iso v (Y) − iso v (µ ) 2 ≤ 2Bn 1/6 + Z SW,πv 2n 1/6 log n.
Finally, for σ-subgaussian zero-mean Z and any fixed permutation π, Lemma 3 of [36] proves that Then we have First, for any s-sparse u ∈ R p , let A ⊆ [p] be some subset of size |A| = s containing the support of u.Denote the permutation π such that x π(1) u ≤ x π(2) u ≤ • • • ≤ x π(n) u as π u (x 1 , . . ., x n ).Then it's clear that π u (x 1 , . . ., x n ) ∈ S n,s (x 1A , . . ., x nA ).Furthermore, where the last step holds by assumption (7).Combining everything, we have shown that Next, recalling that we have shown 1 Therefore, This completes the proof.

A.4 Proofs for Proposition 1
By definition of isotonic regression, where the second inequality is from the contractive property of isotonic regression in 2norm (see Yang and Barber [36]), and Err iso (Z) is defined as in the proof of Theorem 1.We finish the proof by plugging in the results from Lemma 4 and Theorem 1: where α > 0 is a constant which does not depend on n, p, s, n .Therefore, let α For Term 2, we will use concentration bounds to control Term 2. First, since both g and g take values in [−B, B], and therefore g(X i u) − g (X i u ) ∈ [0, 2B] for any X i and any non-decreasing g : R → [−B, B] and u ∈ S p−1 s , we can see that replacing a single X i with a different vector X i can perturb the value of Term 2 by at most 2B n .Therefore by McDiarmid's inequality [31], for any δ 1 > 0, Next, by the symmetrization lemma [32], we have where ξ i iid ∼ Unif{±1} are i.i.d.Rademacher variables, drawn independently from the X i 's.Next, consider the composition The second map is the absolute value function, which is 1-Lipschitz.By the Lipschitz inequality for Rademacher complexity, therefore, The last inequality is because of Lemma 11.Therefore, combining everything, with probability at least 1 − δ/2, This concludes the proof of Lemma 9.
Proof.Let A ∈ {1, . . ., K} be the latent variable indicating which component X was drawn from.Fix any u ∈ S p−1 s .Then for any g, where (R k , S k ) is bivariate normal with mean Here we can calculate where c 0 and c 1 are the smallest and largest restricted eigenvalues of Σ k .Next, since u and u are both unit vectors.Combining everything, suppose for the moment that g (T )−g (−T ) ≥ C for some T > 0 and C > 0. Applying Lemma 12 in the Appendix, we see that where C 0 > 0 depends only on L, C, T, c 0 , V , for each k = 1, . . ., K. Plugging in our bound on ρ, we see that must hold, since c 0 ≤ c 1 and u − u 2 2 ≤ 4 (since u and u are unit vectors).
Since this is true for each k, therefore we have Finally, we just need to find T, C > 0 such that g (T ) − g (−T ) ≥ C. Recall that Var g (X u ) ≥ ν 2 > 0 by assumption.Recall also that g takes values in [−B, B].Define by our bounds on µ k and Σ k and our definition of T .Now let C = g (T ) − g (−T ).Then and therefore, we must have for all u ∈ S p−1 s , where α > 0 is a constant depending on L, c 0 , c 1 , B, ν, V .
Proof of Lemma 11.The expectation can be simplified as follows: where the next-to-last step holds since E [ξ i ] = 0. Next, any non-decreasing function g : R → [0, 1] can be written as a (possibly infinite) convex combination of step functions.This means that for any u, the supremum over g is attained at some step function, in other words, we can write Now let π u;X 1 ,...,Xn be the permutation of {1, . . ., n} induced by the ordering of the X i u's, that is, the permutation π ∈ S n that satisfies To handle the possibility that this might not define a unique permutation, i.e. since it might be the case that X i , u = X i , u for some i = i , we place an arbitrary total ordering on S n and then, in the case of ties, we define π u;X 1 ,...,Xn as the minimal permutation satisfying the ordering above.
Then the above can be rewritten as i.e. the set of all permutations attained by any sparse vector u (using the given feature vectors X 1 , . . ., X n ).The above can therefore be rewritten as  ≤ n −1/2 , and we are taking a supremum over n 2s−1 p s • n = n 2s p s such variables.This expected value is therefore bounded as First consider the case that ρ ≤ ρ 0 , where ρ 0 ∈ [0, 1) is defined to satisfy 2Ls 1 − ρ 2 0 < since for the first inequality, the probability is minimized at ρ = ρ 0 (under the assumption ρ ≤ ρ 0 ), and is therefore lower bounded by some positive value P 0 = P 0 (T, ρ 0 ).Therefore, the conclusion of the lemma holds in this case, as long as we set C 0 ≤

Figure 1 :
Figure 1: Empirical performance of the SOD-SIM and sparseLR algorithms on PU data in terms of bias, SD, correlation and RMSE with their 95% confidence intervals.

Figure 3 :
Figure 3: Average accuracy rates, F1 scores, 1-Brier Score, and AUC values of the four methods on 10 test datasets.Higher values correspond to better performance in all four plots.The error bars represent one standard error.

2 ( 26 )
holds for all s-sparse unit vectors u, and any non-decreasing function g : R → [−B, B].

2 2 , 1 .
log(2n 2s p s ) n ≤ 2 s log(np) n (where the last step holds as long as p ≥ 2).and let g, g : R → R be any non-decreasing functions, such that g is L-Lipschitz and satisfies g (T ) − g (−T ) ≥ C > 0 for some T > 0.Then E g(R) − g (S) ≥ C 0 1 − (ρ ∨ 0) 2 ,where C 0 > 0 is a function of L, T, C, s, |m 2 | (not dependent on ρ), and is defined in the proof.Proof of Lemma 12. First, we will replace R with R = R−m 1 r and S with S = S−m 2 Define non-decreasing functions g(t) = g(m 1 + r • t) and g (t) = g (m 2 + s • t).Then we are looking to lower boundE g(R) − g (S) = E g( R) − g ( S) .Note that g is L-Lipschitz for L = Ls, and g T − m 2 s − g −T − m 2 s = g (T ) − g (−T ) ≥ C, and since g is nondecreasing, we can weaken this bound to g ( T ) − g (− T ) ≥ C where T = T + |m 2 | s .
, K. Balasubramanian, and H. Liu.High-dimensional non-Gaussian single index models via thresholded score function estimation.In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3851-3860, International Convention Centre, Sydney, Australia, 06-11 Aug 2017.PMLR.
(X 1 , . . ., X n ) ≤ n 2s−1 p s .For any sample size n, dimension p, sparsity level s, and points x 1 , . . ., x n ∈ R p , define the set sup By Lemma 7 in the Appendix, for any X 1 , . . ., X n we have |Π p,s (X 1 , . . ., X n )| ≤ n 2s−1 p s .Now, for each π and each k, 1 n i≥k ξ π(i) is a zero-mean subgaussian random variable at the scale