Estimating multi-index models with response-conditional least squares

The multi-index model is a simple yet powerful high-dimensional regression model which circumvents the curse of dimensionality assuming $ \mathbb{E} [ Y | X ] = g(A^\top X) $ for some unknown index space $A$ and link function $g$. In this paper we introduce a method for the estimation of the index space, and study the propagation error of an index space estimate in the regression of the link function. The proposed method approximates the index space by the span of linear regression slope coefficients computed over level sets of the data. Being based on ordinary least squares, our approach is easy to implement and computationally efficient. We prove a tight concentration bound that shows $N^{-1/2}$ convergence, but also faithfully describes the dependence on the chosen partition of level sets, hence giving indications on the hyperparameter tuning. The estimator's competitiveness is confirmed by extensive comparisons with state-of-the-art methods, both on synthetic and real data sets. As a second contribution, we establish minimax optimal generalization bounds for k-nearest neighbors and piecewise polynomial regression when trained on samples projected onto an estimate of the index space, thus providing complete and provable estimation of the multi-index model.


Introduction
Many recent advances in the analysis of high-dimensional data are based on the observation that real-world data are inherently structured, and the relationship between variables, features and responses is often of a lower dimensional nature [1,3,26,27,37,38]. A popular model incorporating this structural assumption is the multi-index model, which poses the relation between a predictor X ∈ R D and a response Y ∈ R as where A ∈ R D×d is an unknown full column rank matrix with d D, g : R d → R is an unknown function, and ζ is a noise term with E[ζ|X] = 0, independent of X given A X. In the following we refer to g as the link function and A as the index space, assuming, without loss of generality, that the columns of A are orthonormal [11]. Model (1) asserts that the information required to predict the conditional expectation f (x) := E[Y |X = x] is encoded in the distribution of A X. Therefore, knowing the projection P := AA allows to estimate f in a nonparametric fashion with a number of samples scaling with the intrinsic dimension d, rather than the ambient dimension D.
Ways to estimate the index space have been studied extensively over the years and by now several methods have been proposed. Most of them originate from the statistical literature, starting with the seminal work of [34] and going forward with [7,32,33]. In recent years, the problem has gained popularity also in the machine learning community, due to its relation and similarity to (shallow) neural networks [11,12,17]. Despite the variety of available approaches, there is no distinctly best method: some estimators are better suited for practical purposes as they are computationally efficient and easy to implement, while others generally enjoy better theoretical guarantees. We provide an extensive overview in Section 1.1.
In this work we derive and analyze a method for estimating f under the model assumption (1) from a given data set {(X i , Y i ) : i = 1, . . . , N }, where (X i , Y i ) are independent copies of (X, Y ). First, we construct an estimateP of the projection P based on the span of responseconditional least-squares vectors of the data. OnceP has been computed, the second step is a regression task on the reduced data set {(P X i , Y i ) : i = 1, . . . , N }, which can be solved by classical nonparametric estimators such as piecewise polynomials or kNN-regression. The proposed method is attractive for practitioners due to its simplicity and efficiency, with almost no parameter adjustment needed. Furthermore, it is provable, with strong theoretical guarantees neatly derivable from a few reasonable assumptions. We establish tight concentration bounds describing the estimator's performance in the finite sample regime. In particular, we prove that P − P ∈ O(N −1/2 ), and determine the explicit dependence of the constants on the parameters involved. A data-driven approximation of the index space error empirically confirms the tightness of our concentration bound, providing guidance for hyperparameter tuning. Moreover, to the best of our knowledge, we are the first ones to provide generalization guarantees for model (1) that take into account the propagation of the projection error P −P into the reduced regression problem. Specifically, we analyze two popular regression methods, namely k-nearest neighbors regression (kNN) and piecewise polynomial regression, and show that the minimax optimal d-dimensional estimation rate is achieved if P − P ∈ O(N −1/2 ).

Related work on index space estimation
Many methods for estimating the index space have been developed in the statistical literature under the name of sufficient dimension reduction [31], where the multi-index model is relaxed to Note that this setting generalizes our problem since (1) and ε ⊥ ⊥ X|A X imply (2). A space Im(A) satisfying (2) is called a dimension reduction subspace, and if the intersection of such spaces satisfies (2) it is called central subspace. Except for degenerate cases, a unique central subspace exists [5,6]. One can also consider a model where (2) is replaced by Y ⊥ ⊥ E[Y |X]|A X, which leads to the definition of central mean subspace [8]. In the case of model (1) with ε ⊥ ⊥ X|A X, the space Im(A) is both the central subspace and the central mean subspace [8]. Thus, we will treat related research under the same umbrella. The methods for sufficient dimension reduction can broadly be grouped into inverse regression based methods and nonparametric methods [1,40]. The first group reverses the regression dependency between X and Y and uses inverse statistical moments to construct a matrix Λ with Im(Λ) ⊆ Im(A). The most prominent representatives are sliced inverse regression (SIR/SIRII) [34,35], sliced average variance estimation (SAVE) [7], and contour regression/directional regression (CR/DR) [32,33] (see Table 1 for the corresponding definition of Λ). Linear combinations of related matrices Λ have been called hybrid methods [51]. Furthermore, in the case where X follows a normal distribution, two popular methods are principal Hessian directions (pHd) [36] and iterative Hessian transformations (iHt) [8]. In this setting, Λ is the averaged Hessian matrix of the regression function, which can be efficiently computed using Stein's Lemma.
If Im(Λ) ⊆ Im(A), eigenvectors corresponding to nonzero eigenvalues of Λ yield an unbiased subspace of the index space Im(A). A typical assumption to guarantee this is the linear conditional mean (LCM), given by E[X|P X] = P X. It holds, for example, for all elliptically symmetric distributions [34,40]. Methods based on second order moments usually need in addition the constant conditional variance assumption (CCV), which requires Cov (X|P X) to be nonrandom. In particular, the normal distribution satisfies both LCM and CCV. If Im(Λ) = Im(A), a method is called exhaustive. A condition to ensure exhaustiveness is E[v Z|Y ] being non-degenerate (i.e. not almost surely equal to a constant) for all nonzero v ∈ Im(A), where Z is the standardization of X. In Table 1 Table 1 A summary of prominent inverse regression based methods (plus pHd). We let Z be the standardized X, and (Z , Y ) an independent copy of (Z, Y ). The table omits details on contour regression [33] (strongly related to DR), iterative Hessian transformations [8] (related to pHd), and hybrid approaches [51] (linear combinations of methods above).
As inverse regression based methods require only computation of finite sample means and covariances, they are efficient and easy to implement. The matrix Λ is usually computed by partitioning the range Im(Y ) = ∪ J =1 R J, , and approximating statistics of X|Y by empirical quantized statistics of X|Y ∈ R J, . Therefore, only a single hyperparameter, the number of subsets J, needs to be tuned. A strategy for choosing J optimally is not known [40].
Nonparametric methods try to estimate the gradient field of the regression function f based on the observation that the d leading eigenvectors of E[∇f (X)∇f (X) ] (assuming f is differentiable) span the index space. The concrete implementation of this idea differs between methods. Popular examples are minimum average variance estimation (MAVE), outer product of gradient estimation (OPG), and variants thereof [50]. While MAVE converges to the index space under mild assumptions, it suffers from the curse of dimensionality due to nonparametric estimation of gradients of f . The inverse MAVE (IMAVE) [50] combines MAVE with inverse regression, achieving N −1/2 -consistency under LCM. Furthermore, iterative generalizations of the average derivative estimation (ADE) [16] have been proved to be N −1/2 -consistent for d ≤ 3 and d ≤ 4 [10,18].
Compared to inverse regression methods, nonparametric methods rely on less stringent assumptions, but are computationally more demanding, require more hyperparameter tuning, and are often more complex to analyze. The relation between inverse regression and nonparametric methods has been investigated in [39,41] by introducing semiparametric methods. The authors showed that the computational efficiency and simplicity of inverse regression methods come at the cost of assumptions such as LCM/CCV. Moreover, they demonstrated that inverse regression methods can be modified by including a nonparametric estimation step to achieve theoretical guarantees even when LCM/CCV do not hold.
The work presented above originates from statistical literature, and, to the best of our knowledge, focuses only on the index space estimation, completely omitting the subsequent regression step. This is different in the machine learning community, where estimation of both A and f has been recently studied for the case d = 1 [13,20,21,24,28,29,44]. The problem was also considered for d ≥ 1 in an active sampling setting, where the user is allowed to query data points (X, Y ) and the goal is to minimize the number of queries [11,17]. Moreover, model (1) has strong ties with shallow neural network models f (x) = m i=1 g i (a i x), which are currently actively investigated [12,19,43,46].

Content and contributions of this work
Index space estimation We propose to estimate the index space by the span of ordinary least squares solutions computed over level sets of the data (see Section 2 for details). We call our method response-conditional least squares (RCLS). Our approach shares typical benefits of inverse regression based techniques: it is computationally efficient, and easy to implement, as only a single hyperparameter (number of level sets) needs to be specified. An additional advantage is that ordinary least squares can be readily exchanged by variants leveraging priors such as sparsity [30,45] and further.
On the density level, we guarantee that RCLS finds a subspace Im(P J ) of Im(P ) under the LCM assumption. In the finite sample regime, we prove a concentration bound disentangling the influence of the number of samples N and the number of level sets J on the performance of our estimator (Corollary 8). Moreover, we show empirically that C(J) in (3) tightly characterizes the influence of the hyperparameter on the estimator's performance, providing guidance to how to choose it in practice.

Link function regression
We analyze the performance of kNN regression and piecewise polynomial regression (with respect to a dyadic partition), when trained on the perturbed data Specifically, we prove for sub-Gaussian X, (L, s)-smooth g (see Definition 9), and almost surely bounded f (X), that the estimatorf satisfies the generalization bound (Theorems 10 and 13) where s ∈ (0, 1] in the case of kNN, and s ∈ (0, +∞) in the case of piecewise polynomials.
The bound (4) shows that optimal estimation rates (in the minimax sense) are achieved by traditional regressors for d = 1 and s > 1 2 , or d ≥ 2 and any s > 0, provided P − P ∈ O(N −1/2 ). In particular, combining (3) and (4) we obtain that RCLS paired with piecewise polynomial regression produces an optimal estimation of the multi-index model.

Organization of the paper
Section 2 describes RCLS for index space estimation. Section 3 presents theoretical guarantees on the population level and in the finite sample regime. Section 4 shows the generalization bound (4). Section 5 compares RCLS with state-of-the-art methods on synthetic and real data sets.

General notation
We let N 0 be the set of natural numbers including 0, and [J] := {1, . . . , J}. We write a ∨ b := max{a, b} and a ∧ b := min{a, b}. Throughout the paper, C stands for a universal constant that may change on each appearance. We use · for the Euclidean norm of vectors, and · , · F for the spectral and Frobenius matrix norms, respectively. For a symmetric real matrix A ∈ R D×D , we denote the ordered eigenvalues as λ 1 (A) ≥ · · · ≥ λ D (A) and the corresponding eigenvectors as u 1 (A), . . . , u D (A). We denote expectation and covariance of a random vector X by E[X] and Cov (X), respectively, and letX := X − E[X]. The sub-Gaussian norm of a random variable Z is Z ψ2 := inf{t > 0 : E exp(Z 2 /t 2 ) ≤ 2}. Similarly, the sub-Exponential norm is Z ψ1 = inf{t > 0 : E exp(|Z| /t) ≤ 2}. Finally, we abbreviate the mean squared error of an estimator f of f by MSE(f , f ) := E|f (X) − f (X)| 2 .

Index space estimation by response-conditional least squares
We first describe response-conditional least squares (RCLS) and then highlight advantages and disadvantages of the approach compared to other methods in the literature (see Section 1.1).
RCLS For the sake of simplicity we assume here that Im(Y ) is bounded. First, let Im(Y ) = ∪ J =1 R J, be a dyadic decomposition of the range into J intervals. For example, this means which we refer to as level sets in the following. On each level set we solve an ordinary least squares problem. That is, for the empirical conditional covariance matrixΣ J, :=Ê Xj (X − E X J, X)(X −Ê X J, X) , whereÊ X J, X is the usual finite sample mean of X over X J, , we compute vectorsb The parameterd ≤ d is user-specified and ideally equals dim(span{b J, : ∈ [J]}) in the limit N → ∞. If this value is unknown, we select it via model selection techniques or by inspecting the spectrum ofM J . The procedure is summarized in Algorithm 1.
Remark 1 (Choice of partition). One possible way to decompose Im(Y ) is by dyadic cells. However, the analysis in Section 3 does not require the dyadic structure and can instead be conducted with arbitrary partitions.
Remark 2 (Algorithmic complexity). The main computational demand is constructing the vectors {b J, : ∈ [J]}. Assuming we use a partition of disjoint level sets R J, , i.e. each sample is only used once in the construction ofM J , the cost for this is O( While RCLS does require the LCM assumption as all inverse regression methods, differently from second order inverse regression methods it does not need the CCV assumption. This is a desirable generalization since, as pointed out in [40], assuming both LCM and CCV for all directions reduces X to the normal distribution.
Comparison of RCLS with nonparametric methods If we interpret vectorsb J, as approximate averaged gradients of the regression function over level sets, RCLS has strong ties with nonparametric methods, because they both rely on gradient information. However, RCLS computes significantly less information than nonparametric methods due to quantization of samples into level sets and the averaging process.

Guarantees for RCLS
We introduce population counterparts ofb J, andM J given by where Σ J, := Cov (X|Y ∈ R J,l ), and ρ J, := P(Y ∈ R J, ). All quantities thus far are defined through the random vector (X, Y ) without using the regression function. In fact, in this section we can technically avoid specifying the regression function by assuming a more general setting, where Im(P ) is defined as the minimal dimensional subspace with (A1) Y ⊥ ⊥ X|P X.
In the following analysis, we also require the following assumptions.
(A2) E[X|P X] = P X almost surely; (A3) X and Y are sub-Gaussian random variables.
(A2) is the LCM assumption introduced in Section 1.1 and is required in all inverse regression based techniques like SIR, SAVE or DR. It is satisfied for example for any elliptical distribution and ensures Im(M J ) ⊆ Im(A) as shown in Proposition 3 below. (A3) is maximally general to use the tools developed in the framework of sub-Gaussian random variables, namely finite sample concentration bounds. Examples of sub-Gaussian random variables include bounded distributions, the normal distribution, or more generally random variables for which all onedimensional marginals have tails that exhibit a Gaussian-like decay after a certain threshold [48].

Population level
The population level results are summarized in the following proposition.
We need the following result for the proof of Proposition 3. (b). By the law of total covariance, where we used E[X|P X, Y ] = P X as shown in the proof of (a). Therefore, Proof of Proposition 3. We only show that b J, ∈ Im(A) for all R J, , since Im(M J ) ⊆ Im(A) follows immediately. We have where the last equality follows from E[QX|Y ] = 0 by Lemma 4. Therefore hence the eigenspace of Σ J, decomposes orthogonally into eigenspaces of Cov (P X|Y ∈ R J, ) and of Cov (QX|Y ∈ R J, ). The same holds for Σ † J, because the eigenvectors are precisely the same as for Σ J, . This implies Σ † J, z ∈ Im(P ) for all z ∈ Im(P ), and the result follows by Exhaustiveness Proposition 3 ensures exhaustiveness of RCLS (on the population level) whenever d out of the J least squares vectors b J, are linearly independent. Even when this is not the case, we believe that RCLS generically finds a subspace of the index space that accounts for most of the variability in f , thereby allowing for a sufficient dimension reduction. The rationale behind this is that the b J, 's can be interpreted as averaged gradients over approximate level sets, and thus they provide samples of the first order behavior of f along the chosen partition. This claim is supported numerically in Section 5.2, where RCLS performs better or as good as all inverse regression based methods listed in Table 1.
Analyzing the exhaustiveness of inverse regression estimators is challenging since in general it is easy to construct examples where some directions of the index space only show up in the tails of (X, Y ). This also justifies why most typical exhaustiveness conditions such as RCP and RCP 2 are formulated on the nonquantized level, and therefore do not quite imply exhaustiveness of the actual quantized estimator. The only exception we are aware of is [33, Theorem 3.1], where sufficient conditions for the exhaustiveness of the estimator are provided by decoupling the roles of regression function and noise.
Lastly, we mention that it is possible to further enrich the space Im(M J ) by adding outer products of vectors Bb J, for matrices B which map Im(P ) to Im(P ). This resembles the idea behind the iHt method [9], where B is chosen as a positive power of the average residual-or response-based Hessian matrix [9,36]. Other choices are powers of Σ J, or Σ † J, , which map Im(P ) to Im(P ) under (A1) and (A2).

Finite sample guarantees
We now analyze the finite sample performance ofP J (d) as an estimator for the orthoprojector P J satisfying Im(P J ) = span{b J, : ∈ [J]}. Our main result is a N −1/2 convergence rate, which is typically also achieved for inverse regression based methods. Additionally however, we carefully track the influence of the induced level set partition on the estimator's performance in order to understand the influence of the hyperparameter J. To achieve this, we rely on an anisotropic bound for the concentration of individual ordinary least squares vectorb J, around b J, . The resulting error bound links the accuracy of RCLS to the geometry of X|Y ∈ R J, encoded in spectral properties of conditional covariance matrices Σ J, = Cov (X|Y ∈ R J, ).
Before we begin the analysis, we introduce some notation. Let X| J, and Y | J, denote random variables X and Y conditioned on Y ∈ R J, . By (A3) and Lemma 17, X| J, and Y | J, are sub-Gaussian whenever ρ J, > 0, which implies that X| J, ψ2 and Y | J, ψ2 are finite. Moreover, we define P J, as the orthoprojector onto span{b J, } and Q J, := Id − P J, .

Anisotropic concentration
and finds separate bounds for the terms P J, (b J, − b J, ) and Q J, b J, . To see why those terms play different roles when estimating Im(P ), let us consider the illustrative case of the singleindex model, where P = aa for some a ∈ S D−1 . We can estimate a by the direction of any b J, , because any nonzero b J, is aligned with a under (A1) and (A2). Using few algebraic manipulations we have, with Q : wheneverb J, b J, > 0. This reveals that the error is dominated by Qb J, , whereas P (b J, − b J, ) is a higher order error term as soon as |X J, | is sufficiently large. A similar observation will be established for higher dimensional index spaces in (14) below. Anisotropic concentration bounds for ordinary least squares vectors have been recently provided in [23]. To restate the bounds, we introduce a directional sub-Gaussian condition number κ J, As described in [23], κ(L, X) is related to the restricted matrix condition number defined bỹ κ(L, Cov (X)) := L Cov (X) † L L Cov (X) L , which measures the heterogeneity of eigenvalues of Cov (X), when restricting the eigenspaces to Im(L). In fact, if X follows a normal distribution, the sub-Gaussian norm is a tight variance proxy and κ(L, X) differs from κ(L, Cov (X)) by a constant factor that only depends on the precise definition of the sub-Gaussian norm.
We further introduce the standardized random variableZ| J, is the matrix square root of Σ † J, . As a consequence of the standardization, we have Cov(Z| J, ) = Id D .
, we get, with probability at least 1 − exp(−u), Proof. The concentration bounds are precisely [23, Lemma 4.1] adjusted to the notation used here. (13)  . In many scenarios, both norms, if viewed as functions of the parameter J, behave very differently. This is because increasing the number of level sets J typically reduces the variance in the direction of the least squares solution b J, , and therefore increases P J, Σ † J, P J, , while Q J, Σ † J, Q J, is often not affected. The effect is particularly strong for single-index models with monotone link functions, as illustrated in Figure 1, but it can also be observed in more general scenarios, for instance if f follows a monotone singleindex model locally on one of the level sets. Recalling (9), using anisotropic concentration is therefore necessary, if we aim at an accurate description of the projection error in terms of both, N and J.

Concentration bounds for index space estimation
Our goal is now to provide concentration bounds forP J (d) around P J . Using the Davis-Kahan Theorem [2, Theorem 7.3.1] we have for where we used Weyl's bound [49]  ρ J, . Whenever Proof. The first step is to decompose the error according to The second term can be bounded using Lemma 5 and 19. Specifically, Lemma 5 implies b J, ≤ 2η J, , and (33) in Lemma 19 with a union bound argument over ∈ [J] shows provided N > C(u+log(J))ε −2 . Thus we have To bound T 1 we first need to ensure that each level set is sufficiently populated. Equation (34) in Lemma 19, and the union bound over ∈ [J] give provided N > C u+log(J) ρ J,min . It follows that 1/2 ≤ρ J, /ρ J, ≤ρ J, /ρ J,min , and whenever N > C (D + u + log(J)) Now we can use Lemma 5 to concentrate b J, − b J, to obtain Finally, (19) and b J, ≤ 2η J, from Lemma 5 implies whenever (17) and (19)   ρ J, .
whenever N > C(D + u + log(J))(ω J,max ρ −1 J,min ∨ ε −2 ). Using the union bound over both events, the conclusion in the statement follows with probability at least 1 − 2 exp(−u) from Lemma 21 in the Appendix), and the Davis-Kahan bound (14).
Assuming ε −2 maximizes (21), Corollary 8 implies the bound It separates the error into a leading factor, which only depends on the hyperparameter J, respectively, the induced level set partition, and a trailing factor, which describes dependencies on √ D, N −1/2 and the confidence parameter u. By using anisotropic bounds from Lemma 5, we obtain a linear dependence on η J, , which scales like the term b J, , and a linear dependence on η ⊥ J, , which scales like Q JbJ, . An isotropic concentration bound forb J, − b J, would have instead given η 2 J, , which can be significantly worse judging by observations in Figure 1.

Data-driven proxy and tightness of (22)
We now empirically study the tightness of (22) when considering a fixed number of samples N but varying the number of level sets J. First, we develop a data-driven proxy to estimate the leading factor in (22) from a given data set. Afterwards, we compare the proxy with the true error on several synthetic examples.
Data-driven proxy We have to replace γ J , ρ J, , η J, , η ⊥ J, and κ J, in (22) by quantities that can be estimated from data. The first three quantities are approximated by γ J ≈ λ d (M J ), ρ J, ≈ρ J, and η J, ≈ b J, , where the last replacement is motivated by the fact that η J, is used to bound Note that replacing squared sub-Gaussian norms with spectral norms of the corresponding covariance matrices can underestimate the true value of κ J, . The same strategy is used for We observe that the map J → P J (d)−P F initially decreases when increasing the number of level sets J beyond d, and then either stalls, such as in Figures 2a, 2c, 2e and 2d, or increases as in Figures 2b and 2f. This behavior is captured well by the data driven proxy (23). Furthermore, even if the relation J → P J (d) − P F shows kinks as in Figure 2e, where the link function is given by g( the derived data-driven proxy reproduces the same behavior. The experiments suggest that Corollary 8 characterizes the influence of J and the induced level set partition on the projection error well. Furthermore, they raise the question whether J, which minimizes the data-driven proxy (23), can be used for hyperparameter tuning in practice. This is an interesting direction for future work, because choosing J for the related class of inverse regression based methods has been identified as a notoriously difficult problem, for which no good strategies exist [40].

Regression in the reduced space
In this section we return to the multi-index model Y = g(A X) + ζ with E[ζ|X] = 0 almost surely. Assumption ζ ⊥ ⊥ X|A X is not strictly required in this part. The second step to estimate the model is to learn the link function g, while leveraging the approximated projection P ≈ P , e.g. constructed by using RCLS. We restrict our analysis to two popular and commonly used regressors, namely kNN-regression and piecewise polynomial regression. Our analysis reveals how the error P − P affects kNN and piecewise polynomials, if they are trained on perturbed data {(P For simplicity, we assumeP is deterministic and thus statistically independent of {(X i , Y i ) : i ∈ [N ]}. In practice, statistical independency can be ensured by using separate data sets for learningP and performing the subsequent regression task.
To study regression rates, smoothness properties of the link function play an important role. We use to the following standard definition [14].
Definition 9. Let f : R D → R, s 1 ∈ N 0 , s 2 ∈ (0, 1] and s = s 1 + s 2 . We say f is (L, s)-smooth if partial derivatives ∂ α f exist for all α ∈ N D 0 with i α i ≤ s 1 , and for all s with i α i = s 1 we have The minimax rate for nonparametric estimation in R d is well known [14,47] and reads, for (L, s)-smooth regression function f , Similarly, the rate is a lower bound for nonparametric estimation of the multi-index model with dim(P ) = d, because we are still left with a nonparametric regression problem in R d once P is identified. In the following, we provide conditions on P − P so that the optimal rate (25) is achieved, when training on perturbed data. In the analysis, we assume that X is sub-Gaussian, |f (X)| ≤ 1 almost surely, and Var(ζ|X) ≤ σ 2 ζ almost surely.

kNN-regression
Let x be a new data point and denote a reordering of the indices by 1(x), . . . , N (x) so that for all j ≥ i and all i, i.e. i(x) is the i-th nearest neighbor to x after projecting onto Im(P ). The kNN-estimator is defined byf k (x) := k −1 k i=1 Y i(x) and the following theorem characterizes the influence of the projection error on the generalization performance. The proof resembles [14,25] and is given in Appendix A.3.
Theorem 10. Let g be (L, s)-smooth for s ∈ (0, 1], and d > 2s. where C 1 depends on d, σ ζ , X ψ2 , C k , L, s, and C 2 additionally linearly on D X 2 ψ2 . Remark 11 (d > 2s assumption). The condition d > 2s in Theorem 10 is not due to the error P − P , but arises from [25, Lemma 1], where ordinary kNN is analyzed for unbounded marginal distributions. It has been shown in [14] that achieving similar rates for d ≤ 2s requires an extra assumption of the marginal distribution of X (boundedness does not suffice).
Remark 12 (Rate optimality). Assuming P − P ∈ O(N −1/2 ), we observe that the second term in (26) has order N −s . Therefore, Theorem 10 ensures, up to the logarithmic factor, the optimal rate N −2s/(2s+d) for d ≥ 2. The logarithmic factor disappears, if the marginal distribution of X is bounded.

Piecewise polynomial regression
Piecewise polynomial estimators can be defined in different ways as they depend on a partition of the underlying space. Therefore we first have to describe the type of piecewise polynomials that we consider in the following.
LetÂ ∈ R D×d contain column-wise an arbitrary orthonormal basis of Im(P ). Denote by ∆ l the set of dyadic cubes in R d , i.e. the set of cubes with side length 2 −l and corners in the set {2 −l (v 1 , . . . , v d ) : v j ∈ Z}, and let ∆ l (R) ⊆ ∆ l be the subset that has non-empty intersection with {Â z : z ∈ B R }, where B R = {X ∈ R D : X ≤ R}. Moreover, let P k be the space of polynomials of degree k in R d and 1 A be the characteristic function of a set A. The function space of piecewise polynomials we consider is defined by To construct the estimator, we perform empirical risk minimizatioñ and then setf (x) := T [−1,1] (f (x)), where T [−1,1] (u) := sign(u)(|u| ∧ 1). Note that piecewise polynomial estimators are typically analyzed after thresholding to avoid technical difficulties with potentially unbounded predictions (see also [4,14]).
The following theorem characterizes the influence of P − P on the generalization performance of the estimator.
where the constants grow with σ ζ , d, s, L * := Ld s1/2 (1 − P − P 2 ) −s/2 , and C 1 depends linearly on (D X 2 ψ2 ) d/2 , and C 2 linearly on (D X 2 ψ2 ) 1∧s . Remark 14 (Boundedness and log-factors). For bounded X, the choice R 2 log(N ) is not required and log 1∨ d 2 (N ) reduces to log(N ). Moreover, D X 2 ψ2 can be replaced by the squared radius of a ball containing the support of X, which removes the dependency on D entirely.
Proof sketch The first step is to apply the following well-known result.
Theorem 16 (Theorem 11.3 in [14]). Let F be a vector space of functions f : and Var(ζ|X = x) ≤ σ 2 ζ . Denote byf the empirical risk minimizer in F over N iid. copies of (X, Y ), and letf = T [−1,1] (f ). Then there exists a universal constant C such that The first term in (30) is the estimation error, which measures the deviation of the performance of the empirical risk minimizer to the best performing estimator in F when having access to the entire distribution. It decreases as more samples become available, but increases with the complexity of F, here measured in terms of the dimensionality. It can be checked that F(Â, l, k, R) is closed under addition and scalar multiplication and is thus a vector space. A basis can be constructed by combining the standard polynomial basis for each cell of the partition. Therefore dim(F(Â, l, k, R)) = |∆ l (R)| d+k k , where |∆ l (R)| is the number of cells required to cover {Â z : z ∈ B R }. Lemma 24 in the Appendix proves |∆ l (R)| ≤ (2 l+1 R) d and therefore The second term in (30) is the approximation error, which measures how well f can be approximated by any function h ∈ F(Â, l, k, R). Neglecting for a moment the perturbation P − P , it is known that a piecewise Taylor expansion of g can be used to approximate g with an accuracy that increases as the underlying partition is refined. The main difficulty in our case is to define a piecewise polynomial function h ∈ F(Â, l, k, R) that approximates g(A x), despite the fact that h depends on coordinatesÂ x instead of A x.
To define such a function, we first prove the existence of a function g * that approximates g uniformly well, when being evaluated onÂ x. Precisely, Lemma 25 in the Appendix shows for some (L * , s)-smooth function g * . Now, by approximating g * through a piecewise Taylor expansion, we can construct a function h ∈ F(Â, l, k, R) which, using choices l, k and R as in Theorem 13, satisfies for constants C 1 depending on L * , d, s, and C 2 depending on L * and linearly on (D X 2 ψ2 ) 1∧s (see Corollary 27). The proof of Theorem 13 concludes by combining Theorem 16, the dimensionality bound (31), and the approximation error bound (32) (see Appendix A.4).

Numerical experiments
We now compare RCLS to the most prominent inverse regression based techniques SIR, SIRII, SAVE, pHd and DR that have been described extensively in Section 1.1. In the first part we consider synthetic problems and we directly assess the performance by evaluating P J (d) − P F , since the true index space is known. In the second part, we consider real data sets from the UCI data set repository. Here, the true index space is unknown, and we instead compare recovered spaces Im(P ) by measuring the predictive performance of kNN-regression, when trained on {(P X i , Y i ) : i ∈ [N ]}. In both cases we construct the partition R J, using dyadic cells in the response domain as described in Section 2. The source code for all experiments is readily available at https://github.com/soply/sdr_toolbox and https://github.com/ soply/mim_experiments.

Synthetic data sets
We sample X ∼ Uni({X : X ≤ 1}) in R 20 , and generate the response by Y = g(A X) + ζ for several functions g and ζ ∼ N (0, 0.01 2 Var(g(A X))). The index space is A = [e 1 | . . . |e d ] ∈ R D×d where e i is the i-th standard basis vector. The hyperparameter J is chosen optimally for SIR, SIRII, SAVE, DR and RCLS to minimize the projection error within J ∈ [100]. No parameter is required for pHd.
We report projection errors averaged over 100 repetitions of the same experiment in Figures  3a -3f. First, notice that most estimators (except pHd in some cases) achieve an expected N −1/2 rate on all problems. pHd fails to detect linear trends and therefore fails to detect the index space in some cases. RCLS achieves the best performance in Figures 3a-3c, is tied with SIRII in Figure 3d, and runner up to SIRII in the remaining cases.
In Figures 3a-3c, where RCLS improves upon competitors, we observe (temporary) convergence speeds beyond N −1/2 . This can be explained by recent results in [23], when recognizing that the multi-index models in 3a-3c have approximately monotone single-index structure if we restrict (X, Y ) to small level sets Y ∈ R J, . More precisely, [23] shows that the convergence rate of RCLS can be temporarily as large as log(N )N −1 , if the data follows a monotone singleindex model. The reason for this increased convergence speed is that the variance of the local least squares estimator decreases quadratically in the length of the level set R J, [23,24], and hence the convergence speed may exceed N −1/2 when choosing J as an increasing function of N . These observations suggest that RCLS is particularly suited for multi-index models where we assume a response-local single-index structure to be a good fit.

Real data sets
To compare RCLS with inverse regression based competitors on real data sets, we first compute an index space and then compare the predictive performance when training a kNN-regressor on projected samples. More precisely, we conduct the following steps.
} into training and test set X Train , Y Train , and X Test , Y Test 2. Use pHd, SIR, SIRII, SAVE, DR, RCLS on the training set to compute an index spacê A 3. Train a kNN-regressor using {(Â X i , Y i ) : X i ∈ X Train } 4. Crossvalidate over hyperparameters d (index space dimension), k (kNN parameter), and J (number of level sets) using a hold-out validation set of the training data 5. Compute the root mean squared error (RMSE) of the kNN-regressor on the test set The test set contains 15% of the data, while cross-validation is performed using a 10-fold splitting strategy. Each experiment is repeated 20 times and we report the mean and standard deviation.  Table 2 RMSE, standard deviation, and cross-validated hyperparameters, over 20 repetitions for several estimators and UCI repository data sets. Values for d, k, J are averages over different runs of each experiment. First 5 rows describe the data sets and their characteristics, and the remaining rows contain the results. For a simplified presentation, we divide the mean and STD of RMSE, and the mean and STD of the data (5th row) by the value in row Factor.
We consider the data sets Airquality, Ames-housing, Boston-housing, Concrete, Skillcraft and Yacht. We standardize the components of X to [−1, 1] and potentially perform a log transformation of Y if the marginal has sparsely populated tails. This is indicated by the log-TF row in Table 2. For some data sets, we also exclude features with missing values, or, in the case of Ames, we exclude some irrelevant and categorical features to reduce the complexity of the data set. Preprocessed data sets can be found at https://github.com/soply/db_hand. The RMSE and cross-validated hyperparameters are presented in Table 2. To have robust baselines for comparison, we also compute the RMSE of standard linear regression and kNN regression. We first see that applying a dimension reduction technique improves the performance of linear regression and ordinary kNN significantly on data sets Airquality, Concrete, Skillcraft and Yacht. Furthermore, on these data sets, RCLS convinces by achieving the best results among all competitors. Runner-up is DR, where SIR and SAVE share third and fourth place. The results of pHd and SIRII are not convincing on most data sets.
The study confirms that RCLS is a viable alternative to prominent inverse regression methods. The data sets were chosen because one-dimensional maps e i X → Y , where e i is the i-th standard basis vector, show a certain degree of monotonicity. We believe that this promotes a response-local monotone single-index structure, which is beneficial for the accuracy of the RCLS approach as briefly discussed in Section 5.1.

A.1. Probabilistic results
This section contains some probabilistic auxiliary results used in the paper.
Proof. Assume without loss of generality Z ∈ R. The result for the vector then follows by the definition. We use the characterization of sub-Gaussianity by the moment bound in [ where C is some universal constant, the second inequality follows from P(E) ≤ 1, and the third from the sub-Gaussianity of Z.
Proof. Using Hölder's inequality and the sub-Gaussianity of X we compute E exp Lemma 19. Fix u > 0, ε > 0. Let Y ∈ R be a random variable, R an interval, andP(Y ∈ R) := |{Y i ∈ R}| N −1 the empirical estimate of P(Y ∈ R) based on N iid. samples. Then Proof. For (33), we writeP(Y ∈ R) − P(Y ∈ R) as a sum of iid. centred random variables and the result follows division by N , and N > 12P(Y ∈ R) −1 u.

A.2. Differences of projections
We gather two auxiliary results to rewrite the norm of differences of projections.
Lemma 20. Let A and B be subspaces with dim(A) = dim(B), and let P A and P B the corresponding orthogonal projections. For Proof. Assume (Id − P A ) P B = P A ⊥ P B < 1 first. Then the first case of Theorem 6.34 in Chapter 1 in [22] applies. Note that the second case can be ruled out since P A can not map Range(P B ) one-to-one onto a proper subspace of V ⊂ Range(P A ) because dim(V ) < dim(Range(P A )) = dim(Range(P B )) according to the assumption. Thus, in the first case it follows that it follows that P B v = v , and thus P B v = v because P B is a projection. With the same argument we deduce also P A ⊥ v = v, and then Lemma 21. Let A and B be subspaces with m = dim(A) = dim(B), and let P A and P B the corresponding orthogonal projections. For Proof. With slight abuse of notation we denote A, B ∈ R D×m two orthonormal bases of A respectively B such that P A = AA , P B = BB . Now, denote A B = U (cos(θ))V where cos(θ) ∈ R m×m is the diagonal matrix containing the principal angles θ i [15]. From [15] we obtain the identity 1/2 Doing some further manipulations we get The result follows by Trace(P B ) = dim(B) = m and Trace(P For the first term, we proceed as in [14,25] by randomly splitting the data set {X i : i ∈ [N ]} into k + 1 sets, where the first k sets contain N/k samples. Then we let X * i(X) denote the nearest neighbor to X (measured in d) within the i-th set. Since {X i(X) : i ∈ [k]} are by definition the closest k samples (measured in d), we can bound where the last equality uses that the distribution of X * i(X) − X is independent of the set index i. Since P (·) = A (·) , d > 2s by assumption, and β β/2 for any β ≥ 1 by the sub-Gaussianity of X (see [48,Proposition 2.5.2]), Lemma 1 in [25] implies the existence of a constant C 1 = C 1 (d, s, X ψ2 ) satisfying It remains to bound the last term in (35). Denote for short σ X = X ψ2 . We first compute that We can control this probability by using the sub-Gaussianity of X. More precisely, since X is sub-Gaussian, X i − X is sub-Gaussian (norm changing only by a universal constant), and Lemma 18 implies that X i − X ψ2 ≤ C √ Dσ X . Taking the square, and using [48, Lemma 2.7.6], we obtain For the first term we realize that u > νDσ 2 X log(N )δ 2s > 4δ 2s implies Then the sub-Exponentiality of X − X i 2 and a union bound argument over i ∈ [N ] give

A.4. Proof of Theorem 13
Interlude: Smoothness of linear concatenations In this section we establish smoothness properties of linear concatenations with explicit bounds for corresponding Lipschitz constants.
Example 1. Let α = e i + e j and i(1) = i, i(2) = j. Then the formula yields the derivative Proof. ψ is a concatenation of a C s function with a linear transformation and is therefore as smooth as φ. For the formula, we use induction over k. Let α be a multi-index with d i=1 α i = 1, i.e. α is equal to a standard basis vector e i for some i ∈ [d]. Since ∇ψ(z) = W T ∇φ(W z) we have For the induction step k −1 → k, we let α be a multi-index with d i=1 α i = k and we calculate ∂ α ψ(z) = ∂ i(k) ∂ α−e i(k) ψ(z). Since α − e i(k) is a multi-index whose entries sum to k − 1, by induction hypothesis we have where we used Schwartz Lemma in the second to last equality The result follows by extending the product.
Lemma 23. Let φ : R d → R, s 1 ∈ N 0 , 0 < s 2 ≤ 1 and s = s 1 +s 2 . Assume φ is (L, s)-smooth, W ∈ R d×d , and define ψ(z) = φ(W z) for some W ∈ R d×d . Then ψ is (Ld Proof. Since W is a linear transformation, ψ has as many continuous partial derivatives as Combining this with the previous calculation, and the fact that φ is (L, s)-smooth, we get Bounding dim(F(Â, l, k, R)) Lemma 24. We have |∆ l (R)| ≤ (2 l+1 R) d , and thus Proof. First we note that the number of cells with side length 2 −l required to cover [−R, R] d is given by (2R2 l ) d = (2 l+1 R) d . Furthermore, for any w ∈ {Â z : z ∈ B R (0)}, we have . Therefore a bound for |∆ l (R)| is given by a bound for the number of cells covering [−R, R] d .
Bounding the approximation error We first show the existence of g * almost as regular as g and satisfying g * (Â x) ≈ g(A x). Then we bound the approximation error between f and h over B R . Finally, we provide the bound for the mean squared approximation error (second term in (30)).
Proof. First notice that |h( is the function defined in Lemma 25. Using the bound in Lemma 25, and x ≤ R, the second term is bounded by L * R 1∧s P − P 1∧s . It remains to bound |h(x) − f * (x)| for a suitably chosen h. Since f * (x) = g * (Â x) and g * is (L * , s)-smooth, we can use the multivariate Taylor theorem to expand g * as for some η on the line segment from z to z 0 . We define the function h as follows: for a cell c ∈ ∆ l , let z c ∈ R d denote the center point of the cell, and set h c to h c (z) := |α|≤s1 ∂ α g * (z c ) α! (z − z c ) α .
Then we define h ∈ F(Â, l, s 1 , R) by where c(x) := {c ∈ ∆ l (R) : x ∈ c}. To prove (38), we now use (39) with z 0 = z c(x) and compute where η lies on the line betweenÂ x and z c(x) . The smoothness of g * implies SinceÂ x, z c(x) ∈ c(x), we can furthermore bound Furthermore since c(x) is convex, and η is on the line between A x and c(x), it follows that η ∈ c(x) and therefore also z c(x) − η ≤ 2 −(l+1) √ d. Thus where we used the multinomial formula in the second to last equality.
Corollary 27. In the setting of Theorem 13, we have inf h∈F (Â,l,s1,R) with C 1 depending on L * , d, s and C 2 depending on L * and linearly on (D X 2 ψ2 ) 1∧s . Proof. Using the law of total expectation, and |h(X) − f (X)| = |f (X)| ≤ 1 if X > R, we obtain for any h ∈ F(Â, l, s 1 , R) For the first term, we use the function h in Proposition 26 satisfying the guarantee (38). Using l = log 2 (N )/(2s + d) , or 2 −l ≥ N −1/(2s+d) , and R 2 = D X 2 ψ2 log(N ) we get For the second term in (41), we note that X is a sub-Gaussian with X ψ2 ≤ √ D X ψ2 by Lemma 18. Therefore, using R 2 = D X ≤ exp(− log(N )) = N −1 .

Finalizing the argument
Proof of Theorem 13. Theorem 16 and Corollary 27 imply where C i = CC i with C i as in Corollary 27, and C is a universal constant. Furthermore, using Lemma 24, 2 l ≤ N 1/(2s+d) + 1 and R 2 = D X 2 ψ2 log(N ), we bound the complexity of F by dim(F(Â, l, s 1 , R)) ≤ d + s 1 s 1 with C 3 depending on d, s 1 and linearly on (D X 2 ψ2 ) d/2 . Inserting this in (42) and using N d 2s+d −1 = N − 2s 2s+d , the result follows for C 2 = C 2 and C 1 = max{C 1 , C 3 , C max{σ 2 ζ , 1}}.