Estimating covariance and precision matrices along subspaces

We study the accuracy of estimating the covariance and the precision matrix of a $D$-variate sub-Gaussian distribution along a prescribed subspace or direction using the finite sample covariance with $N \geq D$ samples. Our results show that the estimation accuracy depends almost exclusively only on the components of the distribution that correspond to desired subspaces or directions. This is relevant for problems where behavior of data along a lower-dimensional space is of specific interest, such as dimension reduction or structured regression problems. As a by-product of the analysis, we reduce the effect the matrix condition number has on the estimation of precision matrices. Two applications are presented: direction-sensitive eigenspace perturbation bounds, and estimation of the single-index model. For the latter, a new estimator, derived from the analysis, with strong theoretical guarantees and superior numerical performance is proposed.


Introduction
Estimating the covariance Σ = E(X − EX)(X − EX) and the precision matrix Σ † of a random vector X ∈ R D is a standard and long standing problem in multivariate statistics with applications in a number of mathematical and applied fields. Notable examples include any form of dimension reduction, such as principal component analysis, nonlinear dimension reduction, manifold learning, but also problems ranging from classification, regression, and signal processing to econometrics, brain imaging and social networks.
Given independent copies X 1 , . . . , X N of X, the most widely used estimator is the sample covarianceΣ := 1 N N i=1 X i X i , and the inverse thereof. The crucial question in estimating covariance and precision matrices is to quantify the minimal number of samples N ensuring that for a desired accuracy ε > 0, and a confidence level u > 0, we have Σ − Σ 2 ≤ εS Σ (X), respectively, Σ † − Σ † 2 ≤ εS Σ † (X), (1.1) with probability at least 1 − exp(−u). Constants S Σ (X) and S Σ † (X) in (1.1) describe the dependence of the error with respect to the distribution of X and properties of Σ, respectively Σ † .
In practice however, we are often not directly interested in matrices Σ or Σ † themselves, but rather in how they, as (bi)-linear operators, act on specific vectors or matrices. In terms of concentration inequalities, this can be interpreted as developing bounds for A(Σ − Σ)B and A(Σ † − Σ † )B , where A and B are a given pair of (rectangular) matrices. Bounds of this type in case of sub-Gaussian distributions are the principal subjects of this work.
For any submultiplicative matrix norm · perturbations A(Σ − Σ)B and A(Σ † − Σ † )B are bounded by A Σ − Σ B and A Σ † − Σ † B . This suggests that standard bounds for Σ − Σ and Σ † − Σ † (see the overviews in Sections 1.1 and 1.2) suffice so long as the distribution is nearly isotropic, i.e. Σ ≈ Id D . However, many modern data analysis tasks explicitly rely on anisotropic distributions because different spectral modalities of the covariance matrix provide crucial, and complementary, information about the task at hand. In this case, using norm submultiplicativity and standard bounds for Σ − Σ and Σ † − Σ † overestimates incurred errors because it decouples A and B from their effect on covariance and precision matrices. Thus, such bounds cannot capture the true behavior of the estimation error.
A typical example that leverages different modalities of (conditional) covariance matrices are problems that analyze the structure of point clouds, such as manifold learning. This is because such methods are often prefaced by a linearization step, where the globally non-linear geometry is locally approximated by tangential spaces. In such a case the conditional covariance of localized data points is anisotropic, i.e. eigenvalues in tangential directions are notably larger than those in non-tangential directions, and the degree of anistropicity increases as the data is more localized [37], facilitating the linearization.
Anisotropic distributions also play an important role in high-dimensional nonparametric regression problems that use structural assumptions. Denoting Y as the dependent output variable, a popular example is the multi-index model E[Y |X = x] = g(A x), for a matrix A with rank(A) D. The complexity of the underlying nonparametric regression problem can be significantly reduced by first identifying Im(A) and then performing nonparametric regression in R rank(A) . Many methods in the structural dimension reduction literature [2,39] propose estimators for A that compare the global covariance matrix Σ (typically assumed to be isotropic), with conditional covariance matrices Cov (X|Y ∈ R). Here R ⊂ Im(Y ) represents a connected level set of the output. The rationale behind this approach is that conditioning breaks the isotropicity and induces different spectral modalities with respect to directions belonging to Im(A) and its orthogonal complement. This is then leveraged to identify Im(A) [15,33].
We showcase the usability of bounds for A(Σ † − Σ † )B on a concrete example in Section 4 by analyzing the ordinary least squares estimator Σ † Cov (X, Y ) as an estimator for the index vector a in the single-index model E[Y |X = x] = g(a x). Our analysis extends studies [10,5] and shows how is the accuracy of 556Ž. Kereta and T. Klock the estimator be affected by anisotropicity of X. Furthermore, an examination of developed estimation bounds motivates a modification of the ordinary least squares estimator via an approach based on conditioning and averaging in the spirit of structural dimension reduction techniques [2,39]. We validate the superiority of this modified estimator theoretically and numerically, and thereby show how a careful tracking of different modalities of the distribution X helps to develop improved methods for common data analysis tasks.
Bounds developed in this work also have a few immediate corollaries, which might be of independent interest. These include eigenspace perturbation bounds similar to [59,Theorem 1], but which are sensitive to the behavior of X in the direction corresponding to the eigenspace of interest, and a relative bound for the smallest eigenvalue ofΣ comparable to [58,Theorem 2.2], but without the isotropicity assumption.

State of the art: covariance matrix estimation
The most common bounds for estimating the covariance matrix from finitely many observations consider sub-Gaussian [54,52] and bounded [6] random vectors. They state that with probability at least 1 − exp(−u) where A B means A ≤ CB for some universal constant C. Besides these two cases, researchers have over the years investigated minimal moment-or tail conditions on X such that bounds similar to (1.2) can be achieved. We refer to papers [52,49,1,53] that consider more general classes of distributions. The most general setting we are aware of is [49] that considers distributions which for universal C, η > 0 satisfy the tail condition for every orthogonal projection P. Distributions satisfying this condition include log-concave random variables (e.g. uniform distributions on convex sets) and product distributions, where the marginals have uniformly bounded 4 + s moments for some s > 0.
Our bounds show that the sample covariance estimatorΣ automatically and implicitly adapts to rank(Σ), which serves as a complexity parameter of the estimation problem. However, in the regime N < rank(Σ) the sample covariance is rank deficient and the estimation is in general not possible. Instead, structural assumptions, such as sparsity or bandedness, are needed to reduce the effective complexity of the problem and allow consistent estimation. These assumptions can be leveraged by regularized estimation techniques, which include banding [9], thresholding [13,8], or penalized likelihood estimation [21]. We refer to [17,13] for detailed reviews of existing methods.
We also mention [29] that considers Gaussian random variables taking values in general Banach spaces, and [30] which analyze the concentration of spectral projectors of the covariance matrix, and (bi)-linear forms thereof, for Gaussian random variables in a separable Hilbert space. This is conceptually related to our work as the bounds take into account information about relevant spectral projectors, and their interplay with vectors to which they are applied. This has recently been extended to the estimation of smooth functionals of covariance operators [27,28] for Gaussians in separable Hilbert spaces.

State of the art: precision matrix estimation
Estimation of the precision matrix is relevant for many problems, ranging from simple tasks such as data transformations (e.g. standardization Z := Σ −1/2 X), to applications that include linear discriminant analysis, graphical modeling, or complex data visualization. Furthermore, precision matrix encodes information about partial correlations of features of X. Namely, if X follows a Gaussian (or paranormal) distribution, the ij-th entry of Σ † is zero if the i-th and the j-th feature are conditionally independent.
The inverseΣ † of the sample covariance, constructed from N independent copies of a mean zero random vector X ∈ R D , is a well-behaved estimator of Σ † as N → ∞ and D is considered fixed [3]. In such a case bounds for the precision matrix can be obtained by using general perturbation bounds for the Moore-Penrose inverse. One of the first such bounds [55] states that for G ∈ R d1×d2 , and an additively perturbed matrix H = G + Δ, we have where · is any unitarily invariant norm, and ω is a small universal constant [42]. Recent studies [36,57] examine the influence of the perturbation in greater detail, implying the bound Using these general perturbation bounds it is easy to derive first concentration bounds for the precision matrix. For example, assume X is sub-Gaussian and that the number of independent data samples satisfies N = C4 k (D + u) X 4 ψ2 /λ rank(Σ) (Σ) 2 for some universal C > 0 and k ∈ N. Equation (1.2) and Weyl's bound [56] imply

558Ž. Kereta and T. Klock
The perturbation bound (1.5) and covariance bound (1.2) give with probability 1 − exp(−u) where the higher order term in (1.2) vanishes in the applicable regime N ≥ While this effectively provides bounds as soon as the covariance perturbation is bounded, in this work we show that (1.6) overestimates the error by assigning a quadratic scaling of Σ † 2 (Corollary 11). Moreover, we are not aware of precision matrix bounds that take into account the specific nature of the perturbation nor of bounds for A(H † − G † )B 2 that are sensitive to objects of interest.
If rank(Σ) grows with N , thenΣ is not a consistent estimator of Σ and thus the precision matrix cannot be estimated well by inverting the sample covariance matrix. Various families of structured precision matrices have been studied to mitigate these issues, motivated by applications in genomics, finance, and other fields. Dominant assumptions are sparsity and bandedness, which are exploited through the use of regularized estimators. Algorithms for estimating Σ † based on regularization include computing Σ † column by column through entry-wise Lasso [41,18], constrained 1 minimization [11], adaptive 1 minimization [12], 1 regularized score matching [38], or ridge regressors [51]. See [17,13] for comprehensive overviews.

Overview and contributions
Throughout, X ∈ R D is a sub-Gaussian random vector withX := X − EX and Σ = Cov (X), and X 1 , . . . , X N are independent copies of X. Sub-Gaussians are a broad class of light-tailed distributions that have received increasing attention in recent years and are used in many branches of probability and statistics. We define finite sample estimators of EX and Σ byμ X : be matrices determining a direction, subspace, or generally an object of interest. We can summarize our findings as follows.
(1) In Section 2 we show that with probability at least 1 − exp(−u) where m(t) = t ∨ t 2 , d A := rank(AΣ), and d B := rank(BΣ). This is similar to [52, Proposition 2.1] but replaces the sub-Gaussian norm X ψ2 by direction/subspace dependent quantities AX ψ2 and BX ψ2 . (2) In Section 3 we show that with probability at least 1 − exp(−u), we have  N(μ, Σ), or more generally for strict sub-Gaussians [4]). The right hand side of (1.8) then scales linearly in Σ † 2 , whereas (1.6) shows a quadratic behavior. This has a significant impact for illconditioned covariance matrices and implies that inverting the sample covariance exhibits better conditioning than inverting a general matrix perturbation. The same effect is observed if A and B are arbitrary.
Two applications of bounds (1.7) and (1.8) are presented. In Section 2 we use the covariance bound (1.7) to establish a bound for perturbations of eigenspaces of the covariance matrix that is sensitive to the distribution of the random vector in the eigenspace of interest. This is relevant for example when estimating manifolds from unlabeled point cloud data, see [43,44,25].
In Section 4.1 we use (1.8) to establish sharp concentration bounds for singleindex model estimation. In this model a response Y ∈ R is assumed to follow the regression model E[Y |X] = f (a X), and the task is to estimate the unknown vector a using a finite data set {(X i , Y i ) : i = 1, . . . , N}. A common estimator is the normalized ordinary least squares vector, for which we provide directionsensitive concentration bounds. Furthermore, our analysis yields an insight into how the estimator can be improved by a simple and straightforward procedure based on conditioning and averaging. This is presented in Section 4.2.
Most proofs are deferred to the Appendix for the sake of brevity and clarity.

General notation
We denote [M ] = {1, . . . , M}, a ∧ b = min{a, b}, a ∨ b = max{a, b}, and we may use the auxiliary function m(t) = t ∨ t 2 . For a real, symmetric matrix A ∈ R d×d we denote by λ 1 (A) ≥ . . . ≥ λ d (A) the ordered set of its eigenvalues, and by u 1 (A), . . . , u d (A) the corresponding eigenvectors. · 2 denotes the spectral norm of a matrix, and the Euclidean norm of a vector, and ·, · is the dot product. · F is the Frobenius norm. Id D ∈ R D×D is the identity matrix. S D−1 is the unit sphere in R D . For any random vector X we writeX := X − EX. For p ≥ 1 and a random variable X we define the Orlicz norm The definition extends to random vectors X ∈ R D by
Throughout the paper C is a placeholder for a positive universal constant that may have a different value on each occurrence, even in the same line of text. Furthermore, we sometimes use A B instead of A ≤ CB.

Covariance matrix estimation
In this section we present bounds for covariance and eigenspace estimation that are sensitive to the distribution of a given random vector in directions of interest. The following matrix concentration bound is the fundamental tool of our analysis.

Remark 3. An analogous result holds for almost surely bounded random vectors by using a different concentration inequality.
In that case · ψ2 can be replaced by a bound for the Euclidean norm of X, and the dimensionality does not appear in the requirement on N . We will return to this point at the end of Section 3.
The proof of Lemma 2 by and large follows along the lines of traditional concentration results. Indeed, if A = B, the result would follow by applying [52, Proposition 2.1] to the random vector AX, along with some minor adjustments to account for the ranks. When A = B, a somewhat more careful tracking of the behavior of the random vector X, with respect directions induced by matrices A and B is needed. In the end, as (2.1) suggests, the payoff is that the error rate scales only with components of X along those directions. Remark 4. Some works that consider concentration inequalities for sub-Gaussian random variables do so with respect to the effective rank, defined as r(Σ) = tr (Σ)/ Σ 2 , see for instance [27] or [54,Remark 5.6.3]. Effective rank cannot exceed the true rank of a matrix, and unlike rank(Σ), it is less affected by small eigenvalues. It can thus be a useful surrogate for approximately low dimensional distributions, and can in those cases be used to provide informative estimation bounds even if N rank(Σ). On the other hand, r(Σ) is not as useful for precision matrix estimation, since inverting a matrix reverses the ordering of the eigenvalues. Thus, r(Σ) should be replaced with r(Σ † ). Furthermore, our current proof technique for precision matrix estimation requires Im(Σ) = Im(Σ), which immediately implies N ≥ rank(Σ). To provide a unified framework, we decided to abstain from using the effective rank in our results.
Applying Lemma 2 we can reconstruct known error rates in case of low-rank distributions in high-dimensional ambient spaces.
Lemma 2 also has an immediate effect on the estimation of eigenvectors and eigenspaces. Denote by P i,l (Σ) := l k=i u k (Σ)u k (Σ) the orthoprojector onto the space spanned by i-th to l-th eigenvectors of Σ, and let P i,l (Σ) be the corresponding finite sample estimator. Moreover, denote Q i,l (Σ) := Id D − P i,l (Σ), and dist(I 1 ; The claim now follows by applying Lemma 2 with A = Id D and B = P i,l (Σ).
Typical bounds for eigenspace perturbations Q i,l (Σ)P i,l (Σ) 2 take the specific eigenspace into account only through the denominator, whereas the numerator relies on squared terms of the form X 2 ψ2 in the sub-Gaussian case, or a bound for X 2 2 in the bounded case. Expression (2.4) is thus beneficial if P i,l (Σ)X ψ2 is smaller than X ψ2 , as it provides a sharper estimate.
In order to ensure δ il > 0, the covariance matrix Σ must have a population eigengap, that is, ) > 0, and sufficiently many samples are required in order to stabilize δ il around δ * il . The latter is typically achieved by first using Weyl's bound [56], giving , and then applying a concentration bound for Σ − Σ 2 . A consequence however is that Proposition 6 is only informative if we have sufficiently many samples with respect to δ * il and X ψ2 . Thus, the estimation error is no longer sensitive to the eigenspace of interest.
Preserving the dependence on P il (Σ)X ψ2 , instead of on X ψ2 , requires the use of relative eigenvalue bounds. Such bounds have been recently provided in [48] under the assumption that X is strictly sub-Gaussian [4], i.e. for some K > 0, and for arbitrary matrices U ∈ R k×D , X satisfies This asserts that the sub-Gaussian norm is a good proxy for variances of onedimensional marginals of the random vector X, and is for instance satisfied for X ∼ N (0, Σ) with K = 1. For such distributions we obtain the following concentration of the sample eigengap.
There exists a constant C K > 0, depending only on K, so that whenever } is equivalent to asking whether Im(Σ) = Im(Σ). This holds for N ≥ rank(Σ) if the law of X has a density which is absolutely continuous with respect to the Lebesgue measure on Im(Σ), because X 1 , . . . , X N are almost surely linearly independent. It also holds with probability at least 1 − exp(−u) if X is sub-Gaussian as soon as N > C K (rank(Σ) + u), provided (2.5) holds, and if (2.5) does not hold then we need N > C(rank(Σ) + u) √ Σ †X 4 ψ2 . Lemma 7 now allows to refine Proposition 6 with a population eigengap.
There exists C K > 0, depending only on K, so that whenever N satisfies (2.6) for any u > 0, with probability at least 1 − exp(−u) we have Proof. The first inequality follows immediately by first conditioning on events in Proposition 6, Lemma 7, and then using the union bound (probability 1 − 3 exp(−u) can be adjusted by adjusting C K ). The second inequality in (2.7) comes from additionally using (2.5) and Cov (P i,l (Σ)X) 2 = λ i (Σ).
Recently, [59] showed a useful alternative Thus, with (2.8) we do not have to stabilize the sample eigengap, and the eigenspace perturbation bound (2.8) can be used for arbitrary N ≥ 1. A natural question to ask is whether Proposition 6 holds if δ il is replaced with δ * il . The following example strongly suggests this is not the case.
Notice that in this case X ψ2 = 1 and u D X ψ2 = η, by the definition of the sub-Gaussian norm. For any N ≥ 1 with probability at least 1 − exp(−u) we would have In particular, we have an increasingly small estimation error as η → 0 by using only one data sample, i.e. by estimating P D,D (Σ) based on the eigendecomposition of a rank one matrix XX .

Precision matrix estimation
In this section we investigate directional estimates of the precision matrix Σ † through the empirical precision matrixΣ † , analogously to results in Section 2.
Let us comment on the implications of Theorem 10. First, we note that √ Σ †X ψ2 is the sub-Gaussian norm of the standardization of X. Provided X is strictly sub-Gaussian for some K > 0, see It follows that for such distributions √ Σ †X ψ2 has a negligible effect on estimating precision matrices.
Second, similar to Lemma 2, the bound (3.1) depends only on components of X induced by A and B. Since the sub-Gaussian norm can be interpreted as a proxy for the variance, this improves non-directional bounds whenever the eigenvalues of AΣ † A and BΣ † B are small compared to those of Σ † .
Third, as soon as N > C(rank(Σ)+u) √ Σ †X 4 ψ2 , the estimation rate in (3.1) is similar to the covariance estimation rate in Lemma 2. That is, assume we are trying to estimate Σ † with the sample covariance matrix of the random vector Z = Σ † X through iid. copies Z i = Σ † X i . In that case Lemma 2 gives precisely the bound (3.1). This should come as a bit of a surprise, since it implies that estimating the precision matrix through the inverse of the sample covariance has the same theoretical guarantees as if we had access to a random vector Z whose covariance is exactly Σ † . To further stress this point, we now compare

564Ž. Kereta and T. Klock
which was derived in Section 1.2 using general perturbation bounds for the matrix inverse.
Assuming the squared sub-Gaussian norm is a good proxy for the variance, i.e. if (2.5) holds, the right hand side in (3.3) becomes K Σ † 2 (rank(Σ) + u)/N . Compared to (3.2), this shows that using a general perturbation bound overestimates the influence the matrix condition number Σ 2 Σ † 2 has on precision matrix estimation. The discrepancy between these two results suggests that the finite sample covariance estimator induces a specific type of a perturbation that performs a form of regularization when estimating the inverse.
An immediate corollary is a relative bound for the smallest eigenvalue of Σ.
If (2.5) does not hold the right hand side of (3.4) is, up to absolute constants, Numerical validation To validate the results of Theorem 10 we consider X ∼ N (0, Σ) for two types of covariance matrices Σ ∈ R 10×10 : Setting 1: Set Σ = USU , for U ∈ R 10×10 sampled uniformly at random from the space of orthonormal matrices, and S = Diag (1, 1, 1, 1, 1, ν, ν, ν, ν, ν). We consider ν = 10 −j+1 for j ∈ [10], which implies that the matrix condition number Σ 2 Σ † 2 ranges from 1 to 10 9 . Setting 2: Set Σ i,j = ν |i−j| , with ν ∈ {0.5, 0.55, . . . , 0.9, 0.95}. This is a common model for distributions where entries of X correspond to values of a certain feature at different time stamps. It leads to correlated entries when the time stamps are close by, i.e. when |i − j| is small. We choose matrices A and B by sampling an orthoprojector A uniformly at random with rank(A) = 3 and setting B = Id D −A. We repeat each experiment 100 times and report the averaged relative errors in Figure 1. Different lines of the same color correspond to different values of ν.
The estimation error shows the expected N −1/2 rate. Moreover, directional errors show a clear dependence on the directional spectral norm. On the other hand, since we are sampling Gaussians the squared sub-Gaussian norm is a proxy for the variance of the given random vector, and the results confirm that the error term in Theorem 10 indeed scales with the corresponding (directional) sub-Gaussian norm. Lastly, although the condition number of Σ depends on ν, the specific value of ν does not affect how accurateΣ † is, as predicted by the theory.
Bounded random vectors As mentioned in Remark 3, a stronger form of direction dependent covariance and precision matrix estimation bounds hold for bounded random vectors. The proofs for the bounded case follow along similar lines as for the sub-Gaussian case, except for the use of a slightly different probabilistic argument. We now state only the results for the estimation of covariance and precision matrix, since the remaining bounds follow by analogy.

Application to single-index model estimation
In this section we use the results of Sections 2 and 3 to establish concentration bounds for estimating the index vector a in the single-index model g(a x). Moreover, directional terms that arise from using directional dependent matrix concentration bounds provide an insight into how to further improve the performance of the standard estimator. The second half of the section is thus devoted to describing this strategy (which is based on splitting up the data, conditioning and averaging), proving the error estimates, and providing numerical evidence to show and examine the claimed performance gains.
This quantity can be seen as a restricted condition number PΣ † P 2 PΣP 2 for the matrix Σ = Cov (X). Indeed, for strict sub-Gaussians, i.e. those satisfying (2.5) for some K, the two are equal up to a factor depending on K.

Ordinary least squares for the single-index model
Single-index model (SIM) is a popular semi-parametric regression model that poses the relationship between the features X ∈ R D and responses Y ∈ R as E[Y |X] = f (a X), where f is an unknown link function and a ∈ S D−1 is an unknown index vector. SIM was developed in the 80s and 90s [10,22] as an extension of generalized linear regression that does not specify the link function, and which could thus avoid errors incurred by model misspecification.
Common applications are in econometrics [16,40] and signal processing under sparsity assumptions on the index vector [46,45]. It has been shown, e.g. in [19], that (in certain scenarios) the minimax estimation rate of SIM equals that of nonparametric univariate regression. Methods for estimating the SIM from a finite data set {(X i , Y i ) : i ∈ [N ]} often first construct an approximate index vectorâ, and then use nonparametric } to estimate the link function. With such an approach the generalization error of the resulting estimator depends largely on the error incurred by estimating the index vector. Thus, the construction ofâ becomes the critical point.
An efficient approach, which first appeared in [35,10], and later in modified forms in [20,5], is to solve the ordinary least squares (OLS) problem and then setd :=b/ b 2 . It was shown in [5] that √ N (d − a) is asymptotically normal with mean zero, provided X has an elliptical distribution and f is a nondecreasing function that is strictly increasing on some non-empty sub-interval of the support of a X. Assuming only ellipticity of X and statistical independence of Y and X given a X, the population vector b := Σ † Cov (X, Y ) is still contained in span{a}, see e.g. [26,Proposition 3]. Provided b = 0, in these cases the direction d := b/ b 2 equals the index vector a up to sign. Thus,d is in many cases a consistent estimator of the index vector a, and under certain conditions it converges with an N −1/2 rate.
The minimal · 2 -norm solution of (4.1) admits a closed form Using the results of the previous two sections we can show a direction dependent concentration bound for the vectorb.
For the normalized vector, respectively the directions, the following bound holds.

Corollary 15.
Assume the setting of Lemma 14. There exists a constant C > 0 such that for any u > 0, with probability at least 1 − exp(−u) we have (4.6) As mentioned before, under certain conditions we have d = a, where a is the index vector in the given SIM. In such cases Corollary 15 confirms thatd is a consistent estimator of a and achieves a N −1/2 convergence rate, which has been observed in previous works. To provide some context for our result we give a brief description of the two most popular strategies for estimating a.
The first group of methods can be distinguished by their simplicity and efficiency as they estimate the index vector using empirical estimates of first and second order moments of random variables X and Y . The studied OLS estimator (sometimes also referred to as the average derivative estimator [10, 35]) belongs to this group. Inverse regression based techniques also fall into this category, see e.g. [34,15,33] or the review [39], and their convergence rate usually equals N −1/2 and is thus on par with OLS. A drawback caused essentially by the simplicity of these methods is that theoretical guarantees typically require X to be an elliptical distribution, and sometimes a form of monotonicity is needed to avoid issues that arise when dealing with symmetric functions such as x → (a x) 2 .
The second, more sophisticated but also computationally heavier, class of methods is based on solving more complicated optimal programs for the recovery of a, to which a closed form solution does not exist. This includes non-parametric methods for sufficient dimension reduction, e.g. [14], aiming at estimating the gradients of f (a X) at all training samples X i ; methods based on combining index estimation and monotonic regression, e.g. [24,23]; and methods that simultaneously estimate the index vector and use spline interpolation [47,32], to name just a few. For these methods convergence rates up to N −1/2 (sometimes slower) can typically be proven, and the theory requires less stringent assumptions than those exhibited by the first class of methods.
A new insight in Corollary 15, which has not been emphasized before, is the directionally-dependent influence of the spectrum of Σ on the estimation guarantee. To make this more precise, we now consider a special case when X is strictly sub-Gaussian, the link function f is Lipschitz-smooth, and the index a is an eigenvector of Σ. We note that similar results hold if a is an approximate eigenvector of Σ, in the sense that a Σ † a ≈ (a Σa) −1 .

Corollary 16.
Let Y be sub-Gaussian, X be strictly sub-Gaussian (i.e. satisfying (2.5) for some K > 0) and assume Y = f (a X) + ζ for Eζ = 0 with σ ζ := ζ ψ2 < ∞. Assume consistent estimation, i.e. d = a, and that Σa = σ 2 P a. Then there exists C K > 0, depending only on K, so that if for any u > 0 we have with probability at least 1 − exp(−u) Corollary 16 shows that the variance in the direction of the index vector, σ 2 P = Var(a X), and the variance in orthogonal directions, quantified through the spectrum of QΣ † Q, influence index vector estimation in a significantly different manner. Namely, as long as σ P σ ζ , smaller σ P has a provably beneficial effect on estimation accuracy, whereas small non-zero eigenvalues of Σ (i.e. large eigenvalues of Σ † ), corresponding to eigenvectors in Im(Q), can only worsen the accuracy. A similar observation can be made when inspecting the asymptotic covariance of √  N (d − a), which after using Σa = σ 2 P a and [5, Theorem 1], is  f (a X) ). We use a dyadic level-set partitioning of the data into J sub-intervals. The color indicates the labeling.
given by The scalar factor T 1 is comparable to b 2 2 , and the matrix T 2 shows that the variance of the estimator grows with large eigenvalues of QΣ † Q. On the other hand, the variance of the residualỸ −X b decreases if the accuracy of the linear fit ofX b ≈ f (a X) − Ef (a X) improves, which can typically be observed under monotonic links f and decreasing σ 2 P = Var(a X). We will now use this observation as a guiding principle for developing a modified estimator that splits the data into subsets with small σ P 's, computes corresponding OLS vectors on each subset, and then combines local estimators into a global estimator by weighted averaging.

Averaged conditional least squares for the single-index model
We now study the just discussed alternative procedure, where we first split the data into subsets, aiming to reduce the variance of the data distribution in the direction of the index vector, and then compute and average out the estimators from each subset. Since we have no a priori knowledge about the index vector, constructing such a partition seems challenging. However, when the link function is monotonic the partitioning is induced by a decomposition of Im(Y ) into equisized intervals, see Figure 2. For the sake of simplicity, in the following we assume Y ∈ [0, 1) holds almost surely. For which induce a partition of the data set into J subsets based on the responses. Then estimate a according to the following algorithm.
Step 2 Define the empirical densityρ J, := |S J, | /N , set the thresholding parameter α > 0, and compute the averaged outer product matrix

570Ž. Kereta and T. Klock
Step 3 Use the eigenvector corresponding to the largest eigenvalue of M J , denoted as u 1 (M J ), as an approximation of the index vector a.
The parameter α is used to promote numerical stability by suppressing the contributions of sparsely populated subsets. In other words, we only keep those level-sets whose empirical mass behaves as if Y were uniformly distributed over Im(Y ) (which can in some problems be achieved by a suitable transformation of the responses). The parameter J on the other hand defines the number of sets we use in the partition of the given data set, and dictates the trade-off between Var a X|Y ∈ R J, and the number of samples |S J, | in a given level set.
For a random vector Z we denote a conditional random vector Z J, := Z|Y ∈ R J, . Note that Z J, inherits sub-Gaussianity of Z provided P(Y ∈ R J, ) > 0, see Lemma 23. We now analyze the approach under the following assumption: Assumption (A) is not particularly restrictive. For example, it can be shown that (A) holds if X is elliptically symmetric, which is a standard assumption when using the OLS functional there exists a sign s ∈ {−1, 1} so that for any u > 0, with probability at least 1 − exp(−u) we have , (4.10) where ε N,J,u := C rank(Σ) + log(J) + u αJN .
The same guarantee holds when replacing the denominator in (4.10) with λ 1 (M J ).
The error rate in Theorem 17 is dictated by ε N,J,u and there are two ways to interpret Theorem 17. First, for a fixed parameter J all terms in (4.9) and (4.10) (except N and ε N,J,u ) are constants, and we obtain a N −1/2 convergence rate for the non-squared error as soon as the sample size is sufficiently large. This is the same rate as the one achieved with the standard OLS estimator (4.2).
Parameter J however provides additional flexibility and an intriguing option is to select J as a function that grows with N while still complying with (4.9). Namely, using u = log(J) and J N/ log(N ), and plugging into (4.10), implies that a log(N/ log(N )) log(N )N −1 ≤ log(N )N −1 rate is achievable, so long as the remaining terms involving parameter J are balanced.
To give an illustrative example of such a case, we consider the following idealized setting. We assume the noise-free regime Y = f (a X) with strictly monotonic link in the sense that for some L > 0 we have Cov a X, f (a X)|Y ∈ R J, > LVar(a X|Y ∈ R J, ). (4.11) Furthermore, conditional random variables X J, are assumed strictly sub-Gaussian, where the strict sub-Gaussianity constant K > 0 in (2.5) is independent of J and , and we model local covariance matrices as where c 1 , c 2 are universal constants independent of J. Condition (4.12) implies that partitioning the data into J level-sets results in a reduction of the variance along the direction of the index vector, while not affecting the variance in directions orthogonal to the index vector.
With strict sub-Gaussianity and (4.12), we have Then, using u = log(J) and J = τ N/ log(N ) for some τ > 0, the result (4.10) can be written as where C K > 0 depends only on K, c 1 , and c 2 . Furthermore, using Assumption (A), and (4.12), we have Cov (X, Y |Y ∈ R J, ) = Cov a X, Y |Y ∈ R J, a, and with the asserted monotonicity (4.11) we get (4.14) Plugging this bound into (4.13), and choosing τ ∈ (0, αL 2 /(2C K (rank(Σ)+1))), we thus obtain a log 2 (N )/N 2 convergence rate for the squared error. The described idealized setting can be a reasonable approximation of reality for strictly monotonic link functions f in the noise-free regime, since (4.12) can be motivated by the observation that in this case slicing the distribution of X based on Y reduces the variance of the data along the direction of the index vector, as shown in Figure 2. Thus, monotonicity is a key requirement for establishing a faster rate. In the case of noisy responses and strictly monotone links,

572Ž. Kereta and T. Klock
the matter is more delicate. Namely, by the minimax rate of linear regression, which equals Var(Y − f (a X))/N , result (4.13) can only hold for all The same observation carries over to more general strictly monotonic links, because, in the presence of noise, condition (4.12) and the lower bound (4.14) do not hold for all J ∈ N. For instance, if J 2 exceeds 1/Var(Y − f (a X)) or becomes even larger, the conditional distribution Y |Y ∈ R J, has roughly as much correlation with label noise as it does with the labels, f (a X). This suggests b J, 2 ≈ 0, and that further slicing the distribution of X based on Y will not decrease the variance in index vector direction, as required per (4.12). For J −2 Var(Y −f (a X)) however, Condition (4.12) and (4.14) can still hold approximately, and our experiments below suggest that the proposed estimator benefits from an aggresive splitting of the data in the case of nonlinear, strictly monotonic links. We now confer the implications of Theorem 17, with synthetic examples, under a data-driven parameter choice rule for J.

Numerical setup and parameter choice
We sample X ∼ N (0, Id D ), with D = 10, and let a = (1, 0, . . . , 0) (the specific choice of a is irrelevant for the results due to the rotational invariance of X). Responses are generated by ζ Var(f (a X))). We set α = 0.05 and additionally exclude subsets with |S J, | < 2D, for the sake of numerical stability. Our goal is twofold. First, to compare estimation of the index vector using the standard OLS approach (4.2) with the strategy proposed in this section, and second, to empirically confer the observations described after Theorem 17.
A critical step of the modified approach is the selection of J as a function of N . As mentioned in Theorem 17, the denominator effectively lower bounds λ 1 (M J ), and the corresponding concentration bound can be written as Neglecting the dependence of κ J, and QΣ † J, ψ2 on J and log(J)-terms, originating from union bounds in the proofs, the error is minimized when maximizing Jλ 1 (M J ). We thus propose to adaptively choose J by where we use an exponential grid in N to decrease the computational demand. Note also that , so, using parameter choice (4.15), we expect no further increase of J if b J, 2 's decay rapidly. Figure 3 shows the errors d − a 2 and the corresponding optimized choices J * of the number of level-sets J, for different link functions f . Solid lines correspond to the estimator (4.2), the dashed lines to u 1 (M J * ), and different colors represent different noise levels σ ζ . In Figures 3(b)-3(d) we consider monotonic link functions and we can see that the approach presented in this section performs substantially better than the standard OLS approach. On the other hand, in Figures 3(a), 3(e) and 3(f) the two approaches achieve similar performance. In case of 3(a) this is because (4.2) is indeed optimal, according to the Gauss-Markov Theorem, whereas link functions in 3(e), 3(f), are not monotonic. In the latter case, the plots of J * confirm that λ 1 (M J ) decays rapidly as a function of J, leaving J * essentially constant as a function of N .

Numerical experiments
Let us examine the results for monotonic functions in more detail. The plots for J * in Figures 3(b)-3(d) show that the number of level-sets J * indeed grows as a function of N , and it does so up to a level dictated by the noise level σ ζ . This shows that λ 1 (M J ) does not decay faster than J −1 over a reasonably large range of N values, i.e. until J −1 ≈ σ ζ . As a consequence, our approach achieves a faster, N −1 estimation rate for the index vector. Specifically, in the noise-free case this holds asymptotically as N → ∞. On the other hand, in the case of corrupted Y 's, we first have a faster N −1 convergence, and then a sharp transition into the usual N −1/2 rate. The number of points N at which this transition occurs depends inversely on the level of noise.

Lemma 19. Let
Applying the Cauchy-Schwartz inequality we have By the same line of argument we have A 1/2 B 1 u 2 2 = u B 1 AB 1 u and Notice that since A is positive semidefinite, then B 1 AB 1 and B 2 AB 2 are positive semidefinite. Therefore, An analogous expression holds for the other term. Identifying the quadratic form on the left hand side in (A.2) as the operator norm of B 1 AB 2 , the conclusion follows.
The following is a standard concentration bound for sub-Gaussian and subexponential random vectors around their mean.
Proof. The argument for the two bounds follows along analogous lines. Let δ < 1/4, and N be a δ-net of S D−1 . We first use [54,Exercise 4.4.3] to rewrite is a sum of either sub-Gaussian or sub-exponential random variables, for any v ∈N . In the former case we have v X i ψ1 ≤ X ψ1 , and v X i ψ2 ≤ X ψ2 in the latter. Hoeffding's inequality [54,Theorem 2.6.2], in the sub-Gaussian case, or Bernstein's inequality [54,Theorem 2.8.1], in the sub-exponential case, now yield concentration bounds for the sums. The claim follows by applying the union bound over v ∈N , where the number of events is bounded by |N | ≤ 12 D , see for instance [54,Corollary 4.2.13].

A.2. Proofs for Section 2
Proof of Lemma 2. LetX = X − EX. Let U A ∈ R d A ×D and U B ∈ R d B ×D be matrices whose rows contain the orthonormal basis for Im(AΣ) and Im(BΣ), respectively. Since Σ andΣ are symmetric, and Im(Σ) ⊆ Im(Σ), we havê Σ − Σ = P Σ (Σ − Σ)P Σ , where P Σ is the orthogonal projection onto Im(Σ). We thus have Notice thatΣ, compared to the empirical covariancê Σ, uses the true, instead of the empirical mean ofX. We then have By Lemma 20 the second term is always of higher order, and the resulting error can be absorbed into a corresponding upper bound for the first term. Thus, in the following we focus on We closely follow the proof of [ Consider now any pair (x, y) ∈ N × M, and write Since A ΣXi , y and B ΣXi , x are sub-Gaussian, their product is subexponential, and from [54, Lemma 2.7.7] we have which we denote by σ AB := A ΣX ψ2 B ΣX ψ2 for short. Since E[Σ] = Σ, by Lemma 20 we have for any u > 0, with m(t) = t ∨ t 2 ,  (12)) .
Using the δ-net approximation bound (A.3) we thus get The result now follows from A ΣX ψ2 = AX ψ2 and B ΣX ψ2 = BX ψ2 and thus σ AB = AX ψ2 BX ψ2 .
Proof of Lemma 7. We use the shorthand notation λ i := λ i (Σ) andλ i := λ i (Σ). By the definition of δ il we have Note that if i = 1 or l = rank(Σ), the corresponding gap is trivial and no concentration is required. Hence, we can focus on the case i > 1 and l < rank(Σ). Using [48,Proposition 3.13], we obtain that for any u > 0 a relative eigenvalue boundλ holds with probability at most exp(−u), provided the summands in the sum in (A.6) are increasing, and we thus get

578Ž. Kereta and T. Klock
Thus, (A.6) is implied by the requirement on N in the statement. Similarly, using [48,Proposition 3.10] we get that holds with probability at most exp(−u), provided It follows that (A. )/2 and thus the result is proven.

A.3. Proofs for Section 3
Proof. We prove the statement by contraction. Assume there exists a nonzero v ∈ ker(Σ) ∩ Im(Σ). Since v ∈ Im(Σ) we have v ∈ Im( √ Σ † ) implying the existence of a u ∈ S D−1 with v = √ Σ † u (possibly after scaling v). Using the properties of u and v, we get a contradiction by Proof of Theorem 10. Assume for now √ Σ † (Σ −Σ) √ Σ † 2 < 1 2 which will be ensured later by concentration arguments. Conditioned on this event, Lemma 21 implies ker(Σ) ∩ Im(Σ) = {0}. Since Im(Σ) ⊂ Im(Σ) holds almost surely, this implies Im(Σ) = Im(Σ) almost surely. Denote now Δ :=Σ − Σ. We want to use a von Neumann series argument to expressΣ † in terms of Σ † and the corresponding finite sample perturbation. Recall that for general symmetric matrices U and V with Im(U) = Im(V), we have (UV) † = V † U † [50, Corollary 1.2]. Letting P Σ be the orthogonal projection onto Im(Σ), and Q Σ = Id D −P Σ , we can thus writê where we used the additivity of the Moore-Penrose inverse for matrices with orthogonal non-trivial eigenspaces in the last equality. Using a von Neumann series expansion for the second term, we obtain where we used √ Σ † √ Σ † = Σ † in the last equality. Subtracting Σ † , and multiplying by A from the right and B from the left, it follows that The rest of the proof requires to apply Lemma 2 multiple times to concentrate Δ when multiplied by different matrices from left and right. More specifically, since Cov( √ Σ † X) 2 1, we can assume the regime N > rank(Σ) + u by the assumption on N in Theorem 10. Then, we have by Lemma 2 with probability 1 − 4 exp(−u) whenever N > C(rank(Σ) + u) √ Σ †X 4 ψ2 (we used properties of the sub-Gaussian norm in Lemma 18 to get ). Using identity (A.9), combining the above bounds and adjusting the uniform constant C, the desired result holds with probability 1 − exp(−u).

proof of Lemma 2 into
The second term is again of higher order, with high probability, and can thus be disregarded. For the first term, define random matrices with probability at least 1 − exp(−u 2 ). The conclusion follows after adjusting the confidence level u to account for the higher order term and the union bound. The first steps for establishing the bound for precision matrices are the same as in the proof of Theorem 10. Indeed, the only differences are in the following lines of inequalities. Instead of (A.10), we have using the just proven covariance bounds. The claim now follows by using this together with (A.9), as in the proof of Theorem 10.

A.4. Proofs for Section 4
We first need a bound for r = Cov (X, Y ), and a concentration around the finite Proof. By Lemma 18 we have Var v AX AX 2 ψ2 , and Var (Y ) Ỹ 2 ψ2 , which implies Define the random variable As in the proof of Lemma 2, by applying Lemma 20 it follows that the second term is of higher order. On the other hand, the first term is an empirical mean of a sub-exponential centered variable AZ i , where we use the centering property of the sub-exponential norm, and the bound for the sub-exponential norm by the product of sub-Gaussian norms (Lemma 18). Applying now Lemma 20, the statement follows.
Proof of Lemma 14. We first note that Cov( √ Σ † X) 2 ≥ 1, so by choosing C in the statement large enough, we assume the regime N > (rank(Σ) + u) in the following. We begin with a bound for P(Σ † r −Σ †r ) 2 .
Since Qd = 0 we now have Therefore, it suffices to ensure P(b − b) 2 < 1 2 b 2 , which would also givê b b > 0. To that end, Lemma 14 gives that P(b − b) 2 ≤ 1 2 b 2 holds with probability at least 1 − exp(−u) provided N > C(rank(Σ) + u) The result follows by bounding Qb 2 through Lemma 14.
Proof of Corollary 16. We first note Furthermore, we can use a bounded difference inequality for unbounded spaces, see [31,Theorem 1], to get f (a X) − Ef (a X) ψ2 L a X ψ2 √ KLσ P , where we used the strict sub-Gaussianity in (2.5) and a Σa = σ 2 P . Therefore, Ỹ ψ2 KLσ P + σ ζ .

Lemma 23.
If Z is sub-Gaussian and E an event with P(E) > 0, then Z|E is also sub-Gaussian.
Proof. Assume without loss of generality that Z ∈ R. The argument for vectors then follows by the definition. We use the characterization of sub-Gaussianity by the moment bound [ where we used P(E) ≤ 1 and the sub-Gaussianity of Z in the last inequality.