Spectrally-truncated kernel ridge regression and its free lunch

Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense $n \times n$ kernel matrix, where $n$ is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nystr\"{o}m approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest $r<n$ eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be used to study the exact trade-off between the level of spectral truncation and the regularization parameter of the KRR. We show that, as long as the RKHS is infinite-dimensional, there is a threshold on $r$, above which, the spectrally-truncated KRR, surprisingly, outperforms the full KRR in terms of the minimax risk, where the minimum is taken over the regularization parameter. This strengthens the existing results on approximation schemes, by showing that not only one does not lose in terms of the rates, truncation can in fact improve the performance, for all finite samples (above the threshold). In other words, there is nothing to be gained by running the full KRR and one should always truncate. Our proof is elementary and distribution-free, only requiring the noise vector to be isotropic.


Introduction
The general nonparametric regression problem can be stated as y i = f * (x i ) + w i , i = 1, . . . , n, Ew = 0, cov(w) = σ 2 I n (1) where w = (w i ) ∈ R n is a noise vector and f * : X → R is the function of interest to be approximated from the noisy observations {y i }. Here, X is the space to which the covariates {x i } belong. We consider the fixed design regression where the covariates are assumed to be deterministic. The problem has a long history in statistics and machine learning [1,2]. In this paper, we assume that f * belongs to a reproducing kernel Hilbert space (RKHS), denoted as H [3]. Such spaces are characterized by the existence of a reproducing kernel, that is, a positive semidefinite function K : X × X → R that uniquely determines the underlying function space H. RKHSs are very versatile modeling tools and include, for example, Sobolev spaces of smooth functions whose norms are measures of function roughness (the opposite of smoothness) [4]. Throughout, we think of these Sobolev spaces as the concrete examples of H. By assuming an upper bound on the Hilbert norm of f * , we can encode a prior belief that the true data generating function f * has a certain degree of smoothness. Without loss of generality, we assume that f * belongs to the unit ball of the RKHS, that is, A natural estimator is then, the kernel ridge regression (KRR), defined as the solution of the following optimization problem: where λ > 0 is a regularization parameter. It is well-known that this problem can be reduced to a finite-dimensional problem (by an application of the so-called representer theorem) [5]: is the (normalized empirical) kernel matrix. Although (4) has a closed form solution, it involves inverting an n × n dense matrix, with time complexity O(n 3 ), which is prohibitive in practice. Various approximation schemes have been proposed to mitigate the computational costs, including (i) approximating the kernel matrix or (ii) directly approximating the optimization problem (4). Examples of the former are the Nyström approximation, column sampling and their variants [6,7,8,9,10]. An example of the latter is sketching [11,12] where one restricts ω to the subspace ran(S) := {Sα | α ∈ R r }, for some random matrix S ∈ R n×r . It is in fact known that Nyström can be considered a special case of sketching with random standard basis vectors [12]. Sketching, with sufficiently large r, has been shown in [12] to achieve minimax optimal rates over Sobolev spaces, under mild conditions on the sketching matrix S. Similarly, the Nyström approximation has been analyzed in [13,14,15,11,16] and [17], the latter showing minimax rate optimality. In addition to the above, (iii) divide and conquer approaches have been proposed [18], where one solves the problem over subsamples and then aggregate by averaging, with some rate optimality guarantees. Other notable approaches to scaling include (iv) approximating translation-invariant kernel functions via Monte Carlo averages of tensor products of randomized feature maps [19,20] and (v) applying stochastic gradient in the function space [21].
In this paper, we consider the most direct kernel approximation, namely, replacing K by its best rank r approximation (in Frobenius norm). This amounts to truncating the eigenvalue decomposition of K to its top r eigenvalues. We refer to the resulting KRR approximation as the spectrally-truncated KRR (ST-KRR). Although somewhat slower than the Nyström approximation and fast forms of sketching, ST-KRR can be considered an ideal rank-r spectral approximation. By analyzing it, one can also gain insights about approximate SVD truncation approaches such as Nyström or sketching. In addition, practically, it is a very viable solution for moderate size problems. See Appendix A for a discussion of the time complexity of various schemes.
We derive an exact expression for the maximum (empirical) mean-squared error (MSE) of ST-KRR, uniformly over the unit ball of the RKHS. This expression is solely in terms of the eigenvalues {µ i } of the kernel matrix K, the regularization parameter λ, the truncation level r, and the noise level σ 2 . Thus if one has access to {µ i } and the noise level (or estimates of them), one can plot the exact regularization curve (maximum MSE versus λ) for a given truncation level r and sample size n, and determine the optimal value of λ. We also note that since the empirical eigenvalues {µ i } quickly approach those of the integral operator associated with K (as n → ∞) [22], one can use these idealized eigenvalues instead of {µ i } to get an excellent approximation of these regularization curves.
We then show that there is an optimal threshold on r, the truncation level, which we denote as r n such that for all r ≥ r n , the minimal maximum-risk of r-truncated KRR, with minimum taken over the regularization parameter, is strictly better than that of the full KRR whenever µ r+1 > 0. In particular, for infinite-dimensional RKHSs, µ rn+1 > 0, hence truncating at level r n is guaranteed to strictly improve performance. The slower the decay of the eigenvalues, the larger this gap in performance.
We also show how the exact expression for the maximum MSE can be used to easily establish an slightly weaker bound for ST-KRR, similar to those derived in [12] for sketching. We discuss the link between the statistical dimension considered in [12] and the optimal truncation level r n , and show how the same rate-optimality guarantees hold for ST-KRR. (Rate-optimality also follows form the fact that ST-KRR, with proper r, strictly dominates full KRR and the latter is rate-optional. However, we do these calculations to make the comparison easier.) Finally, we illustrate the results with some numerical simulations showing some further surprises. For example, the Gaussian kernel has a much faster eigendecay rate than a Sobolev-1 kernel (exponential versus polynomial decay). Hence, the optimal truncation level r n asymptotically grows much slower for the Gaussian kernel. However, for finite samples, depending on the choice of the Gaussian bandwidth, the exact optimal truncation level, computed numerically, can be larger than that of Sobolev-1.

Preliminaries
Before stating and studying the spectrally-truncated problem in more details, let us make some observations regarding the original KRR problem in (3). For ω ∈ R n , we define the kernel mapping Note that ω → f ω is a linear map from R n → H. This map is the link between the solutions of the two optimization problems (3) and (4): For any optimal solution ω of (4), f ω will be an optimal solution of (3). The link is easy to establish by observing the following two identities: the first of which uses the reproducing property of the kernel: f, K( · , x) H = f (x). We will frequently use this property in the sequel. The proof of the equivalence follows from an argument similar to our discussion of the identifiability below.

Identifiability
Let us first observe that f * in (1) is not (statistically) identifiable. That is, there are multiple functions f * (in fact infinitely many if H is infinite dimensional) for which the vector (y i ) has the exact same distribution. To see this, let Let f ω * be the projection of f * onto L X . (It is always possible to choose at least one such ω * by the definition of projection and since L X is a closed subspace of H.) Given observations (y i ), we can only hope to recover the following equivalence class: where the last line follows since f * − f ω * ∈ L ⊥ X by the property of orthogonal projection (and can be absorbed into g).
We will use f ω * as the representative of the (identifiable) equivalence class of f * . We are interested in measuring functional deviations (e.g., the error in our estimate relative to the true function) in the empirical ℓ 2 norm: The use of this norm is common in the literature of nonparametric regression [23,24]. It is interesting to note that f * − f ω * n = 0, and f ω * H ≤ f * H , since projections are contractive. Thus, recalling (2), f ω * also belongs to the Hilbert unit ball: f ω * ∈ B H . It is in fact easy to see that f ω * has the least Hilbert norm among the members in the equivalence class (i.e., the smoothest version). Thus, without loss of generality, we can identify f * with f ω * . Equivalently, we can assume from the start that f * is of the form f ω * for some ω * ∈ R n . Note that the "no loss of generality" statement holds as long as we are working with the empirical ℓ 2 norm, due to (8).

Main results
Let K = U DU T the eigenvalue decomposition (EVD) of the empirical kernel matrix defined in (4). Here, U ∈ R n×n is an orthogonal matrix and D = diag(µ i ) n i=1 where µ 1 ≥ µ 2 ≥ · · · ≥ µ n ≥ 0 are the eigenvalues of K. We assume for simplicity that µ n > 0, that is, the exact kernel matrix is invertible. Consider the rank r approximation of K, obtained by keeping the top r eigenvalues and truncating the rest to zero, that is, Here, D r = diag(µ 1 , . . . , µ r ) and U r ∈ R n×r collects the first r columns of U . The idea is to solve (4) with K replaced with K, to obtain ω. We then form our functional estimate f by using the (exact) kernel mapping (5).
A minimizer in (9), without the additional condition K ω = K ω, is not unique due to the rank deficiency of K. Thus, we can ask for it to satisfy additional constraints. The equality condition in (10), which can be stated as ω ∈ ker( K − K) can always be satisfied. It is enough to choose ω to be the unique minimizer in ran( K) = ran(U r ), that is, ω = U r α for some α ∈ R r . This is how the estimator is often implemented in practice.
We are interested in the deviation of f from the true function f * in the empirical ℓ 2 norm. More precisely, we are interested in the mean-squared error as the statistical risk: Our main result is an expression for the worst-case risk of f over the unit ball of the RKHS: Theorem 1. Let f = f r,λ be an r-truncated λ-regularized kernel KRR estimator (Definition 1) applied to input y generated from model (1). Let Then, for all r = 1, 2, . . . , n and λ > 0, with µ n+1 := 0.
The first term in (11) is the worst-case approximation error (WAE) and the second term the estimation error (EE). The approximation error (AE) is the risk (relative to f * ) off which is obtained by passing the noiseless observations (f * (x i )), instead of y, through the estimation procedure. The AE is the deterministic part of the risk and is given by f − f * 2 n . The estimation error is the stochastic part of the risk and is given by E f −f 2 n . The function x → h λ (x) attains its maximum of λ/4, over [0, ∞), at x = λ. Thus, as long as λ ∈ [µ r , µ 1 ], the bound H r (λ) ≤ λ/4 is good. In general, We note that since the KRR estimates are linear in y, Theorem 1 easily gives the maximum MSE expression over the Hilbert ball of arbitrary radius R, by replacing σ 2 in (11) with σ 2 /R 2 and multiplying the entire right-hand side by R 2 . We also have a precise result on the regularized risk of the approximating function: Proposition 1. Letf =f r,λ be obtained by passing the noiseless observations (f * (x i )), instead of y, through the estimation procedure in Definition 1. Then,
In addition, recalling that f n,λ is the full KRR estimator, let MSE( f n,λ , f * ), and r n := r(λ n ).
That is, λ n is the regularization parameter that achieves the minimal maximum-risk for the full KRR. We have the following corollary of Theorem 1: Corollary 1. For every λ > 0, and every r ∈ [n] with r ≥ r(λ), In particular, for every r ≥ r n , Both inequalities are strict whenever µ r+1 > 0.
Corollary 1 shows that λ-optimized f rn,λ strictly improves on optimized full KRR whenever µ rn+1 > 0, in a sense rendering the full KRR inadmissible, as far as the maximum risk over B H is concerned. Note that we are not claiming inadmissibility in the classical sense which requires one estimator to improve on another for all f * ∈ B H . In general, the slower the decay of {µ i }, the more significant the improvement gained by truncation. Note that (14) allows one to set the precise truncation level including the exact constants if one has access to the eigenvalues of the kernel matrix. In practice, for large n, the eigenvalues of the associated kernel integral operator (if available) can act as excellent surrogates for {µ i } [22].

Gaussian complexity and rates
Less precise bounds, albeit good enough to capture the correct asymptotic rate as n → ∞, can be obtained in terms of the Gaussian complexity of the unit ball of the RKHS. These types of results have been obtained for the Sketched-KRR. To make a comparison easier, let us show how such bounds can be obtained from Theorem 1.
Let us define the r-truncated complexity (of the empirical Hilbert ball) as For the case r = n, this matches the definition of the kernel complexity in [12], which we refer to for the related background. In particular, (17) If λ ≥ µ 1 , one can replace the first term with µ 1 λ 2 /(λ + µ 1 ) 2 for a better bound.
The latter upper bound is what one would get for the full KRR. Matching the two terms in that bound, we chooses δ n such that δ 2 n = 2R n (δ n ) which gives the well-known critical radius for the KRR problem [24]. It is known that δ n gives the optimal rate of convergence for estimating functions in B H , i.e., its rate of decay matches that of the minimax risk [12]. The above argument shows that as long as r is taken large enough so that 4µ r+1 ≤ δ 2 n , the r-truncated KRR achieves (at least) the same rate as the full KRR. (For the sketching, the same conclusion is established in [12], where the smallest r satisfying µ r ≤ δ 2 n is referred to as the statistical dimension of the kernel.) For Sobolev-α kernels, with eigendecay µ i ≍ i −2α , we obtain MSE δ 2 n ≍ (σ 2 /n) − 2α 2α+1 . Interestingly, in this case, the estimate based on the weaker bound (18) and the exact bound (11) give the same rate (cf. Appendix C). This is expected since the given rate is known to be minimax optimal for Sobolev spaces. The same goes for the Gaussian kernel for which µ j ≍ e −cj log j and the rate is γ log(1/γ) for γ = σ 2 /n.
Order-wise, δ 2 n will be the same as λ n defined in (14), that is λ n ≍ δ 2 n , whenever δ 2 n matches the optimal rate. Hence, often µ 1 > λ n ≫ µ n for large n and the argument leading to (12) suggests that in this case H n (λ n ) ≈ λ n /4. Then, For Sobolev-α kernels, this suggests truncation level r n (σ 2 /n) 1 2α+1 which gives moderate savings for high smoothness levels α. Similarly, for the Gaussian kernel, it is not hard to see that truncating to r n log(n/σ 2 ) is enough to get the same rate as the full KRR, which is a substantial saving.

Simulations
We now present some numerical experiments to corroborate the theory. We consider a Gaussian kernel K(s, t) = e −(u−v) 2 /2b 2 of bandwidth b = 0.1 on [−1, 1], as well as the Sobolev-1 kernel K(s, t) = min(s, t) on [0, 1]. We take the covariates {x i } to be n = 200 equi-spaced points in each interval. The top row of Figure 1 shows the plot of the theoretical maximum MSE as given by Theorem 1 for the two kernels, for both the full KRR (r = n), and the optimally truncated version (r = r n ). We have used σ = 2 in (11). As predicted by Theorem 1, the minimum achievable maximum MSE is smaller for the truncated KRR.
To compute the optimal truncation, we have evaluated the regularization curve of the full KRR first, obtained the minimizer λ n and then used (14) to compute the optimal truncation level r n . For the setup of the simulation, we get r n = 10 for the Gaussian and r n = 3 for the Sobolev-1. It is interesting to note that although in terms of rates, r n for the Gaussian should be asymptotically much smaller than that of Sobolev-1, in finite samples, the truncation level for the Guassian could be bigger as can be seen here. This is due to the unspecified, potentially large, constants in the rates (that depend on the bandwidth b as well). Also, notice how surprisingly small r n is relative to n in both cases. The bottom row of Figure 1 shows the empirical MSE obtained for a typical random f * ∈ B H , by computing the KRR estimates for observation y and comparing with f * . The random true function is generated as f * = f ω * where ω * ∼ N (0, I n ) and further normalized so that (ω * ) T Kω * = 1. We have generated n = 200 observations from (1) with σ = 2. The plots were obtained using 1000 replications. The truncation levels are those calculated based on the maximum MSE formula (11). The plots show that for a typical application, the truncated KRR also dominates the full KRR.

Proof of the main result
Here we give the proof of Theorem 1. The remaining proofs can be found in Appendix B.
The u-transform. From the discussion in Section 2.1, both the KRR estimate and the true function belong to L X given in (7). It is then useful to have an expression for the empirical ℓ 2 error of functions belonging to this space. First, we observe that f ω 2 n = 1 n n i=1 [f ω (x i )] 2 = Kω 2 . Now, take any ω, ω * ∈ R d , and let u = Kω and u * = Kω * . Then, we have where the fist equality is by the linearity of ω → f ω . For any function f ω ∈ L X , we call u = Kω the u-space representation of f ω . Identity (19) shows that it is often easier to work in the u-space since the u-transform turns empirical ℓ 2 norms on functions into the usual ℓ 2 norms on vectors. In other words, the map f ω → u, is a Hilbert space isometry from (L X , · n ) to (R n , · ). In the u-space, the KRR optimization problem can be equivalently stated as: where K + is the pseudo inverse of K, and ran(K) its range. More precisely: Lemma 1. For any K ∈ R n×n , problems (4) and (20) are equivalent in the following sense: -For any minimizerω of (4), Kω is a minimizer of (20), and -for any minimizerū of (20), anyω ∈ {ω : Kω =ū} is a minimizer of (4).
It is often the case that the kernel matrix itself is invertible, in which case K + = K −1 , ran(K) = R n and problem (20) simplifies. However, the equivalence in Lemma 1 holds even if we replace K with an approximation which is rank deficient. This observation will be useful in the sequel.
Controlling the approximation error: Recall that we are interested in the worst-case approximation error (WAE) over the unit ball of the Hilbert space, i.e., over f * ∈ B H . Also, recall that without loss of generality, we can take f * = f ω * . Hence, where the second equality is from (6), and the latter two are by definitions of u * and v * = U T u * . We obtain WAE r,λ = sup A further change of variable v * = D 1/2 v gives where · , applied to matrices, is the ℓ 2 → ℓ 2 operator norm. Note that Γ λ is a diagonal matrix with diagonal elements, µ i /(µ i + λ) for i = 1, . . . , r followed by n − r zeros. It follows that (I − Γ λ )D 1/2 is diagonal with diagonal elements: . . , r, √ µ i , i = r + 1, . . . , n.