From Gauss to Kolmogorov: Localized Measures of Complexity for Ellipses

The Gaussian width is a fundamental quantity in probability, statistics and geometry, known to underlie the intrinsic difficulty of estimation and hypothesis testing. In this work, we show how the Gaussian width, when localized to any given point of an ellipse, can be controlled by the Kolmogorov width of a set similarly localized. This connection leads to an explicit characterization of the estimation error of least-squares regression as a function of the true regression vector within the ellipse. The rate of error decay varies substantially as a function of location: as a concrete example, in Sobolev ellipses of smoothness $\alpha$, we exhibit rates that vary from $(\sigma^2)^{\frac{2 \alpha}{2 \alpha + 1}}$, corresponding to the classical global rate, to the faster rate $(\sigma^2)^{\frac{4 \alpha}{4 \alpha + 1}}$. We also show how the local Kolmogorov width can be related to local metric entropy.


Introduction
The Gaussian width is an important measure of the complexity of a set, and it plays an important role in geometry, statistics and probability theory. Most relevant to this paper is its central role in empirical process theory, where the Gaussian width and its Bernoulli analogue (known as the Rademacher width) can be used to upper bound the error for various types of non-parametric estimators [26,27,3,15,5]. More recently, these same complexity measures have also been shown to play an important role in high-dimensional testing problems [31].
For a general set, it is non-trivial to provide analytical expressions for its Gaussian or Rademacher widths. There are a variety of techniques for obtaining bounds, including upper bounds via the classical entropy integral of Dudley, as well as lower bounds due to Sudakov-Fernique (see the book [16] for details on these and other results). More recently, Talagrand [22] has introduced a generic chaining technique that leads to sharp lower and upper bounds. However, for a general set, it is impossible to evaluate the expressions obtained from the generic chaining, and so for applications in statistics, it is of considerable interest to develop techniques that yield tractable characterizations of various forms of widths.
In this paper, we study a class of Gaussian widths that arise in the context of estimation over (possibly infinite-dimensional) ellipses. As we describe below, many non-parametric problems, among them are regression and density estimation over classes of smooth functions, can be reduced to such ellipse estimation problems. Obtaining sharp rates for such estimation problems requires studying a localized notion of Gaussian width, in which the ellipse is intersected with a Euclidean ball around the element θ * being estimated. The main technical contribution of this paper is to show how this localized Gaussian width can be bounded, from both above and below, using a localized form of the Kolmogorov width [19]. As we show with a number of corollaries, this Kolmogorov width can be calculated in many interesting examples.
Our work makes a connection to the evolving line of work on instance-specific rates in estimation and testing. Within the decision-theoretic framework, the classical approach is to study the (global) minimax risk over a certain problem class. In this framework, methods are compared via their worst-case behavior as measured by performance over the entire problem class. For the ellipse problems considered here, global minimax risks in various norms are well-understood; for instance, see the classic papers [20,11,12]. When the risk function is near to constant over the set, then the global minimax risk is reflective of the typical behavior. If not, then one is motivated to seek more refined ways of characterizing the hardness of different problems, and the performance of different estimators.
One way of doing so is by studying the notion of an adaptive estimator, meaning one whose performance automatically adapts to some (unknown) property of the underlying function being estimated. For instance, estimators using wavelet bases are known to be adaptive to unknown degree of smoothness [7,8]. Similarly, in the context of shape-constrained problems, there is a line of work showing that for functions with simpler structure, it is possible to achieve faster rates than the global minimax ones (e.g. [18,33,6]). A related line of work, including some of our own, has studied adaptivity in the context of hypothesis testing (e.g., [25,2,30]). The adaptive estimation rates established in this work also share this spirit of being instance-specific.

Some motivating examples
A primary motivation for our work is to understand the behavior of least-squares estimators over ellipses. Accordingly, let us give a precise definition of the ellipse estimation problem, along with some motivating examples.
Given a fixed integer d and a sequence of non-negative scalars µ 1 ≥ . . . ≥ µ d ≥ 0, we can define an elliptical norm on R d via θ 2 E : = d j=1 θ 2 j µ j . Here for any coefficient µ k = 0, we interpret the constraint as enforcing that θ k = 0. For any radius R > 0, this semi-norm defines an ellipse of the form We frequently focus on the case R = 1, in which case we adopt the shorthand notation E for the set E(1). Whereas equation (1) defines a finite-dimensional ellipse, it should be noted that our theory also applies to infinite-dimensional ellipses for sequences {µ j } ∞ j=1 that are summable. Such results can be recovered by studying a truncated version of the ellipse with finite dimension d, and then taking suitable limits. In order to simplify the exposition, we develop our results with finite d, noting how they extend to infinite dimensions after stating our results.
Suppose that for some unknown vector θ * ∈ E, we make noisy observations of the form where w ∼ N (0, I d ). (2) We assume that the ellipse E and noise standard deviation σ is known. The goal of ellipse estimation is to specify a mapping y → θ(y) such that the associated Euclidean risk E y θ(y) − θ * 2 2 is as small as possible.
Let us consider some concrete problems that can be reduced to instances of ellipse estimation.
Example 1 (Linear prediction with correlated designs). Suppose that we make observations from the standard linear model where y ∈ R n is the response vector, X ∈ R n×p is a (fixed, non-random) design matrix, and w ∼ N (0, I n ) is noise. Suppose moreover that we know a priori that β * 2 ≤ R for some radius R > 0. Alternatively, we can think of a condition of this form arising implicitly when using estimators such as ridge regression. Varying hardness over ellipse Figure 1. Illustration of the ellipse estimation problem. The goal to estimate an unknown vector θ * belonging to an ellipse based on noisy observations. The local geometry of the ellipse controls the difficulty of the problem: due to its proximity to the narrow end of the ellipse, the vector θ * E is relatively easy to estimate. By contrast, the vector θ * H should be harder, since it lies closest to the center of the ellipse. The theory given in this paper confirms this intuition; see Section 4 for details.
Given an estimate β, its prediction accuracy can be assessed via the mean-squared error E 1 n X β − Xβ * 2 2 , where the expectation is taken over the observation noise. Equivalently, letting θ = X β/ √ n and θ * = Xβ * / √ n, our problem is to minimize the mean-squared error E θ − θ * 2 2 . After this transformation, we arrive at the observation model y = θ * + ν √ n w, which is a version of our original model (2) with d = n and σ = ν √ n . Moreover, the constraint on the 2 -norm of β * translates into an ellipse constraint on θ * . In particular, the ellipse is determined by the non-zero eigenvalues of the matrix 1 n XX ∈ R n×n . As shown in Figure 1, it is natural to conjecture that the location of θ * within this ellipse affects the difficulty of estimation. Note that E y − θ * 2 2 = ν 2 /n, so that on average, the observed vector y lies at squared Euclidean distance ν 2 /n from the true vector. In certain favorable cases, such as a vector θ * E that lies at or close to the boundary of an elongated side of the ellipse, the side-knowledge that θ * ∈ E is helpful. In other cases, such as a vector θ * H that lies closer to the center of the ellipse, the elliptical constraint is less helpful. The theory to be developed in this paper makes this intuition precise. In particular, Section 4 is devoted to a number of consequences of our main results for the problem of estimation in ellipses.
Example 2 (Non-parametric regression using reproducing kernels). We now turn to a class of non-parametric problems that involve a form of ellipse estimation. Suppose that our goal is to predict a response z ∈ R based on observing a collection of predictors x ∈ X . Assuming that pairs (X, Z) are drawn jointly from some unknown distribution P, the optimal prediction in terms of mean-squared error is given by the conditional expectation f * ( , the goal of non-parametric regression is to produce an estimate f that is as close to f * as possible. Illustration of the kernel eigenvalues {µ j } n j=1 for kernel matrices K generated from the kernel functions in part (a). Each log-log plot shows the eigenvalue versus the index: note how the Gaussian kernel eigenvalues decay at an exponential rate, whereas those of the Sobolev-One spline kernel decay at a polynomial rate.
Assuming that the samples are i.i.d., we can rewrite our observations in the form where v i is an independent sequence of zero-mean noise variables with unit variance. A computationally attractive way of estimating f * is to perform least-squares regression over a reproducing kernel Hilbert space, or RKHS for short [1,13,10,28]. Any such function class is defined by a symmetric, positive definite kernel function K : X × X → R; standard examples include the Gaussian kernel, Laplace kernel, and the Sobolev (spline) kernels; see Figure 2 for some illustrative examples. Now suppose that f * belongs to the RKHS induced by the kernel K, say with Hilbert norm f * H ≤ R. In this case, the representer theorem [13] implies that the observation model (3) is equivalent to where K ∈ R n×n is the n × n kernel matrix with entries K ij = K(x i , x j )/n for each i, j = 1, . . . , n, and vector v is a n-dimensional vector formed by v i . The representer theorem and our choice of scaling ensures that f * 2 H = (α * ) Kα * , meaning that α * belongs to the ellipse of radius R defined by the symmetric and PSD kernel matrix K.
Note that the matrix K can be diagonalized as K = U DU , where U is orthonormal, and D = diag{µ 1 , µ 2 , . . . , µ n } is a diagonal matrix of non-negative eigenvalues. Following this transformation, we arrive at an instance of the standard ellipse model and where θ * = U Kα * belongs to the standard ellipse (1) defined by the eigenvalues of K. Note that the noise vector w = γU v/ √ n has zero-mean entries each with standard deviation σ = γ/ √ n. The entries of w are not exactly Gaussian (unless the initial noise vector v was jointly Gaussian), but are often well-approximated by Gaussian variables due to central limit behavior for large n.

Organization
The remainder of this paper is organized as follows. In Section 2, we introduce some background on approximation-theoretic quantities, including the Gaussian width, metric entropy, and the Kolmogorov width. Section 3 is devoted to the statement of our main results, while Section 4 develops a number of their specific consequences for ellipse estimation. In Section 5, we provide the proofs of our main results, with more technical aspects of the arguments provided in the appendices.

Background
Before proceeding to the statements of our main results, we introduce some background on the notion of Gaussian width, Kolmogorov width, as well as setting the estimation problem with ellipse constraint.

Gaussian width
Given a bounded subset S ⊂ R d , the Gaussian width of S is defined as ∼ N (0, 1).
It measures the size of set S in a certain sense.
It is also useful to define the classical notions of packing and covering entropy. An -cover of a set S with respect to the · 2 metric is a discrete set {θ 1 , . . . , θ N } ⊂ S such that for each θ ∈ S, there exists some i ∈ {1, . . . , N } satisfying θ − θ i 2 ≤ . The -covering number N ( , S) is the cardinality of the smallest -cover, and the logarithm of this number log N ( , S) is called the covering metric entropy of set S.
Similarly, an -packing of a set S is a set {θ 1 , . . . , θ M } ⊂ S satisfying θ i − θ j 2 > for all i = j. The size of the largest such packing is called the -packing number of S, which we denote by M ( , S). It is related to the (covering) metric entropy by the inequalities For this reason, we use the term metric entropy to refer to either the covering or packing metric entropy, since they differ only in constant terms.
The connection between Gaussian width and metric entropy is well-studied (e.g. [9,23,29]). For our future discussion, we collect a few results here as reference. First, Dudley's entropy integral [9] is an upper bound for the Gaussian width-viz.
for some universal constant c > 0. This upper bound also holds for more general sub-Gaussian processes. Dudley's bound can be much looser than the more refined bounds obtained through Talagrand's generic chaining, which are tight up to a universal constant [23,Thm. 2.4.1]. For Gaussian processes like ours, Sudakov minoration (e.g., [4,Thm. 13.4]) provides a lower bound on the Gaussian width.
Although we do not directly use this lower bound when proving our main lower bound (Theorem 2) below, we follow its spirit by constructing a large collection of well-separated points.

Kolmogorov width
In this section, we define the Kolmogorov width and briefly review its properties. This geometric quantity plays the central role in our main results.
For a given compact set S ⊂ R d and integer k ∈ [d], the Kolmogorov k-width of S is given by where P k denotes the set of all k-dimensional orthogonal linear projections, and Π k θ denotes the projection of θ to the corresponding k-dimensional linear space. Any projection Π k achieving the minimum in expression (5) is said to be an optimal projection for W k (S). Note that the Kolmogorov width W k (S) is a non-increasing function of k, meaning that We refer the readers to the book by Pinkus [19] for more details on the Kolmogorov width and its properties.

Main results
Let us first define the notion of localized Gaussian width formally, and then turn to the statement of our main results.

Localized Gaussian width
Let B(δ) denote the Euclidean ball of radius δ, and for a given vector θ * ∈ E, define the shifted ellipse E * θ : = θ − θ * | θ ∈ E . The localized Gaussian width at θ * and scale δ is defined as Note that this quantity is simply the ordinary Gaussian width of the set E θ * ∩ B(δ), and we say that it is localized since the Euclidean ball restricts it to a neighborhood of θ * . See Figure 3 for an illustration of this set. We note that localized forms of Gaussian and Rademacher complexity are standard in the literature on empirical processes (e.g., [3,14]), where it is known that they are needed to obtain sharp rates. In the case of least-squares estimation over convex sets, there is an extremely explicit connection between the localized Gaussian width and the associated estimation error [26,5,29]; we describe this relationship in more detail in Section 4 and Appendix D.
Our main results, to be stated in the following subsections, provide conditions under which we can provide a sharp characterization of the localized Gaussian width (6) in terms of the Kolmogorov width.

Upper bound on the localized Gaussian width
In order to state our first main result, we introduce an approximation-theoretic quantity having to do with the quality of a given k-dimensional projection. For a given integer k ∈ {1, . . . , d} and any k-dimensional linear projection Π k , let us define the set Here γ > 0 means that γ i > 0 for each coordinate i = 1, . . . , n. It can be verified that the set Γ(θ * , δ, Π k ) is always non-empty since the constant vector γ = √ µ 1 δ 1 always belongs to it. (Here 1 denotes the vector of all ones.) To provide some intuition for this definition, the vector ∆−Π k (∆) corresponds to the error incurred by using the subspace associated with Π k to approximate ∆. The positive vector γ ∈ R d allows us to weight the entries of this error vector in computing the Euclidean norm of the weighted error.
We are now ready to state an upper bound on the localized Gaussian width. Theorem 1. Given any δ > 0, projection tuple (k, Π k ), and vector θ * ∈ E, we have See Section 5.1 for the proof of this result.
Note that Theorem 1 holds for any dimension and projection pair (k, Π k ). Often the case, we can choose a specific pair for which the set Γ(θ * , δ, Π k ) is easy to characterize. In particular, given any fixed δ > 0, let us define the critical dimension for some constant η ∈ (0, 0.1). In words, this integer is the minimal dimension for which there exist a k * -dimensional projection that approximates a neighborhood of the re-centered ellipse to 9 10 δ-accuracy. 1 Although our notation does not explicitly reflect it, note that k * (θ * , δ) also depends on the ellipse E.
Given the integer k * ≡ k * (θ * , δ), we let Π k * ∈ P k * denote the minimizing projection in the definition (5) of the width, and note that for any vector ∆, the error associated with this projection is given by ∆ − Π k * (∆). It can be seen in our later examples, this particular choice (k * , Π k * ) often yields tight control of the localized Gaussian width. So as to streamline notation, we adopt Γ(θ * , δ) as a short hand for Γ(θ * , δ, Π k * ).
Regularity assumption: For many ellipses encountered in practice, the first term in the upper bound (7) dominates the second term involving the set Γ. In order to capture this condition, we say the ellipse E is regular at θ * if there exists some pair (k, Π k ) such that Here c < ∞ is any universal constant. When this condition holds, Theorem 1 implies the existence of another universal constant c such that As is shown in Appendix A, the regularity condition (9) is a generalization of a condition previously introduced by Yang et al. [32] in the context of kernel ridge regression, and it holds for many examples encountered in practice.
As a direct consequence of Theorem 1, the following corollary holds.
Corollary 1. If the regularity assumption (9) is satisfied with dimension and projection pair (k * , Π k * ), then the localized Gaussian width satisfies Let us illustrate the regularity condition (9) and associated consequences of Theorem 1 with some examples.
Example 3 (Gaussian width of the Euclidean ball). We begin with a simple example: suppose that the ellipse E is the Euclidean ball in R d , specified by the aspect ratios µ j = 1 for all j = 1, . . . , d, and let us use Theorem 1 to upper bound the Gaussian width at θ * = 0. For δ ∈ (0, 1 1−η ) and any integer Consequently, the regularity condition (9) certainly holds, so that Theorem 1 implies that In fact, a direct calculation yields that is a quantity tending to zero as d grows (e.g., [29]). Consequently, our bound is asymptotically sharp up to the constant pre-factor in this special case.
We now turn to a second example that arises in non-parametric regression and density estimation under smoothness constraints: Example 4 (Gaussian width for Sobolev ellipses). Now consider an ellipse E defined by the aspect ratios µ j = cj −2α , where α > 1/2 is a parameter. Ellipses of this form arise when studying nonparametric estimation problems involving functions that are α-times differentiable with Lebesgueintegrable α-derivative [24]. Let us again use Theorem 1 to upper bound the localized Gaussian width at θ * = 0. From classical results on Kolmogorov widths of ellipses [19] (see also [30,Sec. 4 Taking into account the intersection with the Euclidean ball, we find that Here the last inequality uses the fact that µ j = cj −2α .
This argument also shows that the corresponding projection subspace is spanned by the first k * standard orthogonal vectors On the other hand, we also have δ 2 k * (0, δ) δ 2−1/α , so there exists some constant c , such that inf γ∈Γ(θ * ,δ) d i=1 γ i ≤ c δ 2 k * (0, δ) which validates the regularity condition (9). Therefore, Theorem 1 guarantees that In fact, the above bound (11) can be shown to be tight up to a constant pre-factor. See the discussion following Corollary 2 in the sequel for further details.

Lower bound on the localized Gaussian width
Thus far, we have derived an upper bound for the localized Gaussian width. In this section, we use information-theoretic methods to prove an analogous lower bound on the localized Gaussian width. This lower bound involves both the critical dimension k * (θ * , δ), as previously defined in equation (8), and also a second quantity, one which measures the proximity of θ * to the boundary of the ellipse. More precisely, for a given θ * ∈ E, define the mapping Φ : As shown by Wei and Wainwright [30], this mapping is well-defined, and has the limiting behavior Φ(δ) → 0 as δ → 0 + ; for completeness, we include the verification of these claims in Appendix G, along with a sketch of the function. Let us denote Φ −1 (x) as the largest positive value of δ such that Φ(δ) ≤ x. Note that by this definition, we have Φ −1 (1) = ∞.
Recall that the elliptical norm on R d is defined via We are now ready to state our lower bound for the localized Gaussian width.
See Section 5.2 for the proof of this theorem.
We remark that the regularity condition (9) is not necessary for this result to hold. Besides, in order to understand the inequality . We assume this since it is not our primary interest to study the case when θ * is sufficiently close to the boundary of the ellipse. Concretely, if we assume that

Some consequences
One useful consequence of Theorem 1 and Theorem 2 is in providing sufficient conditions for tight control of the localized Gaussian width. If the ellipse E is regular at θ * , then the above theorems imply the localized Gaussian width (6) is equivalent to δ k * (θ * , δ) up to a multiplicative constant. Specifically, we have the sandwich relation for some positive constants c and c u .
Recall our earlier calculation from Example 3, where we showed that the localized Gaussian width scales as δ √ d, up to multiplicative constants. The sandwich relation (13) shows that this same scaling holds more generally with d replaced by k * (θ * , δ). Thus, we can think of k * (θ * , δ) corresponding to the "effective dimension" of the set E * θ ∩ B(δ). It is worthwhile pointing out that our results have a number of corollaries, in particular in terms of how local Gaussian widths and Kolmogorov widths are related to metric entropy. Recall the notion of the metric (packing) entropy log M as previously defined Section 2.1. The following corollary provides a sandwich for k * (θ * , δ) in terms of the metric entropy of the set E * θ ∩ B(δ).

Corollary 2.
There are universal constants c j > 0 such that for any pair (θ * , E) satisfying the regularity condition (9), we have See Appendix B for the proof. The lower bound (i) is a relatively straightforward consequence of Sudakov's inequality (4), when combined with our results connecting the Kolmogorov and Gaussian widths. The upper bound (ii) requires a lengthier argument.
Recall that in Example 4, we argued that for the Sobolev ellipse with smoothness α > 1/2, the Kolmogorov width at θ * = 0 is given by k * (0, δ) = c (1/δ) (1/α) . Combining this calculation with Corollary 2, we find that log M δ/2, E θ * ∩ B(δ) = (1/δ) 1/α up to a multiplicative constant. This is a known fact that can be verified by constructing explicit packings of these function classes, but it serves to illustrate the sharpness of our results in this particular context.

Consequences for estimation
In the previous section, we established upper and lower bounds on the localized Gaussian width in Theorem 1 and Theorem 2. We now turn to some consequences of these bounds, in particular for the problem of constrained least-squares estimation.
In particular, suppose we are given observations y ∼ N (θ * , σ 2 I n ) with θ ∈ E according to the earlier model (2), and we consider the constrained least squares estimator (LSE) Let us assume that the ellipse E is regular at θ * , so that the localized Gaussian width satisfies the bounds (13) with constants c and c u . Connecting the error θ − θ * 2 to these Gaussian width bounds involves the following two functions with the critical dimension defined in expression (8).
Let us consider the fixed point equation Since δ → k * (δ) is a non-increasing function of δ (see Wei and Wainwright [30, Appendix D.1]) while δ → δ is increasing, if this fixed point problem (17) has a solution, then the solution is unique and we denote it as δ * .
We can now give a precise statement relating the estimation rate of θ to the solution δ * of the fixed point equation (17).
Proposition 1 (Least squares on ellipses). Let E be regular at θ * , and let δ * be the solution to the fixed point problem (17). Suppose furthermore the following conditions hold (a) The function g is unimodal in δ.
Then the error of the least squares estimator (15) satisfies for some constants that depend only on c 1 and c 2 .
See Appendix D for the proof of this result.
Note that this result is stated for the ellipse E(R) with R = 1. For arbitrary R one can easily rescale to obtain similar results; see equation (37) in Section D.1 for more detail. When we say g is unimodal, we mean that there is some t such that g is nondecreasing for δ < t and nonincreasing for δ > t.
Equation (18) provides a high probability bound on the least-squares error. If furthermore δ * σ, then we are also guaranteed that the mean-squared error is sandwiched as for some universal constants (c, c ).
We claim the conditions of Proposition 1 are relatively mild. Note that the related function ) is strongly convex [5, Thm. 1.1], as mentioned in Appendix D.1. So it is reasonable to believe that its approximation g is unimodal. Moreover, the assumptions (b) and (c) essentially assert that g does not change too drastically at two points c 1 δ * and c 2 δ * close to the critical radius δ * . In the next section, we will check these assumptions for different examples.
Note that fixed point problem (17) can be viewed as a kind of a critical equation (e.g., [29,Ch. 13] and [32]), whose solution δ * we call the critical radius. Typically an upper bound on the localized Gaussian width would allow this critical radius to serve as an upper bound for the error θ − θ * 2 . Here, we show that with two-sided control of the localized Gaussian width and a regularity assumption, the error also satisfies a matching lower bound. In the next section, we will illustrate the consequence of this result with some examples.

Adaptive estimation rates
We now demonstrate the consequences of Proposition 1 via some examples. We begin with the simple problem of estimation for θ * = 0, where we see a number of standard rates from the ellipse estimation literature. We then consider some more interesting examples of extremal vectors, and show how the resulting estimation rates differ from the classical ones.

Estimating at θ * = 0
We begin our exploration by considering ellipse-constrained estimation problem at θ * = 0. In this section, we focus on two type of ellipses that are specified by aspect ratios µ j where µ j follows an α-polynomial decay and γ-exponential decay. The first one corresponds to estimating a function in α-smooth Sobolev class-that is, functions that are almost everywhere α-times differentiable, and with the derivative f (α) being Lebesgue integrable.
α-polynomial decay: Consider an ellipse E defined by the aspect ratios µ j = cj −2α for some α > 1/2. In Example 4, inequality (10), it is verified that this ellipse is regular at 0, and that k * (δ) δ −1/α . Thus, solving the fixed point problem (17) yields δ * σ 2α 2α+1 , and one can check that the conditions for Proposition 1 are met. Here our notation denotes equality up to constants independent of (σ, d). With a rescaling argument (37), the proposition implies for some constants C > c > 0 and c . One may notice that the rate σ 2 2α 2α+1 coincides with the minimax estimation rate for estimating in an α-smooth Sobolev function class. We will show in our later section that it is indeed the case.
γ-exponential decay: Consider another case where the ellipse E is defined by the aspect ratios µ j = c 1 exp(−c 2 j γ ), for some γ > 1/2. Then a slight modification of the computation in Example 4 yields In order to establish the regularity condition, notice that in this case, inf γ∈Γ(θ * ,δ) d i=1 γ i is achieved in limit by γ i = µ i 1{i > k * (δ)} and further more which by definition, shows that E is regular at θ * = 0.
Solving the fixed point problem (17) yields δ * σ log 1 2γ 1 σ up to other polylogarithmic factors in σ. One can check that the conditions for Proposition 1 are met, so by the rescaling argument (37), we have, up to polylogarithmic factors, with probability ≥ 1 − exp −c log 1 γ (1/σ) for some constants C > c > 0 and c .

Estimating at extremal vectors
In the previous section, we studied the adaptive estimation rate for θ * = 0. In this section, we study some non-zero cases of the vector θ * . For concreteness, we restrict our attention to vectors that are non-zero some coordinate s ∈ [d] = {1, . . . , d}, and zero in all other coordinates. Even for such simple vectors, our analysis reveals some interesting and adaptive scalings.
are small constants that are defined in Wei and Wainwright [30,Corollary 2]. Note that the shrinkage −r away from the boundary is due to the boundary issue in Theorem 2. We believe it is an artifact of our analysis that is possibly removable; for instance, in our simulations below ( Figure 4) we have an example with θ * = √ µ 1 e 1 on the boundary of the ellipse that exhibits the same predicted behavior as its shrunken counterpart. So as to streamline notation, we adopt k * (δ) as a short hand for k * (θ * , δ). Wei and Wainwright [30] (Section 4.4) show that with ξ = (1 − η)δ, we have This upper bound is proved by considering the projection onto the m u -dimensional subspace spanned by {e 1 , . . . , e mu }. At the same time, we prove in Lemma 6 that α-polynomial decay: Consider an ellipse E with µ j = cj −2α for some α > 1/2. From the above calculation, we can conclude that m u , m , k * (µ s δ 2 ) − 1 4α , Here our notation denotes equality up to constants independent of problem parameters such as (σ, d). Let us verify the regularity condition (9) with dimension m u and projection to linear space Π mu spanned by {e 1 , . . . , e mu }. Since γ = δ 2 (0, . . . , 0, µ mu+1 , . . . µ d ) is feasible in limit for the set Γ(θ * , δ, m u , Π mu ), we have inf γ∈Γ(θ * ,δ,mu,Πm u ) Since α > 0, k * and m u is equal up to a constant, the right hand side above is bounded above by δ 2 k * , which establishes the regularity condition at θ * .
Solving the fixed point problem (17) yields δ * σ log 1 2γ 1 σ up to other polylogarithmic factors in σ. One can check that the conditions for Proposition 1 are met, so by the rescaling argument (37), we have, up to polylogarithmic factors, for some constants C > c > 0 and c .
Numerical results: To illustrate our findings from above, Figure 4 provides a numerical plot of the mean-squared error of the constrained least squared estimator (15) for estimating the vector θ * = 0 (blue curve) and the vector θ * = e 1 (red curve). In each case, the plot shows show the error decreases as a function of the inverse noise level 1 σ 2 .  The underlying ellipse is defined by the eigenvalues µ j = j −2α with α = 1. Consequently, the predicted scaling of the mean=squared error is (σ 2 ) 2α 2α+1 for the zero vector, and (σ 2 ) 4α 4α+1 for the "spiked" e 1 vector. Based on these predictions, our our theory suggests that on a log-log plot, the mean-squared error should decay at a linear rate with slopes −2/3 and −4/5 respectively. The empirical least-squares fit shows that these predictions are very accurate.

Minimax risk bounds
As another consequence of our main results, in this section, we show that the LSE is minimax optimal for ellipse estimation problem that is described above. Here the minimax risk over the ellipse E is defined as where the supremum is taken over distributions N (θ * , σ 2 I n ) indexed by θ * ∈ E, and the infimum is taken over all estimators. By this criteria, estimators are compared on their worst-case performance.
In the following, we show that the minimax optimal risk is achieved by the LSE estimator and the risk is characterized through the solution to the fixed point problem (17). Let δ * (0) be the solution to the fixed point problem (17) for θ * = 0. If furthermore the ellipse is regular (9) for all θ * ∈ E, then We prove this result in Appendix C.
In contrast to the minimax lower bound of Yang et al. [32], our minimax lower bound (21a) does not require the regularity assumption (9). See Appendix A for a discussion of how the notion of regularity of Yang et al. [32] is a special case of our notion. The lower bound is proved by showing that the ellipse contains a k * -dimensional ball, and then applying the standard minimax bound in for estimation in a k * -dimensional space.
On the other hand, the upper bound (21b) does require the regularity assumption, which allows us to apply Proposition 1. It implies that the risk of the LSE for each problem θ * ∈ E is upper bounded by δ 2 * (θ * ). Furthermore, we show that among all θ * , the largest upper bound δ 2 * (θ * ) is the case θ * = 0, which yields the upper bound in Corollary 3. Thus, the hardest problem for the LSE is estimating θ * = 0, and its risk there matches the lower bound. In short, the LSE is minimax optimal for ellipses that are regular.

Proofs
We now turn to the proofs of our main results, namely Theorem 1 and Theorem 2. The proofs of more technical results are deferred to appendices, as noted within this section.

Proof of Theorem 1
For any dimension and projection pair (k, Π k ), we can write We now proceed to upper bound the two terms T 1 and T 2 .
Bounding T 1 : From standard properties of orthogonal projections onto subspaces, we have w − Π k w, Π k ∆ = 0 for any w and ∆. By combining this fact with the Cauchy-Schwarz inequality, the term T 1 is upper bounded as By the non-expansiveness of projection onto a subspace, we have Π k ∆ 2 ≤ ∆ 2 (i) ≤ δ, where inequality (i) follows from the inclusion ∆ ∈ B(δ). Thus, we have established that where the last step follows from first applying Jensen's inequality, and then noting that the distribution of Π k w is a k-dimensional standard Gaussian vector.

Proof of Theorem 2
As in the preceding proof, we adopt k * as convenient shorthand for the quantity k * (θ * , δ). We now divide our analysis into two cases, depending on whether or not θ * E ≤ 1/2.

Case I
First, suppose that θ * E ≤ 1 2 , which implies that Φ(δ) ≤ ( θ * E − 1) 2 ≤ 1. Under this condition, Lemma 2 from the paper [30] guarantees that By definition, the critical dimension k * : = arg min 10 δ} can be upper bounded as where we have used the fact that 9 10 ≤ 1 − η, and W k (E θ * ∩ B((1 − η)δ)) is non-decreasing in k. Let E k * denotes the k * -dimensional subspace of vectors that are zero is their last d − k * coordinates. Recalling that S(r) denotes a Euclidean sphere of radius r, we claim that Taking this claim as given for the moment, combining it with the bounds θ * E ≤ 1/2 and k * ≤ k * , we find that which completes the proof of Theorem 2 in this case. (24): In this proof, we adopt the convenient shorthand b = 3/10. Part (ii) of the inequality can be seen from the spherical example in the discussion of Theorem 1. It only remains to prove part (i). Let us first show that S(2bδ) ∩ E d−k * ⊂ E. Recalling the definition of k * from equation (23), we have

Proof of inequality
where inequality (iii) follows from the non-increasing order of µ i and inequality (iv) follows from the definition of k * .
In order to establish the inclusion B k * (bδ) ⊂ E θ * , we make use of the fact that θ * E ≤ 1/2. Since 2θ * E ≤ 1, we have 2θ * ∈ E. For any v ∈ S k * (bδ), since B k * (2bδ) ⊂ E we have 2v ∈ E. Combining these two facts together and the convexity of set E, we have v + θ * ∈ E. It further implies that B k * (bδ) ⊂ E θ * and finishes the proof of inequality (24).

Case II
Otherwise, we may assume that θ * E > 1/2, in which case Φ(δ/c) ≤ ( θ * E − 1) 2 < 1, and hence by definition of the function Φ, we have δ < c θ * 2 /a. For the remainder of the proof, we assume that k * ≥ 160. The case when k * < 160 is addressed separately at the end of this proof.
The proof of Theorem 2 requires two auxiliary lemmas. The first is packing lemma, proved in Wei and Wainwright [30,Lem. 4]. Here we state a slightly altered version of this claim, better suited to our purposes. Let M denote the diagonal matrix with entries 1/µ 1 , . . . , 1/µ d , and adopt the shorthands a : = 1 − η and b : = 3 10 based on the definition of the critical dimension (8).
Lemma 1. For any vector θ * ∈ E such that θ * 2 > a , there exists a vector θ † ∈ E, a collection of d-dimensional orthonormal vectors {u i } k * i=1 and an upper triangular matrix of the form with ordered singular values ν 1 ≥ · · · ≥ ν k * −1 ≥ 0 such that: (a) The vectors u 1 , M θ † , and θ † − θ * are all scalar multiples of one another.
(e) For any integers Before proving Theorem 2, let us introduce some notation. Let H, U and θ † be as given in the Lemma 1 above and let X : = U H have columns x 1 , . . . , x k * −1 Let V be the matrix of right singular vectors of H so that H H = V Σ 2 V , where Σ 2 is diagonal with the squared singular values ν 2 1 ≥ · · · ≥ ν 2 k * −1 of H in order.
Let m 1 : = (k * − 1)/8 and m 2 : = (k * − 1)/4 , and define the sparsity level s : = ρ k * −1 16 for some constant 2 ρ ∈ (0, 1). For a given s-sized subset S of {m 1 , . . . , m 2 }, any vector of the form z S = (z S 1 , . . . , z S k * −1 ) ∈ {−1, 0, 1} k * −1 with zeros in all positions not indexed by S is called as an S-valid sign vector. Any such sign vector can be used to define the perturbed vector The following lemma guarantees the existence of a large collection T of s-sized subsets of {m 1 , . . . , m 2 } such that the collection {θ S , S ∈ T } has certain desirable properties. (b) For each S ∈ T , there is a S-valid sign vector z S such that the associated perturbation θ S belongs to the ellipse E, and moreover satisfies the bounds: See Appendix E.1 for the proof of this lemma.
Turning back to the proof of Theorem 2, consider those perturbation vectors (25) that are defined via Lemma 2. For each S ∈ T , we define the vectors Inequality (26) implies that δ ∆ S 2 ≤ 1. By the convexity of the set E * θ , we have ∆ S ∈ E θ * ∩ S(δ) for each S ∈ T . By restricting the supremum to a smaller subset, we obtain the lower bound Re-writing the definition (25) in the form θ S = θ † + bδ √ 32s U HV z S , it follows that which further guarantees that where the second equality follows since E w, θ † − θ * = 0. The right-hand side is non-negative, since for any fixed choice of S 0 ∈ T , we have Noting that inequality (26)(ii) can be rewritten as ∆ S 2 2 ≤ Putting together the pieces, we have established that Our next step is to lower bound the expected maximum on the RHS, and to this end, we state an auxiliary result: Under the conditions of Theorem 2, we have See Appendix E.2 for the proof of this lemma.
Let us now control the term on the right-hand side of inequality (28). Let A be the event that there are least s positive elements among the i.i.d. standard Gaussian random variables {w i } m 2 i=m 1 . By the law of total expectation, we have Beginning our analysis with T 1 , under the event A, there exists some (random) subset S ∈ T of cardinality |S | ≥ s such that w i > 0 for all i ∈ S . (When there are multiple such sets, we choose one of them uniformly at random.) In terms of this set, we have where P[S | A] denotes the conditional probability of the randomly chosen S given that A holds. Since we are conditioning on a random set S on which each w i is positive, we have Since S P[S | A] = 1, we have proved that T 1 ≥ s 2/π.
Turning to the term T 2 , we begin by observing that for any fixed S 0 ∈ T , we have max S∈T i∈S Using this observation we can conclude that where (i) follows from the fact A c only depends on the sign of w i and the distribution of |w i | is independent of A c . Combining these two lower bounds, we find that E max S∈T i∈S We where the last step uses the fact that m 2 − m 1 ≥ k * /16 > 10.
Combining this last bound with inequalities (27) and (28) yields where the last step uses the fact that s = ρ k * −1 16 . In order to finish the proof, we deal with the case of k * < 160 separately. According to part (b) of Lemma 1, if we denote v 1 : = θ * − θ † , then θ † ∈ E and v 1 2 = aδ. It is also shown in the proof of Wei and Wainwright [30,Lem. 5] that θ * +v 1 ∈ E. Therefore the two points ±v 1 are both contained in E θ * ∩ B(δ) for a sufficiently small δ. As a result, we have G (E θ * ∩ S(δ)) ≥ G ({±v 1 }) = aδ 2/π, which establishes the lower bound in Theorem 2 with constant c = a 4 √ 5π .

Discussion
In this paper, we studied the behavior of localized Gaussian widths over ellipses. These localized widths are known to play a fundamental role in controlling the difficulty of associated testing and estimation problems. Despite its fundamental importance, the localized Gaussian width is hard to compute in general. The main contribution of our paper was to show how the localized Gaussian width can be bounded, both from above and below, via the localized Kolmogorov dimension. These Kolmogorov dimensions can be computed in many interesting cases, which leads to an explicit characterization of the estimation error of least-squares regression as a function of the true regression vector within the ellipse. We used this characterization to show how the difficulty of estimating a vector θ * within the ellipse can vary dramatically as a function of the location of θ * . Estimating the all-zeros vector (θ * = 0) is always the hardest sub-problem, and leads to the global minimax rate. Much faster rates of estimation can be obtained for vectors located near "narrower" portions of the ellipse boundary. While much of the analysis in this paper is specific to ellipses, we do anticipate that the general procedure of moving from Gaussian width to the Kolmogorov width could be useful in studying adaptivity and local geometry in other estimation problems. span{e 1 , . . . , e k }, and the maximization is achieved by θ = min{µ 1/2 k+1 , (1 − η)δ}e k+1 . On the other hand, for k = d we have W k (E θ * ∩ B((1 − η)δ)) = 0. Putting these two together gives (with the convention k * = d if the minimum is over an empty set). Thus, we have recovered definition (29) up to a constant factor in δ.
Since the optimal projection Π k * is the projection onto the linear subspace span{e 1 , . . . , e k * }, we can consider a sequence of positive vectors approaching γ : Consequently, our regularity condition (9) holds as long as d i=k * +1 µ i ≤ ck * δ 2 . Thus, it matches the notion of regularity (30) considered in Yang et al. [32].

B Proof of Corollary 2
Throught out this proof, we use c, c , c etc. to denote universal constants that do not depend on any problem parameters such as δ, µ i and θ * and their values can vary from line to line. (14) is straightforward. By combining the Sudakov minoration (4) with our upper bound (13) on the localized Gaussian width, we find that c δ log M (δ/2, E θ * ∩ B(δ)) ≤ G (E θ * ∩ B(δ)) ≤ c u δ k * (θ * , δ).

The proof of inequality (i) in equation
Thus, we have proved inequality (i) in equation (14).
We now turn to the proof the second inequality (ii). It is convenient to divide our analysis into two cases depending on whether or not θ * E ≤ 1 2 .
Case 1: θ * E ≤ 1 2 . As shown earlier in equation (24) from the proof of Theorem 2, the set E θ * ∩ B(δ) contains the k * -dimensional sphere S( 3 10 δ) ∩ E k * . Thus, by a standard volume argument [21,29], it must have log packing number bounded from below by ck * log 1 δ . This quantity is lower bounded by k * up to some universal constant, which establishes inequality (ii) in this case.

Case 2:
θ * E > 1 2 . We follow the notation from Section 5. In the proof of Theorem 2 (in particular, see equation (25) and Lemma 2), we constructed a set of vectors θ S that after rescaling, all lie in our set E θ * ∩ B(δ). Each such vector θ S is formed by taking a certain point θ † near θ * , and adding certain combinations of orthogonal vectors u i . We argue here that there is a subset of these scaled vectors of size k * that are pairwise separated from each other by a distance δ.
We are only interested in proving bounds up to constant factors, meaning that we may assume without loss of generality that k * ≥ 32 × 10 4 ; otherwise the result (14) holds immediately with a sufficiently large choice of c .
Recall the earlier definition s : = ρ k * −1 16 for a fixed constant ρ ∈ (0, 1); for this argument, we take ρ = 10 −4 . By Lemma 4.10 in Massart [17], we can find a subset of s-sparse vectors contained in the binary hypercube {0, 1} 1 16 (k * −1) with log cardinality at least s log and such that any pair of distinct elements differs in at least (2 − 2ρ)s entries. Transferring this result to the context of Lemma 2, we are guaranteed a collection of vectors of log cardinality k * such that for z S = z S in our packing.
Recalling that V H HV = Σ 2 and the definition (25) of θ S , we then have Since z S and z S are zero in their first m 1 − 1 components, we can use inequality (ii) from Lemma 1 to bound the relevant diagonal entries of Σ. Doing so yields Thus, we have obtained a collection of vectors θ S , indexed by subsets S, such that θ S − θ S 2 δ for S = S .
Finally, we need to show that after shrinking these θ S toward θ * and re-centering, we obtain a packing of E θ * ∩ B(δ). For each S recall the definitions ∆ S : = θ S − θ * and ∆ S : = δ ∆ S 2 ∆ S . From discussion below Lemma 2, we have already showed that each vector ∆ S lies in E θ * ∩ B(δ); it only remains to verify that distinct pairs are well-separated.
First, direct computation yields In order to show that the right-hand side is lower bounded by a constant multiple of δ 2 , it suffices to upper bound the inner product term. Using the fact that θ † − θ * has norm aδ and is orthogonal to the columns of U (see Lemma 1), we have If z S = z S are from our packing, then by construction they differ on at least (2 − 2ρ)s components, so they must agree on at most ρs components. Applying the inequality (i) from Lemma 1 to bound the relevant entries of Σ 2 , we can continue from above to obtain The last inequality follows from our earlier choice of a : = 1 − 10 −5 and ρ : = 10 −4 . Dividing both sides by ∆ S 2 ∆ S 2 ≥ δ 2 (where this inequality follows from Lemma 2), we can continue from our earlier step (31) to obtain Putting together the pieces, we have exhibited the claimed packing of E θ * ∩ B(δ) of log cardinality k * and packing radius δ.

C Proof of Corollary 3
We divide our proof into two parts, corresponding to the upper and lower bounds respectively.
Upper bound: Let us start with the proof of the upper bound. Under the regularity assumption, we may apply Proposition 1 to bound the mean-squared error E θ * θ−θ * 2 2 of the LSE; in particular, it is upper bounded by δ 2 * (θ * ) up to an universal constant. (Recall that δ * (θ * ) is the solution to the fixed point equation (17).) In order to arrive at the desired minimax upper bound, we need to show that the function θ * → δ * (θ * ) is maximized at θ * = 0. Since k * is a non-increasing function of δ (see the paper [30, Sec. D.1]), a larger k * (θ * ) corresponds to a larger value of δ * (θ * ). These two quantities are related via the equation δ * (θ * ) = c σ k * (θ * , δ * ).
The following lemma bounds the supremum of k * . Lemma 4. The critical dimensions at any θ * can be controlled as The proof of this lemma is given in Appendix F.2. Note that it implies the claimed upper bound upper bound (21b).
Lower bound: By definition, the minimax risk decreases when the supremum is taken over a smaller subset. In order to establish the lower bound, we restrict the supremum to a ball around zero. Recall our calculations from Example 4, where we showed that the Kolmogorov width of a local ball around θ * = 0 is given by The corresponding k * (0, δ) is given by By inspection, we have the upper bound k * (0, δ) = arg min k=1,...,d √ µ k+1 ≤ 9 10 δ . We also have the lower bound √ µ k * (0,δ) ≥ 9 10 δ for every δ ≤ √ µ 1 . Note that the ellipse E always contains a k-dimensional ball centered at zero with radius √ µ k . Combined with the bounds just stated, for every δ ∈ (0, √ µ 1 ], the ellipse also contains a ball of radius 9 10 δ centered at zero of dimension k * (0, δ). Now we are ready to control the minimax risk. First notice that where recall that E m denotes the space which contains d-dimensional vectors with their last d − m coordinates all equal to zero.

D Proof of Proposition 1
This appendix is devoted to the proof of Proposition 1.

D.1 Reduction to bounding localized Gaussian width
Chatterjee [5] provided one way of obtaining upper and lower bounds on the error θ − θ * 2 of the least squares estimator for a general convex set, under the Gaussian sequence model (2). Define the function which can be shown to be strongly convex on (0, ∞) with a unique minimizer δ 0 > 0. Then: for any t > 0. Furthermore, there is a universal constant C > 0 such that In particular, if we take t = c √ δ 0 , it is guaranteed that The following simple lemma shows how sandwiching g between two functions allows us to obtain upper and lower bounds for its minimizer δ 0 .
Lemma 5. Suppose that there are functions g , g u such that g (δ) ≤ g(δ) ≤ g u (δ) for all δ ∈ [0, ∞). Then for any r ≥ inf δ≥0 g u (δ), we have In particular, if g is unimodal, then this sub-level set is an interval.
The proof of this lemma is simple. For a given r ≥ inf δ≥0 g u (δ), we have where inequalities (i) and (iii) follow from the assumed sandwich relation, and equality (ii) follows from the fact that δ 0 is the minimizer of g. δ r δ 0 δ δ u g g g u Figure 5: Visualization of Lemma 5 when r = inf δ≥0 g u (δ), and g is convex.
Lemma 5 and the bound (36) together show that bounds on the localized Gaussian width that appears in the definition (34) of g can be used to obtain high probability upper and lower bounds on the error of the LSE.
We remark that the case for estimation over E(R) for R > 0 reduces to the case R = 1 by rescaling. Let E(R) θ * : = {θ − θ * : θ ∈ E(R)} denote the re-centered ellipse. Note that g can be rewritten as after the changes of variables δ : = δ/R, θ * : = θ * /R, and σ : = σ/R. Then one can focus on bounding g and ultimately re-scale by R any bounds obtained for the minimizer of g in order to obtain bounds for the original minimizer δ 0 .
Since function g is convex in δ, there are two solutions δ and δ to the equation and Lemma 5 guarantees that Moreover, we show below that c 1 δ * ≤ δ 0 ≤ c 2 δ * . Taking this inequality to be true for the moment, combining it with equation (36) yields which concludes the proof. Note that we arrive at the expectation bounds (19) by simply applying the earlier result (35).
It remains to show that c 1 δ * ≤ δ and δ ≤ c 2 δ * . After some manipulation using the fixed point equation (17), equation (38) can be rewritten as Note that the solutions δ to the equality (38) must satisfy c 2 u k * (δ) ≥ c 2 k * (δ * ), as required for the right-hand side to be non-negative. In addition, they must satisfy one of the following two equations: Note that any solution δ to the first equation (40a) is larger than any solution δ to the second equation (40b). Indeed, we have δ = h − (δ ) < h + (δ ), so the non-increasing nature of h + guarantees that the solution δ to the equation δ = h + (δ) must be larger than δ .
• We first consider the solution δ to the first equation (40a). It is easy to check that Recall k * (δ) is non-increasing in δ. We know δ is smaller than the solution to δ = 2σc u k * (δ), which in turn is smaller than c 2 δ * (by assumption (c) of Proposition 1). We thus have δ * ≤ δ ≤ c 2 δ * .
• Next we consider the solution δ to the second equation (40b). We claim that δ ≥ c 1 δ * . In order to show this, we prove that h − (δ) satisfies ≥ c 1 δ * , for some c 1 ∈ (0, 1) and h − (δ * ) Take the above inequalities as given for now, we can combine them with the fact that h − (δ) is a non-decreasing function of δ to conclude that the fixed point solution δ of (40a) satisfies c 1 δ * ≤ δ ≤ δ * .
Putting these two pieces together with inequality (39), we conclude the proof of Proposition 1. It remains to prove the inequalities (41).

E Auxiliary proofs for Theorem 2
In this appendix, we collect the proofs of various auxiliary results that underlie Theorem 2.

E.1 Proof of Lemma 2
The set class T to be demonstrated consists of all s-sized subsets of a particular subset T ⊂ {m 1 , . . . , m 2 }; the subset T is constructed to have cardinality at least k * −1 16 , so that the set class T has at least where the last step follows from inequality (48). Here let us take η small enough, for instance 10 −5 such that the right hand side above is greater than δ 2 . (We have made these choices of constants for the sake of convenience in the proof, but note that other choices of these quantities are also possible.) Now, we prove that θ S ∈ E and inequality (ii) in equation (26) holds, in particular by using a probabilistic argument. Recall that B := diag(µ −1 1 , . . . , µ −1 d ) so that x 2 E = x Bx. For a given subset S, we specify a random choice of z S , in which for each j ∈ S, the value z S j ∈ {−1, +1} is an independent Rademacher variable. Using this random choice of z S , we then let θ S be defined as in equation (25), so that it is now a random vector.
Proof of inequality (45): We claim that the processes {g S , S ∈ T } and { g S , S ∈ T } satisfy the Sudakov-Fernique conditions. In order to prove this claim, we need to verify that for all subsets S, S ∈ T , we have relation var(g S − g S ) ≥ var( g S − g S ). On one hand, we have var(g S − g S ) = E w, U HV (z S − z S ) 2 = U HV (z S − z S ) 2 2 = HV (z S − z S ) 2 2 , where the last step uses the orthonormality of U . On the other hand, we have the equality var( g S − g S ) = D(z S − z S ) 2 2 . Consequently, it suffices to show that there exists an orthogonal matrix V such that (HV ) HV 1 16 In order to see this fact, part (e) of Lemma 1 implies that the m 2 largest eigenvalues of H H = V Σ 2 V is lower bounded by 1 − m 2 k * −1 − a 2 −9b 2 9b 2 . With the choice of the constants (a, b) specified above (see paragraph below Lemma 1), it is guaranteed that a 2 −9b 2 9b 2 ≤ 1 4 . This observation and the definition m 2 : = (k * − 1)/4 together imply that which implies the claim (47), and further completes the proof of the lower bound (45).
Proof of inequality (46): The vector z S defined above is an indicator vector for the support of z S . Defining a third Gaussian process using the variables g S : = w, D( z S − z S ) , we have var( g S − g S ) = D(z S − z S ) 2 2 ≥ D( z S − z S ) 2 2 = var( g S − g S ).
A second application of the Sudakov-Fernique inequality then yields where in the last step we recall the fact that S is supported on the set {m 1 , . . . , m 2 }.

F Proof of auxiliary lemmas
In this appendix, we collect the proofs of various auxiliary lemmas.

F.1 Proof of Lemma 6
Let us first state the lemma used in Section 4.1.2.
By definition of the critical dimension (8), it is sufficient to show that the Kolmogorov width is upper bounded as where a : = 1 − η.
We claim that the set E θ * ∩ B(aδ) is contained within the set 2E ∩ B(aδ). Indeed, note that any v ∈ E θ * ∩ B(aδ) has Euclidean norm bounded as v 2 ≤ aδ and Hilbert norm bounded as v + θ * E ≤ 1. The Cauchy-Schwarz further guarantees that where the last step follows from the fact that both θ * and v + θ * lie in ellipse E. We have thus established the claimed set inclusion.

G Well-definedness of the function Φ
In this appendix, we verify that the function Φ from equation (12) is well-defined. We again use the shorthand a : = 1 − η. In order to provide intuition, Figure 6 provides an illustration of Φ.