Scale calibration for high-dimensional robust regression

We present a new method for high-dimensional linear regression when a scale parameter of the additive errors is unknown. The proposed estimator is based on a penalized Huber $M$-estimator, for which theoretical results on estimation error have recently been proposed in high-dimensional statistics literature. However, the variance of the error term in the linear model is intricately connected to the optimal parameter used to define the shape of the Huber loss. Our main idea is to use an adaptive technique, based on Lepski's method, to overcome the difficulties in solving a joint nonconvex optimization problem with respect to the location and scale parameters.


Introduction
Robust statistics, in its classical form, is a mature and established field [37,57,32]. Recently, notions from robust statistics such as -contamination and influence functions have surfaced in theoretical computer science and machine learning [20,48]. The use of the Huber loss in place of a squared error loss to encourage robustness has long been adopted in engineering fields, as well [25].
In statistics, a small but growing body of work concerns analyzing highdimensional analogs of classical robust estimators [47,78,56,10,53,23,70,71,27]. The basic premise is that although it is relatively straightforward to devise reasonable high-dimensional estimators, theoretical analysis may become 5934 P. Loh somewhat trickier in high dimensions [2]. Furthermore, special care must be taken when optimizing such objective functions over a high-dimensional space [1].
Our previous work [53] developed a theory for robust high-dimensional linear regression estimators using penalized M -estimation. The main contribution was to show that global optima of 1 -penalized M -estimators enjoy the same rates of convergence as minimizers of the Lasso program, when the M -estimation loss function is convex and has a bounded derivative-without requiring a Gaussian or sub-Gaussian assumption on the additive errors. In fact, we also established that local optima of penalized M -estimators with a nonconvex, boundedderivative loss are statistically consistent within a constant-radius region of the global optimum, and such local optima may be obtained via a two-step process initialized using a global optimum of the 1 -penalized Huber loss.
However, a drawback of Loh [53], as well as other related work on penalized M -estimation [23,71], is that the theoretically optimal choice of the parameter involved in defining the Huber loss depends critically on the scale of the additive errors. This should not be surprising, given that similar complications were recognized in low-dimensional settings for location estimation, when prior knowledge of the scale was unavailable [36]. The "adaptive" methods proposed for low-dimensional robust regression [39,34] are mostly heuristic suggestions involving, e.g., computing the Huber regression estimate over a grid of values and choosing the parameter that minimizes a surrogate for asymptotic variance. Even in low dimensions, a theoretical gap has remained in terms of how to rigorously calibrate the Huber loss function in a finite-sample setting.
In this paper, we introduce a new solution to the problem of adaptively choosing the scale parameter of a robust M -estimator. The key tool is Lepski's method, and the key observation is that whenever the Huber loss parameter is larger than the true scale parameter of the additive errors, it is possible to derive 1 -and 2 -error bounds on the global optimum that increase linearly with the choice of Huber parameter. This allows us to apply Lepski's method to obtain an estimator that behaves comparably well to the oracle estimator. Importantly, our method bypasses the hard optimization problem of jointly estimating the location and scale. We note that Lepski's method could also be invoked in the low-dimensional, unpenalized setting to rigorously obtain robust regression estimators without needing to optimize a nonconvex problem in an ad hoc manner. In addition to relaxing the usual sub-Gaussian distributional assumptions on the additive errors to a finite variance requirement, we also show how to introduce a weight function to downweight leverage points, thus allowing our theory to be applied to a broader range of heavy-tailed covariate distributions, as well.
We further explain how our estimation results can be used to construct confidence intervals for coordinates of the regression vector when the covariates are sub-exponential. Our approach builds directly on recent literature from highdimensional inference [75,43], where confidence regions are derived based on asymptotic normality of one-step corrections of an 1 -penalized M -estimator. However, as the success of these methods relies on suitable nonasymptotic error bounds on the initial estimator, our results on the 1 -penalized Huber estimator fill a gap by providing an appropriate initial estimator which can be used for a wider range of error distributions. One-step estimators themselves originate from classical robust statistics [7], as a method for improving the efficiency of initial (and more computationally tractable) M -estimators. In the same way, whereas the 1 -penalized Huber estimator may suffer from a loss of efficiency-especially when weight functions are introduced to tame the covariate distribution-we show that our proposed one-step estimators enjoy the property of semiparametric efficiency, thus implying optimality of the resulting confidence regions.
Related work: Other proposals for regression with heavy-tailed errors include work by Hsu and Sabato [35], Minsker [58], and Lugosi and Mendelson [55]. However, many of these methods focus on situations where the covariates are well-behaved, and all of them assume knowledge of an upper bound on the error variance. In contrast, our method produces consistent estimators under much milder assumptions on the covariates, and encompasses situations where preliminary scale estimates are notoriously difficult to obtain. Nonetheless, a benefit of the methods introduced in the aforementioned papers is that they can also be shown to be robust in situations where a constant fraction of the data is adversarially contaminated [46,19,51,18,15,67,62,3].
Another important related work is by Chichignoud et al. [17], who suggest an adaptive method for tuning parameter selection in the Lasso based on Lepski's method. However, the main focus in that paper is in obtaining near-optimal bounds on the ∞ -error. Importantly, the objective function still involves a least-squares loss as in the classical Lasso, whereas our objective functions are designed for robust regression and have the corresponding parameter linked to the regularization parameter involved in the 1 -norm.
On the topic of inference, Belloni et al. [6] introduced a different method for constructing confidence intervals in high-dimensional regression settings based on a one-step correction to 1 -penalized M -estimators. Although this approach is somewhat orthogonal to ours, one benefit of Belloni et al. [6] is that the method can be applied to a broader class of M -estimators than ours, since the smoothness conditions on the loss function are not as stringent. On the other hand, our approach has benefits in terms of semiparametric efficiency for estimation of multiple target parameters (cf. Remark 8 in Section 4.3 below).
Finally, we mention another recent proposal for calibrating the tuning parameter in high-dimensional penalized Huber regression [80]. This is a somewhat heuristic method based on iteratively solving the empirical version of a system of equations which, at the population level, has a unique solution equal to the theoretically optimal parameter. We end by noting that although several alternative tuning-free approaches for high-dimensional regression have been proposed, e.g., based on penalized quantile regression [11,79,77,22], the square root Lasso [5], or the Wilcoxon loss from nonparametric statistics [81], to the best of our knowledge, these alternative approaches also require stronger assumptions on the covariate distributions than we impose in our paper. It is unclear whether the analysis in these papers could be extended to settings where 5936 P. Loh weights are introduced to dampen the effect of outliers.
Notation: For a vector v ∈ R p , we write supp(v) ⊆ {1, . . . , p} to denote the support of v, and for an arbitrary subset S ⊆ {1, . . . , p}, we write v S ∈ R S to denote the vector v restricted to S. For a matrix M , we write |||M ||| q to denote the q -operator norm, and we write M max to denote the elementwise ∞ -norm. We write vec(M ) to denote the vectorized version of the matrix. Let R + denote the positive reals.
We use the notation c, C , c 0 , etc., to denote universal positive constants, where we may use the same notation to refer to different constants as we move between results. We use the abbreviation "w.h.p." to refer to an event occurring with probability tending to 1 as the problem parameters n, p → ∞. We use the standard big-O notation, so that two functions f (n) and g(n) satisfy f = O(g) if there exist a constant C and an integer n 0 such that f (n) ≤ Cg(n) for all n ≥ n 0 . We also write f g, and we define f g (equivalently, f = Ω(g)) analogously. Finally, for sequences of random variables {X n } and {Y n }, we write X n = O P (Y n ) to denote boundedness in probability, i.e., for any > 0, there exist a constant B and an integer n such that P Xn Yn > B < for all n ≥ n . We write X n = o P (Y n ) to mean that Xn Yn P → 1. We write f (n) = polylog(n) when f (n) = g(log n), for some polynomial function g.

Background and problem setup
We begin by describing the regression model to be studied in our paper. We also discuss several previously existing proposals in the literature.

Model and assumptions
Consider observations {(x i , y i )} n i=1 from the linear model where β * ∈ R p is the unknown regression parameter vector. We will also assume that β * 0 ≤ k, where k < n p, and denote S := supp(β * ). We will work in a random design setting, where the x i 's and i 's are i.i.d. draws from covariate and error distributions, such that Our results could be adapted to the fixed design setting in a fairly straightforward manner; however, we are primarily interested in a setting where the distribution of the covariates is heavy-tailed, leading to high-leverage points. We will denote the covariance matrix of the x i 's by Σ x . We will also assume that Turning to the error distribution, we will assume throughout the paper that σ * := Var( i ) is finite. We will assume that the distribution of i is symmetric, as is customary in classical robust statistics to ensure consistency of regression M -estimators. Note, however, that this is not a major limitation of our work-we could first postprocess the data to obtain the transformed dataset and then run the regression algorithm on these points.
We will introduce additional assumptions on the distributions of the i 's and x i 's in Assumptions 1, 2, and 3 later. Recall the following standard definitions of sub-Gaussian and sub-exponential distributions [76], which will be used in the sequel: for all t ≥ 0. We say that a random vector X ∈ R p is sub-Gaussian with parameter σ if v T X is a sub-Gaussian random variable with parameter σ, for any unit vector v ∈ R p .

Definition 2.
We say that a random variable X is sub-exponential with param- for all t ≥ 0.

Previous work
We now briefly describe several previously proposed methods for robust linear regression in high dimensions. We focus on methods that have been devised to handle outliers in the covariates, since our proposed algorithm is provably consistent when the covariate distribution is heavy-tailed, as well. (For additional related work, see the references cited in the introduction.) The sparse least trimmed squares (LTS) estimator [1] aims to optimize the objective in ascending order, and h ≤ n is a truncation parameter. This is an 1 -penalized version of the least trimmed squares estimator [65]. Although sparse LTS has been shown to perform well in simulations, only a heuristic algorithm has been proposed for optimizing the objective, and statistical guarantees for both global and local optima are absent from the literature.

P. Loh
where r > 0, and s(r(β)) is a robust scale estimator based on the residuals where ρ is a robust loss function and β 1 is an initial estimate of β * ; for r = 1, this method is also known as the MM-Lasso. Smucler and Yohai [70] derived the asymptotic consistency of global optima when the loss function ρ is of a redescending type, meaning that ρ is eventually equal to 0. However, the results are asymptotic, and again, no guarantees are provided for the performance of local optima, which may result from the optimization algorithm proposed by the authors. Penalized S-estimators are further analyzed in Freue et al. [27].
Our work builds upon Loh [53], which studied local and global optima of penalized M -estimators. The main contribution in that work is a rigorous nonasymptotic analysis of global optima in the convex case, as well as an analysis of certain consistent local optima when the objective function is nonconvex. However, the success of the methods proposed in that paper require the parameter of the Huber loss to be chosen correctly, i.e., upper-bounding an expression involving moments and tails of the error distribution. Since this information would generally be unknown a priori, the question of how to choose the Huber parameter in an adaptive manner remained unanswered.
Finally, we mention methods based on joint estimation of location and scale. One natural approach is to jointly minimize the objective function (or a high-dimensional analog thereof). However, even when the loss function is convex, this leads to a highly nonconvex objective. Iteratively optimizing with respect to β and σ motivates the MM-estimator [83], but theoretical guarantees in terms of both statistical consistency and convergence of the optimization algorithm are largely absent from the literature. Huber [37] also proposed the concomitant estimator: where a is an appropriate constant to ensure Fisher consistency. The key insight is that if is a convex function, the loss function L n (β, σ) appearing in the objective (2.2) is also jointly convex in (β, σ). However, the choice of the correct constant a to provide consistency is somewhat intricate. A small calculation shows that if we denote L(β, Scale calibration for high-dimensional robust regression   5939 holds. Thus, some prior knowledge of the distribution of i is required to choose a appropriately. In contrast, our method results in a consistent estimate of β * whenever i has a symmetric distribution. Another important issue is that if is nonconvex-as is recommended to deal with high-leverage points in the covariates-Huber's estimator (2.2) would no longer be jointly convex, leading to a more tricky analysis of local optima in the (β, σ) parameter space.

Adaptive scale estimation
Consider the Huber loss function defined with respect to a parameter τ > 0. Importantly, the Huber loss is differentiable, and τ ∞ ≤ τ . We also define a weight function w : R p → R + , with characteristics which will be described later. We will study the behavior of the 1 -regularized Huber estimator The idea of downweighting individual terms as a function of the covariates is a classical idea from robust linear regression, where various authors studied M -estimators of the form 1 Hampel [32,Chapter 6.3] and the references cited therein). The motivation for introducing weights in classical settings was to guarantee infinitesimal robustness of the regression estimator by ensuring that the influence function stayed bounded even when the covariates were contaminated. Although our choice to introduce weights only within the individual arguments of the loss function terms does not exactly coincide with the more popular framework of Mallows or Schweppe weights from classical robust statistics, one should keep in mind that the central study in our analysis is somewhat different (concerning robustness to heavy tails in the covariate distribution, rather than a study of influence or other notions of sensitivity). Nonetheless, the idea of downweighting individual arguments can also be found in the paper by Krasker [49]. See Remark 1 for more connections between suitable choices of weight functions for our theory to hold and classical choices of weight functions from robust statistics.
For our theory, we will assume that the weight function satisfies the following properties:

P. Loh
Note that the conditions of Assumption 1 involve both the weight function and the distribution of the x i 's. As noted in Section 3.1 below, when the x i 's are well-behaved (e.g., sub-Gaussian), we may set w ≡ 1, somewhat simplifying the analysis. However, we do not in general assume that the x i 's follow a sub-Gaussian distribution-Assumption 1 can be satisfied by arbitrarily heavytailed distributions, as long as the weight function is chosen appropriately (cf. Example 2 below).
The proof of the following theorem, based on arguments developed in Loh [53], is contained in Appendix B. Recall that σ * denotes the standard deviation of the error distribution, and is assumed to be finite.
with probability at least 1 − c 1 p −c2 , where the c i 's are universal constants.
Importantly, the choice of λ = 2c 0 log p n in Theorem 1 depends only on a universal constant c 0 . This is in contrast to the usual Lasso, which requires the tuning parameter λ to be proportional to the unknown quantity σ * .
We also comment on the requirement that τ ≥ c τ σ * , where c τ is an appropriately defined constant. We will provide a method in the next subsection for adaptively choosing τ without prior knowledge of σ * , with a guarantee that the estimator obtained from our procedure is at least as good as the estimator obtained by taking the theoretically optimal choice τ = c τ σ * . However, suppose momentarily that we are able to set the Huber parameter τ equal to c τ σ * , and consider for the sake of illustration that the i 's are drawn from a mixture distribution (1 − ζ)F + ζG, where F and G are both zero-mean sub-Gaussian distributions with sub-Gaussian parameters σ F ≤ σ G , and ζ is the mixing probability. Standard results on sub-Gaussian distributions imply that the mixture distribution is also sub-Gaussian, with parameter bounded by σ G .
Thus, Lasso theory implies that β Lasso −β * 2 σ G k log p n . On the other hand, the variance of the mixture distribution is a weighted combination of the variances of F and G, hence is bounded by a constant multiple of (1 − ζ)σ 2 F + ζσ 2 G . If ζ is close to 0, the result of Theorem 1 translates into the 2 -error bound on the 1 -penalized Huber estimator. If σ F σ G , this can lead to significant gains in the estimation error in comparison to the Lasso.

Examples
We now explore the applicability of Theorem 1 in some specific examples. In particular, we will discuss combinations of weight functions and covariate distributions under which the conditions of Assumption 1 are satisfied.
Example 1 (Sub-Gaussian distributions). When the distribution of x i is sub-Gaussian, we can simply choose w ≡ 1; i.e., we do not need to downweight any of the terms in the objective (3.1) in order to obtain the desired error bounds. Indeed, the vanilla form of Huber regression is known to perform well when leverage points are not present. In particular, the distribution of w( will satisfy the desired properties for sufficiently large p, where b > 0 is a constant which does not depend on p. We first verify Assumption 1(i). Recall that √ p xi xi 2 , which is uniformly distributed on the surface of the sphere of radius √ p, is sub-Gaussian with parameter σ = Θ(1) [76,Theorem 3.4.5]. Hence, for a unit vector v ∈ R p , we have We now study conditions for which giving Assumption 1(iii). Note that If x i ∼ N (0, I), this inequality will certainly hold for any b < 1 for sufficiently large p, since the x ij 's are i.i.d. and the empirical average concentrates. More generally, Guédon and Milman [31] established a similar concentration inequality when x i is a log-concave distribution, with later generalizations to distributions with heavier tails [26].
In the aforementioned cases, the left-hand side of inequality (3.4) actually tends to 0 as p → ∞. However, we only need the expression to be upper-bounded by a constant. Inequality (3.4) can be rewritten as where we recall that E[R 2 ] = p due to the assumed isotropy condition. Theorem 2.9 of Fang et al. [24] provides the density function for R, from which the condition (3.4) can further be verified for sufficiently small b for various classes of distributions. Note that this line of argument allows the random variable R, and consequently also x i , to be arbitrarily heavy-tailed, as long as it possesses a finite second moment.
Finally, note that while Example 2 is stated for spherically symmetric distributions, a similar weight assignment would work if the x i 's were elliptically symmetric, instead, i.e., Bx i is spherically symmetric for a well-conditioned matrix B ∈ R p×p . In this case, we could define the weight function w( and follow nearly identical derivations as above. Thus, in practice, one might choose to define the weights according to w(x i ) = min 1, for B ∈ R p×p . Optimal choices of B have accordingly been derived to satisfy various criteria, e.g., maximum efficiency subject to bounds on the gross-error sensitivity and/or local-shift sensitivity, in which case the choice of parameters is implicitly derived from the desired upper bounds. Note, however, that since we are only interested in obtaining highprobability error bounds of the correct order, we do not need as fine-grained a characterization of the matrix B as in the classical setting. Thus, our discussion in Example 2 above, which specifies that B 1 √ p , is sufficient for our purposes. See also Krasker and Welsch [50] and Huber [38].

Lepski's method
We now discuss Lepski's method [52,9,14,59]. Consider τ min and τ max such that τ min ≤ c τ σ * ≤ τ max . Let τ j = τ min 2 j , and define Note that |J | ≤ log 2 2τmax τmin . Let β (j) denote the output of the regression procedure with τ = τ j , and define Thus, to compute j * , we perform pairwise comparisons of regression estimates obtained over the gridding of the interval [τ min , 2τ max ]. Note that if our goal were simply to obtain 2 -consistency, we could apply Lepski's method where j * is defined only with respect to comparisons involving the 2 -error. However, we will need 1 -error bounds for the one-step derivations later, so we include both deviations in the screening process here. We then have the following result: The proof follows from straightforward algebraic manipulations and is contained in Appendix C.
Note that Lepski's method does not correspond to a standard grid search over τ , which would be more reminiscent of the adaptive robust estimation procedures described in the introduction. Indeed, for each candidate value of τ , we perform a type of guided comparison between different values of τ , rather than simply choosing the value of τ that gives rise to the smallest value of some objective function. Furthermore, the output of a Lepski-type procedure does not necessarily correspond to the β τ arising from the "optimal" choice of τ σ * . Rather, we are guaranteed that the 1 -and 2 -error of our final estimate is comparable to the error of the estimator generated using the optimal parameter. In contrast, the adaptive procedures appearing in robust statistics literature suggest a method for choosing the optimal σ by minimizing an approximation of the variance of the estimator thus produced.

Remark 2.
Note that our algorithm based on Lepski's method requires knowledge of the sparsity level k, which is one drawback of the procedure. An upper bound k would also be sufficient, in which case the comparisons used to determine j * in equation (3.6) would involve k rather than k. On the other hand, the error guarantees would then also be looser.
We would also need to have an explicit value for C in order to apply Lepski's method. As seen from the proof, the constant C appearing in our bounds depends on universal constants; the choice τ of the parameter used for the robust loss function; and distributional properties of x i (i.e., the eigenvalue bounds {c min , c max , c min }). The last point is somewhat unsatisfactory. However, in practical applications, we might imagine having numerous observed values of the x i 's available, from which we might be able to estimate these quantities. Importantly, we emphasize that our proposed method does not require any information about the distribution of the i 's, which we would not be accessible without a good initial estimate of β * in practice.
Although we do not include the derivations here, a similar procedure based on ∞ -error comparisons could be used to obtain an estimator based on Lepski's method with ∞ -error guarantees on the same order as the 1 -penalized Huber estimator with a theoretically optimal parameter. Furthermore, such a procedure would not involve knowledge of the sparsity, since ∞ -error bounds are typically O hand, one would need to impose slightly stronger assumptions in order to derive ∞ -error bounds [53].

Rough scale parameter bounds
Our application of Lepski's method requires specifying choices of τ min and τ max . We now describe how to select these values in a reasonable manner. We assume we have prior knowledge of the constant c τ , which only depends on characteristics of the covariate distribution and not the unknown error distribution. Then it suffices to compute rough bounds [σ min , σ max ] on σ * . By independence, we have Var(y i ) = Var(x T i β * ) + Var( i ). Hence, we have (σ * ) 2 ≤ Var(y i ), and we may select σ 2 max to be a rough estimate of Var(y i ).
Various estimators for population means exist that only involve weak distributional assumptions. For instance, the "median of means" estimator takes as input n i.i.d. observations X 1 , . . . , X n , and then computes the means Assuming the existence of (2 + )-moments of x i and i , and using the concentration inequality provided in Lemma 13 of Appendix H, we have with probability at least 1 − c exp(−c n).
We now turn to the problem of choosing σ min . Consider the choice σ min = σmax 2 M , for some integer M . Let β be the final output of Lepski's method. We have the following result: Theorem 3. Suppose Lepski's method is performed on the 1 -penalized Huber problem with σ 2 max equal to the median-of-means estimator of Var(y i ) and σ min = we have with probability at least 1 − cM p −c .

P. Loh
Note that if M = o(p c ), the bounds (3.8) and (3.9) in Theorem 3 hold w.h.p. If we define the signal-to-noise ratio SN R : , then inequality (3.7) can be rewritten as log 2 (SN R + 1) M , which is a fairly mild assumption. In particular, if λ max (Σ x ) and β * 2 are bounded, then SN R is also bounded and we can even choose M to be a constant. Finally, note that some knowledge of the curvature of the covariate distribution (i.e., maximum eigenvalue of Σ x ) can be helpful in determining the choice of M necessary for inequality (3.7) to be satisfied. Note also that in practice, we would not want M to be too large, since the computational complexity of the algorithm will increase linearly with M .

One-step estimators
Although we have established the consistency of our estimators under rather weak distributional assumptions on the x i 's and i 's, the presence of the weight function w(x) may lead to poor efficiency. Classical theory for regression Mestimators suggests that efficiency might be improved by using a loss which is governed by the specific form of the error density. The theory of M -estimation from classical robust statistics also recommends one-step estimators for improved efficiency [65,45,69,29]. In this section, we address the problem of improving efficiency by studying one-step modifications of the estimators proposed in the previous section. Note that recent results in high-dimensional inference have led to theoretical derivations based on similar types of one-step estimators to those analyzed here.
We begin by presenting the "one-step" adjustment which may be performed on an initial estimate β to obtain a final estimate b with desirable asymptotic normality properties. The statement of our main theorem about asymptotic normality is provided in Section 4.1, where we also discuss conditions on β and additional assumptions to be imposed on the covariate and error distributions in order for the results of the theorem to hold. In particular, the theory from Section 3 shows that the 1 -penalized Huber estimator is a suitable choice for β. In Section 4.2, we expand upon the specific sense in which b is a more efficient estimator than β when the score function ψ is chosen appropriately. In Section 4.3, we provide a method for constructing confidence regions for subsets of regression coefficients based on b, which is a natural corollary of our result on asymptotic normality.
Consider a differentiable score function ψ, and let [7], we then define the one-step estimator where Θ is a suitable estimate of Θ x = Σ −1 x , to be described in the sequel. For the theory in this section, we will change our notation slightly and adopt the language of scale families. Thus, we write random variables from a fixed reference distribution, and σ * ξ is an unknown scale parameter (note that σ * ξ agrees with σ * , the standard deviation of i defined earlier, up to a constant factor).
As suggested in Bickel [7], we use a score function ψ σ of the form ψ σ (t) = 1 σ ψ t σ , and plug in an estimate σ of the scale parameter σ * ξ . Then the one-step estimator (4.1) becomes and the scale estimate σ is obtained from the consistent regression parameter For ease of notation, we will redefine the term A(ψ) to be equal to the expression (4.3), and let A( f , where f is the density of ξ i (assumed to be smooth), will play a prominent role in our analysis. This corresponds to the derivative of the negative log likelihood function. In the case when ξ i ∼ N (0, 1), we then have ψ(t) = t and ψ (t) = 1, in which case formula (4.2) reduces to which is the "debiased Lasso" [75,43,13,41,85]. However, in that line of work, β is always taken to be the output of the usual MLE-based objective, whereas we take β to be a more general robust high-dimensional estimator with guaranteed statistical consistency properties even when the covariate or error distributions are non-sub-Gaussian.
We now discuss how to obtain a suitable estimate Θ of Θ x . Note that Bickel [7] proposes to use Θ = X T X n −1 ; however, when p > n, the matrix X T X n is not invertible. We instead choose Θ to be the graphical Lasso estimator [86,28], obtained by solving the following convex optimization program: The reason for choosing this estimator rather than the simpler sample covariance matrix Σ = X T X n will become clear in the statement of Theorem 4 and proof of Proposition 1 below, which proceed by deriving a high-probability bound of the Although such a bound would hold for the sample covariance matrix if the x i 's were sub-Gaussian, it behooves us to impose such stringent tail assumptions. See also Remark 5 below for an alternative approach involving a reweighted sample covariance matrix and its connection to classical robust regression literature.

Asymptotic normality
We now derive the limiting distribution of the one-step estimator. Our arguments involve Taylor expansions of the function ψ, so for simplicity, we assume that ψ is thrice-differentiable. We also assume that ψ is an odd function, and suppose the derivatives of ψ are bounded: where ψ (s) denotes the s th derivative. Extensions to cases where ψ does not satisfy these smoothness criteria (e.g., corresponding to the Huber loss function) may be derived via more careful arguments, but we omit the details here.
We now present the assumptions we will make on the covariate and error distributions in order to guarantee asymptotic normality of the one-step estimator. Our theorem will be stated assuming that β satisfies a suitable error bound; thus, if we wish to use the Huber estimator for β, we will also need the covariates to satisfy the conditions of Assumption 1 in order to guarantee that the results of Section 3 hold, as well.
We make the following assumptions on the distribution of the covariates: Note that the conditions imposed on the covariates in this section are somewhat stronger than the conditions imposed in Section 3 (cf. Assumption 1), since we no longer include a weight function to temper the effect of heavy tails. Thus, unlike the scenario described in Example 2, Assumption 2 does not permit the covariates to have arbitrarily heavy tails. On the other hand, we actually do not require the full power of sub-exponential tails: As our analysis shows, as long as we have a high-probability bound of the form X max polylog(p) (cf. Lemma 6), the theorems of this section will still continue to hold under a sample size condition of the form n k 2 polylog(p).
We also impose the following assumptions on the additive errors: Note that the conditions appearing in Assumption 3 are fairly mild, e.g., if ψ is bounded, then condition (ii) holds regardless of the tails of i . Furthermore, if the i 's are Gaussian and ψ corresponds to the MLE of ξ i , then ψ is the identity function and condition (ii) is again satisfied. However, on top of the finite variance bound imposed in Section 3, we now assume that the fourth moments of the i 's are finite.
Our main result is the following: Let P J denote the projection onto any set of m = |J| coordinates of fixed dimension. Then the one-step estimator (4.2) satisfies as n, p → ∞.
The proof of Theorem 4 is contained in Appendix D. In particular, the error bounds (4.5) follow directly from the guarantees for the Huber estimator derived in Theorem 3 (under the additional distributional conditions stated in Assumptions 2 and 3).
Altogether, we conclude that the limiting distribution of the high-dimensional estimator, restricted to m coordinates, agrees with the result of Bickel [7] for low-dimensional robust M -estimators.

Remark 3.
Note that the assumption n k 2 polylog(p) is somewhat stronger than the sample size condition n k log p usually required for consistency in statistical estimation, in the sense that n = Ω(k 2 ) as opposed to n = Ω(k) (cf. Theorem 1). However, a similar gap also appears in the analysis of van de Geer et al. [75] and Javanmard and Montanari [43] in the random design setting. As noted by a reviewer, it would be interesting to see whether this sample size requirement could be improved using more refined arguments, but to the best of our knowledge, existing work on the Lasso [44,4] relies heavily on a Gaussian distributional assumption on the covariates and the specific form of the leastsquares objective.
where β is the solution to the 1 -penalized program

8)
and ρ is assumed to be a smooth convex function. Furthermore, Θ ρ is defined to be a sparse approximate inverse of the matrix 1 Although clear similarities exist between the one-step estimator (4.7) and the expression (4.2), with ρ taking the place of ψ, the one-step estimator (4.7) is only guaranteed to be asymptotically normal when standardized appropriately. Furthermore, note that the M -estimator (4.8) is not designed to be robust to contaminated covariates, and in order to obtain appropriate error bounds, much stronger assumptions must be made on the distribution of the x i 's. Importantly, our proposed one-step estimator involves using one loss (the Huber loss) to define the initial estimate β, and then a separate score function ψ, which does not necessarily correspond to a derivative of the Huber loss, to obtain both (a) robustness and (b) efficiency.
Finally, we provide conditions for the inverse covariance matrix estimator Θ to satisfy the error bound (4.6). Suppose Σ x satisfies the α-incoherence condition, defined by max where α ∈ (0, 1], and we denote Γ * := Σ x ⊗ Σ x and S = supp(Θ x ). We also denote κ Σ := |||Σ x ||| 1 and κ Γ := (Γ * SS ) −1 1 . Combining a high-probability deviation bound on Σ−Σ x max (cf. Lemma 11 in Appendix H) with standard derivations for the graphical Lasso [63] yields the following result: Proposition 1. Suppose Assumption 2 holds and n polylog(p). Also suppose Θ x satisfies the α-incoherence condition (4.9) and the regularization parameter satisfies With probability at least 1 − exp(−cn), the graphical Lasso estimator (4.4) computed with respect to the entrywise MoM estimator Σ satisfies supp( Θ) ⊆ supp(Θ x ), and In particular, if each row of Θ x is k-sparse, we also have the bound The proof of Proposition 1 is contained in Appendix E. We see that the final conclusion of the lemma, with λ log p n , furnishes the deviation bound (4.6). Note that simply applying Theorem 1 in Ravikumar et al. [63] would produce a weaker result than we want, since the concentration result in Lemma 11 would fall into the category of "polynomial-type tails," thus yielding a suboptimal sample size requirement. Instead, we derive a statistical error guarantee suitable for our setting, building upon some of the key lemmas in Ravikumar et al. [63]. [82] studied higher-order expansions of various one-step estimators using different proposals for the Hessian term, and pointed out that the critical characteristic for equivalence of first-order terms is a certain bound on the rate of convergence of the Hessian to its expectation. Our estimator (4.2) is most closely related to the "method of scoring" one-step estimator discussed in Welsch and Ronchetti [82]. However, the direct analog of that estimator would involve inverting the matrix

Semiparametric efficiency
To make the notions of increased efficiency more precise, we now analyze the one-step estimator b ψ from the point of view of semiparametric efficiency. In particular, consider the semiparametric regression model where the distribution of the i 's is unknown and our goal is to estimate the unknown vector β 0 from i.i.d. observations Recall the notion of semiparametric efficiency: √ n( β − β 0 ) is asymptotically normal), and the asymptotic variance is minimal among all regular estimates of β 0 .
Additional background material is included in Appendix A. In particular, Theorem 7 states that a lower bound on the variance of any semiparametrically efficient estimator is given by where f denotes the density of i . For a fixed set of indices J ⊆ {1, . . . , p}, we partition the linear model as consider it as a subclass of the semiparametric regression model Theorem 5 shows that just as in classical asymptotic theory for M -estimators, a one-step correction with ψ function equal to the (negative) derivative of the log likelihood will yield an estimator with the same asymptotic properties as the maximum likelihood estimator. However, a benefit of using the one-step estimator b ψ rather than directly using the maximum likelihood estimator is that the latter may be difficult to compute, especially when the negative log likelihood is nonconvex and/or the scale parameter of the error distribution is unknown. Our theory shows that using the Huber estimator β for initialization sidesteps both of these potential issues, since the Huber loss is convex and our procedure via Lepski's method adapts to the scale.

Remark 6.
The notions of efficiency we have just described should be contrasted with the discussion of efficiency contained in Loh [53]. Importantly, our present results do not require any conditions for correct support recovery of the regression estimator, which were rather strong requirements imposed in the theory of the aforementioned paper. Furthermore, by using a one-step estimator, we do not require a second subgradient optimization routine performed on a nonconvex objective function in order to achieve efficiency, since a one-step modification of the global optimum of the convex surrogate is sufficient for our purposes.
Finally, we note that another notion of semiparametric efficiency was recently studied in Jankova and van de Geer [40], involving a more complicated infinitedimensional model that is allowed to change with n. It was shown that when Θ x is a sparse matrix, the same bounds may be established for semiparametric efficiency; however, van de Geer [73] showed that without the sparsity condition, the variance of an efficient estimator may in fact be lower. We suspect that these notions could also be adapted to the setting of robust regression estimators discussed in our paper, but such derivations are beyond the scope of our present work.

Confidence intervals
Our results from Section 4.1 naturally allow us to derive confidence intervals with the correct asymptotic coverage, which we briefly describe here. Further-more, the semiparametric efficiency result of Section 4.2 provides a type of "optimality" guarantee for the size of the confidence region. We again consider a fixed subset J ⊆ {1, . . . , p}, where |J| = m.
For an error probability α ∈ (0, 1), we write B α,J to denote the subset of R J corresponding to the direct product of m intervals of the form where Φ is the cdf of a standard normal random variable. In particular, if Z ∼ N (0, I m ) is an m-dimensional Gaussian random vector with i.i.d. standard normal components, we have We have the following main result, proved in Appendix G. We impose one additional condition involving the boundedness of (ψ 2 ) in order to facilitate our derivations.

Theorem 6. Let |J| = m be a fixed set of constant cardinality. In addition to the assumptions of Theorem 4, suppose (ψ 2 ) ∞ < ∞. An asymptotically valid (1 − α)-confidence region for the projection β * J of the regression vector onto J is given by
Note that the region (4.12) is a (pointwise) linear transformation of B α,J .
In the case m = 1, the confidence region for a fixed coordinate j reduces to the interval Note that as in Javanmard and Montanari [43], the set B α,J could be replaced with any other set of measure 1 − α under an m-dimensional standard normal distribution. Note that Theorem 6 is a result that holds for any choice of score functions ψ, not necessarily corresponding to the score function of the true pdf. Importantly, we can construct valid confidence intervals without needing to know the true distribution of the i 's. However, in order to construct optimal intervals, we would need to use the correct ψ function corresponding to the distribution.

Remark 7.
As mentioned in Remark 4, our recipe for constructing confidence intervals resembles the proposal of van de Geer et al. [75]. However, the key 5954 P. Loh difference is that the vanilla Lasso estimator would in general not achieve the correct rates of consistency in order for the confidence intervals to be asymptotically valid for the prescribed sample size scaling. Similarly, Javanmard and Montanari [43] include a section in their paper discussing how to construct confidence intervals in the case of non-Gaussian noise; however, again, they assume that the noise and covariance distributions are sufficiently well-behaved to guarantee fast convergence of the initial Lasso estimator. A way to correct this would be to use the Huber estimator as an initial estimator rather than the Lasso; see the simulations at the end of Section 5.2 for additional discussion and an empirical comparison.
Finally, it is worth discussing the relationship between our proposed method and the robust inference procedures studied in classical robust statistics. These include robust Wald-type and likelihood ratio tests [64,32], which are more generally applicable to hypothesis testing scenarios involving linear combinations of predictors. Our method resembles Wald-type tests in the sense that they are constructed with respect to a robust M -estimator, and also include robust estimates of the (inverse) covariance-however, our results are primarily designed for hypothesis testing of single coordinates. It is an interesting open question to see if analogs of the robust Wald-type or τ -tests [64] could be derived in the high-dimensional setting. It is plausible that such tests exist using an initial M -estimator such as the regression estimator introduced in this paper (cf. van de Geer and Stucky [74] and Sur et al. [72] for some theory in the non-robust setting).

Remark 8.
As pointed out by a reviewer, an alternative approach proposed by Belloni et al. [6] does not require incoherence assumptions (4.9) on the inverse covariance matrix, which are required to ensure the validity of our method. However, since the method of Belloni et al. [6] is a coordinatewise approach, it leads to confidence regions which are direct products of confidence intervals for individual components. This misses out on the optimality property of our confidence regions which is derived from the semiparametric efficiency of our regression estimator; note that in general, the confidence regions constructed in equation (4.12) may correspond to affine transformations of cuboids which are not direct products of intervals.

Simulations
We now report the result of experiments that we performed to validate our theoretical predictions.

Summary of procedure
We first briefly summarize the steps of the robust regression procedure: Composite gradient descent: In order to obtain the estimators β τ in the second step above, we employ the composite gradient descent algorithm, which has fast rates of convergence for convex functions [60]. Specifically, the updates are where S λτ /η (β) is the soft-thresholding operator defined componentwise according to Note also that

Synthetic data
We first ran experiments involving synthetic data to check the validity of our theory. The simulation results confirm that our estimator is (a) consistent and (b) efficient. For (a), we provide simulation results under two different scenarios: (i) Additive errors are drawn from a heavy-tailed distribution, but covariates have a sub-Gaussian distribution. (ii) Both x i 's and i 's are drawn from heavy-tailed distributions.
In case (i), we generated the x i 's from a standard normal distribution. The i 's were generated from a t-distribution with five degrees of freedom, to make the fourth moment finite (recall that moments of order five and above do not exist). Independently, with probability 0.1, we then multiplied each i by 10 to simulate further heavy-tailed contamination. Finally, we scaled the additive errors by 0.1. In case (ii), we generated the i 's in the same manner as in (i), but generated the coordinates of the x i 's independently from a Laplace distribution with mean 0 and scale parameter 1. Note that in this case, the marginals of the x i 's are sub-exponential.

P. Loh
Further implementation details are as follows: We set the error tolerance δ = 0.05 for the MoM estimator, and took σ min = σmax 2 2n 1/3 and (c τ , C) = (1, 20) for the Lepski gridding. We defined the weight function according to the expression (3.2), using b = 1 for the simulations in (i) and a range of values for the simulations in (ii). We defined the regularization parameter to be λ = 0.5 log p n for the simulations in (ii) and λ = C λ log p n for a range of values for C λ for the simulations in (i). We chose the problem dimensions to be p = 100 and k = 4, and defined β * to have 1's in the first four components and 0's everywhere else. Figure 1(a) shows the 2 -error of the adaptively tuned Huber estimator in setting (i), using a range of λ values. Figure 1(b) shows the 2 -error of the adaptively tuned Huber estimator in setting (ii), using a range of λ values. For comparison, we also include error curves for the vanilla Lasso, where the tuning parameter was chosen using 10-fold cross-validation. As expected, the error of both the Huber and Lasso estimators appears to decrease to zero with n. However, the Huber estimator tends to perform better than the Lasso, and the gap becomes more noticeable when both the covariates and errors are heavytailed. The precise values of C λ and b do not seem to affect the performance of the Huber estimator too heavily. In order to explore (b) the relative efficiency of the Huber estimator in comparison to its one-step correction, we borrowed some implementation details from the settings described in (a). We generated the coordinates of the x i 's from a Laplace distribution with mean 0 and scale parameter 1. We generated the i 's from a t-distribution with five degrees of freedom, scaled by 0.1. We set the error tolerance δ = 0.05 for the MoM estimator, and took σ min = σmax 2 2n 1/3 and (c τ , C) = (1, 20) for the Lepski gridding. We defined the weight function according to the expression (3.2) with b = 1, and we defined the regularization parameter to be λ = 0.5 log p n . We chose the problem dimensions to be p = 100 and k = 4, and defined β * to have 1's in the first four components and 0's everywhere else.
For the one-step estimator, we use the formulas in equation (4.2) to define A and σ. Recall that the pdf of a t-distribution with ν degrees of freedom is equal to Then we may compute the one-step estimator will always be smaller than the 2 -error of the initial estimator, and only guarantees that the error will decrease at the same rate, up to constant factors. However, in the plot in (b), we can clearly see that the empirical variance of the estimates of all four of the nonzero coefficients of β * indeed appears to decrease after the one-step correction, corroborating our theoretical conclusions.
Finally, we provide a set of simulation results illustrating the validity of our method for constructing confidence intervals described in Section 4.3. Figure 3 shows the result of 100 confidence intervals constructed using our procedure when the coordinates of the x i 's are drawn i.i.d. from a Laplace distribution with mean 0 and scale parameter 1, and the i 's are generated from a t-distribution with five degrees of freedom, scaled by 0.5.
For comparison, we also constructed confidence intervals according to the method suggested by van de Geer et al. [75] and Javanmard and Montanari [43], which essentially corresponds to our one-step procedure with score function ψ ≡ 1 (corresponding to the MLE for Gaussian errors). Furthermore, we set the initial estimator β to be equal to the Huber estimator rather than the Lasso, since the Lasso estimator has slower rates of convergence under heavytailed covariates and/or error distributions; we take the estimate of variance used in those formulas to be the empirical variance of the residuals computed with respect to β. We observe that the empirical coverage of the confidence intervals constructed according to our procedure is similar to that of the method using a Gaussian one-step correction: In comparison to the coverage percentages reported in Figure 3, the coverage levels for confidence intervals constructed using a Gaussian correction were (a) 78%, (b) 81%, and (c) 73%. On the other hand, the confidence intervals were on average shorter when using the one-step estimator with score function ψ corresponding to the t-distribution: The average lengths of confidence intervals for t-distribution To check for consistency as n → ∞, we ran the same confidence interval experiment with p = 10 and n = 500. The results, averaged over 100 trials, are tabulated in Figure 4. Here, t denotes confidence intervals computed with respect to the t-distribution score function, and z denotes confidence intervals computed with respect to the Gaussian one-step correction. We see that the empirical coverage percentages for both methods are roughly equal to 90%, whereas the average lengths of intervals computed using the t-distribution score function are generally slightly smaller. However, the difference between the average lengths of confidence intervals for the two methods vanishes as the number of degrees of freedom ν increases, since the t-distribution tends toward the standard normal.

Real data experiment
Turning to a real dataset, we analyzed a dataset collected from X-ray microanalysis of archaeological glass vessels [42], which has been analyzed in several other papers on high-dimensional robust linear regression with leverage points [56,70]. The dataset consists of n = 180 observations and p = 486 frequencies, which we used as predictors for the contents of compound 13, which is PbO. As discussed in [56], the dataset contains clear outliers.
Following the method of Smucler and Yohai [70] for tuning parameter selection, we chose the parameter λ in our algorithm via 5-fold cross-validation using a τ -scale of the residuals [84,66]. (Note that our theorems are stated with λ equal to log p n times universal constants, but in practice, choosing λ in a datadriven manner leads to better predictive performance.) Based on this procedure, Lepski's method yielded a sparse vector with six nonzero components. This fit corresponds to the value 0.134 of the τ -scale, which is comparable to the values reported in Smucler and Yohai [70] using alternative methods: MM-Lasso (τscale of 0.086, seven selected variables), adaptive MM-Lasso (τ -scale of 0.083, four selected variables), sparse-LTS (τ -scale of 0.329, three selected variables), Lasso (τ -scale of 0.131, seventy selected variables), and adaptive Lasso (τ -scale of 0.138, forty-nine selected variables); note that as is the case for the robust methods advocated in that paper, our method likewise chooses sparser models than the Lasso and adaptive Lasso, making the model easier to interpret, while maintaining good predictive performance.
We also attempted to construct confidence intervals for the selected frequencies. The simulations were inconclusive, due to the fact that various implementations of the graphical Lasso algorithm on the 486 × 486 matrix of covariates failed to converge. We suspect that this is because the assumption that the population-level inverse covariance matrix is sparse is violated, or the covariate distribution is heavy-tailed and/or possesses extreme outliers, so that the rate of convergence of the estimated covariance matrix Σ to Σ x is too slow. This experiment reveals that the additional assumptions required to construct confidence intervals may be somewhat more stringent than the assumptions needed for consistency in terms of estimation or prediction error.

Discussion
Throughout this paper, we have assumed that the variance of i is finite. We now describe a small adaptation that applies to the consistency results of Section 3 when the second moment does not exist. Indeed, one can still define σ * to be a scale parameter of the distribution of i (e.g., in the case of a Cauchy distribution). However, the place where we have required existence of second moments in our analysis is in the computation of the rough scale parameter bounds σ min and σ max .
Instead of the MoM estimator, we may use the median absolute deviation (MAD) as the scale parameter when the second moments are not finite. Recall that the population-level MAD is given by where med denotes the median operator. By Lemma 19 in Appendix H, we know that under the assumption that the distribution of i is symmetric and unimodal, we have so that the MAD estimate based on the y i 's can indeed be used as an upper bound on the scale of the i 's, analogous to the case of the variance. Furthermore, concentration inequalities for the empirical version of the MAD estimator can be found, e.g., in [68].
It is an open question whether the analysis of the one-step estimation results in Section 4 can also be adapted to remove the dependence on finiteness of the variance (and/or higher moments). We also mention an interesting open question of practical relevance: What type of one-step estimator could we use for obtaining a more efficient estimator and/or confidence intervals when the shape of the error distribution is unknown? Some general guidelines for choosing the ψ function in the one-step estimator, or a more principled procedure for flagging outliers and then fitting confidence intervals based on a fitted distribution, would be quite useful in practice.
Finally, an interesting direction to pursue would be whether an approach based on Lepski's method could also be used to adaptively choose the correct parameter for the Huber loss in the case of an -contaminated model (either in location estimation or linear regression). A related question is how to adaptively choose a trimming parameter for the robust location estimator based on trimmed means. These are both questions of theoretical interest that have largely remained open in the classical robust statistics literature-since they depend on minimizing variance quantities, rather than deriving high-probability error bounds, the machinery developed in this paper does not carry over directly. However, it is plausible that an appropriate modification of the Lepski-based approach may result in theoretically valid conclusions for obtaining a near-optimal estimator from the point of view of variance.

Appendix A: Semiparametric efficiency
In this appendix, we review several concepts in semiparametric estimation. For a more detailed overview, we refer the reader to the textbooks by Bickel et al. [8] or Hansen [33].
Following the treatment of Newey [61], we first define the semiparametric regression model [21]: Definition 4. The semiparametric regression model characterized by a parameter vector β 0 ∈ R q and function g 0 is given by where the x i 's and v i 's are vectors of exogenous observations, y i is a scalar response, and i is independent additive error.
Semiparametric efficiency is usually established by obtaining lower bounds on the asymptotic variance of an efficient estimator by considering Cramer-Rao bounds for different parametric "submodels," which are models that include the semiparametric model under consideration and are equal to the semiparametric model for a certain value of the parameter. In particular, the Cramer-Rao bound for any parametric subclass must provide a lower bound for the semiparametric estimation problem, as well, and we have the variance lower bound where V θ is the Cramer-Rao bound corresponding to a parametric submodel indexed by θ. If one can find a parametric submodel with a Cramer-Rao bound that matches the asymptotic variance of a particular semiparametric estimator, that estimator is guaranteed to be efficient. Note that for multidimensional problems, the supremum is taken with respect to the partial order of positive semidefinite matrices (and the supremum is guaranteed to exist under appropriate regularity conditions, which apply in the setting considered here).
Newey [61] presents an approach to compute the variance bound V directly by considering the projection of the score function of the semiparametric model onto the tangent set corresponding to the scores of all parametric submodels, where the score of the semiparametric model is the partial derivative of the negative log likelihood with respect to the parameter vector. Formally, consider a parametric submodel parametrized by θ = (β, η), where both β and η are vectors, and β corresponds to the q-dimensional parametric part of the original semiparametric model. The overall score function may be partitioned as S θ = (S β , S η ). By block matrix inversion, we may verify that the Cramer-Rao bound for estimation of β in the parametric submodel is then given by In particular, BS η is the best linear predictor of S β as a function of S η . We now define the tangent set to be the mean square closure of all qdimensional linear combinations of scores of parametric submodels: , where the A j 's are matrices with q rows and the S θj 's are the score vectors of various parametric submodels.
We have the following result, which holds generally for semiparametric estimation (not just in the case of the semiparametric regression model): [61,Theorem 3.2]] Suppose T is a linear space, and let S T β denote the projection of S β on T . Then provided the matrix is nonsingular.
For the model (A.1), we denote a parametrization of g 0 (v) as g(v, η), where η is a parameter such that g(v, η 0 ) = g 0 (v). Then the log likelihood may be written as where f is the density of i . Taking partial derivatives and evaluating at the true parameter values (β 0 , η 0 ), we obtain the score functions . It is not hard to verify that the tangent set is equal to using the observation that the parametric submodel with g(v, η) . Furthermore, T is clearly a linear space.
In order to compute S T β , we use the following result: Lemma 2. [Newey [61,Lemma 3.4]] If UW has finite second moment and V and W are functions of some random variable T , such that E[UU T | T ] is constant and positive definite, then the projection of UW on the space Applying Lemma 2 with W = x, V = v, and U = f ( ) f ( ) , we conclude that Combining this with Lemma 1, we arrive at the following result: . Suppose x has finite second moments and Then provided the matrix is nonsingular.

Appendix B: Proof of Theorem 1
We begin by analyzing the estimator where we have introduced a side constraint defined in terms of a parameter r to be specified later. We will show that such optima β τ lie in the interior of the constraint set, hence agree with the global optima β τ of the unconstrained problem.

We have the basic inequality
Hence, where the first inequality is due to the convexity of L n . Therefore, we have Denoting ν = β τ − β * and using the bound (B.1), we then have which is the cone condition. Therefore, the RSC condition (B.2) together with the basic inequality (B.3) implies that so combining with inequalities (B.5) and (B.1), we have implying that Rewriting the bound (B.7), we conclude that with probability at least 1 − c exp(−c n). Further note that for n k log p, we are guaranteed that Cτ k log p n < r.
It follows that β τ lies in the interior of the region {β : β − β * 2 ≤ r}, so β τ must also be a global optimum of the regularized Huber estimator (3.1) that does not include the side constraint. Furthermore, any optima of the unconstrained problem must also lie in the interior of the constraint set.
Finally, note that inequality (B.6) implies giving the desired 1 -bound. This concludes the proof of the theorem.

B.2. Bound on regularization parameter
We now verify the bound (B.1). Note that We first condition on the values of the x i 's. For each 1 ≤ j ≤ p, we see that w(x i )e T j x i n is a sum of independent, zero-mean random variables, where the i th term is bounded by τw(xi)|e T j xi| n . Hence, by Hoeffding's inequality and a union bound, we have (B.8) for any t > 0. We will take t √ log p. Furthermore, the random vectors w(x i )x i are sub-Gaussian with parameter b by assumption, so a union bound together with standard concentration inequalities shows that for any s > 0. In addition, Taking s log p n in the concentration inequality (B.9), we then conclude that when n log p. Now let E denote the high-probability event appearing on the left-hand side of inequality (B.10), and let Scale calibration for high-dimensional robust regression 5967 By a conditioning argument, we have where the first term is bounded via inequality (B.8) and the second term is bounded by P(E c ), which is in turn bounded using inequality (B.10). Hence, we conclude that with probability at least 1 − cp −c , for a universal constant c 0 (note that this constant depends on the bound c max on λ max (Σ x )). In particular, the choice of regularization parameter λ = 2c 0 log p n ensures that ∇L n (β * ) ∞ ≤ λτ 2 , w.h.p.

B.3. RSC condition
We now turn to the more challenging task of establishing the RSC condition (B.2). We show that w.h.p., the inequality holds uniformly over the set Defining Note that for |u 1 |, |u 2 | ≤ τ , we have whereas the convexity of τ implies that

P. Loh
Denote Δ := β − β * , and define the events Note that on the event A β i , we have We will now prove that the following statements hold, where γ is a sufficiently small constant to be specified later. The proofs of Lemmas 3, 4, and 5 may be found in Appendix B.4.

Lemma 3. With probability at least
uniformly over β ∈ C.
uniformly over β ∈ C. In particular, taking δ log e γ and assuming n k log p, we have with probability at least 1 − 2 exp −c γ log e γ n . Combining the results of Lemmas 3, 4, and 5, we see that with probability at least 1 − c exp(−c n), where we choose γ, δ, and δ such that in order to ensure the second inequality. (Note that lim γ→0 γ log e γ = 0.) This completes the proof.

B.4. Proofs of supporting lemmas
We now provide the proofs of Lemmas 3, 4, and 5.

B.4.1. Proof of Lemma 3
We make use of Lemma 14 in Appendix H. We will apply the lemma to the matrix with s = k. (We will verify the deviation condition (H.1) momentarily.) Denoting Δ := β − β * , we then have uniformly over all Δ ∈ R p . Now note that for any β ∈ C, we have from which inequality (B.11) follows. Finally, note that the bound (H.1) in the hypothesis of Lemma 14 holds, w.h.p. Indeed, for v 2 ≤ 1, the quantity v T Γv is the recentered average of i.i.d. random variables, each of which is the square of a sub-Gaussian variable with 5970 P. Loh parameter b . Thus, a standard -net argument over 2k-dimensional subspaces and a union bound over the p 2k choices of the support set implies that (cf. Lemma 15 in Loh and Wainwright [54]). This proves the desired result.

B.4.2. Proof of Lemma 4
The proof is similar to the proof of Lemma 3, except that on top of the arguments used there, we also take a union bound over subsets of size at most γn, leading to an additional factor of n γn in the error probability. Recalling a standard bound on binomial coefficients, we have n γn ≤ e γ γn , and using this expression in the probability bound completes the proof.

B.4.3. Proof of Lemma 5
We write For the first term in inequality (B.15), note that by the Chernoff bound in Lemma 16, we have with probability at least 1 − exp(−cn), where the second inequality comes from Markov's inequality. In particular, we can guarantee that this term is bounded by γ 2 if we take τ ≥ c τ σ * , where the constant c τ depends on γ.
For the second term in inequality (B.15), the bound 1{x ≥ y} ≤ x y for x ≥ 0 and y > 0, together with the Cauchy-Schwarz inequality, implies that (B.16) By an analogous argument to the one employed in the proof of Lemma 3, we can derive the bound with probability at least 1 − 2 exp − cδn (b ) 2 + 2k log p , uniformly over all β ∈ C. Combined with inequality (B.16), this implies that Thus, both terms in inequality (B.15) are bounded by γn 2 , leading to the desired result.
We first consider the term which we claim is asymptotically normal. We have By Lemma 7, the second factor is O P ( √ log p). To handle the first factor, we write where the final inequality leverages Lemma 9 and the condition (4.6).
Together with the convergence statement in Lemma 10, we conclude that I has the desired asymptotic normality property, since assuming n k 2 polylog(p).

P. Loh
We now shift our attention to term II on the right-hand side of equation (D.1). By Taylor's theorem applied to each summand, we have and t i lies on the segment between and the same argument employed to bound the term B 3 in the proof of Lemma 9 and the bound on X max from Lemma 6. Altogether, we have the bound Finally, again using the expansion δ i = xi(β * − β) We then bound using inequality (D.2) and Lemma 8, the bound on X max from Lemma 6, and the 1 -error bound on β in the last inequality. Hence, we conclude that under the assumption n k 2 polylog(p). Next, we bound A 2 by noting that using Lemma 6 and Chebyshev's inequality. Combined with the deviation bound on 1 σ − 1 σ * from Lemma 8 and the bound on

D.2. Supporting lemmas
We begin with a lemma concerning the magnitude of the entries of the design matrix.
Proof. Applying a union bound to the entries of X, we have Taking t = 2σ x log(np) then gives the desired result.

P. Loh
The next lemma is a concentration inequality derived using Lemma 15: For inequality (D.4), let y i = ψ(ξ i )x i . Note that conditioned on the i 's, the y i 's are independent, zero-mean vectors. Since ψ(ξ i ) is sub-exponential, a union bound gives P max 1≤i≤n |ψ(ξ i )| ≤ 2σ ξ log n ≥ 1 − 1 n (cf. the proof of Lemma 6). Furthermore, by Chebyshev's inequality, we have Hence, defining E := max 1≤i≤n |ψ(ξ i )| ≤ 2σ ξ log n we have P(E) ≥ 1 − c n . We claim that the conditions (H.2) and (H.3) of Lemma 15 are satisfied with B n polylog(p), conditioned on E. Indeed, we have By assumption, we have c 1 ≤ Var(x ij ) ≤ c 2 σ x for all j. Furthermore, on the event E, the quantity 1 n n i=1 ψ 2 (ξ i ) is bounded. This establishes inequality (H.2). For condition (H.3), recall that since x ij is sub-exponential, we have for some constant c 3 > 0. Then on E. Hence, by taking B n polylog(p), we can guarantee that condition (H.3) is satisfied. Inequalities (D.5) and (D.6) are proved in a similar manner, noting that ψ ∞ , ψ ∞ < ∞ by assumption, so the terms involving ψ(ξ i ) are still subexponential.
The next two lemmas use the preceding concentration results to prove convergence of certain empirical quantities to their population-level counterparts.

Lemma 8. Under the assumptions of Theorem 4, we have
Proof. Using the triangle inequality, we write We bound the second term by O P log p n via Chebyshev's inequality, using the assumption that E[ξ 4 i ] < ∞. Expanding and using the triangle inequality, we bound the first term as

P. Loh
For the first term in inequality (D.7), we show that We use Lemma 14 with Γ = Σ, δ = Θ(1), and s = k. In particular, we will show that inequality (H.1) holds w.h.p. Then the lemma implies that In order to verify the deviation condition (H.1), note that by Lemma 6, we can define Σ = X T X n , where X is the matrix X with columns truncated according to x i = x i · 1{ x i ∞ > 2σ x log(np)}; then Σ = Σ, w.h.p. Furthermore, Σ is the sample covariance matrix of bounded i.i.d. random vectors, so we have w.h.p., using a standard -net + union bound argument (cf. inequality (B.14) in the proof of Lemma 3), where δ = O k polylog p n ) . Hence, by the triangle inequality, we have Proof. Note that it suffices to show the following convergence results: since we may combine the statements via Slutsky's theorem to obtain the desired result. Convergence results (G.1) and (G.3) are direct consequences of Lemma 9 and condition (4.6) of Theorem 4, under the assumed sample size scaling. To obtain the convergence result (G.2), we may use a parallel argument to the one employed to bound term B in the proof of Lemma 9. The only difference is that we use a Taylor expansion of ψ 2 rather than ψ . Note that we have assumed (ψ 2 ) to be bounded. Since (ψ 2 ) = 2ψψ , the terms we need to control replace B 1 and B 2 in inequality (D.12) by the quantities As in the proof of Lemma 9, these terms may be bounded w.h.p. using Lemma 7 and Chebyshev's inequality.
Hence, by Slutsky's theorem, we also have Combined with equation (4.11), we then have lim n,p,k→∞ Rearranging the argument inside the probability expression yields the desired result.