Iteratively reweighted ℓ1-penalized robust regression

This paper investigates tradeoffs among optimization errors, statistical rates of convergence and the effect of heavy-tailed errors for high-dimensional robust regression with nonconvex regularization. When the additive errors in linear models have only bounded second moments, we show that iteratively reweighted `1-penalized adaptive Huber regression estimator satisfies exponential deviation bounds and oracle properties, including the oracle convergence rate and variable selection consistency, under a weak beta-min condition. Computationally, we need as many as O(log s + log log d) iterations to reach such an oracle estimator, where s and d denote the sparsity and ambient dimension, respectively. Extension to a general class of robust loss functions is also considered. Numerical studies lend strong support to our methodology and theory. MSC2020 subject classifications: Primary 62A01; secondary 62J07.


Introduction
Suppose we observe independent and identically distributed (i.i.d.) data vectors {(y i , x i ) : 1 ≤ i ≤ n} from (y, x) that follows the linear model Since the invention of Lasso in the seminal work of [46], a variety of variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. The Lasso estimator β Lasso solves the convex optimization problem min β∈R d (2n) −1 n i=1 (y i − x T i β) 2 + λ β 1 , where λ > 0 is the regularization parameter. The Lasso is an 1 -penalized least squares method in nature: the quadratic loss is used as a goodness of fit measure and the 1 -norm induces sparsity. To achieve better performance under different circumstances, several Lasso variants have been proposed and studied; see [3,5,14,43,51,56,57] to name a few. We refer to [8,20] and [49] for comprehensive and systematic introductions of high-dimensional statistical methods and theory.
As a general regression analysis method, the Lasso, along with many of its variants, has two potential downsides. First, the regularized least squares methods are sensitive to the tails of error distributions, even though various alternative penalties have been proposed to achieve better model selection performance. Consider a Lasso-type estimator that solves the penalized empirical risk minimization min is a general loss function. The effects of the loss and noise on estimation error are coded in the vector { (ε i )} n i=1 . When is the quadratic loss and ε is heavy-tailed, this vector is likely to have relatively many large coordinates. As a result, the combination of the rapid growth of with heavy-tailed sampling distribution inevitably leads to outliers, which will eventually be translated into spurious discoveries. Secondly, the 1 -penalty introduces nonnegligible estimation bias [14,56]. For correlated designs, the bias of the Lasso may offset true signals and creates spurious effects, leading to inconsistency in support recovery. Technically, this is expressed as the irrepresentable condition for the selection consistency of the Lasso [55]. Under restricted eigenvalue type conditions, the Lasso and its sorted variant Slope [1,2] do achieve rate optimality for prediction and coefficient estimation. However, they do not benefit much from strong signals because the bias does not diminish as signal strengthens. Under the restricted isometry property (on the design) and Gaussian errors, [39] derived the lower bound for the minimax risk: inf β sup β * ∈Ω(s,a) E β − β * 2 2 σ 2 s/n when a σ log(ed/s)/n, where Ω(s, a) = {β ∈ R d : β 0 ≤ s, min j:βj =0 |β j | ≥ a}. For estimating such sparse vectors with sufficiently strong signals, Lasso can not achieve the oracle rate without a variant of the irrepresentable condition [37,55], which is a condition on how strongly the important and unimportant variables can be correlated. This condition, however, is in general very restrictive; see [56] for counterexamples and numerical demonstrations. Recently, [23] provided precise descriptions of the lower and upper irrepresentable conditions, proven to be necessary and sufficient for the selection consistency of Lasso, and also pointed that the Lasso estimator cannot achieve selection consistency and √ n-consistency simultaneously under general conditions.
In the presence of heavy-tailed noise, outliers occur more frequently and may have a significant impact on (regularized) empirical risk minimization when the loss grows quickly. When the regression error ε only has finite second moment, the Lasso still achieves the minimax rate s log(d)/n (under 2 -norm) but with worse deviations [28]. To reduce the ill-effects of outliers, a widely recognized strategy is to use a robust loss function that is globally Lipschitz continuous and locally quadratic. A prototypical example is the Huber loss [21]: if |x| > τ, (1.2) where τ > 0 is a robustification parameter that controls the tradeoff between the robustness and bias. Next, to alleviate the nonnegligible estimation bias introduced by 1 -regularization, [14] introduced a family of folded-concave penalties, including the smoothly clipped absolute deviation (SCAD) penalty [14], minimax concave (MC+) penalty [52], and the capped 1 -penalty [41,54]. These ideas motivate the following nonconvex (folded concave) regularized Mestimator where L τ (β) : is the empirical loss, τ > 0 is a robustification parameter, and p λ : R → [0, ∞) is a concave penalty function with a regularization parameter λ > 0. We refer to [53] for a comprehensive survey of folded concave regularized methods.
In practice, it is inherently difficult to solve the nonconvex optimization problem (1.3) directly. Statistical properties, such as the rate of convergence under various norms and oracle properties, are established for either the hypothetical global optimum that is unobtainable by any practical algorithm in polynomial time, or some local optimum that exists somewhere like a needle in a haystack. To mitigate the gap between statistical theory and algorithmic complexity, we apply an iteratively reweighted 1 -penalized algorithm, which originates from [58], to adaptive Huber regression. This multi-step regularized robust regression procedure (provably) yields an estimator with desired oracle properties, and is computationally efficient because it only involves solving a sequence of (unconstrained) convex programs. Our theoretical analysis is based on and improves upon [16], who established the statistical and algorithmic theory for the iteratively reweighted 1 -penalized least squares regression estimator. The aim of this paper is to explore a general class of robust loss functions, typified by the Huber loss, not merely for the purpose of generality but owing to a real downside of the quadratic loss. Typified by the Huber loss, our general principle applies to a class of robust loss functions as will be discussed in Section 4. Software implementing the proposed procedure and reproducing our computational results is available at https://github.com/XiaoouPan/ILAMM. Folded concave regularization: For folded concave penalized M -estimators subject to a convex side constraint, [31] and [32] were among the first to provide rigorous statistical and algorithmic theory for local optima. They quantified statistical accuracy by providing bounds on 1 / 2 -and prediction errors between stationary points and the population-level optimum. They also provided conditions under which the stationary point is unique, and proposed a composite gradient algorithm for provably solving the constrained optimization problem efficiently. In the context of generalized linear models with a sufficiently smooth link function and bounded covariates, [32] proved under the scaling n s 3 log(d) that the nonconvex regularized program subject to an 1 -ball constraint has a unique stationary point given by the oracle estimator with high probability. For linear regression with symmetric heavy-tailed errors, [30] studied statistical consistency and asymptotic normality of nonconvex regularized robust M -estimators (also subject to an 1 -ball constraint) with a locally strongly convex loss. For sub-Gaussian covariates, [30] proved the uniqueness of a stationary point which has 2 -and 1 -error bounds in the order of s log(d)/n and s log(d)/n, respectively. Furthermore, under the scaling n max{s 2 , s log(d)} and the beta-min condition β * S min λ + log(s)/n, this stationary point coincides with the oracle estimator.
In this paper, we address nonconvex regularized robust regression from a different angle. Motivated by the local linear approximation (LLA) algorithm proposed by [58], we apply an iteratively reweighted 1 -penalized algorithm to adaptive Huber regression, which involves solving a sequence of (unconstrained) convex programs. We simultaneously analyze the statistical property and algorithmic complexity of the solutions produced by such an iterative procedure. For sub-exponential covariates and asymmetric error with finite variance, we show that the multi-step penalized estimator, after O(log s + log(log d)) iterations, achieves exponential deviation bounds with 2 -and 1 -errors in the order of s/n and s/ √ n, respectively, under the scaling n s log(d) and the above beta-min condition. The strong oracle property can be obtained under slightly stronger moment condition and the scaling n max{s 2 , s log(d)}. 0 -regularization: Another popular class of sparse recovery algorithms is based on directly solving 0 -constrained or 0 -penalized empirical risk minimizations, which naturally produces sparse solutions. Such a formation is NP-hard, and believed to be intractable in practice. Despite its computational hardness, many practically useful algorithms have been proposed to solve 0 -regularized ERM, while the statistical properties beyond least squares regression are much less studied. We refer to [4] and [19] for two comprehensive survey articles on 0 -regularized regression methods.
The idea of having the robustification parameter grow with the sample size in order to achieve exponential deviations even when the sampling distribution only has finite variance dates back to [9] in the context of mean estimation. Therefore, the robustness considered in this paper is primarily about nonasymptotic exponential deviation of the estimator versus polynomial tail of the error distribution. The resulting procedure does sacrifice a fair amount of robustness to adversarial contamination of the data. To echo the message in [9], the motivation of this work is different from and should not be confused with the classical notion of robust statistics.
From a variable selection perspective, this paper focuses on oracle properties of multi-step penalized robust regression estimators when the signal is sufficiently strong. While allowing for heavy-tailed noise, the high-dimensional feature vector x ∈ R d is assumed to have either sub-exponential or sub-Gaussian tails. For more complex problems in which both the covariates and noise can be (i) heavy-tailed and/or (ii) adversarially contaminated, the estimator obtained by minimizing a robust loss function is still sensitive to outliers in the feature space. To achieve robustness in both feature and response spaces, recent years have witnessed a rapid development of the "median-of-means" (MOM) principle, which dates back to [40] and [22], and a variety of MOM-based procedures for regression and classification in both low-an high-dimensional settings [12,13,26,27,33,35]. We refer to [34] for a recent survey. An interesting open problem is how to efficiently incorporate the MOM principle with nonconvex regularization or iteratively reweighted 1 -regularization so as to achieve high degree of robustness and variable selection consistency simultaneously.

Notation
Let us summarize our notation. For every integer k ≥ 1, we use R k to denote the k-dimensional Euclidean space. The inner and Hadamard products of any two For k ≥ 2, I k represents the identity/unit matrix of size k. For any k × k symmetric matrix Σ ∈ R k×k , Σ 2 is the operator norm of Σ, and we use λ min (Σ) and λ max (Σ) to denote the minimal and maximal eigenvalues of Σ, respectively. For a positive semidefinite matrix Σ ∈ R k×k , · Σ denotes the norm linked to Σ given by u Σ = Σ 1/2 u 2 , u ∈ R k . For any two real numbers u and v, we write u∨v = max(u, v) and u∧v = min(u, v). For any integer d ≥ 1, we write [d] = {1, . . . , d}. For any set S, we use |S| to denote its cardinality, i.e., the number of elements in S.

Regularized Huber M -estimation
We first revisit the 1 -penalized Huber regression estimator in Section 2.1, and point out two different regimes for the robustification parameter τ . In Section 2.2, we propose a multi-step procedure, which is closely related to folded concave regularized Huber regression, for fitting high-dimensional sparse models with heavy-tailed noise. This multi-step penalized robust regression method not only is computationally efficient, but also achieves optimal rate of convergence and oracle properties, as will be studied in Sections 2.3 and 2.4. Throughout, denotes the active set and s = |S| is the sparsity.

1 -penalized Huber regression
from the linear model (1.1), consider the 1 -regularized Huber M -estimator, which we refer to as the Huber-Lasso, where L τ (·) is the empirical loss function defined in (1.3). Statistical properties of the penalized Huber M -estimator have been studied by [1,15,24,30] under different assumptions. A less-noticed problem is the connection between the robustification parameter and the error distribution, which in turn quantifies the tradeoff between robustness and unbiasedness. Recent studies by [44] reveal that the use of Huber loss is particularly suited for heavy-tailed problems in both low and high dimensions. With a properly chosen robustification parameter, calibrated by the noise level, sample size and parametric dimension, the effects of the heavy-tailed noise can be removed or dampened.
Remark 2.1. In practice, it is natural to leave the intercept or a given subset of the parameters unpenalized in the penalized M -estimation framework. Denote by R be a user-specified index set of unpenalized parameters, which contains at least index 1. A modified Huber-Lasso estimator is then defined as the solution Similar theoretical analysis can be carried out with slight modifications, and thus will be omitted for ease of exposition.
We first impose the following assumptions on the data generating process. The (random) covariate vectors are assumed to be sub-exponential/sub-gamma [6], and we allow the regression errors to be heavy-tailed and asymmetric.
Theorem 2.1 provides the error bounds for the one-step penalized estimator, and paves the way for our subsequent analysis for the multi-step procedure. Theorem 2.1 is a modified version of Theorem B.2 in [44] (when δ = 1) with an explicit relation between deviation bound and confidence level under slightly relaxed moment condition on the design. When the (conditional) distribution of ε is symmetric, β * can be identified as β * ∈ argmin β∈R d E L τ (β). Then, with a fixed τ (e.g. τ σ 2 ), Theorem 2.1 can also be obtained as a special case of Theorem 2.1 in [1] when the feature vector x is sub-Gaussian.

Iteratively reweighted 1 -penalized Huber regression
For fitting sparse regression models, the Lasso-type estimators typically exhibit a suboptimal rate of convergence, as compared to the oracle rate achieved by nonconvex regularization methods, under a minimum signal strength condition [39,53], also known as the beta-min condition [8,Section 7.4]. However, as noted previously, directly solving the nonconvex optimization problem (1.3) is computationally challenging. Moreover, statistical properties are only established for the hypothetical global optimum (or some stationary point), which is typically unobtainable by any polynomial time algorithm.
Inspired by the local linear approximation to folded concave penalties [58], we consider a multi-stage procedure that solves a sequence of convex programs up to a prespecified optimization precision. This is an iteratively reweighted 1 -penalized algorithm, which is similar in spirit to the iteratively reweighted basis-pursuit algorithms studied in [18]. Let p λ (·) be a differentiable penalty function as in (1.3) and recall that L τ (·) is the empirical loss function. Starting with an initial estimate β (0) = ( β T is the optimal solution to program (P ). Following [53], we assume the following conditions on the penalty function p λ .
By a Bayesian argument, [14] suggested the choice of a = 3.7. For each ≥ 1, program (P ) corresponds to a weighted 1 -penalized empirical Huber loss minimization of the form where λ = (λ 1 , . . . , λ d ) T is a d-vector of regularization parameters with λ j ≥ 0. By convex optimization theory, any optimal solution β to the convex program (2.4) satisfies the first-order condition Following the terminology in [16], for a prespecified tolerance level > 0, we say β is an -optimal solution to (2.4) (2.5) In view of Definition 2.1, for a prespecified sequence of tolerance levels { } ≥1 , we use β ( ) ) T to denote an -optimal solution to program (P ), that is, . For simplicity, we consider a trivial initial estimator β (0) = 0. Since p λ (| β (0) j |) = p λ (0) = λ for j = 1, . . . , d, the program (P 1 ) coincides with that in (2.1). In Section 3, we will describe an iterative local adaptive majorize-minimization (I-LAMM) algorithm which produces an -optimal solution to (2.4) after a few iterations.
The above procedure is sequential, and can be categorized into two stages: contraction ( = 1) and tightening ( ≥ 2). As we will see in the next subsection, even starting with a trivial initial estimator that is fairly remote from the true parameter, the contraction stage will produce a reasonably good estimator whose statistical error is of the order log(d) · s/n. Essentially, the contraction stage is equivalent to the 1 -penalized Huber regression in (2.1). A tightening stage further refines this coarse contraction estimator consecutively, and eventually gives rise to an estimator that achieves the oracle rate s/n under a weak beta-min condition.

Deterministic analysis
To analyze the statistical properties of { β ( ) } ≥1 , we first define a "good" event regarding the restricted strong convexity (RSC) property of the empirical Huber loss over a local 1 -cone.

Definition 2.2.
For some r, l, κ > 0, define the event where B(r) = B d (r) = {δ ∈ R d : δ 2 ≤ r} is an 2 -ball and C(l) := {δ ∈ R d : Throughout the following, we assume that the penalty function p λ (·) satisfies Condition 2.2. Moreover, define where L τ (β) = E L τ (β) is the population loss. Here w * ∈ R d is the centered gradient vector which corresponds to the stochastic error, and b * τ quantifies the (deterministic) approximation bias induced by the Huber loss. See Lemma A.1 in the Supplementary Material for an upper bound on the bias.

Remark 2.2.
In this paper, we introduce the bias term b * τ into the results primarily because the error distribution, if not specified, can be asymmetric. This term is typically nullified in the literature due to two reasons. First, under the symmetry assumption that ε (conditional on x) is symmetric around zero, E{ τ (ε)x} = 0 for any τ > 0. Secondly, it is sometimes assumed that x i and ε i are independent, and both have zero means. Again, for any τ > 0, it follows that E{ τ (ε)x} = E{ τ (ε)} · E(x) = 0. In these two scenarios, the bias term b * τ vanishes for any given τ .
Then, conditioned on the event E 1 (r, l, κ) ∩ {λ ≥ 2( w * ∞ + 1 )} with l = 6s 1/2 , any 1 -optimal solution β (1) of program (P 1 ) satisfies Proposition 2.1 is deterministic in the sense that the error bound (2.9) holds conditioning on the event E 1 (r, l, κ)∩{λ ≥ 2( w * ∞ + 1 )}. Under Condition 2.1 (sub-exponential design and heavy-tailed error with finite variance), we will establish the delicate choices of λ, 1 , r and sample size requirement in order that this event occurs with high probability. Specifically, we will show that with high probability as long as n s log(d).
Next, we investigate the statistical properties of { β ( ) } ≥2 in the tightening stage. We impose a minimum signal strength condition on β * S min = min j∈S |β * j |, so that the error rate obtained in Proposition 2.1 is improvable [39,53]. Recall that s = |S|. Proposition 2.2. Assume there exists some γ > 0 such that p (γ) > 0. Let and choose c > 0 so that 0.5p (γ)(c 2 + 1) 1/2 + 2 = cκγ. (2.11) Set l = {2 + 2 p (γ) }(c 2 + 1) 1/2 s 1/2 + 2 p (γ) s 1/2 and let r > 0 satisfy Under the minimum signal strength condition β * S min ≥ γλ, and conditioned on event (2.14) Proposition 2.2 unveils how the tightening stage improves the statistical rate: every tightening step shrinks the estimation error from the previous step by a δ-fraction. The second term on the right-hand side of (2.13) or (2.14) dominates the 2 -error, and up to constant factors, consists of three components, We identify p λ (|β * S | − γλ) 2 as the shrinkage bias induced by the penalty function. This explains the limitation of the 1 -penalty p λ (t) = λ|t| whose derivative p λ (t) = λ sign(t) (t = 0) does not vanish regardless of the signal strength. Intuitively, choosing a proper penalty function p λ (·) with a descending derivative reduces the bias as signal strengthens. The second term, w * S 2 + b * τ , reveals the oracle property. To see this, consider the oracle estimator defined as Since s = |S| n, the finite sample theory for Huber's M -estimation in low dimensions [44] applies to β ora , indicating that with high probability, According to Definition 2.1, the last term s 1/2 demonstrates the optimization error, which will be discussed in Section 3. The above results provide conditions under which the sequence of estimators { β ( ) } ≥1 satisfies the contraction property and, meanwhile, falls in a local neighborhood of β * . Another important feature of the proposed procedure is that the resulting estimator satisfies the strong oracle property, as demonstrated by the following result. Let { β ( ) } ≥1 be any optimal solutions to the convex programs {(P )} ≥1 in (2.3) with β (0) = 0. Similarly to Definition 2.2, we define the following event in regard of the restricted strong convexity of the empirical Huber loss. For some r, l, κ > 0, For a prespecified δ ∈ (0, 1), let κ ≥ 1.25/(δγ 0 ) and choose c 0 > 0 so that the strong oracle property holds under the minimum signal strength condition

Random analysis
In this section, we complement the previous deterministic results with probabilistic bounds on the random events of interest. To be more specific, events E 1 (r, l, κ) and E 2 (r, l, κ) correspond to the RSC properties of L τ (·). The order of the regularization parameter λ depends on w * ∞ , where w * = ∇ L τ (β * ) − ∇L τ (β * ) is the centered score function evaluated at β * . The oracle convergence rate depends on the 2 -norm of w * S ∈ R s , the subvector of w * indexed by S.
Under Condition 2.1, x = (x 1 , . . . , x d ) T is sub-exponential and Σ = E(xx T ) = (σ jk ) 1≤j,k≤d is positive definite. Here we do not require the components of x to have zero means. Moreover, given the true active set S ⊆ [d] of β * , we define the following s × s principal submatrix of Σ: Throughout, " " and " " stand for "≤" and "≥", respectively, up to constants that are independent of (n, d, s) but might depend on those in Condition 2.1.
In particular, define which is a constant depending only on σ x .

Proposition 2.4. Assume Condition 2.1 holds, and let
holds with probability at least 1 − e −t as long as τ ≥ max{C 1 σ 2 , C 2 r} and n ≥ log(2d), where C 0 , C 1 are absolute constants and C 2 depends only on σ x .
The next proposition provides high probability bounds on w * ∞ and w * with probability at least 1 − e −t , and Similarly to Theorem 2.1, Propositions 2.4 and 2.5 are also modified versions of Lemmas C.4 and C.6 in [44] under a weaker sub-exponential condition on the feature vector x. Therefore in the proofs, we only provide the necessary steps that help improve upon the existing results. Together, Propositions 2.4 and 2.5 reveal the impact of the robustification parameter on the statistical properties of the resulting estimator. As discussed in Section 2.3 above, the order of w(β * ) S 2 determines the oracle rate of convergence. In Theorem 2.2, we show that after only a small number of iterations, the proposed procedure leads to an estimator that achieves the oracle rate of convergence. Recall from Section 2.2 that { β ( ) } =1,2,... is a sequence of -optimal solutions of the convex programs (2.3), initialized at β (0) = 0. Theorem 2.2. Assume Conditions 2.1 and 2.2 hold, and there exist some γ 1 > γ 0 > 0 such that Given t ≥ 0, suppose the sample size satisfies n s log d + t, and ≤ 1/n for all ≥ 1. Moreover, suppose that we choose a regularization parameter λ σ 2 (log d + t)/n, and let τ satisfy Then, under the minimum signal strength condition β * S min ≥ (γ 0 + γ 1 )λ, the multi-stage estimator β (T ) We refer to the conclusion of Theorem 2.2 as the weak oracle property in the sense that the proposed estimator achieves the same rate of convergence as the oracle β ora which knows a priori the support S of β * . We keep the two terms τ (s + t)/n and b * τ in the upper bounds of (2.27) to keep track the impact of τ on the estimator error: the former is part of the stochastic error and the latter characterizes the bias. Below are two cases that are of general interests.

(Symmetry) As discussed in Remark 2.2, if ε (conditional on x)
is symmetric around zero, then b * τ = 0 for any τ > 0. To certify (2.2), τ can be taken as a constant-multiple of σ 2 , and the resulting error bounds become (Asymmetry) When the conditional distribution of ε i is asymmetric, there will be a bias-robustness tradeoff. If ε only has bounded second moment, by Then, the multi-step iterative estimator β (T ) with τ σ 2 n/(s + log d + t) satisfies, under the scaling n s log d + t, that A more intriguing result, as revealed by the following theorem, is that our estimator achieves the strong oracle property, namely, it coincides with the oracle with high probability. Here we need slightly stronger moment conditions than those in Condition 2.1, that is, the random predictor x is sub-Gaussian and the noise variable ε satisfies an L 2+η -L 2 norm equivalence for some η ∈ (0, 1].

Remark 2.3.
A direct consequence of the strong oracle property is variable selection consistency, saying that P supp( β ( ) ) = S → 1 as n, d → ∞.
In particular, assume Condition 2.3 holds with η = 1, implying that ε satisfies an L 3 -L 2 norm equivalence. Then, Theorem 2.3 implies that the multi-step estimator β ( ) with λ σ 2 (log d)/n, τ σ 2 n/(s + log d) and log s achieves variable selection consistency as n, d → ∞ under the scaling n max(s log d, s 2 ) and the necessary beta-min condition β * S min σ 2 log(d)/n [39]. As previously noted, Lasso [46] achieves desirable risk properties, in terms of both estimation and prediction, under mild conditions, yet its variable selection consistency requires much stronger assumptions [23,37,48,55]. In addition to sub-Gaussian errors, it requires a stronger beta-min conditionβ * S min σ 2 (s log d)/n, and the irrepresentable condition See, for example, Chapter 7 in [8] and Section 7.5 in [49]. We also refer to [23] for the precise lower and upper irrepresentable conditions, which are necessary and sufficient for the variable selection consistency of Lasso.

Optimization algorithm
In this section, we use the local adaptive majorize-minimization (LAMM) principal [16] to derive an iterative algorithm for solving each subproblem (P ) in

LAMM algorithm
To minimize a nonlinear function f (·) on R d , at a given point β (k) , the majorizeminimization (MM) algorithm first majorizes it by another function g(· |β (k) ), which satisfies and then compute β (k+1) := argmin β∈R d g(β|β (k) ) [25]. The objective value of such an algorithm is non-increasing in each step, because where inequality (i) is due to the majorization property of g(·|β (k) ) and inequality (ii) follows from the definition β (k+1) . [16] observed that the global majorization requirement is not necessary. It only requires the local properties for the inequalities in (3.1) to hold. Using the above principle, it suffices to locally majorize the objective function L τ (β) in the penalized optimization problem. At the k-th step with working parameter vector β ( ,k−1) , we use an isotropic quadratic function, that is, to locally majorize L τ (β) such that where φ ( ,k) is a proper quadratic coefficient at the k-th update, and β ( ,k) is the solution to It is easy to see that β ( ,k) takes a simple explicit form ..,d is the soft-thresholding operator. For simplicity, we summarize and define the above update as β ( ,k ). Using this simple update formula of β, we iteratively search for the pair (φ ( ,k) , β ( ,k) ) that ensures the local majorization (3.4). Starting with an initial quadratic coefficient φ = φ 0 , say 10 −4 , we iteratively increase φ by a factor of γ u > 1 and compute until the local property (3.4) holds. This routine is summarized in Algorithm 1.
Algorithm 1 LAMM algorithm at the k-th iteration of the -th subproblem.

Complexity theory
To investigate the complexity theory of the proposed algorithm, we first impose the following standard regularity conditions on the objective function.
Our next theorem characterizes the computational complexity in the contraction stage. Recall that λ (0) = (λ, . . . , λ) T ∈ R d . Theorem 3.1. Assume Condition 3.1 holds and the optimal solution β (1) satisfies β (1) (1) ) ≤ c , in the contraction stage, we need as many as The sublinear rate in the contraction stage is due to the lack of global strong convexity of the loss function in this stage, because we start with a naive initial value 0. Once we enter the contracting region where the estimator is relatively closer to the underlying true parameter vector, the problem becomes strongly convex (at least with high probability). This endows the algorithm a linear rate of convergence. Our next theorem provides a formal statement on the geometric convergence rate of LAMM for solving each subproblem in the tightening stage. To this end, we describe a variant of the sparse eigenvalue condition.

Condition 3.2.
We say an LSE(C 0 ) condition holds for some C 0 ≥ 1 if there exist an integer s s and constants κ * , κ * , C 1 > 0 such that Note that if a vector u ∈ R d belongs to the sparse cone C 0 (m) for some m ≥ 1, by Cauchy-Schwarz inequality we have u 1 ≤ m 1/2 u 2 . This implies that u also falls into the 1 -cone C(m 1/2 ) defined in (2.6). Proposition 2.4 will remain valid, possibly with different constants, if the 1 -cone C(l) therein is replaced by a sparse cone. Note that Proposition 2.4 controls the minimum LSE. Similar results can be obtained to bound the maximum LSE from above.
We summarize the above two theorems in the following result, which characterizes the computational complexity of the whole algorithm.  (1) ) ≤ c λ and ω λ ( −1) ( β ( ) ) ≤ t 1/n for 2 ≤ ≤ T , the required number of LAMM iterations is of the order C 1 −2 where C 1 and C 2 are positive constants independent of (n, d, s).

Extension to general robust losses
Thus far, we have restricted our attention to the Huber loss. As a representative robust loss function, the Huber loss has the merit of being (i) globally τ -Lipschitz continuous, and (ii) locally quadratic. A natural question arises that whether similar results, both statistical and computational, remain valid for more general loss functions that possess the above two features. In this section, we introduce a class of loss functions which, combined with folded concave regularization, leads to statistically optimal estimators that are robust against heavy-tailed errors. Note that Condition 4.1 excludes some important Lipschitz continuous functions, such as the check function for quantile regression and the hinge loss for classification, which do not have a local strong convexity. The recent works [1] and [12,13] established optimal estimation and excess risk bounds for (regularized) empirical risk minimizers and MOM-type estimators based on general convex and Lipschitz loss functions even without a local quadratic behavior. Our work complements the existing results on 1 -regularized ERM by showing oracle properties of nonconvex regularized methods under stronger signals. For this reason, we need an additional local strong convexity condition on the loss. It remains unclear whether the oracle rates or variable selection consistency can still be achieved without such a local curvature of the loss function.
Below we list five examples of (·) (including the Huber loss) that satisfy Condition 4.1.

(Huber loss):
2. (Pseudo-Huber loss I): (x) = √ 1 + x 2 − 1, whose first and second derivatives are whose first and second derivatives are The derivative of this function is used in [10] for mean vector estimation. We compute It is easy to see that sup x∈R | (x)| ≤ 2 √ 2/3 and (x) ≥ 1 − c 2 /2 for all |x| ≤ c and 0 < c < √ 2. Noting that is The loss functions discussed above, along with their derivatives up to order three, are plotted in Figure 1 except for the Huber loss. Provided that the loss function τ (·) satisfies Condition 4.1, all the theoretical results in Sections 2.3 and 2.4 remain valid only with different constants. It is worth noticing that the four loss functions described in examples 2-5 also have Lipschitz continuous second derivatives; see Figure 1. In fact, if the function satisfies (0) = 0, (0) = 1 and has L 2 -Lipschitz second derivative, then property (iii) in Condition 4.1 holds with c 2 = L 2 /2. The Lipschitz continuity of (·) also helps remove the anti-concentration condition (2.29) on ε.

Numerical study
In this section, we compare the empirical performance of the proposed multistep penalized robust regression estimator with several benchmark methods, such as the Lasso [46], the SCAD and MC+ penalized least squares [14,52]. All the computational results presented below are reproducible using software available at https://github.com/XiaoouPan/ILAMM.
We generate data vectors {(y i , x i )} n i=1 from two types of linear models: 1. (Homoscedastic model): , and therefore the variance of the noise is the same as that of ε i .
Except for the normal distribution, all the other three are skewed and heavytailed. To meet our model assumption, we subtract the mean from the lognormal and Pareto distributions. In both homoscedastic and heteroscedastic models, the sample size n = 100, the ambient dimension d = 1000 and the sparsity parameter s = 6. The true vector of regression coefficients is β * = (4, 3, 2, −2, −2, 2, 0, . . . , 0) T , where the first 6 elements are non-zero and the rest are all equal to 0. We apply the proposed TAC (Tightening After Contraction) algorithm to compute all the estimators with tuning parameters λ and τ chosen via three-fold cross-validation. To be more specific, we first choose a sequence of λ values the same way as in the glmnet algorithm [17]. Guided by its theoretically "optimal" magnitude, the candidate set for τ is taken to be {2 j σ MAD n/log(nd) : j = −2, −1, 0, 1, 2}, where σ MAD := median{| R − median( R)|}/Φ −1 (3/4) is the median absolute deviation (MAD) estimator using the residuals R = ( r 1 , . . . , r n ) T obtained from the Lasso.
To highlight the tail robustness and oracle property of our algorithm, we consider the following four measurements to assess the empirical performance: 1. True positive, TP, which is the number of signal variables that are selected; 2. False positive, FP, which is the number of noise variables that are selected; 3. Relative error, RE 1 and RE 2 , which is the relative error of an estimator β with respect to the Lasso under 1 -and 2 -norms:  Tables 1 and 2 summarize the averages of each measurement, TP, FP, RE 1 , and RE 2 with their standard deviations in brackets, over 200 replications under both homoscedastic and heteroscadastic models. RE 1 and RE 2 for Lasso are defined to be one, so we omit their standard deviations. Here, Huber-SCAD and Huber-MC+ signify the proposed two-stage algorithm using the SCAD and MC+ penalties, respectively. When the noise distributions are heavytailed and/or skewed, we see that Huber-SCAD and Huber-MC+ outperform SCAD and MC+, respectively, with fewer spurious discoveries (false positives), smaller estimation errors and less variability. Under the homoscedastic normal model, Huber-SCAD and Huber-MC+ perform similarly to their least squares counterparts; while under heteroscedasticity, the proposed algorithm exhibits a notable advantage over existing methods on selection consistency even though the error is normally distributed. In summary, these numerical studies validate our expectations that the proposed robust regression algorithm improves the Lasso as a general regression analysis method on two aspects: robustness against heavy-tailed (and even heteroscedastic) noise and selection consistency.
To further visualize the advantage of the multi-step penalized robust regression methods over the existing ones (e.g. Lasso, SCAD and MC+), we draw the receiver operating characteristic (ROC) curve, which is the plot of true positive rate (TPR) against false positive rate (FPR) at various regularization parameters. Specifically, TPR and FPR are defined, respectively, as the ratio of true positive to s and the ratio of false positive to d − s. We generate data vectors where Moreover, define the population loss L τ (β) = E L τ (β).
The following result provides conditions under which an -optimal solution to the convex program (A.1) falls in an 1 -cone. Recall that S = supp(β * ) and which are, respectively, the centered score function and the approximation bias. We first characterize the magnitude of the bias b * τ := b(β * ), as a function of τ .
Then, conditioned on the event Proof of Lemma A.3. For some r > 0 to be specified, define η = sup{u ∈ [0, 1] : and η ∈ (0, 1) otherwise. Then, the intermediate estimate By the convexity of Huber loss and Lemma F.2 in [16], we have First we bound the left-hand side of (A.4) from below. Conditioned on the stated event, Lemma A.2 indicates − β), we have β η ∈ β + B(r) ∩ C(l) and conditioned on event E 1 (r, l, κ), Next we upper bound the right-hand side of (A.4). For any ξ ∈ ∂ β 1 , write
We have shown that under proper conditions, all the estimates β ( ) fall in a local neighborhood of β * , i.e., β ( ) − β * Σ ≤ r crude = cγs 1/2 λ. To further refine this bound as signal strengthens, on the right-hand side of (B.4), we need to establish sharper bounds on and maintain the bias term b * τ , instead of replacing it with an upper bound s 1/2 λ.

B.3. Proof of Proposition 2.3
By construction, the oracle estimator β ora is such that β ora S c = 0 ∈ R d−s and ∇ L τ ( β ora ) S = 0 ∈ R s . With w ora = ∇ L τ ( β ora ), the proof strategy is similar to that in the proof of Proposition 2.2 with = 0, because β ( ) are optimal solutions to (P ).

B.4. Proof of Proposition 2.4
The proof is based on similar arguments that were used in the proof of Lemmas C.3 and C.4 in [44]. We only present the necessary steps in order to slightly relax the sub-Gaussian condition on For any β ∈ β * + B(r) ∩ C(l), write δ = β − β * . Following the proof of Lemma C.3 in [44], it can be shown under Condition 2.1 that where the second inequality uses the bound x e −τ/(8σxr) ≤ 1/8. It then follows from (B.10) that holds uniformly over β ∈ β * + B(r) ∩ C(l).

C.3. Proof of Theorem 2.3
The proof is based primarily on Proposition 2.3, combined with complementary probabilistic analysis. For t ≥ 0 and a prescribed q ≥ max(s, log d), set τ = σ 2 n/(q + t). In order to apply the high-level result in Proposition 2.3, we need the following two technical lemmas to control the events in (2.19). The former controls the event E 2 (r, l, κ) defined in (2.16) under proper sample size requirement, and the latter provides upper bounds on the ∞ -error terms β ora − β * ∞ and w ora ∞ = ∇ L τ ( β ora ) ∞ .
Lemma C.1. Under the conditions of the theorem, let τ, r, l > 0 satisfy where C 0 is an absolute constant and C 1 depends only on σ x . Then, with probability at least 1 − e −t , The next lemma provides statistical properties of the oracle estimator β ora defined in (2.15) with τ = σ 2 n/(q + t). Since the oracle β ora has access to the true active set S, it is essentially an unpenalized Huber estimator based on Lemma C.2. Under the sample size scaling n q + t, the following bounds and hold with probability at least 1 − 7e −t .
Compared to Propositions 2.4 and 2.5, the proofs of Lemmas C.1 and C.2, which are placed in the following two subsections, require a more delicate analysis of the local behavior of the gradient process {∇ L τ (β), β ∈ R d } around both the underlying vector β * and the oracle estimator β ora , with the latter being random itself.

C.4.1. Technical lemmas
We first present three technical lemmas, which are the key ingredients to the proof. The first lemma provides an alternative to the stopping rule.
Proof of Lemma C. 3. For simplicity, we write L(·) = L τ (·) as the loss function of interest. Since β (k) is the exact solution at the k-th iteration when = 1, the first-order optimality condition holds: there exists some ξ (k) ∈ ∂ β (k) 1 such that For any u ∈ R d such that u 1 = 1, we have where the last inequality is due to the Lipschitz continuity of ∇L(·). Taking the supremum over all u satisfying u 1 ≤ 1, we obtain ω λ (β (k) ) ≤ (φ (k) + L) β (k) − β (k−1) 2 .
This leads to a contradiction, indicating that φ (k) ≤ Lγ u .
To make the right-hand side of the above inequality smaller than t , we need k to be sufficiently large that k ≥ C 1 log(C 2 s 1/2 λ/ t ), where C 1 , C 2 > 0 are constants depending only on localized sparse eigenvalues and γ u . This completes the proof.