High-Dimensional Composite Quantile Regression: Optimal Statistical Guarantees and Fast Algorithms

The composite quantile regression (CQR) was introduced by Zou and Yuan [Ann. Statist. 36 (2008) 1108--1126] as a robust regression method for linear models with heavy-tailed errors while achieving high efficiency. Its penalized counterpart for high-dimensional sparse models was recently studied in Gu and Zou [IEEE Trans. Inf. Theory 66 (2020) 7132--7154], along with a specialized optimization algorithm based on the alternating direct method of multipliers (ADMM). Compared to the various first-order algorithms for penalized least squares, ADMM-based algorithms are not well-adapted to large-scale problems. To overcome this computational hardness, in this paper we employ a convolution-smoothed technique to CQR, complemented with iteratively reweighted $\ell_1$-regularization. The smoothed composite loss function is convex, twice continuously differentiable, and locally strong convex with high probability. We propose a gradient-based algorithm for penalized smoothed CQR via a variant of the majorize-minimization principal, which gains substantial computational efficiency over ADMM. Theoretically, we show that the iteratively reweighted $\ell_1$-penalized smoothed CQR estimator achieves near-minimax optimal convergence rate under heavy-tailed errors without any moment constraint, and further achieves near-oracle convergence rate under a weaker minimum signal strength condition than needed in Gu and Zou (2020). Numerical studies demonstrate that the proposed method exhibits significant computational advantages without compromising statistical performance compared to two state-of-the-art methods that achieve robustness and high efficiency simultaneously.


Introduction
Consider a sparse linear regression model y = β * 0 + x T β * + ε, where y ∈ R is the response variable, x = (x 1 , . . . , x p ) T is the p-vector of explanatory variables (covariates), and ε ∈ R is the observation noise. In high-dimensional settings where the number of covariates considerably exceeds the number of observations, a common practice is to impose a low-dimensional structure on β * , the p-vector of regression coefficients. Over the last three decades, various penalized regression methods have been developed for fitting highdimensional models with low intrinsic dimensions, typified by the L 1 -penalized least squares method, also known as the Lasso (Tibshirani, 1996;Chen, Donoho and Saunders, 1999).
We refer to the monographs Bühlmann and van de Geer (2011) One of the main challenges in high-dimensional linear regression is that the maximum spurious correlation between the covariates and the realized noise can be large even when the population counterpart is small. Therefore, the penalized least squares methods are sensitive to the tails of the error distribution, or equivalently, the response distribution. The statistical properties are often derived under exponentially light-tailed error distributions (Bickel, Ritov and Tsybakov, 2009;Bühlmann and van de Geer, 2011), including but not limited to Gaussian, sub-Gaussian or sub-exponential distributions. Heavy-tailedness, however, has been frequently observed in empirical data, such as the high-dimensional microarray data as well as financial and economic data. To cope with heavy-tailed error contamination in high dimensions, many robust penalized regression methods have been proposed; see, for example, Wang, Li and Jiang (2007) Sun, Zhou and Fan (2020). A common thread in these methods is the use of a robust loss function (that replaces the L 2 loss) to achieve either high breakdown point under arbitrary contamination or near-optimal error bounds under heavy-tailed errors. For the latter, Loh (2017) considered the case where ε has a symmetric distribution, including the standard Cauchy; Fan, Li and Wang (2017) and Sun, Zhou and Fan (2020) provided a concentration study for penalized Huber regression with a properly tuned cut-off parameter when ε has a bounded variance but can be skewed/asymmetric. To achieve robustness against gross outliers, we refer to She, Wang and Shen (2021) the most recent advance and the references therein.
In this work, we focus on heavy-tailed error contamination in a more general scenario.
When the error distribution is not only heavy-tailed but also asymmetric, using a classical robust loss function, such as the L 1 loss, the Huber loss and the Tukey loss, may induce nonnegligible bias. The impact of this bias can be alleviated by letting the cut-off parameter in the Huber/Tukey loss grow with the sample size, yet we still need ε to have finite variance in order to achieve (near-)optimal convergence rate, and the parameter tuning is quite delicate in practice. Although the least absolute deviation (LAD) regression requires no moment condition on ε, the relative efficiency of the LAD can be arbitrarily small when compared with the least squares (Zou and Yuan, 2008). To overcome the efficiency loss while being robust against heavy-tailed errors, Zou and Yuan (2008) introduced the composite quantile regression (CQR), as a robust regression method, by combining quantile information across various quantile levels. The asymptotic efficiency of the CQR relative to the least squares has a universal lower bound 86.4% (Kai, Li and Zou, 2010). Theoretically, CQR requires the existence of an everywhere nonvanishing density function of ε without any moment constraint, thus allowing the infinite variance case. By complementing a composite loss function with sparsity-inducing penalties, Bradic, Fan and Wang (2011) and Gu and Zou (2020) further proposed penalized composite quasi-likelihood and quantile regression estimators, respectively.
While the CQR method inherits the robustness property of quantile regression (Koenker and Bassett, 1978), it also inherits the computational hardness especially in high dimensions.
Note that the L 1 -penalized quantile regression can be recast as a linear program (LP) (Wang, Li and Jiang, 2007;Li and Zhu, 2008), solvable by general-purpose optimization toolboxes.
These toolboxes are convenient to use yet are only adapted to small-scale problems (Bach et al., 2012). Yu et al. (2017) and Gu et al. (2018) proposed more efficient algorithms based on the alternating direction method of multipliers (ADMM). For penalized CQR, Gu and Zou (2020) proposed an ADMM-based algorithm which we will revisit in Section 4.1. The computational complexity of each ADMM update is of order O(pnq + (p + q) 2 ), where q ≥ 1 is the number of quantile levels used in the CQR. This can be computationally intensive when applied to large-scale datasets; see Section 4 for more detailed discussions.
To extend the capability of CQR with large-scale data, in this paper we propose a convolution-smoothed CQR (SCQR) method, complemented with iteratively reweighted L 1 -penalization for fitting sparse models. Convolution smoothing turns the piecewise linear check function into a twice continuously differentiable, convex and locally strongly convex surrogate. Its success has recently been witnessed in the context of quantile regression in both statistical and computational aspects (Fernandes, Guerre and Horta, 2021;He et al., 2022). Under a Lipschitz continuity condition on the density of ε and sub-Gaussian (stochastic) designs, we show that the L 1 -penalized SCQR (SCQR-Lasso) estimator achieves the same rate of convergence as the Lasso estimator when ε is sub-Gaussian. We do not require the symmetry of the error distribution nor the existence of any moment, including the mean. Moreover, under a mild minimum signal strength (also known as the beta-min) condition, we show that the iteratively reweighted L 1 -penalized SCQR estimator converges at a near-oracle rate O( (s + log q)/n). This reveals the advantage of folded-concave penalization in terms of its adaptivity to strong signals. Heuristically, the L 1 penalty applies soft-thresholding to all signals ignoring their magnitudes, thus creating a bias that is of order λ for all non-zero signals, where λ > 0 is the regularization parameter. Furthermore, we employ a variant of the local adaptive majorize-minimization (LAMM) algorithm (Fan et al., 2018) for solving weighted L 1 -penalized SCQR estimator. The main idea is to construct an isotropic quadratic objective function that locally majorizes the smoothed composite quantile loss such that closed-form updates are available at each iteration. The quadratic coefficient is adaptively chosen so that the objective function is non-increasing along the iteration path. Compared to ADMM, LAMM is a simpler gradient-based algorithm that is particularly suited for large-scale problems, where the dominant computational effort is a relatively cheap matrix-vector multiplication at each step. The (local) strong convexity of the convolution smoothed loss facilitates the convergence of such a first order method.
Our work complements Gu and Zou (2020) in two aspects. The theoretical results in Gu and Zou (2020) are derived in the case of fixed designs satisfying conditions (C1) and (C2) therein. It is unclear whether these conditions hold with high probability for Gaussian or sub-Gaussian covariates. We provide a random design analysis for sub-Gaussian covariates.
To achieve oracle convergence rate, our beta-min condition is weaker than that in Gu and Zou (2020) by relaxing the √ s-factor, where s denotes the model sparsity. Computationally, we develop a fast algorithm for penalized CQR without sacrificing statistical efficiency by means of convolution smoothing. We believe that this paper introduces an interesting compromise between robustness, statistical performance and numerical efficiency for sparse linear regression with heavy-tailed errors.
This work is also closely related to Wang et al. (2020), in which a new robust regression method is proposed along with a simulation-based procedure for choosing the regularization parameter. In low dimensions, we refer to Wang et al. (2020)'s method as pairwise-LAD as it applies LAD regression to the pairwise differences of the observations, namely, The rest of the paper is organized as follows. Section 2 starts with a brief review of (penalized) composite quantile regression, followed by the proposed convolution smoothed CQR with iteratively reweighted L 1 -penalization. The selection of tuning parameters is discussed in Section 2.3. Section 3 provides the statistical guarantees for penalized SCQR, including a bias analysis and rates of convergence of the solution path. In Section 4, we first revisit the ADMM-based algorithm proposed in Gu and Zou (2020), and then introduce a gradient-based LAMM algorithm for convolution smoothed CQR. Numerical comparisons of the three methods, CQR, SCQR and pairwise-LAD, are conducted in Section 5. All the proofs are placed in the appendix. The Python code for the proposed method and our implementation of the methods inGu and Zou (2020)

Preliminaries
Suppose we observe n independent samples {(x i , y i )} n i=1 of a random variable (x, y) ∈ R p ×R satisfying the linear model where β * 0 is the intercept, β * = (β * 1 , . . . , β * p ) T ∈ R p is the p-vector of slope coefficients, and ε denotes the observation noise. Assume that ε is independent of x, and has cumulative distribution function F (·) and probability density function f (·). Without loss of generality, we assume β * 0 = 0; otherwise we set ε = β * 0 + ε so that the model becomes y = x T β * + ε. Under these assumptions, the conditional τ -quantile (0 < τ < 1) of y|x is To robustly estimate β * in model (2.1), we consider the composite quantile regression (CQR) approach proposed in Zou and Yuan (2008), which delivers consistent estimates even when the error distribution has infinite variance and enjoys high efficiency otherwise (Zou and Yuan, 2008;Kai, Li and Zou, 2010). Given a positive integer q, let {τ k } q k=1 ⊆ (0, 1) be an increasing sequence of quantile indexes and write α * k = F −1 (τ k ). When p < n, the canonical CQR estimator of β * , denoted by β CQR , is defined as where ρ τ (u) = {τ − 1(u < 0)}u is the check function. In the special case of q = 1, this becomes the usual quantile regression (Koenker and Bassett, 1978). Zou and Yuan (2008) established the asymptotic normality of β CQR when the density function f (·) of ε is nonvanishing at the selected quantile levels. Therefore, the root-n consistency of β CQR requires no moment condition on ε, thus allowing very heavy-tailed errors such as the Cauchy error.
Computationally, Gu and Zou (2020) employed the local linear approximation (LLA) algorithm (Zou and Li, 2008) to obtain an approximate solution to the nonconvex problem (2.3), which enjoys desirable statistical properties. The LLA algorithm for the optimization problem (2.3) is iterative, starting at iteration 0 with an initial estimate β 0 ∈ R p . At iteration t = 1, 2, . . ., it combines a weighted L 1 -penalty with the composite quantile loss to obtain the updated estimates ( α t , β t ). The procedure involves two steps.
1) Using the previous estimate β t−1 = ( β t−1 1 , . . . , β t−1 p ) T , compute the penalty weights 2) Solve the convex optimization problem Under the following "beta-min" (minimum signal strength) condition among other regularity conditions on the non-stochastic design matrix X = (x 1 , . . . , x n ) T ∈ R n×p , Gu and Zou (2020) showed that initialized with the L 1 -penalized CQR (CQR-Lasso) estimator, the LLA algorithm converges to the oracle estimator in two iterations with high probability. For every β ∈ R p , let F (·; β) be the empirical cumulative distribution function of the

Convolution smoothed composite quantile regression
Then, the empirical composite quantile loss in (2.2) can be written as where α = (α 1 , . . . , α q ) T ∈ R q . Let K : R → [0, ∞) be a symmetric, non-negative kernel function (a function that integrates to 1). For a given sequence of bandwidth parameters h = h n > 0, we smooth the loss Q(·, ·) by For each k = 1, . . . , m, define the convolution smoothed counterpart of ρ τ k (·) as where * denotes the convolution operator. Consequently, the smoothed composite loss Q h (α, β) defined in (2.6) can be equivalently written as Starting with an initial estimate β 0 h ∈ R p , we define a sequence of iteratively reweighted L 1 -penalized smoothed CQR estimators as follows. At iteration t = 1, 2, . . ., ( α t h , β t h ) is defined as a solution to the convex optimization problem where λ > 0 is the regularization parameter.
Section 3 establishes the statistical properties of the solution path {( α t h , β t h )} t≥1 in the stochastic design setting. Specifically, we assume that the random covariate vectors x i 's are sub-Gaussian; see Condition (A3) below. Recall that the theoretical results in Gu and Zou (2020) are derived in the case of fixed designs satisfying conditions (C1) and (C2) therein.
It is natural to question whether or not these conditions hold with high probability for Gaussian or sub-Gaussian covariates. In Section 3.2, we describe a variant of the LAMM algorithm for solving (2.8), which will be compared to the ADMM algorithm for solving (2.4) in terms of statistical accuracy and computation time; see Section 5.

Selection of tuning parameters
The penalized smoothed CQR method relies primarily on two key tuning parameters, the regularization parameter λ and the bandwith h. As for the quantile indexes, we follow the suggestion in Zou and Yuan (2008) and take τ k = k/(q + 1), k = 1, . . . , q with q = 19.
The resulting estimator thus combines the strength across multiple QR estimators at levels 5%, 10%, . . . , 90%, 95%. where τ = q −1 q k=1 τ k . The penalty level λ, on the other hand, has a more visible impact on the performance as it directly controls the sparsity of the solution. One general approach is to use K-fold cross-validation (e.g. K = 5 or 10) when given a set of λ values. If model selection is of more interest than prediction, information criteria typically produce much smaller models and thus are preferable. As a variant of the high-dimensional Bayesian information criterion (BIC) for penalized QR (Lee, Noh and Park, 2014), Gu and Zou (2020) considered the following BIC in the context of composite QR: where ( α(λ), β(λ)) is a penalized CQR estimator with regularization parameter λ, S λ is the support of β(λ), and C n is a positive number depending on n. Typically C n is chosen as a slowly growing function of n, e.g., C n = log(log n).
Motivated by the simulation-based method proposed by Belloni and Chernozhukov (2011), we further describe a λ-tuning procedure that is computationally much cheaper than the cross-validation and BIC methods. The key is to utilize the the pivotal property of the L 1 -loss (Belloni and Chernozhukov, 2011). As we shall see from the theoretical results in Section 3, the magnitude of λ depends in theory on ∥ω * ∥ ∞ , where serves as a smoothed proxy of 1(ε i ≤ α * k ). Recall that P(ε i ≤ α * k ) = τ k for each k = 1, . . . , q. Therefore, we consider a pivotal proxy of w * , defined as ∼ Unif(0, 1).

Statistical analysis
In this section, we establish the statistical properties of the penalized smoothed CQR estimators {( α t h , β t h )} t≥1 initialized at β 0 h = 0 and for an ordered sequence of quantile indexes 0 < τ 1 < · · · < τ q < 1 with q ≥ 1. To begin with, Section 3.1 provides non-asymptotic upper bounds on the smoothing bias. Throughout the section, we assume that β * ∈ R p in model (2.1) is s-sparse, that is, its support S = {1 ≤ j ≤ p : β * j ̸ = 0} has cardinality s.

Smoothing bias
For any h > 0, define the population composite quantile loss Q h (α, β) = E{ Q h (α, β)} for α ∈ R q and β ∈ R p , and its minimizer We first show that Q h : R q+p → R is convex under mild regularity conditions.
Lemma 3.1. Assume that the random covariate vector x ∈ R p is non-degenerate withΣ = E(xx T ) ≻ 0, wherex = (1, x T ) T . Moreover, let the kernel function K(·) and bandwidth where F and f denote the CDF and density function of ε, respectively. Then, the population smoothed composite quantile loss Q h : R q+p → R is convex and strictly convex at (α * , β * ), Condition 3.2 can easily be verified if either the kernel function K(·) or the density function f (·) is positive everywhere. Without loss of much generality, we assume the former throughout this section. Intuitively, convolution smoothing induces bias which allows us to think (α * h , β * h ) ̸ = (α * , β * ) for any given h > 0. By exploiting the independence of ε and x and a strictly positive kernel (e.g., Gaussian, Laplacian or logistic), we find that β * h = β * for any h > 0, and therefore the proposed smoothing mechanism only generates bias on the intercepts α * 1 , . . . , α * q that are of less interest. To obtain explicit upper bounds on the (smoothing) bias, we impose the following regularity conditions of the density function f (·) of ε as well as the kernel function K(·).

Define the function
where ℓ h,k = ρ τ k * K h . Under assumptions (A1) and (A2), we will show that m h : Consequently, the smoothed population composite quantile loss Q h : Proposition 3.1. Suppose assumptions (A1) and (A2) hold and thatΣ is positive definite.
Then, the smoothed population composite quantile loss Q h : R q+p → R for any h > 0 is strictly convex and has a unique minimizer given by h 2 for k = 1, . . . , q. (3.4)
Without smoothing, Gu and Zou (2020) obtained the convergence rates (under L 2 -loss) for under fixed designs. For sub-Gaussian (stochastic) designs, we first establish estimation error bounds for ( α 1 h , β 1 h ), which complement the results in Gu and Zou (2020). Recall from Proposition 3.1 that (α * h , β * ) is the unique minimizer of the population loss The key elements of our analysis are (i) a cone-like property for β 1 h − β * , and (ii) a local restricted strong convexity (RSC) property for the empirical loss Q h , which is based on the function For any regularization parameter λ > 0, define the event and the restricted (cone-like) set It can be shown that . Proposition 3.2 below validates that for all sufficiently large λ, the event G(λ) holds with high probability.
Proposition 3.2. Under assumption (A3), the event G(λ) holds with probability at least Due to high dimensionality, the empirical loss Q h : R q × R p → R does not have a curvature along all directions. In fact, there exists a subspace with dimension at least p − n of directions in which it is completely flat. Instead, it can be shown that the cone-like subset C = C(S) is well-aligned with the curved directions of the Hessian of Q h in a local region with high probability, which we refer to as the local RSC property. For r, l > 0, define the (rescaled) ℓ 2 -ball and ℓ 1 -cone as For any curvature parameter c > 0, radius parameter r > 0, and cone parameter l > 0, define the event The following proposition shows that under certain conditions on (s, p, n) and h, there exists some curvature parameter c > 0 such that event R(c, r, l) occurs with high probability.
Based on the above preparations, we are now ready to present the first main result of this subsection regarding the estimation error of the SCQR-Lasso estimator defined in (3.5).
with probability at least 1 − q/p, provided that the bandwidth h satisfies where C > 0 is a constant independent of (n, s, p, h).
The above theorem shows that the L 1 -penalized SCQR estimator achieves the same rate of convergence as the L 1 -penalized quantile regression estimator (Belloni and Chernozhukov, 2011), with a proper choice of the bandwidth parameter h, which is yet flexible.
Moreover, Theorem 3.1 indicates that, the regularization parameter λ ≍ ν 0 σ x log(2p)/n is independent of the error distibution, which alleviates the difficulty of tuning parameter selection of the LS estimator that typically depends on the standard deviation of the error distribution, and of the high-dimensional CQR estimator by Gu and Zou (2020) that is dependent on the minimum density of the error distribution at each quantile level.
As a corollary, we derive a prediction error bound for β h , which is a direct consequence of Theorem 3.1.
Corollary 3.1. Under the conditions of Theorem 3.1, it holds with probability at least 1 − (q + 2)/p.
Next, we investigate the statistical properties of the iteratively reweighted L 1 -penalized The following result characterizes the dependence of the estimation error ∥θ t ∥ 2 at t-th step on ∥θ t−1 ∥ 2 from the previous step. It reveals how iteratively reweighted L 1 -penalization refines the statistical rate when the signals are sufficiently strong. We first derive a deterministic bound of the estimation error, conditioned on some "good" events.
Theorem 3.2. Suppose assumptions (A1)-(A4) hold, and let a 0 , c > 0 be such that where δ := 1 + {P ′ (a 0 )} 2 /2/(a 0 cγ p ) ∈ (0, 1). In addition, it holds for for any t ≥ 2 that (3.20) The above theorem shows that under proper conditions on the curvature parameter c and penalty function, the estimation error, at least its leading term, can be refined iteratively via reweighted L 1 -penalization. From (3.19) we see that there are three terms on the right-hand side that cannot be improved, which are The first term, ∥ω * S ∥ 2 , determines the oracle convergence rate because it corresponds to the estimation error of the oracle SCQR estimator when only the significant covariates (indexed by S) are used in the fitting. The oracle SCQR estimator is formally defined as where subscript h is ommited for the brevity of notation. The second term ∥P ′ λ ((|β * S | − a 0 λ) + )∥ 2 is the shrinkage bias induced by the penalty function. For the L 1 -penalty P λ (t) = λt (t ≥ 0), it is easy to see that this term is of order s 1/2 λ regardless of how large the nonzero coordinates of β * are (in magnitude). For a concave penalty that has a decreasing P ′ λ , there is a chance that this shrinkage bias might be reduced when the signals are sufficiently strong. A concave penalty function P λ satisfying (A4) is called folded concave if it further satisfies the following property.
Under assumption (A5) and the minimum signal strength condition (also known as the beta-min condition) that ∥β * S ∥ min ≥ (a 0 + a * )λ, the shrinkage bias becomes zero. The third term, √ q∥ζ * ∥ 2 , depends on the partial gradient of the empirical loss with respect to α.
Therefore, its order is independent of p and only scales with q.
Recall that Theorem 3.2 is a deterministic result conditioned on some "good" event related to the local RSC structure and the magnitude of the gradient of the empirical loss.
Combining Theorem 3.2 with Propositions 3.2 and 3.3, we further provide a complete result characterizing the oracle convergence rate of the iteratively reweighted L 1 -penalized SCQR estimator under a weaker beta-min condition than needed in Gu and Zou (2020).
Theorem 3.3. Suppose assumptions (A1)-(A5) hold, and that there exist a 1 ≥ a * > a 0 > 0 such that Moreover, let the regularization parameter λ and bandwidth h satisfy Then, for any z > 0, under the beta-min condition ∥β * S ∥ min ≥ (a 0 + a 1 )λ, the iteratively reweighted L 1 -penalized SCQR estimator ( α t h , β t h ) with t ≳ log log(2p)/ log(1/δ) satisfies the bounds The above result shows that under a beta-min condition ∥β * S ∥ min ≳ log(p)/n, the proposed estimator (of β * ) achieves a near-oracle rate (s + log q)/n after as many as log(log p) steps, where q ≥ 1 is a predetermined number of quantile levels. This complements the strong oracle property established in Gu and Zou (2020), which requires the minimum signal strength to be of order s log(p)/n.

Strong oracle property
To establish the strong oracle property of our proposed multi-step estimator ( α t h , β t h ), we need to show that the estimator equals to the oracle estimator defined in (3.21) for sufficiently large t. We define a similar event that resembles the local RSC property. Let (3.24) Given radius parameters r, l > 0 and a curvature parameter c > 0, define Theorem 3.4. Suppose assumptions (A1)-(A5) hold, and for some predetermined δ ∈ (0, 1) and c > 0, there exist constants a 1 > a 0 > 0 such that Moreover, let r ≥ q 1/2 γ 1/2 p a 0 bs 1/2 λ and l = the strong oracle property holds: β ℓ = β o provided ℓ ≥ log(s 1/2 /δ)/ log(1/δ).
Similar to the Theorem 3.2, Theorem 3.4 is a deterministic result that depends on the event described in (3.29). Thus, our next goal is to control the probability of the event (3.29). To control such probability, we require deviation bound and a non-asymptotic Kiefer-Bahadur representation of the oracle estimator that are of independent interest.
(A2') In addition to Condition (A2), assume sup u∈R K(u) ≤ κ u for some κ u ∈ (0, 1]. Since the oracle estimator is essentially an unpenalized smoothed CQR estimator in the low-dimensional regime where s << n, we need to derive relevant results for the lowdimensional smoothed CQR estimator. The following proposition summarizes those results about the oracle estimator that is essential to deriving necessary conditions for the strong oracle property. Same result has been derived in low dimensional smoothed CQR paper by Yan, Wang and Zhang (2023), we refer their paper for detailed proof of the following result.
Proposition 3.4. Assume Conditions (A1'), (A2'), and (A3) hold. Then, for any t ≥ 0, Before presenting our final theorem that characterizes the strong oracle property, there is one more event that we need to make sure that it holds with high probability, which is R rsc (r, l, c). The following Proposition characterizes the event and its probability bound.
Proposition 3.5. Let r, l, and h satisfy for some sufficiently large constant C independent of (n, s, p, h). Then, the event R rsc (r, l, c) holds with probability at least 1 − q/(2p) with c = 0.5κ · f .
With the above preparations, we finally establish the strong oracle property of our iterative estimator.

Algorithms
In this section, we discuss the computational methods for penalized composite quantile regression, with a particular focus on the weighted L 1 -penalty. We first revisit the ADMMbased algorithm proposed in Gu and Zou (2020), and then describe a local adaptive majorizeminimization (LAMM) for convolution-smoothed CQR. Complexities of the two algorithms are also discussed.

An alternating direction method of multipliers algorithm
The computation of either L 1 -penalized or folded concave penalized composite quantile regression boils down to solving a weighted L 1 -penalized problem where λ j ≥ 0 for j = 1, . . . , d. A conventional strategy is to formulate (4.1) as a linear program, solvable by general-purpose optimization toolboxes. These toolboxes are convenient to use yet are only adapted to small-scale problems. To solve (4.1) more efficiently under high-dimensional settings, Gu and Zou (2020) proposed an algorithm based on the alternating direction method of multipliers (ADMM). The idea is to cast (4.1) as an equivalent program solvable by ADMM. Specifically, they consider the following reformulation be the total vector of parameters (of interest). The augmented Lagrangian of the above problem is is an (nq)-dimensional vector that stacks q copies of y ∈ R n one underneath the other, σ > 0 is a optimization parameter and
The former is the proximity operator of the check loss ρ τ with respect to parameter a > 0, and the latter is the proximity operator of | · |, also known as the soft-thresholding operator.

A local adaptive majorize-minimization algorithm for smoothed CQR
In this section, we focus on solving a smoothed version of (4.1) with each ρ τ k replaced by ℓ h,k = ρ τ k • K h . To take advantage of the smoothness and the local strong convexity of the smoothed loss, we employ a variant of the local adaptive majorize-minimization algorithm (LAMM) proposed by Fan et al. (2018). The main idea of LAMM is to construct For t = 0, 1, . . . , repeat the following steps until convergence.

Update
an isotropic quadratic objective function that locally majorizes the smoothed composite quantile loss such that closed-form updates are available at each iteration. To see this, . For k = 1, 2, . . ., let φ k = ((α k ) T , (β k ) T ) T be the iterate after the k iteration. At the k-th iteration, for some sufficiently large quadratic parameter ϕ k > 0, we define a locally majorizing isotropic quadratic function . For a large enough ϕ k , say no less than the where λ = (λ 1 , . . . , λ p ) T . It is easy to see that This ensures that the objective function (with penalty) decreases after each iteration. From the first-order optimality condition we obtain the following closed forms for α k and β k : To choose a sufficiently large quadratic coefficient ϕ k that ensures majorization property, we start from a relatively small number, say ϕ 0 = 0.01, and successively inflate it by a factor γ > 1, denoted by ϕ k,l = γ l ϕ 0 for l = 1, 2, . . .. If the solution φ k,l to (4.7) with ϕ k = ϕ k,l satisfies (4.8) for some l ≥ 0, we stop the search and set ϕ k = ϕ k,l . Therefore, the quadratic coefficient ϕ k is automatically determined at each step. By default, we set the optimization parameters to be (ϕ 0 , γ) = (0.01, 1.25). We summarize the whole procedure in Algorithm 2.
From Algorithms 1 and 2 we see that the dominant computational effort of each LAMM update is the multiplication of a p × nq matrix and (nq)-dimensional vectors, which can be implemented in O(pnq) operations. In addition to this, each ADMM update also involves the multiplication of a (p + q) × (p + q) matrix and (p + q)-dimensional vectors with a For k = 0, 1, . . . , repeat the following steps until convergence.

Update
complexity O(pnq + (p + q) 2 ). Moreover, the ADMM needs to compute and store the inverse of X T 1 X 1 + X T 2 X 2 ∈ R (p+q)×(p+q) , hence incurring extra computational cost and memory allocation. Via the Sherman-Morrison-Woodbury formula, the real computational effort of this step is to evaluate the inverse of an n × n matrix (with complexity O(n 3 )), which is still expensive when n is large. Figure 1 shows a preliminary comparison between the ADMM and LAMM algorithms for computing L 1 -penalized CQR estimators on a simulated dataset with increasing n, p subject to p = 5n. To make comparisons that are as fair as possible, each algorithm is implemented in Python, using the NumPy library for basic linear algebra operations. On the statistical aspect, the CQR-Lasso (by ADMM) and SCQR-Lasso (by LAMM) estimators exhibit nearly identical estimation errors (under squared model error); in a speed comparison, the runtime of ADMM grows significantly faster than that of LAMM as the sample size and dimension increase. These preliminary numerical results show evidence that LAMM can also be faster than ADMM by several orders of magnitude. More empirical evidence will be given in the next section.
(c) The t-distribution with 3 degrees of freedom-t 3 .
The statistical performance of each method is measured via the (average) squared model error (with the standard error in the parenthesis), which is ∥ β − β * ∥ 2 Σ , the number of false positive results (FP), which is the number of spurious covariates that are selected, and the number of true positive results (TP), which is the number of significant covariates that are selected. For the implementation of CQR, we set q = 19 and choose quantile levels τ k = k/20 for k = 1, . . . , 19. Table 1 summarizes the simulation results for CQR-Lasso, SCQR-Lasso and SCQR-SCAD that uses the SCAD penalty to compute the weights in (2.8). For a fair comparison between the two methods in terms of statistical and numerical efficiency, we first compute an "oracle" λ value based on the true model error ∥ β − β * ∥ Σ for each estimator. To be specific, we first compute each estimator along a predetermined sequence of λ values, and choose the λ that minimizes the true model error averaged over 50 replications. Next, we run 100 additional simulations for each method using the optimally chosen λ, and report the results in Table 1. When the L 1 penalty is used, the SCQR has slightly lower model errors than the CQR yet at the cost of more false positives. From the runtime comparison we see a significant computational advantage of the SCQR via LAMM over the CQR via ADMM. As mentioned in Section 3.2, both algorithms are implemented in Python using the NumPy library for basic linear algebra operations. Moreover, with the optimally chosen λ, the SCQR-SCAD estimator considerably outperforms the Lasso counterparts and achieves oracle-like performance.
We further implement both methods with λ chosen by two data-driven procedures, the (five-fold) cross-validation and a modified BIC method; see Section 2.3 for details.
Under the four error distributions, Tables 2 and 3 show the simulation results for CQR-Lasso, SCQR-Lasso and SCQR-SCAD with λ chosen by five-fold cross-validation and BIC, respectively. Statistically, the L 1 -penalized CQR and SCQR methods perform similarly in terms of model selection accuracy and estimation accuracy (for β * ). This empirically validates the theoretical results that smoothing only affects the intercept terms and thus does not compromise the estimation of β * . The runtime comparison, on the other hand, shows that the computational cost of ADMM, combined with either cross-validation or BIC, becomes prohibitive as soon as the data has moderately large scales.     refer to their estimator as Rank Lasso, defined as By utilizing the pivotal property of the L 1 -loss (Belloni and Chernozhukov, 2011), they further proposed a simulation-based procedure to choose λ automatically from the data.
Computationally, Wang et al. (2020) reformulate the optimization problem (5.2) as a linear program (LP), and then use general-purpose optimization toolboxes. We thus follow this route and implement Rank Lasso using the SciPy library with method "highs" (Huangfu and Hall, 2018). In the following simulation study, we use equation (7) in Wang et al. (2020) with c = 1.01 and α 0 = 0.1 to compute the λ in (5.2); for SCQR-Lasso, we simulate λ via (2.10) with c = 1.9 and α = 0.05.
For data-driven Rank Lasso and SCQR-Lasso estimators, we summarize results on the statistical and computational performance in Table 4 under the four error distributions when (n, p) = (100, 600). The data-driven SCQR typically has much smaller estimation errors but more false positives than the data-driven Rank Lasso. This could just be a consequence of the different tuning procedures. The runtime comparison confirms SCQR as a practical and computational efficient approach to robust regression. The linear program reformulation of (5.2), on the other hand, involves 2n 2 + 2p variables and O(n 2 + p) constraints. Even the state-of-the-art LP solvers are not adapted to large-scale problems. Bühlmann, P. and van de Geer, S.

A Proof of main results
By a change of variable, we can rewrite the smoothed composite quantile loss Q h : R q+p → R in (2.6) as Using this expression, we obtain

In this notation, we haveK
A.1 Proof of Lemma 3.1

Combining (A.1) and (A.2), we see that the full gradient of
For the Hessian, note that for any 1 ≤ k, l ≤ q and 1 ≤ j ≤ p, where r(β) = y − x T β. It follows that and For any a = (a 1 , . . . , a q ) T ∈ R q and b ∈ R p , note that ). This verifies the positive semidefiniteness of ∇ 2 Q h (α, β) and . Using the independence of ε and x, and condition (3.2), we obtain that for all a ∈ R q and b ∈ R p , where the second equality follows from integration by parts and a change of variable. This proves the strict convexity of Q h at (α * , β * ).
Turning to the sample Hessian, for any a = (a 1 , . . . , a q ) T ∈ R q , b ∈ R p we have Hence, the empirical composite quantile loss Q h is twice-differentiable and convex.

A.2 Proof of Proposition 3.1
We first show that the function m h : R q → R has a unique minimizer, denoted by b h .
For each 1 ≤ k ≤ q, define the univariate function m h,k (b) = Eℓ h,k (ε − b) whose first and second-order derivatives are Since K is positive everywhere, we have m ′′ h,k (b) > 0 for all b. Therefore, m h,k (·) is strictly convex and has a unique minimizer, denoted by b h,k . Noting further that ∇ 2 m h (b) = For any α ∈ R q , β ∈ R p , we write ∆ x = x T (β − β * ) and obtain that In other words, the function Q h : R q+p → R is minimized at (β * 0 +b h , β * ). With a everywhere positive kernel, Q h is strictly convex so that (β For the left-hand side, This, together with the Lipschitz continuity of f , implies On the other hand, Putting together the pieces, we conclude that , where the second inequality follows from the fact that κ 1 < κ 1/2 2 . Canceling out |δ k | from both sides yields By the definition of b k , we must have |b h,k − F −1 (τ k )| ≤ κ 1/2 2 h; otherwise |δ k | = κ 1/2 2 h which contradicts the above inequality. Consequently, t k = 1 and b k = b h,k , thus implying the claimed bound (3.4).
We first claim that, when ∥∆ k ∥Σ ≤ r, we have the following lower bound with high probability for some positive c > 0, so that we can combine all the bounds for each k to derive the desired result of the Proposition 3.3. Let us define an event in the neighborhood of the true parameter, where κ := min |x|≤1 K(x). Also, by using similar smoothing technique from Loh (2017), we define a Lipshitz continuous function for R > 0, as φ R (u) := u 2 1(|u| ≤ R/2) + (u − R) 2 1(R/2 < u ≤ R) + (u + R) 2 1(−R ≤ u < R/2). (A.9) Then, we can further lower bound (A.8) by where χ i,k = 1(|ε i − α * h,k | ≤ h/2). To prove our claim, now it suffices to show that when ∥∆ k ∥Σ = δr, for each (δ ∈ (0, 1]), we have (A.11) If it holds for δ = 1, then Hence, we only need to prove when ∥∆ k ∥Σ = r. Suppose ∥∆ k ∥Σ = r, we have with Proposition 3.1. Moreover, we have the following lower bound when h ≤ min{f /(4κ 1/2 2 l 0 ), f /(2l 0 )}, where C b := 2κ 2 l 0 /f is an upper bound of the bias |b h,k − f (F −1 (τ k ))| that derived in the Proposition 3.1. Then, using above results, we get The last term of the above inequality is equal to Since we have sub-Gaussian covariates, for any u > 0, we get where we use the condition (A3) to establish the last two inequalities.
By taking expectation on the right-hand side of (A.16), we apply Talagrand's contraction principle in Theorem 4.12 of Ledoux and Talagrand (1991), which leads to where e i 's are independent Rademacher random variables.
A.5 Proof of Theorem 3.1 Throughout the proof, we write α = α h and β = β h for simplicity. To prove Theorem 3.1, we first derive an upper bound on the symmetrized Bregman divergence given in (3.8), along with a cone property for the estimator. Next, we prove a local RSC property based on Proposition 3.3, which in turns implies a lower bound on the Bregman divergence.
Combining these upper and lower bounds yields the claimed estimation error bound.
Set δ = α − α * h ∈ R q and ∆ = β − β * ∈ R p . Conditioned on the event G(λ) defined in (3.9), we have Recall from Proposition 3.2 that a lower bound for D(α, β) holds when (δ T , ∆ T ) T is in a cone-like set, where δ = α − α * h and ∆ = β − β * . We thus need to show that the estimator satisfies a cone-like property (with high probability). Using the optimality of ( α, β) and the convexity of Q h , we have It follows that conditioned on G(λ). Using the above bound and Cauchy-Schwarz inequality, we get so that ( δ T , ∆ T ) T ∈ C Ω (l) with l = 4γ −1/2 p 2 · max(s, q), where C Ω (l) is defined in (3.12).
Note further that the RSC property only holds in a local neighborhood of α * h and β * , for which ( α, β) does not necessarily satisfy. We thus employ a localized argument complemented with proof by contradiction. For some r > 0 to be determined, define η := sup u ∈ [0, 1] : u( δ, ∆) ∈ B Ω (r) , where B Ω (r) is defined in (3.11). By definition, η = 1 when ( δ, ∆) ∈ B Ω (r), and η ∈ (0, 1) otherwise. Then define an intermediate "estimate" , and (ii) ( α, β) lies on the boundary of (α * h , β * ) + B Ω (r) if ( δ, ∆) / ∈ B Ω (r). Moreover, ( α, β) inherits the cone property of ( α, β) conditioned on G(λ). Applying Proposition 3.3, we obtain that The idea behind this proof is that we need to control the magnitude of false discoveries at each step to refine the estimation error. The larger value of λ j with j ∈ S c tends to penalize the false discoveries harder. Throughout the proof, we write α t = α t h and β t = β t h for simplicity. Let us define a sequence of sets for t ≥ 1 as follows Each set depends on the estimator of the previous iterative step. Using above definition of the index set S t , we claim that We first assume that the above claime holds. On G(P ′ (a 0 )λ), using (A.26), we can derive where g ∈ ∂∥ β t ∥ 1 . For the first term of the right-hand side of the equality above, we split it into three parts, which leads to The inequality above is derived using β * S c = 0 and the property of subdifferential. Then, combining above result with Hölder's inequality, we get the following upper bound of the symmetrized Bregmann divergence As in the proof of Theorem 3.1, define an intermediate vector ( α t , β t ) := (α * h , β * )+η( δ t , ∆ t ) with η := sup u ∈ [0, 1] : u( δ t , ∆ t ) ∈ B Ω (r) , where B Ω (r) is defined in (3.11). On event R(c, r, l)∩G(P ′ (a 0 )λ), we can ensure that η = 1, since D( α t , β t ) ≤ ηD( α t , β t ) from Lemma gives ∥ θ t ∥ Ω < r opt , which implies ( δ t , ∆ t ) ∈ B Ω (r), thus ensuring η = 1 via proof by contradiction. Now, we need to verify the claim (A.31). For the second part of the claim, it holds trivially for t = 1. Assume that it holds for 1, . . . , t. Using the definition of the index set, for j ∈ S c t+1 , we have λ t j ≥ P ′ (a 0 )λ, which verifies the second part of (A.31). For the first part of (A.31), since S 1 = S, it holds for t = 1 trivially. Suppose it holds for some t ≥ 1.
Then, we get P ′ λ (| β t j |) < P ′ (a 0 )λ = P ′ λ (a 0 λ) for j ∈ S t+1 \ S, which implies that | β t j | > a 0 λ due to the monotonicity of P ′ (·). Thus, we get an upper bound for the size of the set as follows.
A.10 Proof of Proposition 3.4 We need to first establish the estimation error bound of the oracle estimator, which is essentially the estimation error bound of low dimensional smoothed CQR estimator. In this proof, with abuse of notations, use same notaion we used for the high-dmensional estiamtion error bound, except for the dimension which is now s << n. Proof strategy is similar to the proof of Theorem 3.1, but now it is low dimensional, so that some steps can be omitted.
We establish upper and lower bounds of D(α, β) in low dimension. Here, let We first prove that R(c, r) holds with high probability. We follow the proof of Proposition 3.3 until (A.18), with slight modification that we no longer require C Ω (l), thus taking supremum only on ∥∆ k ∥Σ = r. Thus, just denote Z n (l) as Z n in this proof. Then, we need to bound E(Z ′ ) to establish the RSC property. Using Rademacher symmetrization and Talagrand's contraction principle on (A.16), we get Thus, as long as n ≥ Cf hs/r 2 for sufficiently large C > 0, we get the desired RSC property with probability at least 1 − qe −(s+t) . Next, we need to get an upper bound of D(α, β).
Now, using Bernstein's inequality and applying union bound over N ϵ , we get with probability at least 1 − e log(1+2/ϵ)s−u . Choosing ϵ = 2/(e 2 − 1) ahaend u = 2s + t gives the desired upper bound with probability at least 1 − e −t . Thus, following the similar argument used in the Theorem 3.1 to combine the lower and upper bounds of the Bregmann divergence, we get the desired estimation error bound for the oracle estimator. For the Bahadur representation part, we refer to the Theorem 2 of Yan, Wang and Zhang (2023).