Asymptotic normality of robust $M$-estimators with convex penalty

This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\le\gamma$ for some fixed constant $\gamma>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions including the Huber loss and its smoothed versions regularized with a strongly convex penalty. The asymptotic variance that characterizes the width of the resulting confidence intervals is estimated with data-driven quantities. This estimate of the variance adapts automatically to low ($p/n\to0)$ or high ($p/n \le \gamma$) dimensions and does not involve the proximal operators seen in previous works on asymptotic normality of M-estimators. For the Huber loss, the estimated variance has a simple expression involving an effective degrees-of-freedom as well as an effective sample size. The case of the Huber loss with Elastic-Net penalty is studied in details and a simulation study confirms the theoretical findings. The asymptotic normality results follow from Stein formulae for high-dimensional random vectors on the sphere developed in the paper which are of independent interest.


Robust inference
In his seminal paper on robustness, Huber [20] introduced M -estimators for an unknown location parameter µ ∈ R from observations Y i = µ + ε i , i = 1, . . ., n, where ε i are iid noise random variables distributed as a mixture F = (1 − )N (0, 1) + H with H being the distribution of the contaminated samples, possibly chosen by an adversary.Given a differentiable loss function ρ : R → R and its derivative ψ = ρ , Huber defined Mestimators as minimizers μ = argmin b∈R n i=1 ρ(Y i − b), or equivalently as solutions to n i=1 ψ(Y i −b) = 0. Huber [20] went on to prove consistency and asymptotic normality of such M -estimators, obtaining among other results that if ρ is convex and ψ is absolutely continuous, then under mild assumptions on F , the convergence μ → µ in probability holds as well as the asymptotic normality Huber's M -estimators were extended to regression models, where a design matrix X ∈ R n×p is observed together with responses y i = x i β + ε i where x 1 , ..., x n are the rows of X and ε 1 , ..., ε n are possibly contaminated noise random variables as in the previous paragraph.For fixed or slowly growing dimension p as n → +∞, consistency and asymptotic normality of M -estimators of the form β = argmin b∈R p n i=1 ρ(y i −x i b) were obtained, see [21,Section 7] or [27] among others.Explicitly, if e j ∈ R p is a canonical basis vector and one is interested in the asymptotic normality of βj − β j for the purpose of confidence intervals, [27] shows that if (p log n) 3/2 /n → 0 and under mild assumptions on X.As in the location parameter problem of the previous paragraph, the asymptotic variance is characterized by the ratio The last decade has seen striking developments of similar asymptotic normality results in high-dimensions, where p/n → γ for some constant γ < 1, cf. [17,3,25,15,16].In terms of asymptotic normality, these works show that if X has iid N (0, Σ) rows, the unregularized M -estimator β = argmin b∈R p n i=1 ρ(y i − x i b) satisfies asymptotic normality of the form (e j Σ −1 e j ) −1/2 √ p( βj − β j ) → d N (0, r 2 ) ( where r > 0 is a deterministic constant that captures the high-dimensionality of the problem [17, Lemma 1].The constant r > 0 is determined by solving a system of nonlinear equations with two unknowns: In the unregularized setting, [17, S2] describes this system of nonlinear equations with unknowns (r, c) as where Z ∼ N (0, 1) is independent of ε 1 , and prox c (ρ)(t) = argmin u∈R ρ(u) + (t − u) 2 /(2c) denotes the proximal operator of the convex function t → cρ(t) with derivative [prox c (ρ)] (t).The optimality conditions of the proximal minimization problem leads to the expressions for the solutions (r, c) to the above system.Hence (1.2) can be rewritten as see, e.g., [15,Theorem 4.1 and Corollary 4.6].These results embody that when p and n are of the same order, the asymptotic variance in (1.1) must be modified to account for the high-dimensionality of the problem by (a) replacing ψ in the numerator and ψ in the denominator by their compositions with the proximal operator prox c (ρ), and (b) adding the extra Gaussian term rZ to the initial noise ε 1 .The distribution of ε 1 + rZ is sometimes referred to as the effective noise.The Gaussian assumption can be relaxed and some of the above results are still valid if X has iid centered entries with variance one [25,16].Despite the subtle introduction of the proximal operator and the constants (r, c), it is remarkable that the informal ratio average[ψ 2 ] average[(d/dt)ψ] 2 unifies the results (1.1) and (1.4) in both low and high-dimensions.
All results of the previous section are applicable when p/n → γ with γ < 1.For γ > 1 regularization is required to ensure the uniqueness of β, for instance through an additive penalty which leads to regularized M -estimators of the form for some convex penalty function g : R p → R. The case of Ridge regularization with g(b) = τ b 2 2 /2 for some constant τ > 0 is treated in [25,16].In this case, the two solutions (r, c) of a system of two nonlinear equations similar to (1.3) characterize the error β − β 2 , asymptotic normality and asymptotic variance of where a is a constant capturing the bias induced by regularization [16, Proposition 3.30] and a is a function of (γ, r, c).[29] characterize the error β − β 2 for a large class of (ρ, g) pairs using a technique known as the Convex Gordon Min-Max theorem pioneered by [28], and the recent paper [19] on Approximate Message Passing focused on g(b) = λ b 1 and ρ either the Huber loss or the absolute value.Little is known, however, on asymptotic normality of the regularized estimators (1.5) for penalty functions different from the square norm b → τ b 2 2 .The theories developed in [29,19] do not readily provide asymptotic normality results and regularized M -estimators of the form (1.5) lack confidence interval capabilities.One goal of the present paper is to fill this gap.
A separate line of research develops asymptotic normality results and confidence intervals for regularized least-squares estimators of the form where X has rows x 1 , ..., x n .Early results studied the Lasso with g(b) = λ b 1 [32,22,30] under a sparsity condition s log(p) = o( √ n) where s = β 0 , or Ridge regression [11].For the Lasso the sparsity condition was later improved to s log 2 (p)/n → 0 [24], to s log(p/s)/n → 0 [7] and p/n → γ ∈ (0, ∞) with s n/ log(p/s) ( [23,26] for isotropic Gaussian design and [13] [6, Theorem 3.2] for non-isotropic Gaussian design).For penalty functions beyond the Lasso and Ridge regularization, [12,Proposition 4.3(iii)] provides asymptotic normality on average over the coordinates for permutation invariant penalty function g in (1.6), and [6, Theorem 3.1] proves asymptotic normality for individual coordinates of (1.6) under a strong convexity assumption.A high-level message of these works is that one must de-bias the regularized estimator (1.6) in order to obtain asymptotic normality at the √ n-adjusted rate and construct confidence intervals.In the p/n → γ regime that is the focus of the present paper, this bias correction takes the following form.Under a strong convexity assumption and for X with iid N (0 p , Σ) rows, [6] proves that for most coordinates j = 1, ..., p, where Ω jj = e j Σ −1 e j and df is the effective degrees of freedom of β defined as the Jacobian of the map y → X β for fixed X.For Σ = I p and consequently Ω jj = 1, a similar bias correction proportional to e j X (y − X β) is visible in the asymptotic normality result [12, Proposition 4.3(iii)], although there the coefficients (n − df) and y − X β 2 in (1.7) are replaced with deterministic scalar counterparts obtained by solving a system of nonlinear equations of a similar nature as (1.3).Another goal of the present paper is to equip robust M -estimators (1.5), with ρ different than the squared loss, with de-biasing and asymptotic normality results similar to the previous display, allowing for general robust loss functions ρ coupled with general convex penalty functions g.

Contributions
Our contributions are the following.
1. We provide de-biasing and asymptotic normality results for robust M -estimators with convex penalty functions when p and n are of the same order.This leads to confidence intervals for the j-th coordinate β j of the unknown coefficient vector β.
Asymptotic normality holds for a large class of robust loss functions, including the Huber loss and its smoothed versions.2. Although this bias correction required for asymptotic normality resembles the onestep estimators recommended in the theory of classical M -estimator to improve efficiency (e.g., [8]), a notable difference from the low-dimensional case is the requirement of a degrees-of-freedom adjustment to amplify the one-step correction.
For the squared loss, this degrees-of-freedom adjustment takes the form of multiplication by (n − df) in (1.7); one contribution of this paper is to identify the degrees-of-freedom adjustment that leads to asymptotic normality for robust and regularized M -estimators, beyond the squared loss.3. The asymptotic variance is estimated by random, data-driven quantities, as opposed to the deterministic scalars (r, c) that determine the asymptotic variance for unregularized estimators in (1.4).The fact that the asymptotic variance is estimated by observable quantities makes this results more suitable for confidence intervals (case in point: computing the solution (r, c) of (1.3) and the asymptotic variance in (1.4) requires the knowledge of the noise distribution subject to contamination).The asymptotic normality result takes the form where the data-driven variance estimate V again is a ratio of the form average for a particular sense of average to be defined in (2.13) below.Interestingly, the expression for this average and V does not involve the proximal mapping in (1.4).This informal statement will be made precise in Section 2.4 below.

4.
In order to derive these new asymptotic normality results, we develop in Appendix B new identities for random vectors uniformly distributed on the Euclidean sphere.Although the argument of the present paper for asymptotic normality is closely related to that used in [6] for the squared loss, this previous theory for the squared loss for functions of standard normal vector does not extend to robust loss functions due to the lack of strong convexity of ρ for robust losses, and consequently the lack of explicit lower bounds on 1 n n i=1 ψ(y i − x i β) 2 .Developing these new identities and the corresponding asymptotic normality results for random vectors uniformly distributed on the sphere is a crucial step to overcome the lack of global strong convexity of ρ for robust losses and to obtain the asymptotic normality results of the previous bullet points.These new identities provide novel Stein formulae for random vectors on the sphere and may be used more broadly for elliptical distributions.

Model and assumptions
We consider a linear model where y ∈ R n , X ∈ R n×p and ε ∈ R n , with a regularized M-estimator where ρ : R → R is the loss and g : R p → R is the penalty.
Assumption A (Assumptions on the loss).Let ρ : R → R be convex and continuously differentiable on R, with derivative ψ = ρ being L-Lipschitz with for some positive constant K > 0 independent of n, p.
Assumption B (Strongly convexity of g).For some constant τ > 0 independent of n, p, the penalty g : R p → R is τ -strongly convex in the sense that x → g(x) − τ x 2 2 /2 is convex.Some useful characterizations of strong convexity are the following.Throughout, ∂g(b) denotes the subdifferential of g at b ∈ R p .Then g is τ -strongly convex if and only if Assumption C. The rows of the design matrix X are iid N (0, Σ) random vectors and all the eigenvalues of Σ ∈ R p×p are in [κ, 1/κ] for some constant κ ∈ (0, 1) independent of n, p.The noise ε is independent of X and admits a density with respect to the Lebesgue measure.
Assumption D. p/n ≤ γ for some constant γ > 0 independent of n, p.

Notation
Throughout the paper, we use the notation For each j ∈ [p], let e j denote the j-th vector in the standard basis of R p , and let We remark that the above definition implies the following properties: By construction of z j and Q j , the response y can be decomposed as y = β j z j +XQ j β+ε where β j ∈ R is the scalar parameter of interest for a fixed covariate j ∈ [p], the vector Q j β is a high-dimensional nuisance parameter and ε is independent noise.Under the additional assumption of ε i ∼ N (0, 1), Ω −1 jj is the Fisher information for the estimation of β j .
In the proof, it will be useful to treat ψ = ψ(ε, X) as a map from R n×(p+1) → R n , formally defined as Since (β, ε) are unknown, we cannot compute the derivatives of ψ a priori.However, for a fixed X, the quantity ψ(y − Xβ, X) is observable since all terms in (β, ε) cancel out (ψ(y − Xβ, X) is simply ψ(y − X β) with β in (2.2)).We can thus define the observable matrix of size n × n Lebesgue almost every y, and with probability one since y has continuous distribution under Assumption C. Furthermore, the gradient (2.8) at the currently observed data (y, X) does not depend on any unknown quantity.It can be computed from (y, X) either by finding a closed form expression for (2.8) for a given penalty function, or by approximation using finite difference or other numerical methods.Finally, throughout the paper Φ(t) = (2π) −1/2 t −∞ exp(−u 2 /2)du is the standard normal cumulative distribution function.

Main result
In the following result, we consider implicitly a sequence of integer pairs (n, p), regression problems (2.1) and M -estimator (2.2).For instance, one can think of p = p n as a nondecreasing function of n and (g, β, β, ρ) are also implicitly indexed by n with values possibly changing with n. (2.9)

.10)
Then for any positive sequence (a p ) with lim p→+∞ a p = +∞, max j∈Jn,p ) The proof is given in Appendix A.2.To interpret the above result, we remark that for any slowly increasing sequence a p such as a p = log p or a p = log log p, the asymptotic normality (2.11) holds uniformly over all coordinates j = 1, ..., p except a p of them, so that both ξ j and ξ j are asymptotically pivotal for an overwhelmingly majority of β j .Another interpretation is given in the following Corollary: For any given precision threshold υ > 0, there exist at most a * coordinates j = 1, ..., p such that |P(Ω Corollary 2.2.Let the setting and assumptions of Theorem 2.1 be fulfilled.For any arbitrarily small constant υ > 0 independent of n, p, define Then, sup n,p |J υ n,p | ≤ a * for a certain constant a * depending on {υ, τ, R, γ, L, K, t} only.In other words, for any (n, p) with p/n ≤ γ there are at most a constant number of coordinates j = 1, ..., p such that P(Ω The same conclusion holds with ξ j replaced by ξ j . of Corollary 2.2.We proceed by contradiction.If the claim does not hold, there exists a constant υ * > 0 such that |J υ * n,p | ≥ 2a p for an unbounded sequence a p .By extracting a subsequence if necessary, we may assume without loss of generality that a p is monotonically increasing with a p → +∞.By Theorem 2.1 there exists J n,p ⊂ [p] such that (2.11) holds.By definition of J υ * n,p and J n,p , we have J n,p ∩ J υ * n,p = ∅ for p large enough.This implies that p ≥ |J n,p | + |J υ * n,p | ≥ (p − a p ) + 2a p for p large enough, a contradiction.

Data-driven variance estimate
Except for at most a constant number of coordinates j ∈ [p], the approximation holds where the observable bias correction is given by the first term in (2.10) and the data-driven variance estimate that characterizes the length of confidence intervals for

13)
This ratio of an average of ψ 2 by a squared average of a derivative of ψ mirrors the robust asymptotic results in (1.1) and (1.4) as discussed in the introduction.Confidence intervals can be readily constructed from Theorem 2.1 or the informal approximation (2.12): a 95%-confidence interval for β j is given by βj In contrast with the asymptotic variance in (1.4), the above V involves observable quantities.In particular, while the asymptotic variance in (1.4) depends on the distribution of the noise through the solutions (r, c) of the system (1.3), the knowledge of the noise distribution is not required to compute V and construct confidence intervals for β j .Theorem 2.1 is valid for p/n ≤ γ, without requiring a specific limit for the ratio p/n.Theorem 2.1 is also valid for p = o(n), so that Theorem 2.1 and the estimated asymptotic variance (2.13) unifies both low-and high-dimensional asymptotic normality results.
For the Huber loss the quantity tr[∇ y ψ] has a simpler form: By the chain rule where the differentiation is understood holding X fixed as in (2.8).With n = tr[diag ψ ] the number of observations such that y i − x i β falls in the range where the Huber loss is quadratic and df = tr[diag(ψ )(∂/∂y)X β] representing the degrees-of-freedom of the M-estimator β, the quantity tr[∇ y ψ] appearing in the dominator of V simplifies to tr[∇ y ψ] = n − df.In this case, the asymptotic normality for ξ j in Theorem 2.1 takes the form uniformly over all j ∈ J n,p .This extends the asymptotic normality result (1.7) to the Huber loss with the following important modifications: the sample size n is replaced by n and the residuals y − X β are replaced by ψ = ψ(y − X β).The variance V that determines the length of the confidence interval resulting from (2.12) presents a trade-off between ψ 2 /n, an effective sample size n and the degrees-of-freedom df: For confidence intervals with small length, the M-estimator β should have small residuals measured as ψ(y − X β) 2 , small degrees-of-freedom df, and large effective sample size n.

Example
This section specializes the above results to scaled versions of the Huber loss (2.14) and the Elastic-Net penalty g(b which corresponds to the scaled Huber loss ρ(u) = σ 2 H(σ −1 u) where σ > 0 is a scaling parameter.Since the derivative H is 1-Lipschitz, so is ψ = ρ .Furthermore, ) and Assumption A is satisfied with L = 1 and K 2 = min(1, σ 2 ).Assumption B is also satisfied as the penalty is the sum of the is the number of observations i = 1, ..., n such that y i − x i β falls in the range where ρ is quadratic, Ŝ = {j ∈ [p] : βj = 0} and X Ŝ ∈ R n×| Ŝ| is the submatrix of X containing columns indexed in Ŝ.

Simulation study
We provide here simulations to showcase the asymptotic normality in the Huber loss and Elastic-Net penalty example of the previous section.
We set n = 200, p = 300, and generate ε i ∼ N (0, 1) and the coordinates of β from iid Bernoulli variables with parameter 0.1.We compute 1000 simulations of the Z-score Ω 1/2 jj ξ j from Theorem 2.1 for j = 1 for the M -estimator with the Huber loss and the Elastic-Net penalty (2.16) with σ = 1 and the four combinations (λ, τ ) ∈ {n −1/2 , 2n −1/2 } × {0, 0.1}.The covariance matrix Σ is set as A A/n where A ∈ R 2p×p has iid Rademacher entries; Σ is generated once and is the same across the 1000 simulations.The average value and standard error over the 1000 simulations of n, df, | Ŝ| and V 1/2 n −1/2 are presented in Table 2 for ε with iid standard Cauchy components and Table 3 for ε with iid components from the t-distribution with 2 degrees of freedom, together with histograms and QQ-plots against standard normal quantiles of Ω 1/2 jj ξ j in (2.15).
The quantity V 1/2 n −1/2 = ψ 2 /(n − df) featured in the boxplots of Figure 1 characterizes the length of our confidence intervals in (2.12).Computing the values V 1/2 for different tuning parameters lets the practitioner pick the tuning parameters leading to the smallest confidence interval width, although this process amounts to the construction of multiple confidence intervals and warrants a Bonferroni multiple testing correction.
The histograms and QQ-plots in Table 2 and Table 3 confirm the normality of ξ j for these two heavy-tailed continuous noise distributions.
Since n − df appears in the denominator of V , the length of confidence intervals can be large if n − df is nearly zero and the length is infinite if n − df = 0.This explains the large values and large variances observed in the boxplots of Figure 1 for small tuning parameters.

Table 2
Averages and standard errors over 1000 simulations of (n, df, | Ŝ|, V 1/2 n −1/2 ), as well as histograms and QQ-plots for Ω 1/2 jj ξ j given in (2.10).The coordinate j is always j = 1.The noise ε i is set as iid standard Cauchy.The red line on the QQ-plots is the diagonal line with equation x = y.

Table 3
Averages and standard errors over 1000 simulations of (n, df, | Ŝ|, V 1/2 n −1/2 ), as well as histograms and QQ-plots for Ω 1/2 jj ξ j given in (2.10).The coordinate j is always j = 1.The noise ε i is set as iid t-distribution with degree of freedom 2. The red line on the QQ-plots is the diagonal line with equation x = y.
. The simulation setup is described in Section 2.6.The top plot corresponds to iid standard Cauchy noise ε i .The noise in the second plot is iid from the t-distribution with 2 degrees of freedom.Different colors correspond to different values of τ ∈ {0, 10 −3 , 10 −2 , 0.1, 0.5}.
In the x-axis, λ takes values in {0.1n Lemma A.2. Let Assumptions A, B and C be fulfilled.Let , and E j be the conditional expectation given ( z j 2 , XQ j , ε).Then, when u * > 0 A.2. Proof of the main result of Theorem 2.1.Since ψ = 0 n for almost every X by Proposition D.2, ξ j is well-defined with P-probability 1.By Lemma A.2 and the monotone convergence theorem as δ → 0 for the left-hand side (A.3), when u * > 0 Under our assumptions, Σ op ≤ 1/κ < +∞ and E[ Σ 1/2 h 2 2 ] ≤ R < +∞, so that there exists some finite constant N > 0 and A < +∞ independent of n, p, such that for n ≥ N , By Markov's inequality with respect to the uniform distribution on [p] = {1, ..., p}, the set In this paragraph, for a given, fixed value of (ε, XQ j ) we view ψ as a function of z j , instead of (2.8), and we denote its Jacobian by ∇ψ(z j ) at any point z j where (A.6) is Fréchet differentiable.Next, we argue conditionally on (XQ j , z j 2 ): Since z j / z j 2 is independent of ( z j 2 , XQ j ) and by rotational invariance of the Gaussian distribution, conditionally on ( z j 2 , XQ j ) the vector z j is uniformly distributed on the sphere S n−1 ( z j 2 ).Let E j denote the conditional expectation of z j given (XQ j , z j 2 ).By Proposition C.4, the above function is locally Lipschitz.From Proposition C.4 (iii) we have that and the right-hand side is finite with probability one with respect to (XQ j , z j 2 ) thanks to (A.4) and Tonelli's theorem for non-negative measurable functions.By Proposition D.2, conditional on almost every (XQ j , z j 2 ), ψ(z j ) = 0 n for almost every z j ∈ R. We are now in position to apply Proposition A.1 with f = ψ/ ψ 2 , z = z j and where P ⊥ v := I n − P v with P v := vv / v 2 2 for any v ∈ R n .By Proposition A.1 and (A.7), , where the upper bounds follow from I Ej z j converge in probability to 0 uniformly over j ∈ J n,p .
We now study the asymptotic distribution of ξ f (z j ).Since z j ∼ N (0 n , Ω −1 jj I n ) is independent with XQ j by Proposition D.1, without loss of generality, we can assume that z j = z j 2 ζ j / ζ j 2 for some ζ j ∼ N (0 n , I n ) independent of (XQ j , z j 2 ).Then E j coincides with the conditional expectation of ζ j given (XQ j , z j 2 ).After some rearrangement, Uniformly over j ∈ J n,p , we have (i

and (iii) Ω
1/2 jj z j 2 / ζ j 2 → 1 in probability.By Slutsky's Theorem, Ω 1/2 jj ξ f (z j ) converges in distribution to N (0, 1) uniformly over j ∈ J n,p .That is, for any t ∈ R, max j∈Jn,p |P(Ω an open neighborhood of S n−1 (R), define the smooth function ȟ : Ω → R by ȟ(x) = h(Rx/ x 2 ) and define the gradient of h as that of ȟ, i.e., for x ∈ S n−1 (R), set ∇h(x) = ((∂/∂x 1 ) ȟ(x), ..., (∂/∂x n ) ȟ(x)).For every x ∈ S n−1 , the gradient ∇h(x) belongs to the hyperplane orthogonal to x which is the tangent space of S n−1 at x. Furthermore if γ : R → S n−1 (R) is a smooth curve with γ(0) = x, (d/dt)γ(t)| t=0 = v then v x = 0 and ∇h(x) v = (d/dt)h(γ(t))| t=0 .For such smooth function h and equipped with its gradient, for α ∈ {1, 2} we define the Sobolev norm and the Sobolev space W 1,α (S n−1 (R)) as the completion of the space of smooth functions over S n−1 (R) with respect to the above norm.This definition is equivalent to the definition given in [18,Definition 2.1].By [18,Proposition 2.3], since the manifold S n−1 (R) is compact the resulting Sobolev spaces do not depend on the chosen metric.Equivalently, the Sobolev space W 1,α (S n−1 (R)) is also the completion with respect to the above norm of the space of once continuously differentiable functions on S n−1 (R).
If h is locally Lipschitz on S n−1 (R) (i.e., every point has a neighborhood in S n−1 (R) on which h is Lipschitz), then again by considering an open neighborhood Ω ⊂ R n of S n−1 (R), the function ȟ(x) = h(Rx/ x 2 ) is locally Lipschitz on Ω.Thus, in this case and by Rademacher's theorem, ∇h(z) is well defined almost everywhere in S n−1 (R) as the gradient of ∇ ȟ(x), and where the gradient ∇f is the matrix in R n×n with columns ∇f 1 , ..., ∇f n , Lemma B.1 (Stein's formula on the sphere).Let n ≥ 3, R > 0 and z ∼ Unif(S n−1 (R)).
where ∇f = (∇f 1 , . . ., ∇f n ).Moreover, for all Note that (B.2) is the classical Poincaré inequality on the sphere.A proof is provided for completeness.
of Lemma B.1.As the operators T and T ij defined by T f (z) = f (z) z and T ij f (z) = [∇f (z) P ⊥ z ] ij for every i, j = 1, ..., n are all continuous linear mappings from W 1,α (S n−1 (R)) n to the L α space with the norm f Lα = (E[|f (z)| α ]) 1/α , we assume without loss of generality that all coordinates of f are continuously differentiable.Indeed, if (f (q) ) q≥1 is a sequence of smooth functions over the sphere converging to f in W 1,α (S n−1 (R)) n for α = 1 and (B.1) holds for f (q) then (B.1) also holds for the limit by continuity; the same argument applies with α = 2 for (B.2)-(B.3)-(B.4).
Let ζ ∼ N (0, I n ).Assume without loss of generality z = Rζ/ ζ 2 as Rζ/ ζ 2 ∼ Unif(S n−1 (R)).Let φ(t) be a continuously differentiable function in R with φ(t) = 0 for t ≤ 0 and φ(t) = 1 for t ≥ 1.For δ > 0 define ϕ δ (x) = φ( x 2 /δ)f (Rx/ x 2 ).As ζ 2 is independent of z and ϕ δ (x) has uniformly bounded gradient, the first order Stein formula for ϕ δ yields By the product and chain rules, ∇ϕ δ (x) is given by with φ (t) = (d/dt)φ(t) and φ (0 . Next, as the exchange of limit and expectation is allowed when ϕ δ → ϕ 0+ = f , the Gaussian Poincaré inequality [9, Theorem 1.6.4]yields and ( ζ 2 , z) are independent, we obtain (B.2).Finally by applying the Second Order Stein formula of [5] to ϕ δ (ζ) we find by dominated convergence where the last equality follows from the independence of z and ζ 2 .We now simplify the left-hand side in order to get rid of ζ 2 .By expanding the square and using the independence of (z, ζ 2 ), the above display is equal to The proof is complete since (B.4) follows directly from (B.3) by the Cauchy-Schwarz inequality.
Appendix C: Bounds on ( The goal of this section is to prove Lemma A.2.

C.1. High probability events E j
Lemma C.1 (high probability of E j ).Assume that X has iid N (0, Σ) rows.Then Furthermore, with η n = 2 log(n)/n + n −1/2 and for the events of Lemma C.1.Let us first notice that XΣ −1/2 is a random Gaussian matrix with iid standard normal entries.Theorem 7.3.1 in [31] provides that Since the operator norm of a matrix is a 1-Lipschitz function of the entries of the matrix, by the Gaussian Poincaré inequality [10, Theorem 3.20], Var( XΣ −1/2 op ) ≤ 1.This proves (C.1).By Theorem II.13 in [14], we have for t > 0, . With this we can conclude the above inequalities using a union bound over j ∈ [p] when t = 2 log(n)/n, P(∩ j∈[p] E j ) ≥ 1−(2p+1)Φ(− 2 log(n)).Thanks to Φ(−t) ≤ (2π) −1/2 exp(−t 2 /2)/t for t > 0, we have Remark C.1.Our specific construction of E j satisfies the following properties: 1.I Ej is a function of XQ j op and z j 2 only, consequently the event

C.2. Twice continuously differentiable penalty
Lemma C.2.Let L, τ be such that Assumptions A and B are fulfilled.Further assume that the Hessian matrix G = (∇ 2 g(b)) b= β of g at β exists and is symmetric, and define Then with the partial order A B if and only if the matrix B − A is positive semidefinite, we have of Lemma C.2.Throughout the proof we denote by B = diag(ψ ) 1/2 Xn −1/2 G −1/2 .By some algebra, we have Combining those upper bounds, we obtain (C.4).For the upper bound of M , we notice This gives (C.5).For lower and upper bounds on V , we first notice that by definition of B, By the L-Lipschitz property of ψ, we have diag(ψ ) LI n .Thus (C.6) holds.
For (ii) and (iii), additionally assume that g is twice continuously differentiable.
(ii) For almost every X and every a ∈ R p , (iii) For almost every X and every a ∈ R p , if ψ = 0, then We remark that in view of (A.6), ∇ z ψ = ∇ψ(z j ) when a = e j .
Proof.(i) We refer our readers to the proof of Proposition C.4.
(ii) As the functions h and ψ are Lipschitz on every compact, their Fréchet derivatives exist almost everywhere by Rademacher's theorem.Moreover, the chain rule holds almost everywhere by [33,Theorem 2.1.11].Let G, V and M be as in Lemma C.2.By differentiating these KKT conditions X ψ(X) = n(∇g(b)) b=β+h , and by the chain rule, we obtain that for almost every X, Solving the above system of equations gives (C.7) and (C.8).
(iii) Since the map v → v/ v 2 with v ∈ R n has Fréchet derivative v −1 2 P ⊥ v at every point v = 0 ∈ R n , by chain rule, (C.9) follows for almost every X if ψ(X) = 0 n . of Lemma A.2 when g is twice continuously differentiable.Here, we further assume that g is twice differentiable so that V , M in Lemma C.2 are well-defined.
By Proposition C.4, the map z j → (h, ψ) is locally Lipschitz, thus the map of the product z j → h j ψ( ψ 2 2 + δ) −1 is also locally Lipschitz.By Proposition D.2, without loss of generality, we can assume that ψ = 0 n at almost every point z j ∈ R n .By the first order Stein's formula on the sphere (B.1) for (n − 1) For the terms (C.12)-(C.14),by Lemma C.2 and P ⊥ zj op ≤ 1 we find By leaving term (C.11) unchanged and using the above inequalities, We now bound (C.10) from below.Let ) and by X op ≤ XQ j op + z j 2 for all j ∈ [p], we have that U j diag(ψ ) V LI n .Since U j diag(ψ ) V and K 2 I n diag(ψ ) + diag{ψ 2 i , i = 1, ..., n} both holds, multiplying the second inequality by U j and summing yields K 2 U j I n U j diag ψ 2 i + V .Multiplying both sides P ⊥ zj to the left and to the right and using that P ⊥ zj I n and tr P ⊥ zj = n − 1 we find Multiplying by ( ψ 2 2 + δ) −1 h 2 j and taking the conditional expectation E j , Note that by definition of u * and the events in Lemma C.1, we have when u * > 0, Then combining (C.16) with the previous display, multiplying both sides by I Ej and summing over j ∈ [p] we find Taking expectations E, letting diag{I Ej } denote the diagonal matrix with the j-th diagonal element I Ej , we find where the second inequality follows from the Cauchy-Schwarz inequality and ( ψ 2 2 + δ) −1/2 ψ 2 ≤ 1, and the third inequality follows from Ω As A > 0, inequality Ax 2 +Bx+C ≤ 0 implies that x lies between the two real roots of the polynomial AX 2 + BX + C. In particular, x is smaller than the largest root, i.e., x ≤ The upper bound (C.1) on E[ XΣ −1/2 2 op ] completes the proof.
C.3.Non-smooth strongly convex penalty g

C.3.1. Almost everywhere differentiability
In this section, we provide the almost everywhere existence of the Jacobian matrices.We also notice that if our penalty g is not twice differentiable, the matrices M and V in Lemma C.2 are not well defined.In this case we do not have explicit formula for the Jacobian matrices (∂/∂z j )ψ and (∂/∂z j )h such as those in terms of V , diag(ψ ), M in Proposition C.3 (ii) and (iii).In this section we provide upper bounds of the norms of these Jacobian matrices without using Proposition C.3.
Proposition C.4.Let ρ : R → R be convex and continuously differentiable with derivative ψ = ρ being L-Lipschitz.Let g : R p → R be strongly convex with parameter τ > 0.
Let X, X ∈ R n×p , ε, ε ∈ R n and correspondingly, (C.20) (iii) For any ε ∈ R n fixed, and for any η ∈ R n and a Furthermore, The content of the above proposition appeared in [4, Proposition 5.2] with variables (y, X) instead of (ε, X).It follows from strong convexity and the KKT conditions of β and β.Its proof is provided below for completeness.An application of the above Proposition C.4 to normalized ψ yields the following.

C.3.2. Approximation using smooth penalty gν
Lemma C.6 (Approximation of strongly convex functions).Let g : R p → R be strongly convex with constant τ ≥ 0. Then for every ν > 0 there exists a real-analytic strongly convex function g ν : R p → R with constant τ such that g ν − ν ≤ g ≤ g ν .