Finite-sample Analysis of M-estimators using Self-concordance

We demonstrate how self-concordance of the loss can be exploited to obtain asymptotically optimal rates for M-estimators in finite-sample regimes. We consider two classes of losses: (i) canonically self-concordant losses in the sense of Nesterov and Nemirovski (1994), i.e., with the third derivative bounded with the $3/2$ power of the second; (ii) pseudo self-concordant losses, for which the power is removed, as introduced by Bach (2010). These classes contain some losses arising in generalized linear models, including logistic regression; in addition, the second class includes some common pseudo-Huber losses. Our results consist in establishing the critical sample size sufficient to reach the asymptotically optimal excess risk for both classes of losses. Denoting $d$ the parameter dimension, and $d_{\text{eff}}$ the effective dimension which takes into account possible model misspecification, we find the critical sample size to be $O(d_{\text{eff}} \cdot d)$ for canonically self-concordant losses, and $O(\rho \cdot d_{\text{eff}} \cdot d)$ for pseudo self-concordant losses, where $\rho$ is the problem-dependent local curvature parameter. In contrast to the existing results, we only impose local assumptions on the data distribution, assuming that the calibrated design, i.e., the design scaled with the square root of the second derivative of the loss, is subgaussian at the best predictor $\theta_*$. Moreover, we obtain the improved bounds on the critical sample size, scaling near-linearly in $\max(d_{\text{eff}},d)$, under the extra assumption that the calibrated design is subgaussian in the Dikin ellipsoid of $\theta_*$. Motivated by these findings, we construct canonically self-concordant analogues of the Huber and logistic losses with improved statistical properties. Finally, we extend some of these results to $\ell_1$-regularized M-estimators in high dimensions.


Introduction and problem formulation
Recall the standard statistical learning setup: given a set Θ ⊆ R d that parametrizes the space of possible hypotheses, and observing a random Z ∈ Z with unknown distribution P, one would like to minimize the population risk L(θ) := E[ Z (θ)]. For each possible observation z of Z, the loss z : Θ → R specifies the cost of choosing θ under the outcome {Z = z}, and E[·] is the expectation with respect to the distribution P. This distribution is assumed unknown, so the population risk cannot be computed and minimized directly. Instead, one is granted access to the sample (Z 1 , ..., Z n ) of independent copies of Z, and uses it to construct an estimate θ of the population risk minimizer, assuming that such a minimizer exists. As such, we can consider the empirical distribution P n -uniform probability measure supported on the sample -and the empirical risk L n (θ), defined as the observable counterpart of L(θ), namely, Zi (θ).
Ideally, we would like to have an estimator with small excess risk L( θ) − L(θ * ), in probability or in expectation over the sample. Since for each fixed value θ of the parameter, L n (θ) is an unbiased estimate of L(θ) which converges to L(θ) almost surely by the law of large numbers, a natural candidate estimator of θ * is the empirical risk minimizer (ERM), defined as θ n ∈ Argmin θ∈Θ L n (θ).
In this paper, we are concerned with establishing high-probability finite-sample bounds on the excess risk L( θ n ) − L(θ * ) of this estimator. The classical Fisher theorem ( [30]) implies the rescaled excess risk has a chi-square type limiting behavior, under weak conditions, when n → ∞. When stated informally, our goal in this paper is to characterize the critical sample size sufficient to enter the this "asymptotic regime", i.e., to guarantee a chi-square type high-probability bound for the excess risk in finite sample. Elaborating on this goal in more detail and stating our results would be impossible without first giving a brief overview of the classical asymptotic theory. We give such overview in the next section.

Classical asymptotic theory
Our main focus in this paper is the setting where L n (θ) is a negative loglikelihood, that is z (θ) = − log p θ (z) where p θ (·) is some probability density supported on Z. In this case, θ n maximizes the likelihood of observing the i.i.d. sample (Z 1 , ..., Z n ) from P θ ranging over a parametric family P = {P θ , θ ∈ Θ}. In reality, P may or may not contain the actual data-generating distribution P. When P ∈ P, we say that the parametric model corresponding to P is well-specified ; in this case, ERM becomes the maximum-likelihood estimator (MLE). Otherwise, the model is called misspecified, and ERM can be regarded as MLE under model misspecification, or quasi maximum likelihood estimator [62]. In this case, P θ * corresponds to the "projection" of P onto the family P in the sense of the Kullback-Leibler divergence, and the quasi MLE approximates P θ * by replacing P with the empirical distribution P n . Our goal in this section is to give a brief overview of the main results of the asymptotic theory of M -estimation. Most of them, see monographs [30,24,57,10], rely on the local regularity assumptions about the loss, allowing for secondorder Taylor expansion of L(θ) around θ * . In particular, it is assumed that L(θ) is sufficiently smooth at θ * , which is an interior point of Θ, so that the first-order optimality condition for θ * reduces to ∇L(θ * ) = 0. Moreover, the Hessian is assumed to be non-degenerate, i.e., H 0. Finally, the empirical risk is assumed to be three times continuously differentiable at θ * , see, e.g., [30]. When combined together, these assumptions allow to derive, as a starting point, the local asymptotic normality of quasi MLE: when n → ∞ with fixed d, where denotes convergence in law, and G is the covariance matrix of the loss gradient at θ * (also called Fisher's information matrix): Matrices G and H remain fixed as n grows. Hence, under mild regularity assumptions, 1 one also has that the variance of θ n decreases as O(1/n). Moreover, in the well-specified case G = H, see, e.g., [6], which leads to Fisher's theorem: where I d is the identity matrix of size d. Thus, denoting · J the norm linked to positive semidefinite matrix J by x J = J 1/2 x 2 , we have n θ n − θ * 2 H χ 2 d , where χ 2 d is the chi-square law with d degrees of freedom. The second-order Taylor expansion of the average risk around θ * then allows to derive the same asymptotic law for the scaled excess risk 2n[L( θ n )−L(θ * )] -this result is known as Wilks' theorem. In turn, this implies (under mild regularity conditions) that 1 It suffices for ρn := √ n θn 2 to be uniformly integrable, i.e., lim ε→0 sup n E[ρn1 ρn≥ε ] = 0. This is a very weak condition; see [26,Sec. 6.2] for stronger (but easier to verify) conditions. 329 where E n is the expectation over the product distribution P ⊗n of (Z 1 , ..., Z n ). More precisely, by the standard chi-square deviation bounds (see e.g., [28,Lemma 1]) one has that, with probability ≥ 1 − δ,

Analysis of M -estimators using self-concordance
Finally, these O(d/n) asymptotic bounds can be extended to the general situation of misspecified models by introducing the effective dimension: Note that in a well-specified model, d eff = d since G = H; moreover, in the ill-specified case one can still have d eff = O(d) "in favorable circumstances"we will consider one such situation, that of misspecified linear regression, later on. 2 The expected excess risk bound (2) then changes to and the corresponding in-probability bound (see again [28,Lemma 1]) is In fact, the main term in the right-hand side of (4) is the minimum possible asymptotic variance of any unbiased estimator; this result is known as the Cramér-Rao bound.
For what goes next, it is important to note that the asymptotic approach can be summarized as follows: • First, the estimate is localized : θ n − θ * 2 H is upper-bounded with the squared "natural" norm of the score, ∇L n (θ * ) 2 H −1 , which can be controlled by the central limit theorem.
• Then, using the second-order Taylor expansion of L(θ) around θ * , similar behavior is obtained for the excess risk L( θ n ) − L(θ * ).
Paying tribute to the clarity and historical significance of the classical asymptotic theory, one should keep in mind that its operating regime n → ∞ with fixed parameter dimension usually cannot be applied in the modern context. The recent works [16,5] extend the classical results to the asymptotic highdimensional regime d → ∞ with d = O(n), analyzing M -estimator as the fixed point of the approximate message passing algorithm. However, existing analysis of approximate message passing in finite samples is scarce: the only work we are aware of is [48], which only considers fixed-design linear regression. Postponing a more detailed review of related work to Section 7, let us briefly overview the main approaches in finite-sample analysis.
and the empirical risk is a quadratic form corresponding to H n = 1 n n i=1 X i X i : Hn + ∇L n (θ * ), θ − θ * . As such, the global curvature information about L(θ) is encapsulated in a single matrix H, and we have at our disposal an unbiased estimate H n of this matrix. This observation allows to dramatically simplify the analysis: it suffices to control the deviations of H n from its expectation, which can be done using the standard tools of random matrix theory. In particular, in [59], see also Theorem A.2 in Appendix, it is shown that whenever X is K-subgaussian in all directions, and n K 4 (d + log(1/δ)), where symbol hides a constant factor, with probability at least 1 − δ it holds In other words, the sample second-moment matrix H n approximates H, up to a constant factor, in the sense of the corresponding Mahalanobis distances (in particular, H n is non-degenerate whenever H is). This result can then be exploited as follows: since ∇L n ( θ n ) = 0, and H n 0, Using (8), this gives 1 2 θ n − θ * 2 H ≤ 2 ∇L n (θ * ) 2 H −1 , which, via (6), results in Finally, a non-asymptotic version of ( ) is obtained by controlling the squared norm ∇L n (θ * ) 2 H −1 under light-tailed (say, subgaussian or subexponential) assumptions on ∇ Z (θ * ) = εX, through standard concentration inequalities. These light-tailed assumptions can further be relaxed to fourth-moment assumption, using the generalized median-of-means estimator (see [21]). On the other hand, it is much more challenging to get rid of the light-tailed assumption on X, as obtaining covariance estimators with guarantees of the type (8) under weak moment assumptions is by itself a non-trivial problem. Recently, this problem has been addressed in [45], whose authors then proposed an estimator for ridge and ridgeless regression with near-optimal high-probability guarantees under heavy-tailed assumptions on X (see [45, Theorem 6.1]). 3 The remarkable feature of the outlined analysis is that, as soon as the curvature of L(θ), as given by H, is reliably estimated, the localization step is "automatic" due to (9). The only requirement is for n to reach the lower bound (7), so that one could relate the norms · Hn and · H . The crucial fact here is that for the quadratic loss, the curvature information is global, i.e., is encoded in a single matrix. However, for more general losses this is not the case, and there seems to be no direct way of extending the above argument. As discussed before, the known solution to the problem [49] involved localization of the estimate, through the control of the global uniform deviations of L n (θ), to a neighborhood of θ * where a local quadratic approximation can be used; this approach requires global assumptions on the pointwise deviations of L n (θ). Yet, we will show that in some other models beyond linear regression with quadratic loss, the local analysis suffices to provide localization of the estimate, and the complicated and opaque localization step using generic chaining, as in [49], can be circumvented.

Contributions and outline
Our analysis applies to linear prediction models: observing a pair Z = (X, Y ) with X ∈ X ⊆ R d and Y ∈ Y ⊆ R, one predicts Y through linear combination η = X θ with θ ∈ Θ ⊆ R d . Accordingly, we consider losses given by for some function : Y × R → R assumed to be sufficiently smooth in its second argument. This subsumes regression (Y = R) and classification (Y = {0, 1}). Moreover, we assume the ability to bound the third derivative of (y, η) with respect to η via the second derivative in two alternative ways, as will be detailed in Section 2. Such self-concordance assumptions originate from [43], where they were used in the context of interior-point methods; later on, they were modified and used in the statistical analysis of logistic regression [2,4]. We consider both variants of self-concordance in our analysis, and show that the canonical selfconcordance assumption, introduced in [43], leads to somewhat better bounds on the critical sample size than its modification suggested in [2] (see . In addition to self-concordance of the loss, we make some assumptions on the local behavior of the gradient and Hessian of the empirical risk at the population risk minimizer θ * , or its neighbordhood given by the unit Dikin ellipsoid (5) of the population risk at θ * . To prove our main results (cf. Theorems 4.1-4.2), we carefully combine these assumptions through a non-standard covering argument, which allows us to control the uniform deviations of ∇ 2 L n (θ) from ∇ 2 L(θ) over the Dikin ellipsoid, and implies localization of the estimator. We mention once again that global assumptions in the vein of [49] about the deviations of the empirical risk, its gradient and Hessian can be avoided by using self-concordance.
Our framework includes random-design least-squares linear regression as a baseline. However, as we show in Section 2, it is in fact much more general. First, it encompasses some conditional generalized linear models with random design. Here we find that both versions of self-concordance are related to natural assumptions about the moments of Y , and discover several generalized linear models amenable to our analysis, including logistic regression. Second, we can address some common losses in robust estimation, which turn out to be pseudo self-concordant in the sense of [2]. Moreover, we show how to slightly modify these losses to make them canonically self-concordant, while preserving their first-and second-order structure. According to our theory, this leads to the improved statistical performance of the M -estimator, as characterized by the sufficient sample size to reach the asymptotically optimal rate for the excess risk.
Our analysis carries out the following plan. First, the local assumptions allow to make sure that starting from the certain sample size, the sample Hessian approximates the true Hessian H = ∇ 2 L(θ * ) up to a constant factor, completely analogous to the case of least squares. After that, self-concordance comes at play. First, using simple analytic arguments, we prove that with high probability, ∇ 2 L n (θ) remains nearly constant in a Dikin ellipsoid of a smaller radius of order O(1/ √ d), leading to a larger critical sample size than in the case of leastsquares. We then use these initial results to prove that under slightly stronger -but still local -assumptions, ∇ 2 L n (θ) in fact remains constant in a constantradius Dikin ellipsoid, leading to the critical sample size comparable to that in least-squares (cf. Theorems 4.1-4.2). This is done via a simple but somewhat non-trivial covering argument, which might be of independent interest.
Let us now give a more detailed overview of the obtained results.
In Section 3, we show that for pseudo self-concordant losses [2], the asymptotically optimal (up to a costant factor) bound on the excess risk is guaranteed when the sample size reaches O(ρ · d · d eff ) up to a logarithmic factor in 1/δ, where ρ is the local curvature parameter linking H and Σ := E[XX ] by Σ ρH.
Moreover, for canonically self-concordant losses in the sense of [43], the dependency on ρ can be eliminated, and the critical sample size becomes O(d·d eff ). We now give a simplified (and slightly vulgarized) formulation of these two results. Theorem 1.1 (Simplified formulation of Theorems 3.1-3.2). Assume that (y, ·) is self-concordant, for any y, in the sense of Nesterov and Nemirovski [43], i.e., and that η (Y, X θ * )X =: ∇ Z (θ * ) and η (Y, X θ * ) 1/2 X are subgaussian. Then with probability ≥ 1 − δ, δ ∈ (0, 1), as long as where , hide constants. Moreover, if the loss satisfies the modified assumption instead of (10), X is as well subgaussian, and Σ ρH, then (11) is valid once While the only available generic upper bound on ρ is given by the inverse of the global strong convexity modulus of the loss, and can be very large or even infinite in the case of unbounded predictors, the actual value of ρ depends on the data distribution, and is moderate when this distribution is not chosen adversarially, as discussed in [4,Sections 3.1,4.2] and in our Section 2.2. In this vein, we show in Appendix D that ρ 1 + θ * 3 Σ in logistic regression with Gaussian design X ∼ N (0, Σ). Motivated by this result, we propose canonically self-concordant losses for classification and robust regression in Section 2.1.
In Section 4, we obtain improved bounds for the critical sample size, scaling near-linearly in the parameter dimension, under slightly stronger assumption on the data distribution. Essentially, we now require that the calibrated design X(θ) := [ (Y, X θ)] 1/2 X, is subgaussian uniformly over θ in the set -the r-radius Dikin ellipsoid of the population risk at θ * . Specifically, we require r = 1 for canonically self-concordant losses, and r = 1/ √ ρ for pseudo selfconcordant losses. This assumption is still local, and is not much more restrictive in some practical situations: in Appendix D we show, informally, that in the case of logistic regression with Gaussian design, the tails of X(θ) over θ ∈ Θ 1/ √ ρ (θ * ) are not heavier than those of X(θ * ) (see Proposition D.1). It allows to control the uniform deviations of the empirical Hessians from their means on Θ r (θ * ), leading to the reduced sample size as per the following result. . In addition to the premise of Theorem 1.2, assume that the vectors X(θ) := [ (Y, X θ)] 1/2 X are subgaussian for θ ∈ Θ r (θ * ), cf. (15), with r = 1 in the case of (10) and r = 1/ √ ρ in the case of (13). Then bounds (11) in Theorem 1.1 are valid once The main technical challenge when proving this result is the fact that, while (pseudo) self-concordance of the population risk over Θ r (θ * ) with appropriate r follows from that of the loss function (by relating the directional derivatives of L(θ) to the corresponding moments of X(θ)), this fails to hold for the empirical risk. Hence, we cannot uniformly control its Hessians on Θ r (θ * ) by simply integrating the directional third derivatives of the empirical risk. Instead, such control is attained by observing that self-concordance of the losses suffices to control Hessians in a smaller Dikin ellipsoid with radius O(1/d κ ) for some κ ≥ 1/2, and combining this observation with a somewhat non-standard covering argument. We hypothesize that the bounds (16) are optimal up to the log(d) factor, i.e., ERM cannot provably achieve the nonasymptotic version of ( ) in the regime where n is sublinear in d eff or d. This hypothesis is motivated by the observation that n d is necessary to estimate the local norm · H , whereas n d eff is necessary to have ∇L n (θ * ) H ≤ c, which, in turn, allows to localize θ n near θ * .
In Section 5, we extend some of the above results to the high-dimensional setup. Specifically, we obtain analogues of Theorem 1.1 for 1 -regularized Mestimators, assuming that the optimal parameter θ * is s-sparse, the matrices G and H are bounded in the operator norm, and the design is uncorrelated (the last assumption can in principle be relaxed). In the case of pseudo self-concordant losses (Theorem 5.1), we replace max(d, d eff ) with O(ρs log(d)), both in the error rates and the minimal sample size requirements. Unfortunately, for canonically self-concordant losses, we do not get the expected improvement by ρ (see Theorem 5.2), and the bounds essentially remain the same as in the case of pseudo self-concordance. This, however, is not surprising, since sparsity and 1regularization depend on the choice of the basis, and are not affine-invariant, which prevents us from fully exploiting self-concordance in the analysis by forcing to rely on the usual 1 -and 2 -norms instead of · H . More detailed discussion of these results and their comparison with related work is deferred to Section 5.

Notation
We write f g or f = O(g) to state that f (·) ≤ Cg(·) for any admissible arguments of f (·), g(·) and some constant C > 0; analogously for f g or f = Ω(g). Notation f ≈ g means f g f . [n] is the set of integers {1, 2, ..., n}. Throughout, θ * is the unique minimizer of L(θ). Similarly, θ n is the minimizer of L n (θ), which will be (provably) unique with high probability in all cases. Random vectors are denoted with capital letters (such as Z), and matrices with bold capital letters (such as A). I d is the d × d identity matrix. A is the transpose of A. For two square matrices A 1 , A 2 of the same size, we write A 1 ≺ A 2 (resp., A 1 A 2 ) when A 2 − A 1 is positive (semi)definite. We denote with · p the p -norm on R d and the Schatten p -norm of a matrix; in particular, A 2 is the Frobenius and A ∞ the operator norm. For A 0, we define the seminorm θ A := A 1/2 θ 2 .

Assumptions and examples
Before introducing the assumptions, we remind that the loss Z : Θ → R is modeled as Z (θ) = (Y, X θ) for some function (y, η) on Y × R (+) , where Y is a subset of R, and R (+) is allowed to be either R or the ray R + of strictly positive numbers, which allows to encompass the exponential response model (cf. Section 2.1). We refer to both Z (θ) and (y, η) as the loss; which of the two we mean is clear from context. The derivatives of (y, η) are with respect to η.

Self-concordance assumptions
Let us introduce the assumptions related purely to the loss, rather than to the data distribution. Our standing assumption, which we silently use later on, is that the loss z (·) is three times differentiable and convex on Θ for any z ∈ Z.
We first present the assumption of pseudo self-concordance, introduced in [2] for the analysis of logistic regression. The reader may refer to [50, 55, 4] for the uses of generalized self-concordance in the context of quasi-Newton algorithms.
Assumption SCa. For any y ∈ Y and η ∈ R (+) , the loss satisfies | (y, η)| ≤ (y, η). We also consider the canonical self-concordance assumption first introduced in [43] in the context of interior-point algorithms. The constant 2 is standard in the literature, but can be replaced with arbitrary constant by rescaling the loss.
Assumption SCb. For any y ∈ Y and η ∈ R (+) , the loss satisfies We now present some examples in which either of these assumptions is satisfied.

Generalized linear models over canonical exponential family
In generalized linear models (GLM) with canonical link function ( [37]), one has where the cumulant a(η) : R (+) → R normalizes − (y, η) to be a log-likelihood: With η = X θ, we have a conditional GLM for Y given η = X θ. Note that the second and third derivatives of (y, η) with respect to η coincide with those of a(·), hence satisfies the basic smoothness/convexity assumption whenever a(·) is three times differentiable (as such, a(·) must be convex). Moreover, the cumulant derivatives correspond to the central moments of Y : where E η [·] is expectation with respect to the distribution with negative loglikelihood given by (17). Hence, Assumption SCb states precisely that the skewness of the model distribution is bounded by a constant uniformly over η ∈ R (+) . This is the case in the exponential response GLM where Y ∼ Exp(η) and a(η) = − log(η) defined on R (+) = R + .
On the other hand, Assumption SCa is satisfied whenever the third absolute central moment of Y is uniformly bounded by the variance of Y , without the 3/2 power. This is the case in Poisson regression: Y ∼ Poisson(λ) with λ = exp(η); then b(y) = − log(y!) and a(η) = exp(η) so that a (η) = a (η). This model is appropriate for count data where the rate of arrival itself depends multiplicatively on the canonical parameter η; see, e.g., [15]. Perhaps most importantly, Assumption SCa is automatically satisfied in logistic regression in which Y = {0, 1}, and Y is modeled as a Bernoulli random variable with P η {Y = 1} = σ(η) where σ(η) = 1/(1 + e −η ) is the sigmoid function. In this case, a(η) = log(1 + e η ), and one can verify that a (η) = a (η)(1 − 2σ(η)), so Assumption SCa is satisfied since |σ(η)| < 1 for any η ∈ R. Another way to see this is by looking at the cumulant and using that Y = {0, 1}:

Robust estimation
Here, Y = R, and (y, η) = ϕ(y − η) for some contrast ϕ : R → R, a function minimized in the origin and usually even. Crucially, ϕ(·) must be globally Lipschitz-continuous, which guarantees robustness of the M -estimator, see [23]. On the other hand, from the statistical perspective, one can motivate contrasts that are locally quadratic, i.e., such that ϕ (0) exists and is strictly positive, see, e.g., [31]. 4 These considerations, along with some minimax optimality results, lead to the Huber loss (see [22]): The Huber loss is parametrized by τ > 0, which allows to control the tradeoff between robustness and statistical performance. Indeed, on one hand, |ϕ τ (t)| ≤ τ for any t ∈ R, and we make estimation more robust by decreasing τ ; on the other hand, the variance of the corresponding M -estimator usually decreases as τ . However, finite-sample statistical analysis of the Huber loss is complicated by the fact that ϕ(t) is not C 3 -smooth. This is also unfavorable from the algorithmic perspective, as it complicates the analysis of Newton-type algorithms for the computation of the M -estimator. These issues can be circumvented if one instead uses pseudo-Huber losses, which retain the favorable properties of the Huber loss, yet are C 3 -smooth. E.g., such are contrasts of the form ϕ τ (t) = τ 2 ϕ(t/τ ) with In both cases, the resulting φ τ (·) satisfies φ τ (0) = 1 for any τ > 0, and |ϕ τ (t)| ≤ τ for any t ∈ R. Moreover, simple algebra shows that both functions in (19) satisfy Assumption SCa up to c = 3, whence |ϕ τ (t)| ≤ 3 τ φ τ (t). As such, our theory is applicable to both these losses once they are properly renormalized.

Novel self-concordant losses
Here we construct a canonically self-concordant (up to a constant) pseudo-Huber loss, and similarly, a canonically self-concordant loss suitable for classification and similar to the logistic loss. This construction is motivated by the observation that our theory has a somewhat tighter guarantee on the critical sample size (after which the fast rates occur) under the canonical self-concordance assumption. (However, in practice the situation might be different as we explore in Sec. 6.) The key idea in this construction is that self-concordance is preserved under convex conjugation (see, e.g., [50, Prop. 6]), while at the same time one can control the range of the function through the domain of its convex conjugate (see [47]). Namely, consider φ : (−1, 1) → R + : that is, the negative log-barrier on [−1, 1] normalized by φ (0) = 1. Its convex conjugate ϕ(t) can be explicitly computed: Note that φ(·) is even, satisfies φ (0) = 1 and |φ (u)| ≤ 2 √ 2[φ (u)] 3/2 , since both functions log(1 ± u) satisfy Assumption SCb. By simple calculations detailed in Appendix C, ϕ(t) defined in (21) has all the same properties. On the other hand, we have |ϕ (t)| < 1 since φ(u) is a barrier on [−1, 1]. Thus, ϕ(t) has all properties desired for a robust loss, and besides is canonically selfconcordant (albeit with constant 2 √ 2 instead of 2). As illustrated in Figure 1, the quality of approximating the Huber loss for the new loss is essentially as good as for the commonly used pseudo-Huber losses (19). The new loss has a rescaled version ϕ Similarly, we can construct a self-concordant counterpart of the logistic loss suited for classification. In this case, we take φ(u) = − log(u(1 + u))/2, the normalized log-barrier of [−1, 0], whose convex conjugate is The derivative of φ * (·) must belong to (−1, 0), and is canonically self-concordant (up to a constant) by the same reasoning as before. By rescaling and shifting it, we obtain the loss which can be understood as a convex surrogate of the 0-1 loss similar to the logistic loss, see Figure 1. However, this loss is negative for yη > 2.4, and therefore does not globally upper-bound the 0-1 loss. Fortunately, its right branch can be lower-bounded with Ω(− log(yη)), so the resulting "leakage" is insignificant. On the other hand, this defect is unavoidable: one can show that a canonically self-concordant function on R + cannot have a horizontal asymptote: Finally, let us remark that the "leakage" effect can also be quantified using the so-called calibration theory [7].

Distribution assumptions
Preliminaries We now introduce additional assumptions that ar e related to the distribution of the design scaled by the derivatives of the loss at the true optimum θ * . All these assumptions are fully local, i.e., they only concern the true optimal point θ * . We begin with the basic assumptions. First, we assume the existence of the matrices Generally, Σ = H (unless for least-squares), and G = H (unless in a wellspecified model). Recall that E[∇ Z (θ * )] = 0; as such, G is the covariance matrix of ∇ Z (θ * ). For future reference we also note that, for any θ ∈ Θ, one has We assume that X θ ∈ R (+) for any θ ∈ Θ and X ∈ X . This assumption is nontrivial only when R (+) = R + which is of interest in the exponential response model. In this case, one can assume Θ ⊆ R d with other pairs of mutually dual convex cones in R d . Following [59], we use the formalism of subgaussian, or ψ 2 -norms. The ψ 2norm ξ ψ2 of a random variable ξ ∈ R can be defined in a number of equivalent ways (see Appendix A), e.g., as ξ ψ2 This definition is extended to random vectors Z ∈ R d in the standard way: In other words, Z ψ2 is the maximal · ψ2 -norm for all one-dimensional marginals of Z. See Appendix A on more details on subgaussian random variables.

Assumption D0. The decorrelated design is subgaussian: it holds
Assumption D0 is often satisfied with a constant K 0 not depending on n or d. For example, this is the case for zero-mean Gaussian design X ∼ N (0, Σ), or design with independent Bernoulli components. Moreover, it can be shown that affine transformation of the design X that satisfies Assumption D0 also satisfies it, with at worst twice larger K 0 (see Lemma A.5 in Appendix).
Assumption D1. The decorrelated loss gradient at θ * is subgaussian: Note that Assumption D1 can be reformulated in terms of the design vector scaled by the loss derivative at θ * since ∇ Z (θ * ) = (Y, X θ * )X. Similarly, we can consider the random vector which we call the calibrated design. Note that X is linked with H by E[ X X ] = H, cf. (23). As stated next, we assume that the calibrated design is subgaussian. This allows to control the deviations of H n using Theorem A.2 in Appendix.

Assumption D2. The calibrated design
Assumption D2 can be reformulated in terms of the loss Hessian ∇ 2 Z (θ * ) due to (23). However, this formulation does not give new ideas, and we omit it.
is the cumulant function. The transform [a (X θ)] 1/2 that scales X along a direction θ can be highly-non-linear, breaking subgaussianity for X(θ). For example, Assumption D2 does not hold in Poisson regression. Another limitation of our approach is that the constants K 1 , K 2 in Assumptions D1-D2 can depend on the magnitude of θ * . In fact, for logistic regression with Gaussian design X ∼ N (0, Σ), one has This proof of this estimate (see Appendix D) is highly non-trivial, and relies on the Gaussianity of X. We also show that if the subgaussian norm · ψ2 is replaced with the subexponential norm · ψ1 (see Appendix D and Section 3 for details). In other applications, one should carefully verify Assumptions D1-D2, bounding the constants K 1 and K 2 . This can be a non-trivial task itself, especially when the distribution of X is unknown.
Finally, for pseudo self-concordant losses we need compatibility of Σ and H. Assumption C. It holds Σ ρH for some ρ < ∞.
Assumption C has already appeared in the statistical analysis of logistic regression in [4]. Note that the simplest generic upper bound for ρ is and unless (y, ·) is strictly convex on R (+) (which is usually not the case), this bound is vacuous. On the other hand, the infinum in (25) can be taken on the subset of R (+) corresponding to possible values of X θ * , but such bound can still be very conservative: for example, it only gives ρ = O(e RD ) in the case of logistic regression with X 2 ≤ R a.s. and Θ = {θ ∈ R d : θ 2 ≤ D}. However, the actual value of ρ depends on the true distribution of the data, and is usually much smaller, see, e.g., dicsussion in [4, Sections 3.1, 4.2] for the case of logistic regression. For example, consider a "quasi well-specified" robust regression model: (Y, X θ) = ϕ(Y − X θ) with even contrast ϕ(·) and unconstrained parameter. Suppose that the true distribution of Y is given by Y = X θ * +ε, with ε being independent from X, zero-mean, and symmetrically distributed. One can check that in this case, L(θ) is minimized at θ * , and ρ = 1/E[ϕ (ε)]. On the other hand, the worst-case bounds on ρ can be enforced if the data distribution is chosen adversarially. In particular, for logistic regression [18] construct an adversarial distribution that enforces ρ = Ω(e RD ) as long as n = O(e RD ).

Results under minimal assumptions
In this section, we present extensions of the asymptotic deviation bound ( ) to the finite-sample regime under minimal assumptions. We then refine these results in Section 4, under a slightly strengthened version of Assumption D2, through a more subtle analysis. In the proofs, we use some probabilistic tools collected in Appendix A; in particular, we use deviation bounds for the quadratic forms (Theorem A.1) and for sample covariance matrices (Theorem A.2) of subgaussian vectors. We also use technical results on (pseudo) self-concordant functions collected in Appendix B. Some of them appear to be new, and are of independent interest. To improve readability, we defer the proofs to Appendix C.

Preliminaries
In the results which we are about to present, there is a technical difficulty arising due to the unboundedness of the vectors X and X, cf. (24). 5 To this end, we observe that, due to Assumptions D0 and D2, these vectors admit O( √ d) high-probability bound on their norms -more precisely, the events hold with probability ≥ 1 − δ, correspondingly, under Assumptions D0 and D2.
To exploit this fact, we replace the population risk L(θ) with the restricted risks: where we exclude from averaging the low-probability outcomes in which the norms of X and X are too large. Provided that δ is small enough, we can show , so that the second-order structure of the population risk is preserved; at the same time, we can now work with X and X as if they were almost surely bounded. We now present our basic result for M -estimators with self-concordant losses.
with probability at least 1 − δ it holds Moreover, one has The main message of Theorem 3.1 is that, under minimal assumptions, the "quadratic" behavior of the population risk, as given by (28)-(30), is guaranteed for sample sizes growing quadratically in parameter dimension -more precisely, for n = Ω(ρ · d · d eff ), cf. the second bound in (27), where Ω hides subgaussian constants and the logarithmic factor in δ. Technically, the curvature parameter ρ appears in (27) because of the "incorrect" power of the second derivative in Assumption SCa as compared to power 3/2 in Assumption SCb. Indeed, for canonically self-concordant losses, the factor ρK 2 0 in the bound for the critical sample size get replaced with K 2 2 , and Assumptions C and D0 are not needed. Theorem 3.2. Let Assumptions SCb, D1, D2 hold, and assume that δ satisfies (31). Then, (28)- (30) We also note that both of the above results include a technical condition (31) that does not minimal violation probability δ. This condition is mild, as the admissible δ depends polynomially on n and d. Moreover, this condition can be dropped if one reinforces Assumption D0 (resp., D2) by assuming that Σ −1/2 X (resp., H −1/2 X) is almost surely bounded. The corresponding modifications of Theorems 3.1-3.2 are given in the arXiv version of this paper [44, Thms 3.1-3.2].
As we previously discussed (cf. Remark 2.2), the distribution assumptions D0-D2, although local, are quite restrictive, as they assume light-tailed behavior. Next we discuss how these assumptions can be relaxed.

Extension to heavy-tailed distributions
To extend the results, we might use the confidence-boosting technique based on a version of the multi-dimensional sample median as proposed in [21]. This allows to completely get rid of Assumption D1, only assuming the existence of the covariance matrix G(θ * ). To use the technique, one first divides the sample into k = log(e/δ) non-overlapping subsamples, and computes the corresponding M -estimators θ (1) , ..., θ (k) over each subsample. Then, one aggregates them through [21, Algorithm 3], by using as the random distance oracle related to θ (i) . The final estimator is .
A similar technique also allows to somewhat weaken Assumptions D0 and D2, replacing the subgaussian norm · ψ2 with the subexponential norm · ψ1 at the expense of an extra logarithmic factor. (By definition, X ∈ R d satisfies X ψ1 ≤ K if for any u on the unit sphere one has (E[| X, u | p ]) 1 On the other hand, replacing Assumptions D0 and D2 with finite-moment assumptions (ideally, finite kurtoses of vectors X and X) is challenging. First of all, sample covariance estimators Σ and H would have to be replaced by some estimatorsΣ andH that admit affine-invariant bounds of the form with high probability, under the existence of only finite moments (ideally, the fourth moment) of X and X in any direction. Such estimators were recently obtained in [45] based on the iterative appication of the truncated covariance estimator analyzed in [61]. Computing such an estimator on the hold-out sample would allow to get rid of Assumption D0 in Theorem 3.1. However, this technique by itself does not allow relax Assumption D2, note first that we do not know the true minimizer θ * , and hence cannot directly compute the robust estimatorH. A possible remedy, leading to the extension of Theorems 3.1-3.2, is to apply an approximation technique on top of the affine-invariant covariance estimator, similarly to the one used below to prove Theorems 4.1-4.2 with improved critical sample size. As we will discuss in the end of Section 4, this would allow to get rid of Assumptions D0 and D2 in Theorems 3.1-3.2 but not in Theorems 4.1-4.2.

Improved results: near-linear critical sample size
As we demonstrate next, the previously obtained bounds on the critical sample size can be improved: essentially, the product of d eff and d can be replaced with their maximum. This requires to slightly strengthen Assumption D2 as follows.
, for any θ in the Dikin ellipsoid Θ r (θ * ) given by Note that Assumption D2 corresponds to Assumption D2 * with r = 0, the correspondence being given by K 2 =K 2 (0). On the other hand, the strengthened assumption is still local, i.e., it only concerns the points r-close to θ * , in the local Hessian metric, rather than in the whole domain Θ. With the new assumption at hand, we now state the improved result for canonically self-concordant losses.
Let us briefly explain the key ideas behind this result. First of all, recall that the extra factor d in the bound of Theorem 3.2 appears because self-concordance of the individual losses only allows to obtain a second-order approximation of the empirical risk in a small Dikin ellipsoid with radius O(1/ √ d), due to the fact that X H −1 = Ω( √ d) with high probability. This second-order approximation then allows to localize the estimate as soon as ∇L n (θ * ) H −1 becomes smaller than the radius of the ellipsoid in which such an approximation holds, cf. the proof of Proposition B.3. Hence, the extra factor d would be eliminated if we managed to provide a second-order Taylor approximation of L n (θ) in the constant-radius Dikin ellipsoid Θ c (θ * ). The immediately arising difficulty is that unlike the individual losses, the empirical risk is not self-concordant, hence, the desired Taylor approximation cannot be obtained purely by integration. In- stead, we conduct a somewhat non-standard argument (see Figure 2) which combines (i) self-concordance of the population risk following from Assumption D2 * ; (ii) self-concordance of the individual losses; (iii) a covering argument in which ellipsoid Θ c (θ * ) is covered with small ellipsoids with radius O(1/d γ ) for some γ ≥ 1/2. In particular, we choose γ = 2: this simplifies the calculations in the final step of the proof without breaking (35) since d γ enters the analysis under logarithm, when bounding covering numbers.
Next we present a counterpart of Theorem 4.1 for pseudo self-concordant losses. As one might expect, the bound on the critical sample size degrades by ρ.
The two results above merit some discussion. First, note that, in the case of pseudo self-concordance, the radius of the Dikin ellipsoid in which Assumption D2 * is required to hold is √ ρ times smaller than in the case of canonical self-concordance. As it will become clear from the proof of Theorem 4.2, this deflation is related to the fact that we cannot control the Hessians of L(θ) over Dikin ellipsoids with a larger radius, even when Assumption D2 * holds on such an ellipsoid. On the other hand, decreasing the radius r of the Dikin ellipsoid allows to controlK 2 (r): as we show in Appendix D, in logistic regression with Gaussian design X ∼ N (0, Σ), one has Thus Assumption D2 * with r = 1/ √ ρ is essentially equivalent to Assumption D2.
Second, note that the second threshold in (35) has an extraK 4 2 (r) factor compared to that in (32) if we do not distinguish betweenK 2 (r) and K 2 = K 2 (0), and similarly when comparing (36) and (27). This can be a substantial difference since K 2 andK 2 (r) can both depend on the norm of θ * . In fact, in Appendix D (Proposition D.1) we show, by a technical calculation, that in logistic regression with X ∼ N (0, Σ) one has this bound being tight, while the bound onK 2 (1/ √ ρ) is up to a logarithmic factor. Thus,K 4 2 (1/ √ ρ) can potentially be as large as ρ 2/3 . On the other hand, when the distribution of X(θ) is log-concave and centrally symmetric at any θ ∈ Θ r (θ * ), the factorK 4 2 (r) can be eliminated. This amounts to using the improved relation between the third and second moments of the marginals of H(θ) −1/2 X(θ) in step 1 o of the analysis in Theorems 4.1-4.2: as follows from [11,Lem. 2] by simple algebra, using log-concavity of H(θ) − 1 2 X(θ).

Extending Theorems 4.1-4.2 to heavy-tailed distributions
One fact playing the key role in the proofs of the last two theorems is that in the bound for the sample complexity of estimating a single covariance matrix, the confidence term log(1/δ) is additive with d. This allows to take the union bound over an exponential in d number of events correponding to the centers of the epsilon net, while still preserving a near-linear in d sample complexity. As discussed in Section 3, the main technical challenge when trying to extend our results to heavy-tailed distributions is posed by Assumption D2, which for Theorems 4.1-4.2 gets strengthened to Assumption D2 * . To get rid of it, one could replace the empirical Hessians H(θ) by some estimatorH(θ) that estimates H(θ) with high confidence in the positive-semidefinite sense (cf. (33)) under weak moment assumptions. Given such estimators, we can essentially repeat the covering argument in the analysis of Theorems 4.1-4.2, replacing the Hessian estimate in any θ ∈ Θ r (θ * ) (with r = 1 or r = 1/ √ ρ) with the es-timateH(θ ) in the closest center θ of the cover, and replacing empirical risk minimization with a version of stochastic quasi-Newton algorithm withH(θ ) as the Hessian oracle for H(θ). Unfortunately, the only known to us estimator that provably satisfies a high-confidence affine-invariant bound under weak moment assumptions is the one from [45], and its sample complexity scales as i.e., the confidence term enters multiplicatively with d. After taking the union bound over d O(d) events, this bound becomes quadratic in d. While this is sufficient to extend Theorems 3.1-3.2, the argument in Theorems 4.1-4.2 is destroyed. Thus, extending the latter theorems, and obtaining near-linear sample complexity, has to rely on -type covariance estimation with additive confidence, cf. (37). The closest in this direction is the recent work [40] which establishes a high-probability bound in the operator norm, is a constant depending only on the kurtosis, and r(Σ) := Tr(Σ)/ Σ ≤ d is the effective rank. Unfortunately, it is challenging to apply this result in our context, since the operator-norm bounds cannot be translated to -type guarantees akin to (33) when the estimator is not affine-equivariant. Some progress in this direction has recently been obtained in [45]; see also [45, Sec. 2.3] for the detailed discussion.

High-dimensional setup
Our next goal is to extend the results obtained so far to the high-dimensional setting. Namely, we assume that Θ = R d with d n, and that the optimal parameter θ * is sparse, i.e., the number of non-zero components of θ * is at most s d. Note that if the support S of θ * was known, a reasonable estimator could be obtained by replacing X with its projection X S on S, and minimizing the empirical risk on S. As in the case of quadratic loss, and the classical Lasso estimator, this would lead to the improvement over the results of Section 3-4: the ambient dimension d would be replaced with s, and d eff with the quantity Tr( However, in reality S is unknown, and the common recommendation is to use the 1penalized M -estimator, given by In the case of quadratic loss, it is well-known that the risk of the 1 -penalized estimator, when measured in terms of the 1 -loss or the "prediction" loss corresponding to the design covariance matrix, is within a logarithmic in d factor from the "ideal" risk of the projection oracle, provided that the penalization parameter λ is appropriately chosen, and the design is near-isotropic and subgaussian -see, e.g., [54], [12], [9], [25]. While the statistical theory for the quadratic loss is almost complete, this is not yet the case for general M -estimators. Here our goal is to partially close this gap, providing analogues of Theorems 3.1 and 3.2 in the high-dimensional setting. These results extend those obtained in [2] for the logistic loss using pseudo self-concordance, and are close to those proved in [56]; we discuss the connections with these works in the end of this section. Finally, notice that we do not prove analogues of Theorems 4.1-4.2, which would have resulted in a near-linear, rather than quadratic, dependency of the critical sample size from s. We leave such extensions for future work. We now introduce the final assumption complimentary to Assumption C.
Together, Assumptions C and C * imply the bounds in operator norm: Moreover, we can reasonably expect that in the ill-specified case, G H, which is a stronger version of the natural inequality d eff ≥ d. When this is the case, the eigenvalues of both H and G belong to the interval [ρ −1 , κ] where κ := max(κ 1 , κ 2 ). Then, the product Q := ρκ can be considered as the condition number of the estimation problem at hand. In particular, we are about to see that the excess risk bounds, as well the bounds for the critical sample size, get inflated by Q in the high-dimensional regime. This reflects the requirement that the problem should be well-conditioned with respect to the standard coordinate basis, since both 0 -"norm" and 1 -norm depend on the choice of the basis. Some further remarks are given below.
• Similarly to the bound (25), we can always bound κ 1 and κ 2 : Arguably, these bounds are more informative than the bound (25) for ρ, as they involve the suprema of the loss derivatives (e.g., the right-hand sides are constants for pseudo-Huber and logistic losses). • Correlated designs can also be considered, but this would lead to the inflation of the bounds by the condition number of Σ. This is natural, as 1regularization fixes the basis, and the estimator is not affine-invariant.
The next result characterizes the statistical properties of the 1 -penalized Mestimator (38) with a canonically self-concordant loss, extending Theorem 3.1.

Whenever
and the regularization parameter satisfies we have that with probability at least 1 − δ, 2. Define E := { X ∞ K 0 log (ed/δ)}. Then, P(E) ≥ 1−δ, and whenever δ λ Clearly, the right choice of λ is the one attaining the lower bound in (40): Next we state a version of Theorem 5.1 for canonically self-concordant losses.

The event
Comparison of Theorems 5.1 and 5. 2 The usual gain of ρ that we have observed so far for canonically viz. pseudo self-concordant losses is not preserved in 1 -regularized estimators. Instead, the second bound in (39) and the upper bound in (40) get inflated with κ 2 , and the critical sample size, given the "ideal" choice of the regularization parameter corresponding to the lower bound in (44), becomes n max(Qs, Q 2 s 2 ) log(ed/δ). Essentially, the reason for that is that 1 -regularization does not "know" anything about the matrices H and H n , and, in a sense, violates the affine-invariant structure of the proofs for nonregularized M -estimators. This seems to be a fundamental problem with 1regularization, rather than the artifacts of our proofs, since 1 -regularized Mestimators are themselves not affine-invariant. As such, we believe the additional factors of Q and Q 2 to be unimprovable in the high-dimensional setup without further assumptions.
Comparison with prior work Theorem 5.1 extends the result of [2, Theorem 5] for logistic regression with fixed design, obtained using the pseudo selfconcordance of the logistic loss. While the established error bounds are similar, our results have important novelties. First, we analyze the random-design setting, whereas [2] assumes fixed design. Second, the result of [2] requires larger sample size, scaling with the product of s and R 2 where R is an upper bound on X 2 . Typically, R scales as Ω( √ d) (e.g., this is the case where the design is pre-generated by sampling from a subgaussian distribution), thus [2] essentially proves the bound O(sd) for the critical sample size.
On the other hand, our results can be compared to those in [56] who establish the rate O(λs) for the 1 -error and O(λ 2 s) for the prediction error (see their Theorems 5.2 and 7.3), addressing a larger class of models including GLMs with non-canonical link functions, and general convex robust losses. However, in order to control the precision of the local quadratic approximations of the risk, the authors of [56] assume that (Y, X θ * ) is bounded from below (Conditions A4 and B), which can only be guaranteed by assuming that θ * is bounded in 1norm. Thus, their results do not address the case of unbounded parameter. Remarkably, these results have a similar requirement for the sample size to scale as Ω(s 2 log d).
computed from a small subsample. We defer discussion of further related work on 1 -regularized M -estimators to Sec. 7.

Numerical experiments
We now present two numerical experiments that illustrate our theoretical results. 6 Critical sample size grows linearly with model dimensionality Here the point is to illustrate the results in Section 4, namely Theorems 4.1-4.2.
Recall that, in a nutshell, these results state that the fast O(d/n) rate for the excess risk becomes available starting from the critical sample size which is O(d eff ∨ d), where O(·) hides factors depending on the distribution-dependent constants K 0 , K 1 ,K 2 , ρ arising in Assumptions D0, D1, D2 * , and C. In our first experiment (see Fig. 3), we empirically demonstrate that the critical sample size indeed scales linearly with the parameter dimension. For growing sample size n = 10 k , k ∈ [1, 3], we generate an i.i.d. sample (X i , Y i ) n i=1 with standard Gaussian design X i ∼ N (0, I d ) and conditional distribution of the (binary) label given by P[Y i = 1] = 1/(1 + exp(−X i θ * )) (i.e., such that the logistic model is well-specified) or by P[Y i = 1] = 1 − φ(X θ * ), where φ(·) is the standard Gaussian c.d.f., which corresponds to the probit regression. Thus, the logistic model for Y |X is well-specified in the second case We take θ * = 1 d / √ d (thus θ * 2 = 1) and consider the following three quantities for d ∈ {8, 16, 32, 64}: Here L SC (θ) := E[ SC (Y, X θ)] with SC (y, η) given by (22); θ SC * minimizes L SC (θ) and might be different from θ * . Note that SC (y, ·) and (y, ·) have the same second-order Taylor expansion around η = 0 (see Fig. 1). 3. Excess risk L( θ SC n ) − L(θ * ) that evaluates θ SC n as a surrogate estimator.
In all three cases, we approximate the excess risk via a test sample with N = 10 4 observations, and we compute θ SC * by running fmincon optimization routine in Matlab (we use the constraint θ 2 ≤ 2 to avoid numerical instabilities). Then, for each value of d and the three notion of excess risk, we plot the excess risk against the sample size in the log 10 − log 10 scale. The experiment is repeated T = 800 times, and the averaged curve is then plotted along with a 3σ-confidence interval.
The results are shown in Fig. 3. We can distinctively see the elbow effect: the initial slow convergence rate (slope around −1/2 on the log-log scale) changes to the fast rate (slope −1) for larger sample size. This is observed for all three curves, all values of d, and both conditional distributions of Y .
• For the logistic distribution, θ n outperforms θ SC n in the fast rate zone (i.e., with sample sizes above the critical level) in terms of their corresponding "native" risks as well as the logistic risk. This is expected: while θ n is well-specified, estimator θ SC n has to pay for model misspecification, and its excess risk depends on d eff rather than d (cf. (34) in Theorem (4.1)). Meanwhile, for smaller sample sizes L SC ( θ SC n ) − L SC (θ SC * ) is smaller than the other two excess risks. This seems to be simply due to SC (y, η) being smaller than (y, η) away from η = 0 (cf. Fig. 1).
• In the case of probit distribution, both estimators are misspecified, and turn out to have very close performance in terms of all three excess risks.
Finally, and most importantly, we see that the "elbow" on the curves moves to the right in (roughly) constant increments as we increase d geometrically. This is what we expect: according to Theorems 4.2-4.1, the critical sample size grows linearly with d or d eff in the misspecified case (and here d eff is itself linear in d).
Critical sample size growing as e RD for "hard" design distributions Here we empirically investigate the dependency of constants K 0 , K 1 ,K 2 , ρ from the norm D = θ * Σ of the population risk minimizer. Recall that in Appendix D we provide polynomial bounds in the case of logistic regression with Gaussian design. However, for certain (quite artificial) distributions of design the dependency might be exponential as implied by the results of [18]. In this experiment, we consider the adversarial distribution proposed in [18,Sec. 3.2], in which X ∈ R 2 is supported on three points with carefully chosen probabilities (see [18,) and Y ≡ 1. The authors prove the Ω(1/ √ n) lower bound (and hence the absence of fast rate) for the excess risk long as n e RD , where R = X Σ −1 . We empirically discover a similar phenomenon for the selfconcordant loss (22). To this end, we follow a similar protocol as in the previous experiment but generate the pairs (X i , Y i ) according to the distribution in [18] and linearly increase D while fixing d = 2. The experiment is repeated T = 1600 times for sample sizes n ∈ [10 1 , 10 4 ], and the population risk is approximated via a test sample with size N = 5 · 10 4 . We then plot the same three dependencies as in the previous experiment (again in the log-log scale) for D ∈ {1, 3, 5, 7}.
The results are presented in Fig. 4. For small sample sizes the curves oscillate, which seems to be due to the special low-dimensional structure of the design  (22) (estimator θ SC n ) in the first experiment. "Logistic", "self-conc" and "calibrated" correspond to the three notions of excess risk: the "native" risks for θn, θ SC n and the "transfer" risk for θ SC n with logistic loss (see p. 30 for more details). The comparison of M -estimators θn, θ SC n in the second experiment, using the adversarial data distribution from [18] (see p. 31 for more details). "Logistic", "self-conc" and "calibrated" correspond to the same three notions of excess risk as in the first experiment (see Fig. 3).
distribution. However, the upper envelope of the curve clearly exhibits the same "elbow" effect as before: the slope changes from roughly −1/2 to −1 for large sample sizes. Moreover, the horizontal location of the elbow moves in nearly uniform increments as we change D linearly, precisely as expected from the theory in [18]. We also note that the "transfer" risk L( θ SC n ) − L(θ * ) converges to a non-zero value, which shows that θ SC * = θ * for the distribution considered here.

Related work
Self-concordant analysis of logistic regression Our approach is inspired by [2], and we reuse and extend some of their technical results in our Propositions B.3-B.4. However, our results and analysis are crucially different from those in [2] in several ways. First, we address the random-design setting, whereas in [2] the design is fixed. Second, [2] considers only pseudo self-concordant losses, focusing on logistic regression, whereas we also provide results for canonically self-concordant losses, and, crucially, compare the two cases. Third, we obtain similar results for ill-specified models, whereas [2] only establishes a slow rate in this case. Finally, and most importantly, while we use very similar tools to those in [2], the "core" of our analysis is more direct. Namely, [2] studies the minimizer θ λ,n of the 2 -penalized empirical risk with strictly positive regularization parameter λ, and moreover, imposes some technical condition on the minimal magnitude of λ, see their Eq. (13). Upon close inspection, this condition implies where the degrees of freedom parameter df λ replaces d in the 2 -penalized setting. This, in turn, allows to carry out an argument analogous to ours, but applying Proposition B.4 to the regularized empirical risk. However, 2 -penalization makes the analysis much more involved, as it rests on the comparison of the regularized risks, and accordingly, relates θ * and θ λ,n through the intermediate point -the minimizer θ λ of the regularized average risk. The extra condition in [2], which makes this analysis possible, is non-trivial, and requires some fine balance between the regularization parameter, sample size, and various types of degrees of freedom and biases. We manage to circumvent these difficulties for the plain ERM, including the ill-specified case, by realizing that the only condition needed to carry out the argument based on self-concordance, in the non-regularized case, is the sufficient sample size.

Self-concordant analysis and improper algorithms
Another relevant work is [3] which studies logistic regression with random design, but analyzes an estimate computed by stochastic approximation with averaging. While this estimator is more advantageous from the computational standpoint, the control of the distance to the optimum is more involved (see [3,Proposition 7]) which leads to the suboptimal risk bound where μ is the minimal eigenvalue of H, R is an upper bound for X 2 and sup θ∈Θ ∇ Z (θ) 2 , and D 0 := θ 0 − θ * 2 is the initial 2 -distance from the optimum (in fact, if D 0 is known up to a constant factor, R 4 D 4 0 in (48) can be replaced with R 2 D 2 0 ). The bound (48) reflects the fact that gradient descent trajectory is not affine-invariant, hence the distances are not "measured" in terms of the natural norm · H . For the natural gradient, that is, gradient descent on the tranformed problemθ = H 1/2 θ, factor μ would disappear from (48), but R would be replaced with max(d eff , ρ · d), and D 0 with the initial prediction distance θ 0 − θ * H , which would lead to a bound scaling as the cube of max(d eff , ρd). The follow-up work [4] studies a version of the quasi-Newton method, solving the quadratic subproblems via stochastic approximation. This allows to conduct affine-invariant analysis of the outer loop, and results in whenever n (R 4 D 4 0 + 1). It should be noted that the curvature parameter ρ that appears in these results, as well as in our results for pseudo selfconcordant losses, is problem-dependent. In particular, it depends on the true distribution P of the data, and can be very large if this distribution is chosen adversarially. By constructing such an adversarial distribution, [18] prove a lower bound Ω( RD/n), i.e., for the excess risk of any algorithm, in logistic regression in the finite-sample regime n = O(e RD ). This implies that ρ grows super-polynomially in RD for this distribution. Notably, the lower bound of [18] is not applicable in the setting of improper prediction, where one is allowed to estimate η * := X θ * with any predictor η : X → R, not necessarily with a linear one. Making such an observation, [17] recently proposed an improper estimator which attains the excess risk O(d/n) up to logarithmic factors in RD, n, and 1/δ. Their estimator reduces to Vovk's Aggregating Algorithm [60] for online convex optimization, combined with a simple "boosting the confidence" scheme proposed in [38].

Non-parametric setup and 2 -regularization
After the arXiv preprint of this work began circulating, 7 our analysis was extended in [36] to M -estimators with 2 -regularization, including the case of infinite-dimensional parameter. The reader might refer to [36] for an overview of related work in this direction. In a nutshell, [36] proves asymptotically near-optimal bounds O(d eff /n) on the "variance" term corresponding to the excess risk L( θ λ,n ) − L(θ λ ), and the additional "bias" term L(θ λ ) − L(θ), under condition (47), and without extra conditions in the vein of those in [2]. Moreover, it is shown that the classical source and capacity conditions [13], known to lead to faster non-parametric rates in ridge regression, can be extended to M -estimators with self-concordant losses. However, [36] does not extend our improved results with near-linear sample size (Theorems 4.1-4.2) to the 2 -penalized case. We believe that such extension is possible by replacing Theorem A.2 with a similar result for regularized covariance matrices such as [27,Thm. 9]. This, however, would be of little practical interest, since typically under source condition, df λ is a constant depending only on the rate of decay of the eigenvalues of H.

Quasi-Newton algorithms
We also mention in passing that recently there has been a surge of interest in stochastic quasi-Newton methods applied to the finite-sum setting with self-concordant losses, see, e.g., [63], [65]. However, none of these works establishes generalization bounds for the associated estimator. In fact, such bounds have recently been established in the work [35] for a certain (globally convergent) ad-hoc quasi-Newton scheme. These generalization bounds are similar to those established in [36] for the exact ERM, with similar criticism.

Empirical processes
The use of empirical processes in the context of parametric estimation was pioneered in [49], which has been one of the main inspirations for our work. Apart from the technical difficulty in the proofs, our main critique of [49] is the global conditions they impose -most importantly, they require ∇L n (θ) to be subgaussian uniformly over the whole parametric set Θ. As can be seen from the proof of Proposition D.1 in Appendix D, verification of such global conditions can be quite technical; moreover, such conditions can in fact be way more restrictive than their local counterparts (see, e.g., the bounds in Proposition D.1 which degrade drastically when θ * Σ 1, and which are sharp as can be seen from the analysis).
Recently we learned about the work [39] that applies empirical processes to study constrained empirical risk minimization with smooth non-convex losses. (Its preprint came out after that of our work.) Essentially, [39] proves that in the regime n d log d (resp. n s log d in the high-dimensional setup), the sample gradients L n (θ) and Hessians ∇ 2 L n (θ) uniformly converge to their population counterparts, assuming that ∇L n (θ) is subgaussian, and ∇ 2 L n (θ) is subexponential on the whole domain Θ. This allows to establish correspondence between the stationary points of L(·) and L n (·). While the focus of [39] is different, we expect that one may also prove asymptotically optimal rates for the excess risk in this regime, i.e., prove "local" analogues of our improved results (cf. Section 4), by localizing a minimizer θ n to the unit Dikin ellipsoid of the associated population risk minimizer θ * . 8 However, [39] has the same limitations as [49], requiring "global" conditions on the tails of ∇L n (θ) and ∇ 2 L n (θ). Similar criticism applies to the literature on 1 -regularized M -estimators in high dimensions ([56, 41, 31]); we discuss these results in more detail in Section 5.
Further related work on 1 -regularized M -estimators Interestingly, [64] showed that, in absense of restricted-eigenvalue (RE) type conditions imposed on the (fixed) design matrix, decomposable regularizers only lead to slow O(1/ √ n) rates, even with quadratic loss. Hence, the light-tailed design condition that we impose appears to be necessary when 1 -regularized M -estimators are considered.
Not directly relevant to our results here, we note that, whenever the computational considerations are not important, the 1 -regularization can perform worse than other types of regularization. In fact, 0 -regularized estimators are known to achieve O(s/n) rate for the prediction error without RE-type conditions (in the fixed-design setup) [64]. Moreover, there are other (non-convex) penalties that have favorable statistical properties without incoherence, see, e.g., [34,33,31,32].
After the preprint of this work had been publicized, the statistical performance of regularized M -estimators has been studied in the asymptotic regime n, d → ∞ with d/n = c, including the "high-dimensional" case with c 1see [53, 51].

Conclusion
Our work sheds light on the mechanism behind the transition to the fastrate regime in M -estimators with smooth losses. Our analysis allows to deal with M -estimators with losses satisfying self-concordance-type assumptions, including logistic regression, other generalized linear models, and robust regres-sion. Self-concordance assumptions allow to control the precision of the local quadratic approximations of empirical risk. Simple analysis under minimal assumptions leads to a fast-rate guarantee for large sample sizes -larger (in order) than d · d eff , where d eff is the effective dimensionality of the parametric model. However, a refined analysis under slightly stronger assumptions leads to the O(max{d eff , d log d}) sample size threshold. This is done through a combination of self-concordance with a covering argument, allowing to control the uniform deviations of the empirical risk Hessian in the Dikin ellipsoid around the population risk minimizer. We also extend the analysis to 1 -regularized Mestimators in the high-dimensional regime. Finally, we verify the empirical performance of a canonically self-concordant analogue of the logistic loss in numerical experiments.

A.1. Subgaussian distributions
We recall the definition of subgaussian norm for random variables ξ ∈ R: The lemma below provides equivalent definitions of the subgaussian norm. Following [59], we define the · ψ2 -norm of a random vector as follows: where S d−1 is the unit sphere in R d . Note that this is indeed a norm; in particular, it satisfies the triangle inequality: Z 1 + Z 2 ψ2 ≤ Z 1 ψ2 + Z 2 ψ2 for any pair of random vectors Z 1 , Z 2 . Another elementary property is that JZ ψ2 ≤ J ∞ Z ψ2 for arbitrary matrix J. Some well-known properties of subgaussian random vectors are summarized in the following lemmas.
Proof. The claim follows from Item 1 of Lemma A.1 by the union bound.
Next we give a bound for the p-th moment of Z ∞ . Although this bound is loose for any fixed p, we only use it in the regime p ≈ log d where it is tight. 9 Lemma A.3. In the assumptions of the previous lemma, for any p ≥ 1 it holds Proof. Using the bound from Lemma A.2, we obtain We obtain the claim by extracting the p-th root and doing simple estimates.
ψ2 . The next result shows that the · ψ2 -norm is stable under affine transforms.

Then for any
Proof. The quantity Σ −1/2 X is invariant with respect to linear transforms, so it only remains to treat the case X = X + b. Now, in this case, Σ = Σ + bb , and On the other hand, by the Sherman-Morrison formula. Finally, note that K 1, as follows from the inequality E[ξ 4 ] ≥ (E[ξ 2 ]) 2 applied to ξ = u, X , and Item 2 of Lemma A.1. 9 Tight bounds for all moments can be obtained via the Chernoff method combined with the general Orlicz norms · ψα with α = 2/p, see, e.g., [46]. It is beyond the scope of this paper.

A.2. Quadratic forms of subgaussian vectors
We call random vector Z ∈ R d isotropic if E[Z] = 0 and E[ZZ ] = I d . The following result is a deviation bound for quadratic forms of isotropic subgaussian random vectors. It is obtained from [20, Theorem 2.1] using the isotropicity assumption which allows to get rid of the K 2 factor ahead of Tr(J).
Theorem A.1. Let Z ∈ R d be an isotropic random vector with Z ψ2 ≤ K, and let J ∈ R d×d be positive semidefinite. Then, In other words, with probability at least 1 − δ it holds As a consequence, P ζ ≥ cuK Tr(J) ≤ exp c 1 − u 2 /c 2 , so that ζ ψ2 ≤ cK Tr(J).

A.3. Sample covariance matrices
Next we focus on sample second-moment matrices of subgaussian vectors.
.., X n are independent copies of X. Whenever n K 4 (d + log(1/δ)), with probability at least 1 − δ it holds Next we present an extension of this result to the high-dimensional setting.
The next proposition, whose proof follows [42], allows to control the second derivative of the loss when it is restricted to a straight line.
for some c ≥ 0. Then, for any t ∈ R (+) such that c|t| g(0) ≤ 1, it holds Proof. We first treat the case g(0) > 0. Consider the segment and assume that g(t) > 0 on the whole T 0 . Then, we can define ψ(t) := 1/ g(t) on T 0 , and the premise of the proposition translates to |ψ (t)| ≤ c. Now, let t ∈ T 0 be positive without loss of generality. Integrating from 0 to t, we get Multiplying by the product g(t)g(0) > 0, and rearranging the terms, we prove the claim in the case where g(t) does not vanish on T 0 (the case of negative t is treated analogously). Now, let t 0 ∈ T 0 be the leftmost zero of g(t) on T 0 ∪ R + (recall that we still assume g(0) > 0). Then the preceding argument is valid for any t ∈ [0, t 0 ], which implies that g(t 0 ) > 0, thus yielding a contradiction. This argument can be repeated for negative t, taking t 0 to be the rightmost negative zero of g(t) on T 0 . Hence, g(0) > 0 in fact implies that g(t) > 0 on the whole T 0 . Finally, assume that g(0) = 0. Then if g(t) ≡ 0 on the whole T 0 , we are done. Otherwise, there is a point t ∈ T 0 in which g(t ) > 0. W.l.o.g. assume that t > 0, let t 0 be the rightmost zero of g(t) on T 0 ∪ R + , and take a pair of points t 1 , t 2 ∈ T 0 such that t 0 < t 1 < t 2 . Integrating ψ (t) from t 1 to t 2 , we get which, after the mutiplication by g(t 1 )g(t 2 ) and rearrangement, results in When t 1 → t 0 , by continuity of g(t) we get a contradiction with g(t 0 ) = 0.
The next proposition describes the local properties of multivariate functions whose restrictions to line segments behave as pseudo self-concordant functions (Case (a)), or similarly but with a weaker control of the third derivative (Case (b)). Case (a) repeats [2, Proposition 1], and suffices for pseudo selfconcordant losses; Case (b) allows to treat canonically self-concordant losses.
Lemma B.1 (Lemma 1 in [2]). Let g : [0, 1] → R be a three times differentiable and convex function such that g (0) > 0, and for some S ≥ 0, Then, for any 0 ≤ t ≤ 1, one has Proof. First assume that g (t) > 0 on [0, 1]. Then, the premise of the lemma implies that −Sdt ≤ d log g (t) ≤ Sdt for 0 ≤ t ≤ 1. Integrating this, we get Two more integrations successively give , and then (64). Now, let t 0 ∈ (0, 1] be the leftmost zero of g (t). Then, the preceding argument can be applied on [0, t 0 ], yielding a contradiction due to the left inequality in (65).
Integrating two more times, and assuming S < 1 for the upper bound, we get and then (66). When t = 1, the term (1 − S) vanishes from the denominator of the right-hand side of (66), hence in this case we can take S < 2.
In particular, taking t = 1/S, we have Proof. We again assume w.l.o.g. that g (t) > 0. Integrating three times, we get implying (68). The last claim follows by continuity of f (u) = u log u at 0.
Proof of the proposition Case (a), the first statement of Case (b), and the second statement of Case (b) follow, correspondingly, from Lemmas B.1, B.3, and B.2 applied to g(t) = φ F (t) and using that g(t) = F (θ t ), g (0) = F (θ 0 ), θ 1 − θ 0 , and g (0) = θ 1 − θ 0 2 H0 . Note that the inner-product structure of S does not play a role here, but is used in Proposition B.4.
The next result describes the behavior of (pseudo) self-concordant functions close to the optimum. Case (a) corresponds to [2,Proposition 2]. The argument for Case (b) appears to be new, and is of independent interest. We note that a very similar argument was independently invented by U. Marteau-Ferey in [36]. The key message of Proposition B.4 is that the local information about F (·) at one point efficiently amounts to the global information about how close is this point to the optimum. When applied to the empirical risk with θ 0 = θ * andθ = θ n , this proposition allows to localize θ n using that the quantity ∇L n (θ * ) 2 H −1 decreases at rate O(d eff /n) under the i.i.d. assumption.

Case (a) By
where we first used that u → (e −u − 1 + u)/u is a decreasing function, and then the Cauchy-Schwarz inequality. Denoting u = Rr, we arrive at By the premise, we know that νR ≤ 1/2, hence e −u − 1 + u/2 ≤ 0. We can check numerically that this implies u ≤ 2; moreover, one has e −u − 1 + u ≥ u 2 /4 for such u. Plugging this back into (70), we arrive at u ≤ 4νR, that is, θ 1 −θ 0 H0 ≤ 4ν. In other words, the level set Θ F (F (θ 0 )) is compact and belongs to the · H0ball of radius 4ν centered at θ 0 . Hence, the minimumθ exists and belongs to the same ball; it is also unique since F (θ) 0.
Case (b) with S < 1 By the lower bound in (63), we have where we used that u → 1/(2 + u) is a decreasing function on R + . Whence, where u := Rr. Since νR ≤ 1/2, we have u ≤ 2. Thus, we get r ≤ 4ν as required.