Annales de l'Institut Henri Poincaré, Probabilités et Statistiques

Sparsity in penalized empirical risk minimization

Vladimir Koltchinskii

Full-text: Open access


Let $(X, Y)$ be a random couple in $S×T$ with unknown distribution $P$. Let $(X_1, Y_1), …, (X_n, Y_n)$ be i.i.d. copies of $(X, Y)$, $P_n$ being their empirical distribution. Let $h_1, …, h_N:S↦[−1, 1]$ be a dictionary consisting of $N$ functions. For $λ∈ℝ^N$, denote $f_λ:=∑_{j=1}^Nλ_jh_j$. Let $ℓ:T×ℝ↦ℝ$ be a given loss function, which is convex with respect to the second variable. Denote $(ℓ•f)(x, y):=ℓ(y; f(x))$. We study the following penalized empirical risk minimization problem $$\hat{\lambda}^{\varepsilon }:=\mathop{\operatorname {argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P_{n}(\ell\bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr],$$ which is an empirical version of the problem $$\lambda^{\varepsilon }:=\mathop{\operatorname{argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P(\ell \bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr]$$ (here $\varepsilon≥0$ is a regularization parameter; $λ^0$ corresponds to $\varepsilon=0$). A number of regression and classification problems fit this general framework. We are interested in the case when $p≥1$, but it is close enough to 1 (so that $p−1$ is of the order $\frac{1}{\log N}$, or smaller). We show that the “sparsity” of $λ^\varepsilon$ implies the “sparsity” of $\hat{\lambda}^\varepsilon$ and study the impact of “sparsity” on bounding the excess risk $P(ℓ•f_{{\hat{\lambda}^\varepsilon}})−P(ℓ•f_{{λ^0}})$ of solutions of empirical risk minimization problems.


Soit $(X, Y)$ un couple aléatoire à valeurs dans $S×T$ et de loi $P$ inconnue. Soient $(X_1, Y_1), …, (X_n, Y_n)$ des répliques i.i.d. de $(X, Y)$, de loi empirique associée $P_n$. Soit $h_1, …, h_N:S↦[−1, 1]$ un dictionnaire composé de $N$ fonctions. Pour tout $λ∈ℝ^N$, on note $f_λ:=∑_{j=1}^Nλ_jh_j$. Soit $ℓ:T×ℝ↦ℝ$ fonction de perte donnée que l’on suppose convexe en la seconde variable. On note $(ℓ•f)(x, y):=ℓ(y;f(x))$. On étudie le problème de minimisation du risque empirique pénalisé suivant $$\hat{\lambda}^{\varepsilon }:=\mathop{\operatorname {argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P_{n}(\ell\bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr],$$ qui correspond à la version empirique du problème $$\lambda^{\varepsilon }:=\mathop{\operatorname{argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P(\ell \bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr]$$ (ici $\varepsilon≥0$ est un paramètre de régularisation; $λ^0$ correspond au cas $\varepsilon=0$). Ce cadre général englobe un certain nombre de problèmes de régression et de classification. On s’intéresse au cas où $p≥1$, mais reste proche de 1 (de sorte que $p−1$ soit de l’ordre $\frac{1}{\log N}$, ou inférieur). On montre que la “sparsité” de $λ^\varepsilon$ implique la “sparsité” de $\hat{\lambda}^\varepsilon$. En outre, on étudie les conséquences de la “sparsité” en termes de bornes supérieures sur l’excès de risque $P(ℓ•f_{\hat{\lambda}^\varepsilon})−P(ℓ•f_{λ^0})$ des solutions obtenues pour les différents problèmes de minimisation du risque empirique.

Article information

Ann. Inst. H. Poincaré Probab. Statist., Volume 45, Number 1 (2009), 7-57.

First available in Project Euclid: 12 February 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G99: None of the above, but in this section 62J99: None of the above, but in this section 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Empirical risk Penalized empirical risk ℓ_p-penalty Sparsity Oracle inequalities


Koltchinskii, Vladimir. Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincaré Probab. Statist. 45 (2009), no. 1, 7--57. doi:10.1214/07-AIHP146.

Export citation


  • [1] A. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 (1999) 301–413.
  • [2] P. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher complexities. Ann. Statist. 33 (2005) 1497–1537.
  • [3] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. Analysis, Algorithms and Engineering Applications. MPS/SIAM, Series on Optimization, Philadelphia, 2001.
  • [4] F. Bunea, A. Tsybakov and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674–1697.
  • [5] F. Bunea, A. Tsybakov and M. Wegkamp. Sparsity oracle inequalities for the LASSO. Electron. J. Statist. 1 (2007) 169–194.
  • [6] E. Candes and T. Tao. The Dantzig selector statistical estimation when p is much larger than n. Ann. Statist. 35 (2007) 2313–2351.
  • [7] E. Candes, M. Rudelson, T. Tao and R. Vershynin. Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS05) 295–308. IEEE, Pittsburgh, PA, 2005.
  • [8] E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (2006) 1207–1223.
  • [9] O. Catoni. Statistical Learning Theory and Stochastic Optimization. Springer, New York, 2004.
  • [10] D. L. Donoho. For most large underdetermined systems of equations the minimal 1-norm near-solution approximates the sparsest near-solution. Preprint, 2004.
  • [11] D. L. Donoho. For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 (2006) 797–829.
  • [12] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory 52 (2006) 1289–1306.
  • [13] D. L. Donoho, M. Elad and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 (2006) 6–18.
  • [14] van de S. Geer. High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 (2008) 614–645.
  • [15] V. Koltchinskii. Model selection and aggregation in sparse classification problems. Oberwolfach Reports Meeting on Statistical and Probabilistic Methods of Model Selection, October, 2005.
  • [16] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mnimization. Ann. Statist. 34 (2006) 2593–2656.
  • [17] V. Koltchinskii and D. Panchenko. Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 (2005) 1455–1496.
  • [18] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, New York, 1991.
  • [19] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Tolouse (IX) 9 (2000) 245–303.
  • [20] P. Massart. Concentration Inequalities and Model Selection. Springer, Berlin, 2007.
  • [21] S. Mendelson, A. Pajor and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in Asymptotic Geometric Analysis. Geom. Funct. Anal. 17 (2007) 1248–1282.
  • [22] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the LASSO. Ann. Statist. 34 (2006) 1436–1462.
  • [23] A. Nemirovski. Topics in non-parametric statistics. In Ecole d’Et’e de Probabilités de Saint-Flour XXVIII, 1998 85–277. P. Bernard (Ed). Springer, New York, 2000.
  • [24] M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019–4041.
  • [25] R. Tibshirani. Regression shrinkage and selection via Lasso. J. Royal Statist. Soc. Ser. B 58 (1996) 267–288.
  • [26] A. Tsybakov. Optimal rates of aggregation. In Proc. 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, 303–313. Lecture Notes in Artificial Intelligence 2777. Springer, New York, 2003.
  • [27] van der A. Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.
  • [28] Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75–87.
  • [29] Y. Yang. Aggregating regression procedures for a better performance. Bernoulli 10 (2004) 25–47.
  • [30] P. Zhao and B. Yu. On model selection consistency of LASSO. J. Mach. Learn. Res. 7 (2006) 2541–2563.