The Annals of Statistics

Oracle inequalities and optimal inference under group sparsity

Karim Lounici, Massimiliano Pontil, Sara van de Geer, and Alexandre B. Tsybakov

Full-text: Open access

Abstract

We consider the problem of estimating a sparse linear regression vector β* under a Gaussian noise model, for the purpose of both prediction and model selection. We assume that prior knowledge is available on the sparsity pattern, namely the set of variables is partitioned into prescribed groups, only few of which are relevant in the estimation process. This group sparsity assumption suggests us to consider the Group Lasso method as a means to estimate β*. We establish oracle inequalities for the prediction and 2 estimation errors of this estimator. These bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger condition, we derive bounds for the estimation error for mixed (2, p)-norms with 1 ≤ p ≤ ∞. When p=∞, this result implies that a thresholded version of the Group Lasso estimator selects the sparsity pattern of β* with high probability. Next, we prove that the rate of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds for the prediction and 2 estimation errors of the usual Lasso estimator. Using this result, we demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation errors as compared to the Lasso.

An important application of our results is provided by the problem of estimating multiple regression equations simultaneously or multi-task learning. In this case, we obtain refinements of the results in [In Proc. of the 22nd Annual Conference on Learning Theory (COLT) (2009)], which allow us to establish a quantitative advantage of the Group Lasso over the usual Lasso in the multi-task setting. Finally, within the same setting, we show how our results can be extended to more general noise distributions, of which we only require the fourth moment to be finite. To obtain this extension, we establish a new maximal moment inequality, which may be of independent interest.

Article information

Source
Ann. Statist. Volume 39, Number 4 (2011), 2164-2204.

Dates
First available: 26 October 2011

Permanent link to this document
http://projecteuclid.org/euclid.aos/1319595462

Digital Object Identifier
doi:10.1214/11-AOS896

Zentralblatt MATH identifier
05987688

Mathematical Reviews number (MathSciNet)
MR2893865

Subjects
Primary: 62J05: Linear regression
Secondary: 62C20: Minimax procedures 62F07: Ranking and selection

Keywords
Oracle inequalities group Lasso minimax risk penalized least squares moment inequality group sparsity statistical learning

Citation

Lounici, Karim; Pontil, Massimiliano; van de Geer, Sara; Tsybakov, Alexandre B. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics 39 (2011), no. 4, 2164--2204. doi:10.1214/11-AOS896. http://projecteuclid.org/euclid.aos/1319595462.


Export citation

References

  • [1] Aaker, D. A., Day, G. S. and Kumar, V. (1995). Marketing Research. Wiley.
  • [2] Argyriou, A., Evgeniou, T. and Pontil, M. (2008). Convex multi-task feature learning. Machine Learning 73 243–272.
  • [3] Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
  • [4] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • [5] Borwein, J. M. and Lewis, A. S. (2006). Convex Analysis and Nonlinear Optimization: Theory and Examples, 2nd ed. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC 3. Springer, New York.
  • [6] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194 (electronic).
  • [7] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [8] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351.
  • [9] Cavalier, L., Golubev, G. K., Picard, D. and Tsybakov, A. B. (2002). Oracle inequalities for inverse problems. Ann. Statist. 30 843–874.
  • [10] Chesneau, C. and Hebiri, M. (2008). Some theoretical results on the grouped variables Lasso. Math. Methods Statist. 17 317–326.
  • [11] Diggle, P. J., Heagerty, P. J., Liang, K.-Y. and Zeger, S. L. (2002). Analysis of Longitudinal Data, 2nd ed. Oxford Statistical Science Series 25. Oxford Univ. Press, Oxford.
  • [12] Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 6–18.
  • [13] Dümbgen, L., van de Geer, S. A., Veraar, M. C. and Wellner, J. A. (2010). Nemirovski’s inequalities revisited. Amer. Math. Monthly 117 138–160.
  • [14] Evgeniou, T., Pontil, M. and Toubia, O. (2007). A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science 26 805–818.
  • [15] Hsiao, C. (2003). Analysis of Panel Data, 2nd ed. Econometric Society Monographs 34. Cambridge Univ. Press, Cambridge.
  • [16] Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Ann. Statist. 38 2282–2313.
  • [17] Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
  • [18] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Berlin.
  • [19] Koltchinskii, V. and Yuan, M. (2008). Sparse recovery in large ensembles of kernel machines. In 21st Annual Conference on Learning Theory—COLT 2008, Helsinki, Finland, July 9-12, 2008 (R. A. Servedio and T. Zhang, eds.) 229–238. Omnipress.
  • [20] Lenk, P. J., DeSarbo, W. S., Green, P. E. and Young, M. R. (1996). Hierarchical Bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science 15 173–191.
  • [21] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2 90–102.
  • [22] Lounici, K., Pontil, M., B Tsybakov, A. and van de Geer, S. A. (2009). Taking advantage of sparsity in multi-task learning. In Proc. of the 22nd Annual Conference on Learning Theory (COLT 2009) 73–82. Omnipress.
  • [23] Maurer, A. (2006). Bounds for linear multi-task learning. J. Mach. Learn. Res. 7 117–139.
  • [24] Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53–71.
  • [25] Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
  • [26] Nardi, Y. and Rinaldo, A. (2008). On the asymptotic properties of the group lasso estimator for linear models. Electron. J. Stat. 2 605–633.
  • [27] Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • [28] Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist. 39 1–47.
  • [29] Petrov, V. V. (1995). Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Oxford Studies in Probability 4. The Clarendon Press, New York.
  • [30] Raskutti, G., Wainwright, M. J. and Yu, B. (2009). Minimax rates of estimation for high-dimensional linear regression over q-balls. Available at arXiv:0910.2042.
  • [31] Ravikumar, P., Liu, H., Lafferty, J. and Wasserman, L. (2008). Spam: Sparse additive models. In Advances in Neural Information Processing Systems (NIPS) (J. C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 22 1201–1208. MIT Press, Cambridge, MA.
  • [32] Rigollet, P. and Tsybakov, A. (2010). Exponential Screening and optimal rates of sparse estimation. Available at arXiv:1003.2654.
  • [33] Rio, E. (2009). Moment inequalities for sums of dependent random variables under projective conditions. J. Theoret. Probab. 22 146–163.
  • [34] Srivastava, V. K. and Giles, D. E. A. (1987). Seemingly Unrelated Regression Equations Models: Estimation and Inference. Statistics: Textbooks and Monographs 80. Dekker, New York.
  • [35] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.
  • [36] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • [37] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • [38] Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA.
  • [39] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • [40] Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Amer. Statist. Assoc. 57 348–368.
  • [41] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • [42] Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.