Statistical Science

High-Dimensional Regression with Unknown Variance

Abstract

We review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings are covered, including coordinate-sparsity, group-sparsity and variation-sparsity. The emphasis is put on nonasymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the lasso estimator and some references are collected for some more general models, including multivariate regression and nonparametric regression.

Article information

Source
Statist. Sci. Volume 27, Number 4 (2012), 500-518.

Dates
First available in Project Euclid: 21 December 2012

Permanent link to this document
http://projecteuclid.org/euclid.ss/1356098553

Digital Object Identifier
doi:10.1214/12-STS398

Mathematical Reviews number (MathSciNet)
MR3025131

Zentralblatt MATH identifier
1331.62346

Citation

Giraud, Christophe; Huet, Sylvie; Verzelen, Nicolas. High-Dimensional Regression with Unknown Variance. Statist. Sci. 27 (2012), no. 4, 500--518. doi:10.1214/12-STS398. http://projecteuclid.org/euclid.ss/1356098553.

References

• [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) 267–281. Akadémiai Kiadó, Budapest.
• [2] Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Statistics 22 327–351.
• [3] Antoniadis, A. (2010). Comments on: $\ell_{1}$-penalization for mixture regression models. TEST 19 257–258.
• [4] Arlot, S. and Bach, F. (2009). Data-driven calibration of linear estimators with minimal penalties. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.) 46–54. Curran Associates, New York.
• [5] Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv. 4 40–79.
• [6] Arlot, S. and Celisse, A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation. Stat. Comput. 21 613–632.
• [7] Arlot, S. and Massart, P. (2010). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245–279.
• [8] Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
• [9] Bach, F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res. 9 1019–1048.
• [10] Baraniuk, R., Davenport, M., DeVore, R. and Wakin, M. (2008). A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28 253–263.
• [11] Baraud, Y. (2011). Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields 151 353–401.
• [12] Baraud, Y., Giraud, C. and Huet, S. (2009). Gaussian model selection with an unknown variance. Ann. Statist. 37 630–672.
• [13] Baraud, Y., Giraud, C. and Huet, S. (2010). Estimator selection in the Gaussian setting. Available at arXiv:1007.2096v2.
• [14] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
• [15] Baudry, J.-P., Maugis, C. and Michel, B. (2012). Slope heuristics: Overview and implementation. Statist. Comput. 22 455–470.
• [16] Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806.
• [17] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [18] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
• [19] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33–73.
• [20] Bunea, F., She, Y. and Wegkamp, M. H. (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Statist. 39 1282–1309.
• [21] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• [22] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [23] Cao, Y. and Golubev, Y. (2006). On oracle inequalities related to smoothing splines. Math. Methods Statist. 15 398–414.
• [24] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
• [25] Comte, F. and Rozenholc, Y. (2004). A new algorithm for fixed design regression and denoising. Ann. Inst. Statist. Math. 56 449–473.
• [26] Dalalyan, A. and Tsybakov, A. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity. Machine Learning 72 39–61.
• [27] Devroye, L. P. and Wagner, T. J. (1979). The $L_{1}$ convergence of kernel density estimates. Ann. Statist. 7 1136–1139.
• [28] Donoho, D. and Tanner, J. (2009). Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4273–4293.
• [29] Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306.
• [30] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
• [31] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• [32] Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc. 70 320–328.
• [33] Gerchinovitz, S. (2011). Sparsity regret bounds for individual sequences in online linear regression. In Proceedings of COLT 2011. Microtome Publishing, Brookline, MA.
• [34] Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
• [35] Giraud, C. (2008). Estimation of Gaussian graphs by model selection. Electron. J. Stat. 2 542–563.
• [36] Giraud, C. (2011). Low rank multivariate regression. Electron. J. Stat. 5 775–799.
• [37] Giraud, C. (2011). A pseudo-RIP for multivariate regression. Available at arXiv:1106.5599v1.
• [38] Giraud, C., Huet, S. and Verzelen, N. (2012). Supplement to “High-dimensional regression with unknown variance.” DOI:10.1214/12-STS398SUPP.
• [39] Giraud, C., Huet, S. and Verzelen, N. (2012). Graph selection with GGMselect. Stat. Appl. Genet. Mol. Biol. 11 1–50.
• [40] Guthery, S. B. (1974). A transformation theorem for one-dimensional $F$-expansions. J. Number Theory 6 201–210.
• [41] Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika 77 521–528.
• [42] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• [43] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 18 1603–1618.
• [44] Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• [45] Huber, P. J. (1981). Robust Statistics. Wiley, New York.
• [46] Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. J. Multivariate Anal. 5 248–264.
• [47] Ji, P. and Jin, J. (2010). UPS delivers optimal phase diagram in high dimensional variable selection. Available at http://arxiv.org/abs/1010.5028.
• [48] Klopp, O. (2011). High dimensional matrix estimation with unknown variance of the noise. Available at arXiv:1112.3055v1.
• [49] Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
• [50] Lebarbier, E. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Processing 85 717–736.
• [51] Leng, C., Lin, Y. and Wahba, G. (2006). A note on the lasso and related procedures in model selection. Statist. Sinica 16 1273–1284.
• [52] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• [53] Li, K.-C. (1987). Asymptotic optimality for $C_{p}$, $C_{L}$, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 958–975.
• [54] Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
• [55] Mallows, C. L. (1973). Some comments on $C_{p}$. Technometrics 15 661–675.
• [56] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [57] Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Handbook of Social Psychology, Vol. 2 (G. Lindsey and E. Aronson, eds.). Addison-Wesley, Reading, MA.
• [58] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
• [59] Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Statist. 12 758–765.
• [60] Park, T. and Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103 681–686.
• [61] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• [62] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• [63] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
• [64] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• [65] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
• [66] Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221–264. With comments and a rejoinder by the author.
• [67] Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68 45–54.
• [68] Städler, N., Bühlmann, P. and van de Geer, S. (2010). $\ell_{1}$-penalization for mixture regression models. TEST 19 209–256.
• [69] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B 36 111–147.
• [70] Sun, T. and Zhang, C.-H. (2010). Comments on: $\ell_{1}$-penalization for mixture regression models. TEST 19 270–275.
• [71] Sun, T. and Zhang, C. H. (2011). Scaled sparse linear regression. Available at arXiv:1104.4595.
• [72] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [73] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
• [74] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392.
• [75] Vert, J. P. and Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 2343–2351. Curran Associates, New York.
• [76] Verzelen, N. (2010). High-dimensional Gaussian model selection on a Gaussian design. Ann. Inst. H. Poincaré Probab. Stat. 46 480–524.
• [77] Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons. Electron. J. Stat. 6 38–90.
• [78] Wainwright, M. J. (2009). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inform. Theory 55 5728–5741.
• [79] Ye, F. and Zhang, C.-H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the $\ell_{q}$ loss in $\ell_{r}$ balls. J. Mach. Learn. Res. 11 3519–3540.
• [80] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
• [81] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
• [82] Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 17 2077–2098.
• [83] Zhang, T. (2011). Adaptive forward–backward greedy algorithm for learning sparse representations. IEEE Trans. Inform. Theory 57 4689–4708.
• [84] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
• [85] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.

Supplemental materials

• Supplementary material: Supplement to “High-dimensional regression with unknown variance”. This supplement contains a description of estimation procedures that are minimax adaptive to the sparsity for all designs $\mathbf{X}$. It also contains the proofs of the risk bounds for LinSelect and lasso-LinSelect.