## Bernoulli

### Ridge regression and asymptotic minimax estimation over spheres of growing dimension

Lee H. Dicker

#### Abstract

We study asymptotic minimax problems for estimating a $d$-dimensional regression parameter over spheres of growing dimension ($d\to\infty$). Assuming that the data follows a linear model with Gaussian predictors and errors, we show that ridge regression is asymptotically minimax and derive new closed form expressions for its asymptotic risk under squared-error loss. The asymptotic risk of ridge regression is closely related to the Stieltjes transform of the Marčenko–Pastur distribution and the spectral distribution of the predictors from the linear model. Adaptive ridge estimators are also proposed (which adapt to the unknown radius of the sphere) and connections with equivariant estimation are highlighted. Our results are mostly relevant for asymptotic settings where the number of observations, $n$, is proportional to the number of predictors, that is, $d/n\to\rho\in(0,\infty)$.

#### Article information

Source
Bernoulli, Volume 22, Number 1 (2016), 1-37.

Dates
Received: January 2013
Revised: September 2013
First available in Project Euclid: 30 September 2015

Permanent link to this document
https://projecteuclid.org/euclid.bj/1443620842

Digital Object Identifier
doi:10.3150/14-BEJ609

Mathematical Reviews number (MathSciNet)
MR3449775

Zentralblatt MATH identifier
06543262

#### Citation

Dicker, Lee H. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22 (2016), no. 1, 1--37. doi:10.3150/14-BEJ609. https://projecteuclid.org/euclid.bj/1443620842

#### References

• [1] Bai, Z.D. (1993). Convergence rate of expected spectral distributions of large random matrices. II. Sample covariance matrices. Ann. Probab. 21 649–672.
• [2] Bai, Z.D., Miao, B. and Yao, J.-F. (2003). Convergence rates of spectral distributions of large sample covariance matrices. SIAM J. Matrix Anal. Appl. 25 105–127.
• [3] Baranchik, A.J. (1973). Inadmissibility of maximum likelihood estimators in some multiple regression problems with three or more independent variables. Ann. Statist. 1 312–321.
• [4] Batir, N. (2008). Inequalities for the gamma function. Arch. Math. (Basel) 91 554–563.
• [5] Belitser, E.N. and Levit, B.Y. (1995). On minimax filtering over ellipsoids. Math. Methods Statist. 4 259–273.
• [6] Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806.
• [7] Beran, R. (1996). Stein estimation in high dimensions: A retrospective. In Research Developments in Probability and Statistics 91–110. Utrecht: VSP.
• [8] Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer Series in Statistics. New York: Springer.
• [9] Bickel, P.J. (1981). Minimax estimation of the mean of a normal distribution when the parameter space is restricted. Ann. Statist. 9 1301–1309.
• [10] Bickel, P.J., Ritov, Y. and Tsybakov, A.B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [11] Bondar, J.V. and Milnes, P. (1981). Amenability: A survey for statistical applications of Hunt–Stein and related conditions on groups. Z. Wahrsch. Verw. Gebiete 57 103–128.
• [12] Borel, É. (1914). Introduction Géométrique à Quelques Théories Physiques. Paris: Gauthier-Villars.
• [13] Breiman, L. and Freedman, D. (1983). How many variables should be entered in a regression equation? J. Amer. Statist. Assoc. 78 131–136.
• [14] Brown, L.D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Statist. 42 855–903.
• [15] Brown, L.D. (1990). The 1985 Wald Memorial Lectures. An ancillarity paradox which appears in multiple linear regression. Ann. Statist. 18 471–538. With discussion and a reply by the author.
• [16] Brown, L.D. and Gajek, L. (1990). Information inequalities for the Bayes risk. Ann. Statist. 18 1578–1594.
• [17] Brown, L.D. and Low, M.G. (1991). Information inequality bounds on the minimax risk (with an application to nonparametric regression). Ann. Statist. 19 329–337.
• [18] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194.
• [19] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [20] Dalalyan, A. and Chen, Y. (2012). Fused sparsity and robust estimation for linear models with unknown variance. Adv. Neural Inf. Process. Syst. 25 1268–1276.
• [21] DasGupta, A. (2010). False vs. missed discoveries, Gaussian decision theory, and the Donsker–Varadhan principle. In Borrowing Strength: Theory Powering Applications – A Festschrift for Lawrence D. Brown. Inst. Math. Stat. Collect. 6 1–21. Beachwood, OH: IMS.
• [22] Davidson, K.R. and Szarek, S.J. (2001). Local operator theory, random matrices and Banach spaces. In Handbook of the Geometry of Banach Spaces I 317–366. Amsterdam: North-Holland.
• [23] Diaconis, P. and Freedman, D. (1987). A dozen de Finetti-style results in search of a theory. Ann. Inst. Henri Poincaré Probab. Stat. 23 397–423.
• [24] Dicker, L.H. (2013). Optimal equivariant prediction for high-dimensional linear models with arbitrary predictor covariance. Electron. J. Stat. 7 1806–1834.
• [25] Dicker, L.H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101 269–284.
• [26] Donoho, D.L. and Johnstone, I.M. (1994). Minimax risk over $l_{p}$-balls for $l_{q}$-error. Probab. Theory Related Fields 99 277–303.
• [27] Draper, N.R. and Van Nostrand, R.C. (1979). Ridge regression and James–Stein estimation: Review and comments. Technometrics 21 451–466.
• [28] El Karoui, N. (2008). Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Statist. 36 2757–2790.
• [29] El Karoui, N. and Kösters, H. (2011). Geometric sensitivity of random matrix results: Consequences for shrinkage estimators of covariance and related statistical methods. Preprint. Available at arXiv:1105.1404.
• [30] Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 37–65.
• [31] Goldenshluger, A. and Tsybakov, A. (2001). Adaptive prediction and estimation in linear regression with infinitely many parameters. Ann. Statist. 29 1601–1619.
• [32] Goldenshluger, A. and Tsybakov, A. (2003). Optimal prediction for linear regression with infinitely many parameters. J. Multivariate Anal. 84 40–60.
• [33] Golubev, G.K. (1987). Adaptive asymptotically minimax estimates for smooth signals. Probl. Inf. Transm. 23 57–67.
• [34] Golubev, G.K. (1990). Quasilinear estimates for signals in $L_{2}$. Probl. Inf. Transm. 26 15–20.
• [35] Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
• [36] Kečkić, J.D. and Vasić, P.M. (1971). Some inequalities for the gamma function. Publ. Inst. Math. (Beograd) (N.S.) 11 107–114.
• [37] Leeb, H. (2009). Conditional predictive inference post model selection. Ann. Statist. 37 2838–2876.
• [38] Lévy, P. (1922). Leçons d’Analyse Fonctionnelle. Paris: Gauthier-Villars.
• [39] Marčenko, V. and Pastur, L. (1967). Distribution of eigenvalues for some sets of random matrices. Math. USSR–Sb. 1 457–483.
• [40] Marchand, E. (1993). Estimation of a multivariate mean with constraints on the norm. Canad. J. Statist. 21 359–366.
• [41] Muirhead, R.J. (1982). Aspects of Multivariate Statistical Theory. Wiley Series in Probability and Mathematical Statistics. New York: Wiley.
• [42] Nussbaum, M. (1999). Minimax risk: Pinsker bound. In Encyclopedia of Statistical Sciences Update Vol. 3 451–460. New York: Wiley.
• [43] Pinsker, M.S. (1980). Optimal filtration of square-integrable signals in Gaussian noise. Probl. Inf. Transm. 16 52–68.
• [44] Raskutti, G., Wainwright, M.J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• [45] Robert, C. (1990). Modified Bessel functions and their applications in probability and statistics. Statist. Probab. Lett. 9 155–161.
• [46] Silverstein, J.W. (1995). Strong convergence of the empirical distribution of eigenvalues of large-dimensional random matrices. J. Multivariate Anal. 55 331–339.
• [47] Stam, A.J. (1959). Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inform. Control 2 101–112.
• [48] Stein, C. (1960). Multiple regression. In Contributions to Probability and Statistics 424–443. Stanford, CA: Stanford Univ. Press.
• [49] Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
• [50] Tihonov, A.N. (1963). On the solution of ill-posed problems and the method of regularization. Dokl. Akad. Nauk SSSR 151 501–504.
• [51] Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. New York: Springer.
• [52] Ye, F. and Zhang, C.-H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the $\ell_{q}$ loss in $\ell_{r}$ balls. J. Mach. Learn. Res. 11 3519–3540.
• [53] Zamir, R. (1998). A proof of the Fisher information inequality via a data processing argument. IEEE Trans. Inform. Theory 44 1246–1250.
• [54] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.