## The Annals of Statistics

### High-dimensional asymptotics of prediction: Ridge regression and classification

#### Abstract

We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p,n\to\infty$ and $p/n\to\gamma>0$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength and the aspect ratio $\gamma$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover an exact inverse relation between the limiting predictive risk and the limiting estimation risk in high-dimensional linear models. The analysis builds on recent advances in random matrix theory.

#### Article information

Source
Ann. Statist., Volume 46, Number 1 (2018), 247-279.

Dates
Revised: November 2016
First available in Project Euclid: 22 February 2018

https://projecteuclid.org/euclid.aos/1519268430

Digital Object Identifier
doi:10.1214/17-AOS1549

Mathematical Reviews number (MathSciNet)
MR3766952

Zentralblatt MATH identifier
06865111

#### Citation

Dobriban, Edgar; Wager, Stefan. High-dimensional asymptotics of prediction: Ridge regression and classification. Ann. Statist. 46 (2018), no. 1, 247--279. doi:10.1214/17-AOS1549. https://projecteuclid.org/euclid.aos/1519268430

#### References

• Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, New York.
• Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, Berlin.
• Bartlett, P. L. and Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
• Bayati, M. and Montanari, A. (2012). The LASSO risk for Gaussian matrices. IEEE Trans. Inform. Theory 58 1997–2017.
• Bean, D., Bickel, P. J., El Karoui, N. and Yu, B. (2013). Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. USA 110 14563–14568.
• Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., Waldron, L. and Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30 i105–i112.
• Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
• Candès, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• Chen, L. S., Paul, D., Prentice, R. L. and Wang, P. (2011). A regularized Hotelling’s $T^{2}$ test for pathway analysis in proteomic studies. J. Amer. Statist. Assoc. 106. 1345–1360.
• Couillet, R. and Debbah, M. (2011). Random Matrix Methods for Wireless Communications. Cambridge Univ. Press, Cambridge.
• Deev, A. (1970). Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size. Sov. Math., Dokl. 11 1547–1550.
• Dicker, L. (2013). Optimal equivariant prediction for high-dimensional linear models with arbitrary predictor covariance. Electron. J. Stat. 7 1806–1834.
• Dicker, L. (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22 1–37.
• Dobriban, E. (2017). Sharp detection in PCA under correlations: All eigenvalues matter. Ann. Statist. 45 1810–1833.
• Dobriban, E. and Wager, S. (2018). Supplement to “High-dimensional asymptotics of prediction: Ridge regression and classification.” DOI:10.1214/17-AOS1549SUPP.
• Donoho, D. L. and Montanari, A. (2015). Variance breakdown of Huber (M)-estimators. $n/p\rightarrow m\in(1,\infty)$. Preprint. Available at arXiv:1503.02106.
• Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object. J. Roy. Statist. Soc. Ser. B 54 41–81.
• Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892–898.
• El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results. Preprint. Available at arXiv:1311.2445.
• El Karoui, N. (2015). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Technical report, Univ. California, Berkeley.
• El Karoui, N. and Kösters, H. (2011). Geometric sensitivity of random matrix results: Consequences for shrinkage estimators of covariance and related statistical methods. Preprint. Available at arXiv:1105.1404.
• Fan, J., Fan, Y. and Wu, Y. (2011). High dimensional classification. In High-Dimensional Data Analysis (T. Cai and X. Shen, eds.) 3–37. World Sci. Publ., Singapore.
• Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165–175.
• Fujikoshi, Y. and Seo, T. (1998). Asymptotic approximations for EPMCs of the linear and the quadratic discriminant functions when the sample sizes and the dimension are large. Random Oper. Stochastic Equations 6 269–280.
• Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (2011). Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, New York.
• Grenander, U. and Szegő, G. (1984). Toeplitz Forms and Their Applications, 2nd ed. Chelsea Publishing Co., New York.
• Hachem, W., Loubaton, P. and Najim, J. (2007). Deterministic equivalents for certain functionals of large random matrices. Ann. Appl. Probab. 17 875–930.
• Hachem, W., Khorunzhiy, O., Loubaton, P., Najim, J. and Pastur, L. (2008). A new approach for mutual information analysis of large dimensional multi-antenna channels. IEEE Trans. Inform. Theory 54 3987–4004.
• Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL.
• Hsu, D., Kakade, S. M. and Zhang, T. (2014). Random design analysis of ridge regression. Found. Comput. Math. 14 569–600.
• Kleinberg, J., Ludwig, J., Mullainathan, S., Obermeyer, Z. et al. (2015). Prediction policy problems. Am. Econ. Rev. 105 491–495.
• Kubokawa, T., Hyodo, M. and Srivastava, M. S. (2013). Asymptotic expansion and estimation of EPMC for linear classification rules in high dimension. J. Multivariate Anal. 115 496–515.
• Ledoit, O. and Péché, S. (2011). Eigenvectors of some large sample covariance matrix ensembles. Probab. Theory Related Fields 151 233–264.
• Liang, P. and Srebro, N. (2010). On the interaction between norm and dimensionality: Multiple regimes in learning. In ICML.
• Marchenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 114 507–536.
• Ng, A. and Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In NIPS.
• Pickrell, J. K. and Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8 e1002967.
• Raudys, Š. (1967). On determining training sample size of linear classifier. Comput. Syst. 28 79–87 (in Russian).
• Raudys, Š. (1972). On the amount of a priori information in designing the classification algorithm. Technical Cybernetics 4 168–174 (in Russian).
• Raudys, Š. (2001). Statistical and Neural Classifiers: An Integrated Approach to Design. Springer Science & Business Media, Berlin.
• Raudys, Š. and Saudargiene, A. (1998). Structures of the covariance matrices in the classifier design. In Joint IAPR Intl Workshops on SPR and SSPR 583–592. Springer, Berlin.
• Raudys, Š. and Skurichina, M. (1995). Small sample properties of ridge estimate of the covariance matrix in statistical and neural net classification. New Trends Probab. Stat. 3 237–245.
• Raudys, Š. and Young, D. M. (2004). Results in statistical discriminant analysis: A review of the former Soviet Union literature. J. Multivariate Anal. 89 1–35.
• Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y. and Muller, X. (2011). The manifold tangent classifier. Adv. Neural Inf. Process. Syst. 24 2294–2302.
• Rubio, F., Mestre, X. and Palomar, D. P. (2012). Performance analysis and optimal selection of large minimum variance portfolios under estimation risk. IEEE J. Sel. Top. Signal Process. 6 337–350.
• Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. et al. (2015). ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 211–252.
• Saranadasa, H. (1993). Asymptotic expansion of the misclassification probabilities of D-and A-criteria for discrimination from two high dimensional populations using the theory of large dimensional random matrices. J. Multivariate Anal. 46 154–174.
• Serdobolskii, V. I. (1983). On minimum error probability in discriminant analysis. Dokl. Akad. Nauk SSSR 27 720–725.
• Serdobolskii, V. I. (2007). Multiparametric Statistics. Elsevier, Amsterdam.
• Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. J. Multivariate Anal. 55 331–339.
• Simard, P. Y., Le Cun, Y. A., Denker, J. S. and Victorri, B. (2000). Transformation invariance in pattern recognition: Tangent distance and propagation. Int. J. Imaging Syst. Technol. 11 181–197.
• Sutton, C. and McCallum, A. (2006). An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning 93–128.
• Toutanova, K., Klein, D., Manning, C. D. and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL.
• Tulino, A. M. and Verdú, S. (2004). Random matrix theory and wireless communications. Commun. Inf. Theory 1 1–182.
• Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280.
• Wang, S. and Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 2 90–94. Association for Computational Linguistics, Stroudsburg PA.
• Wray, N. R., Goddard, M. E. and Visscher, P. M. (2007). Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17 1520–1528.
• Yao, J., Bai, Z. and Zheng, S. (2015). Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge Univ. Press, Cambridge.
• Zhang, M., Rubio, F., Palomar, D. P. and Mestre, X. (2013). Finite-sample linear filter optimization in wireless communications and financial systems. IEEE Trans. Signal Process. 61 5014–5025.
• Zollanvari, A., Braga-Neto, U. M. and Dougherty, E. R. (2011). Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans. Signal Process. 59 4238–4255.
• Zollanvari, A. and Dougherty, E. R. (2015). Generalized consistent error estimator of linear discriminant analysis. IEEE Trans. Signal Process. 63 2804–2814.

#### Supplemental materials

• Supplement to “High-dimensional asymptotics of prediction: Ridge regression and classification”. In the supplementary material, we give efficient methods to compute the risk formulas, and prove the remaining lemmas and other results.