Bernoulli

Persistence in high-dimensional linear predictor selection and the virtue of overparametrization

Eitan Greenshtein and Ya'Acov Ritov

Full-text: Open access

Abstract

Let $Z^i=(Y^i,X_1^i,\dots,X_m^i)$, $i=1,\dots,n$, be independent and identically distributed random vectors, $Z^i \sim F, \;F \in {\cal F}$. It is desired to predict Y by $\sum \beta_j X_j$, where $(\beta_1,\dots,\beta_m) \in B^n \subseteq \R^m, under a prediction loss. Suppose that $m=n^\alpha$, $\alpha>1$, that is, there are many more explanatory variables than observations. We consider sets Bn restricted by the maximal number of non-zero coefficients of their members, or by their l1 radius. We study the following asymptotic question: how 'large' may the set Bn be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on ${\cal F}$. Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is, optimization under l1 constraints, could be efficient in finding optimal sparse predictors in high dimensions.

Article information

Source
Bernoulli Volume 10, Number 6 (2004), 971-988.

Dates
First available: 21 January 2005

Permanent link to this document
http://projecteuclid.org/euclid.bj/1106314846

Mathematical Reviews number (MathSciNet)
MR2108039

Digital Object Identifier
doi:10.3150/bj/1106314846

Citation

Greenshtein, Eitan; Ritov, Ya'Acov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 (2004), no. 6, 971--988. doi:10.3150/bj/1106314846. http://projecteuclid.org/euclid.bj/1106314846.


Export citation

References

  • [1] Bickel, P. and Levina, E. (2004) Some theory for Fisher´s linear discriminant function, 'naive Bayes´, and some alternatives where there are more variables than observations. Bernoulli, 10, 989-1010.
  • [2] Billingsley, P. (1995) Probability and Measure. (3rd edition). New York: Wiley.
  • [3] Breiman, L. (2001) Statistical modeling: the two cultures. Statist. Sci., 16, 199-231. Abstract can also be found in the ISI/STMA publication
  • [4] Breiman, L. and Freedman, D. (1983) How many variables should be entered in a regression equation? J. Amer. Statist. Assoc., 78, 131-136.
  • [5] Chen, S., Donoho, D. and Saunders, M. (2001) Atomic decomposition by basis pursuit. SIAM Rev., 43, 129-159.
  • [6] Donoho, D.L. and Johnstone, I.M. (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425-455. Abstract can also be found in the ISI/STMA publication
  • [7] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. Ann. Statist., 32, 407-499. Abstract can also be found in the ISI/STMA publication
  • [8] Emery, M., Nemirovski, A. and Voiculescu, D. (2000) Lectures on Probability Theory and Statistics. École d´ Été de Probabilités de Saint-Flour XXVIII - 1998 (P. Bernard, ed.), Lecture Notes in Math, 1738. Berlin: Springer-Verlag.
  • [9] Foster, D.P. and George, E.L. (1994) The risk inflation criterion for multiple regression. Ann. Statist., 22, 1947-1975. Abstract can also be found in the ISI/STMA publication
  • [10] Huber, P. (1973) Robust regression: asymptotics, conjectures, and Monte Carlo. Ann. Statist., 1, 799-821.
  • [11] Juditsky, A. and Nemirovski, A. (2000) Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712. Abstract can also be found in the ISI/STMA publication
  • [12] Le Cam, L. and Yang, G.L. (1990) Asymptotics in Statistics. New York: Springer-Verlag.
  • [13] Lee, W.S., Bartlett, P.L. and Williamson, R.C. (1996) Efficient agnostics learning of neural networks with bounded fan in. IEEE Trans. Inform. Theory, 42, 2118-2132.
  • [14] Nemirovski, A. and Yudin, D. (1983) Problem Complexity and Method Efficiency in Optimization. Chichester: Wiley.
  • [15] Portnoy, S. (1984) Asymptotic behavior of M-estimators of p regression parameters when p2=n is large, I. Consistency. Ann. Statist., 12, 1298-1309.
  • [16] Silverstein, J.W. (1985) The smallest eigen value of a large dimensional Wishart matrix. Ann. Probab., 13, 1364-1368.
  • [17] Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58, 267-288. Abstract can also be found in the ISI/STMA publication