Bernoulli

  • Bernoulli
  • Volume 14, Number 3 (2008), 661-690.

Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process

Hannes Leeb

Full-text: Open access

Abstract

In regression with random design, we study the problem of selecting a model that performs well for out-of-sample prediction. We do not assume that any of the candidate models under consideration are correct. Our analysis is based on explicit finite-sample results. Our main findings differ from those of other analyses that are based on traditional large-sample limit approximations because we consider a situation where the sample size is small relative to the complexity of the data-generating process, in the sense that the number of parameters in a ‘good’ model is of the same order as sample size. Also, we allow for the case where the number of candidate models is (much) larger than sample size.

Article information

Source
Bernoulli Volume 14, Number 3 (2008), 661-690.

Dates
First available in Project Euclid: 25 August 2008

Permanent link to this document
https://projecteuclid.org/euclid.bj/1219669625

Digital Object Identifier
doi:10.3150/08-BEJ127

Mathematical Reviews number (MathSciNet)
MR2537807

Zentralblatt MATH identifier
1155.62029

Keywords
generalized cross validation large number of parameters and small sample size model selection nonparametric regression out-of-sample prediction S_p criterion

Citation

Leeb, Hannes. Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process. Bernoulli 14 (2008), no. 3, 661--690. doi:10.3150/08-BEJ127. https://projecteuclid.org/euclid.bj/1219669625


Export citation

References

  • Akaike, H. (1969). Fitting autoregressive models for prediction., Ann. Inst. Statist. Math. 21 243–247.
  • Akaike, H. (1970). Statistical predictor identification., Ann. Inst. Statist. Math. 22 203–217.
  • Amemiya, T. (1980). Selection of regressors., Internat. Econom. Rev. 21 331–354.
  • Baraud, Y. (2002). Model selection for regression on a random design., ESAIM Probab. Statist. 6 127–146 (electronic).
  • Barron, A., Birge, L. and Massart, P. (1999). Risk bounds for model selection via penalization., Probab. Theory Related Fields 113 301–413.
  • Beran, R. (1996). Confidence sets centered at, Cp-estimators. Ann. Inst. Statist. Math. 48 1–15.
  • Beran, R. (2000). REACT scatterplot smoothers: Superefficiency through basis economy., J. Amer. Statist. Assoc. 95 155–171.
  • Beran, R. and Dümbgen, L. (1998). Modulation of estimators and confidence sets., Ann. Statist. 26 1826–1856.
  • Breiman, L. and Freedman, D. (1983). How many variables should be entered in a regression equation?, J. Amer. Statist. Assoc. 78 131–136.
  • Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations., Ann. Math. Stat. 23 493–507.
  • Claeskens, G. and Hjort, N.L. (2003). The focused information cirterion., J. Amer. Statist. Assoc. 98 900–945.
  • Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation., Numer. Math. 31 377–403.
  • Foster, D.P. and George, E.I. (1994). The risk inflation criterion for multiple regression., Ann. Statist. 22 1947–1975.
  • Goldenshluger, A. and Tsybakov, A. (2003). Optimal prediction for linear regression with infinitely many parameters., J. Multivariate Anal. 84 40–60.
  • Hocking, R.R. (1976). The analysis and selection of variables in linear regression., Biometrics 32 1–49.
  • Huber, P.J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo., Ann. Statist. 1 799–821.
  • Hurvich, C.M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples., Biometrika 76 297–307.
  • Kabaila, P. (1998). Valid confidence intervals in regression after variable selection., Econometric Theory 14 463–482.
  • Kempthorne, P.J. (1984). Admissible variable-selection procedures when fitting regression models by least squares for prediction., Biometrika 71 593–597.
  • Killeen, T.J., Hettmansperger, T.P. and Sievers, G.L. (1972). An elementary theorem on the probability of large deviations., Ann. Math. Statist. 43 181–192.
  • Leeb, H. (2007). Conditional predictive inference after model selection., Manuscript.
  • Leeb, H. and Pötscher, B.M. (2005). Model selection and inference: Facts and fiction., Econometric Theory 21 21–59.
  • Leeb, H. and Pötscher, B.M. (2008). Model Selection. In, Handbook of Financial Time Series. New York: Springer.
  • Leeb, H. and Pötscher, B.M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator., J. Econometrics 142 201–211.
  • Mallows, C.L. (1973). Some comments on, Cp. Technometrics 15 661–675.
  • Mammen, E. (1989). Asymptotics with increasing dimension for robust regression with applications to the bootstrap., Ann. Statist. 17 382–400.
  • Marčenko, V.A. and Pastur, L.A. (1967). Distribution of eigenvalues for some sets of random matrices., Math. USSR-Sb. 1 457–483.
  • Portnoy, S. (1984). Asymptotic behavior of, M-estimators of p regression parameters when p2/n is large. I. Consistency. Ann. Statist. 12 1298–1309.
  • Portnoy, S. (1985). Asymptotic behavior of, M-estimators of p regression parameters when p2/n is large. II. Normal approximation. Ann. Statist. 13 1403–1417.
  • Rissanen, J. (1978). Modeling by shortest data description., Automatica 14 465–471.
  • Schwarz, G. (1978). Estimating the dimension of a model., Ann. Statist 6 461–464.
  • Shao, J. (1997). An asymptotic theory for linear model selection (with discussion)., Statist. Sinica 7 221–264.
  • Thompson, M.L. (1976a). Selection of variables in multiple regression: Part I. A review and evaluation., Internat. Statist. Rev. 46 1–19.
  • Thompson, M.L. (1976b). Selection of variables in multiple regression: Part II. Chosen procedures, computations and examples., Internat. Statist. Rev. 46 129–146.
  • Tukey, J.W. (1967). Discussion of ‘Topics in the investigation of linear relations fitted by the method of least squares’ by F.J. Anscombe., J. Roy. Statist. Soc. Ser. B. 29 47–48.
  • van de Vijver, M.J., et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 347 1999–2009.
  • van’t Veer, L.J., et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530–536.
  • Wegkamp, M. (2003). Model selection in nonparametric regression., Ann. Statist. 31 252–273.
  • Yang, Y. (1999). Model selection for nonparametric regression., Statist. Sinica 9 475–499.
  • Yang, Y. (2005). Can the strengths of AIC and BIC be shared?, Biometrika 92 937–950.