The Annals of Statistics

Conditional predictive inference post model selection

Hannes Leeb

Full-text: Open access

Abstract

We give a finite-sample analysis of predictive inference procedures after model selection in regression with random design. The analysis is focused on a statistically challenging scenario where the number of potentially important explanatory variables can be infinite, where no regularity conditions are imposed on unknown parameters, where the number of explanatory variables in a “good” model can be of the same order as sample size and where the number of candidate models can be of larger order than sample size. The performance of inference procedures is evaluated conditional on the training sample. Under weak conditions on only the number of candidate models and on their complexity, and uniformly over all data-generating processes under consideration, we show that a certain prediction interval is approximately valid and short with high probability in finite samples, in the sense that its actual coverage probability is close to the nominal one and in the sense that its length is close to the length of an infeasible interval that is constructed by actually knowing the “best” candidate model. Similar results are shown to hold for predictive inference procedures other than prediction intervals like, for example, tests of whether a future response will lie above or below a given threshold.

Article information

Source
Ann. Statist. Volume 37, Number 5B (2009), 2838-2876.

Dates
First available in Project Euclid: 17 July 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1247836671

Digital Object Identifier
doi:10.1214/08-AOS660

Mathematical Reviews number (MathSciNet)
MR2541449

Zentralblatt MATH identifier
1173.62026

Subjects
Primary: 62G15: Tolerance and confidence regions
Secondary: 62H12: Estimation 62J05: Linear regression 62J07: Ridge regression; shrinkage estimators

Keywords
Predictive inference post model selection regression with random design conditional coverage probability finite sample analysis approximately honest and short prediction interval

Citation

Leeb, Hannes. Conditional predictive inference post model selection. Ann. Statist. 37 (2009), no. 5B, 2838--2876. doi:10.1214/08-AOS660. https://projecteuclid.org/euclid.aos/1247836671.


Export citation

References

  • [1] Adam, B.-L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J., Schellmanner, P. F., Yasui, Y., Feng, Z. and Wright, G. L. J. (2002). Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research 62 3609–3614.
  • [2] Baraud, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32 528–551.
  • [3] Barndorff-Nielsen, O. E. and Cox, D. R. (1996). Prediction and asymptotics. Bernoulli 2 319–340.
  • [4] Beran, R. and Dümbgen, L. (1998). Modulation of estimators and confidence sets. Ann. Statist. 26 1826–1856.
  • [5] Breiman, L. and Freedman, D. (1983). How many variables should be entered in a regression equation? J. Amer. Statist. Assoc. 78 131–136.
  • [6] Cai, T. T. and Low, M. G. (2004). An adaptation theory for nonparametric confidence intervals. Ann. Statist. 32 1805–1840.
  • [7] Cai, T. T. and Low, M. G. (2006). Adaptive confidence balls. Ann. Statist. 34 202–228.
  • [8] Ding, A. A. and Hwang, J. T. G. (1999). Prediction intervals, factor analysis models, and high-dimensional empirical linear prediction. J. Amer. Statist. Assoc. 94 446–455.
  • [9] Geisser, S. (1993). Predictive Inference: An Introduction. Monographs on Statistics and Applied Probability 55. Chapman & Hall, New York.
  • [10] Genovese, C. R. and Wasserman, L. (2005). Confidence sets for nonparametric wavelet regression. Ann. Statist. 33 698–729.
  • [11] Genovese, C. R. and Wasserman, L. (2008). Adaptive confidence bands. Ann. Statist. 36 875–905.
  • [12] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, D. C. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • [13] Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 32 1–49.
  • [14] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression. Ann. Statist. 30 325–396.
  • [15] Joshi, V. M. (1969). Admissibility of the usual confidence sets for the mean of a univariate or bivariate normal population. Ann. Math. Statist. 40 1042–1067.
  • [16] Juditsky, A. and Lambert-Lacroix, S. (2003). Nonparametric confidence set estimation. Math. Methods Statist. 12 410–428.
  • [17] Kabaila, P. and Leeb, H. (2006). On the large-sample minimal coverage probability of confidence intervals after model selection. J. Amer. Statist. Assoc. 101 619–629.
  • [18] Leeb, H. (2005). The distribution of a linear predictor after model selection: Conditional finite-sample distributions and asymptotic approximations. J. Statist. Plann. Inference 134 64–89.
  • [19] Leeb, H. (2006). The distribution of a linear predictor after model selection: Unconditional finite-sample distributions and asymptotic approximations. IMS Lecture Notes—Monograph Series 49 291–311.
  • [20] Leeb, H. (2008). Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process. Bernoulli 14 661–690.
  • [21] Leeb, H. and Pötscher, B. M. (2003). The finite-sample distribution of post-model-selection estimators, and uniform versus non-uniform approximations. Econometric Theory 19 100–142.
  • [22] Leeb, H. and Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory 21 21–59.
  • [23] Leeb, H. and Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554–2591.
  • [24] Leeb, H. and Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376.
  • [25] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 1001–1008.
  • [26] Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83 1134–1143.
  • [27] Pötscher, B. M. (1991). Effects of model selection on inference. Econometric Theory 7 163–185.
  • [28] Robins, J. and van der Vaart, A. (2006). Adaptive nonparametric confidence sets. Ann. Statist. 34 229–253.
  • [29] Shen, X., Huang, H.-C. and Ye, J. (2004). Inference after model selection. J. Amer. Statist. Assoc. 99 751–761.
  • [30] Souders, T. M. and Stenbakken, G. N. (1991). Cutting the high cost of testing. IEEE Spectrum 28 48–51.
  • [31] Stenbakken, G. N. and Souders, T. M. (1987). Test point selection and testability measures via QR factorization of linear models. IEEE Trans. Instrum. Meas. 36 406–410.
  • [32] Thompson, M. L. (1978). Selection of variables in multiple regression: Part II. Chosen procedures, computations and examples. Int. Statist. Rev. 46 129–146.
  • [33] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91–108.
  • [34] van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A. M., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., Friend, S. H. and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 347 1999–2009.
  • [35] van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530–536.
  • [36] Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. J. Amer. Statist. Assoc. 45 133–150.
  • [37] West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A. J., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. U.S.A. 98 11462–11467.