The Annals of Applied Statistics

A study of pre-validation

Holger Höfling and Robert Tibshirani

Full-text: Open access


Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward “one degree of freedom” analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.

Article information

Ann. Appl. Stat., Volume 2, Number 2 (2008), 643-664.

First available in Project Euclid: 3 July 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Cross-validation hypothesis testing point estimation inference microarray


Höfling, Holger; Tibshirani, Robert. A study of pre-validation. Ann. Appl. Stat. 2 (2008), no. 2, 643--664. doi:10.1214/07-AOAS152.

Export citation


  • Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H., He, Y. D., van’t Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O. and van de Vijver, M. J. (2005). Robustness, scalability and integration of a wound-response gene expression signature in predicting breast cancer survival., Proc. Natl. Acad. Sci. USA 102 3531–3532.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion)., Ann. Statist. 32 407–499.
  • Höfling, H. and Tibshirani, R. (2008). Supplement to “A study of pre-validation.” DOI:, 10.1214/08-AOAS152SUPP.
  • Park, M.-Y. and Hastie, T. (2007)., L1-regularization path algorithm for generalized linear models. J. Roy. Statist. Soc. Ser. B 69 659–677.
  • Pepe, M. S., Janes, H., Longton, G., Leisenring, W. and Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker., American J. Epidemiology 159 882–890.
  • Tibshirani, R. J. and Efron, B. (2002). Pre-validation and inference in microarrays., Statist. Appl. Genet. Mol. Biol. 1 1–18.
  • van’t Veer, L. J., van de Vijver, H. D. M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, G. J. S. A. T., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer., Nature 415 530–536.
  • Ware, J. H. (2006). The limitations of risk factors as prognostic tools., The New England J. Medicine 355 2615–2617.
  • Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression., Biostatistics 5 427–443.
  • Zhu, X., Ambroise, C. and McLachlan, G. J. (2006). Selection bias in working with the top genes in supervised classification of tissue samples., Statist. Methodol. 3 29–41.

Supplemental materials