G. J. McLachlan, J. Chevelu, J. Zhu
There is increasing interest in the use of diagnostic rules based on microarray data. These rules are formed by considering the expression levels of thousands of genes in tissue samples taken on patients of known classification with respect to a number of classes, representing, say, disease status or treatment strategy. As the final versions of these rules are usually based on a small subset of the available genes, there is a selection bias that has to be corrected for in the estimation of the associated error rates. We consider the problem using cross-validation. In particular, we present explicit formulae that are useful in explaining the layers of validation that have to be performed in order to avoid improperly cross-validated estimates.
References
[1] Ambroise, C. and McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 97 6562–6566.
[2] Ein-Dor, L., Kela, I., Getz, G., Givol, D. and Domany, E. (2005). Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics 21 171–178.
[3] Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46 389–422.
[4] Lipshutz, R., Fodor, S., Gingeras, T. and Lockhart, D. (1999). High density synthetic oligonucleotide arrays. Nature Genetics 21 20–24.
[5] McLachlan, G. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
[6] Michiels, S., Koscielny, S. and Hill, C. (2005). Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365 488–492.
[7] Quackenbush, J. (2006). Microarray analysis and tumor classification. New England J. Medicine 354 2463–2472.
[8] Schena, M., Shaon, D., Heller, R., Chai, A., Brown, P. and Davis, R. (1996). Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. USA 93 10614–10619.
[9] Somorjai, R., Dolenko, B. and Baumgartner, R. (2003). Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: Curses, caveats, cautions. Bioinformatics 19 1484–1491.
[10] van de Vijver, M., He, Y., van ’t Veer, L., Dai, H., Hart, A., Voskuil, D., Schreiber, G., Peterse, J., Roberts, C., Marton, M., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Valde, T., Bartelink, H., Rodenhuis, S., Rutgers, E., Friend, S. and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England J. Medicine 347 1999–2009.
[11] van ’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R. and Friend, S. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530–536.
[12] Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
[13] Zhu, X., Ambroise, C. and McLachlan, G. (2006). Selection bias in working with the top genes in supervised classification of tissue samples. Statistical Methodology 3 29–41.