The Annals of Applied Statistics

Sample size determination for training cancer classifiers from microarray and RNA-seq data

Sandra Safo, Xiao Song, and Kevin K. Dobbin

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


The objective of many high-dimensional microarray and RNA-seq studies is to develop a classifier of cancer patients based on characteristics of their disease. The germinal center B-cell (GCB) classifier study in lymphoma and the National Cancer Institute’s Director’s Challenge lung (DC-lung) study are two examples. In recent years, such classifiers are often developed using regularized regression, such as the lasso. A critical question is whether a better classifier can be developed from a larger training set size and, if so, how large the training set should be. This paper examines these two questions using an existing sample size method and a novel sample size method developed here specifically for lasso logistic regression. Both methods are based on pilot data. We reexamine the lymphoma and lung cancer data sets to evaluate the sample sizes, and use resampling to assess the estimation methods. We also study application to an RNA-seq data set. We find that it is feasible to estimate sample size for regularized logistic regression if an adequate pilot data set exists. The GCB and the DC-lung data sets appear adequate, under specific assumptions. Existing human RNA-seq data sets are by and large inadequate, and cannot be used as pilot data. Pilot RNA-seq data can be simulated, and the methods in this paper can be used for sample size estimation. A MATLAB program is made available.

Article information

Ann. Appl. Stat. Volume 9, Number 2 (2015), 1053-1075.

Received: March 2014
Revised: January 2015
First available in Project Euclid: 20 July 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Sample size lasso classification regularized logistic regression conditional score high-dimensional data measurement error


Safo, Sandra; Song, Xiao; Dobbin, Kevin K. Sample size determination for training cancer classifiers from microarray and RNA-seq data. Ann. Appl. Stat. 9 (2015), no. 2, 1053--1075. doi:10.1214/15-AOAS825.

Export citation


  • Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 14 6562–6566.
  • Bi, X., Rexer, B., Arteaga, C. L., Guo, M. and Mahadevan-Jansen, A. (2014). Evaluating HER2 amplification status and acquired drug resistance in breast cancer cells using Raman spectroscopy. J. Biomed. Opt. 19 25001.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. Monographs on Statistics and Applied Probability 105. Chapman & Hall/CRC, Boca Raton, FL.
  • Cook, J. R. and Stefanski, L. A. (1994). Simulation-extrapolation estimation in parametric measurement errror models. J. Amer. Statist. Assoc. 89 1314–1328.
  • Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge Series in Statistical and Probabilistic Mathematics 1. Cambridge Univ. Press, Cambridge.
  • Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression. Bioinformatics 19 1061–1069.
  • Dobbin, K. K. and Simon, R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 8 101–117.
  • Dobbin, K. K. and Song, X. (2013). Sample size requirements for training high-dimensional risk predictors. Biostatistics 14 639–652.
  • Dyrskjøt, L. (2003). Classification of bladder cancer by microarray expression profiling: Towards a general clinical use of microarrays in cancer diagnostics. Expert Rev. Mol. Diagn. 3 635–647.
  • Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892–898.
  • Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The.632$+$ bootstrap method. J. Amer. Statist. Assoc. 92 548–560.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Frazee, A. C., Langmead, B. and Leek, J. T. (2011). ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12 449.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22.
  • Geisser, S. (1993). Predictive Inference: An Introduction. Chapman & Hall, New York.
  • Graveley, B. R., Brooks, A. N., Carlson, J. W., Duff, M. O., Landolin, J. M., Yang, L., Artieri, C. G., van Baren, M. J., Boley, N., Booth, B. W., Brown, J. B., Cherbas, L., Davis, C. A., Dobin, A., Li, R., Lin, W., Malone, J. H., Mattiuzzo, N. R., Miller, D., Sturgill, D., Tuch, B. B., Zaleski, C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green, R. E., Hammonds, A., Jiang, L., Kapranov, P., Langton, L., Perrimon, N., Sandler, J. E., Wan, K. H., Willingham, A., Zhang, Y., Zou, Y., Andrews, J., Bickel, P. J., Brenner, S. E., Brent, M. R., Cherbas, P., Gingeras, T. R., Hoskins, R. A., Kaufman, T. C., Oliver, B. and Celniker, S. E. (2011). The developmental transcriptome of Drosophila melanogaster. Nature 471 473–479.
  • Hanash, S. M., Baik, C. L. and Kallioniemi, O. (2011). Emerging molecular biomarkers—blood-based strategies to detect and monitor cancer. Nat. Rev. Clin. Oncol. 8 142–150.
  • Hanfelt, J. J. and Liang, K.-Y. (1995). Approximate likelihood ratios for general estimating functions. Biometrika 82 461–477.
  • Hanfelt, J. J. and Liang, K.-Y. (1997). Approximate likelihoods for generalized linear errors-in-variables models. J. Roy. Statist. Soc. Ser. B 59 627–637.
  • Huang, Y. and Wang, C. Y. (2000). Cox regression with accurate covariates unascertainable: A nonparametric-correction approach. J. Amer. Statist. Assoc. 95 1209–1219.
  • Huang, Y. and Wang, C. Y. (2001). Consistent functional methods for logistic regression with errors in covariates. J. Amer. Statist. Assoc. 96 1469–1482.
  • McShane, L. M. and Hayes, D. F. (2012). Publication of tumor marker research results: The necessity for complete and transparent reporting. J. Clin. Oncol. 30 4223–4232.
  • Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53–71.
  • Moehler, T. M., Seckinger, A., Hose, D., Andrulis, M., Moreaux, J., Hielscher, T., Willlhauck-Fleckenstein, M., Merling, A., Bertsch, U., Jauch, A., Goldschmidt, H., Klein, B. and Schwartz-Albiez, R. (2013). The glycome of normal and malignant plasma cells. PLoS ONE 8 e83719.
  • Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R. and Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol. 10 119–142.
  • Novick, S. J. and Stefanski, L. A. (2002). Corrected score estimation via complex variable simulation extrapolation. J. Amer. Statist. Assoc. 97 472–481.
  • Pfeffer, U., Romeo, F., Noonan, D. M. and Albini, A. (2009). Predictin of breast cancer metastasis by genomic profiling: Where do we stand? Clin. Exp. Metastasis 26 547–558.
  • Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., Hurt, E. M., Zhao, H., Averett, L., Yang, L., Wilson, W. H., Jaffe, E. S., Simon, R., Klausner, R. D., Powell, J., Duffey, P. L., Longo, D. L., Greiner, T. C., Weisenburger, D. D., Sanger, W. G., Dave, B. J., Lynch, J. C., Vose, J., Armitage, J. O., Montserrat, E., López-Guillermo, A., Grogan, T. M., Miller, T. P., LeBlanc, M., Ott, G., Kvaloy, S., Delabie, J., Holte, H., Krajci, P., Stokke, T. and Staudt, L. M. (Lymphoma/Leukemia Molecular Profiling Project) (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346 1937–1947.
  • Safo, S., Song, X. and Dobbin, K. K. (2015). Supplement to “Sample size determination for training cancer classifiers from microarray and RNA-seq data.” DOI:10.1214/15-AOAS825SUPP.
  • Shedden, K., Taylor, J. M., Enkemann, S. A., Tsao, M. S., Yeatman, T. J., Gerald, W. L., Eschrich, S., Jurisica, I., Giordano, T. J., Misek, D. E., Chang, A. C., Zhu, C. Q., Strumpf, D., Hanash, S., Shepherd, F. A., Ding, K., Seymour, L., Naoki, K., Penell, N., Weir, B., Verhaak, R., Ladd-Acosta, C., Golub, T., Gruidl, M., Sharma, A., Szoke, J., Zakowski, M., Rusch, V., Kris, M., Viale, A., Motoi, N., Travis, W., Conley, B., Seshan, V. E., Meyerson, M., Kuick, R., Dobbin, K. K., Lively, T., Jacobson, J. W. and Beer, D. G. (2008). Gene expression-based survival prediction in lung adenocarcinoma: A multisite, blinded validation study. Nat. Med. 14 822–827.
  • Simon, R. (2010). Clinical trials for predictive medicine: New challenges and paradigms. Clin. Trials 7 516–524.
  • Simon, R. M., Radmacher, M. D., Dobbin, K. K. and McShane, L. M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95 14–18.
  • Stefanski, L. A. and Carroll, R. J. (1987). Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika 74 703–716.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • Varma, S. and Simon, R. M. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7 91.
  • Zhang, J. X., Song, W., Chen, Z. H., Wei, J. H., Liao, Y. J., Lei, J., Hu, M., Chen, G. Z., Liao, B., Lu, J., Zhao, H. W., Chen, W., He, Y. L., Wang, H. Y., Xie, D. and Luo, J. H. (2013). Prognostic and predictive value of a microRNA signature in stage II colon cancer: A microRNA expression analysis. Lancet Oncol. 14 1295–1306.
  • Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics 5 427–443.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.
  • Zwiener, I., Frisch, B. and Binder, H. (2014). Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 8 e85150.

Supplemental materials

  • Supplemental tables, figures, algorithms, details and discussion. Supplemental material for paper by Safo, Song and Dobbin.