Annals of Applied Statistics

Feature selection in omics prediction problems using cat scores and false nondiscovery rate control

Miika Ahdesmäki and Korbinian Strimmer

Full-text: Open access


We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James–Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda” available from the R repository CRAN.

Article information

Ann. Appl. Stat., Volume 4, Number 1 (2010), 503-519.

First available in Project Euclid: 11 May 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Feature selection linear discriminant analysis correlation James–Stein estimator “small n, large p” setting correlation-adjusted t-score false discovery rates higher criticism


Ahdesmäki, Miika; Strimmer, Korbinian. Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann. Appl. Stat. 4 (2010), no. 1, 503--519. doi:10.1214/09-AOAS277.

Export citation


  • Ackermann, M. and Strimmer, K. (2009). A general modular framework for gene set enrichment. BMC Bioinformatics 10 47.
  • Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 503–511.
  • Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99 6562–6566.
  • Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naive Bayes,’ and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
  • Dabney, A. R. and Storey, J. D. (2007). Optimality driven nearest centroid classification from genomic data. PLoS ONE 2 e1002.
  • Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–15795.
  • Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892–896.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2008a). Empirical Bayes estimates for large-scale prediction problems. Technical report, Dept. Statistics, Stanford Univ.
  • Efron, B. (2008b). Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 23 1–22.
  • Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
  • Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165–175.
  • Guo, Y., Hastie, T. and Tibshirani, T. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics 8 86–100.
  • Hand, D. J. (2006). Classifier technology and the illusion of progress. Statist. Sci. 21 1–14.
  • Hausser, J. and Strimmer, K. (2009). Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 10 1469–1484.
  • Hintze, J. L. and Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. Amer. Statist. 52 181–184.
  • Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7 673–679.
  • Opgen-Rhein, R. and Strimmer, K. (2007). Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Statist. Appl. Genet. Mol. Biol. 6 9.
  • Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J. P., Lander, E. S. and Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 436–442.
  • Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4 32.
  • Schwender, H., Ickstadt, K. and Rahnenführer, J. (2008). Classification with high-dimensional genetic data: Assigning patients and genetic features to known classes. Biometr. J. 50 911–926.
  • Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R. and Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
  • Slawski, M., Daumer, M. and Boulesteix, A.-L. (2008). CMA—a comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bionformatics 9 439.
  • Strimmer, K. (2008). A unified approach to false discovery rate estimation. BMC Bioinformatics 9 303.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer type by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
  • Tibshirani, R., Hastie, T., Narsimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Sci. 18 104–117.
  • Wilson, E. and Hilferty, M. (1931). The distribution of chi-square. Proc. Natl. Acad. Sci. 17 684–688.
  • Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high-dimensional problems. J. Roy. Statist. Soc. Ser. B 71 615–636.
  • Xu, P., Brock, G. N. and Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional micoarray data. Comput. Stat. Data Anal. 53 1674–1687.
  • Zuber, V. and Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation. Bioinformatics 25 2700–2707.