The Annals of Applied Statistics

Feature selection in omics prediction problems using cat scores and false nondiscovery rate control

Miika Ahdesmäki and Korbinian Strimmer
Source: Ann. Appl. Stat. Volume 4, Number 1 (2010), 503-519.

Abstract

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James–Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda” available from the R repository CRAN.

First Page: Show Hide
Full-text: Access denied (no subscription detected)
In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1273584465
Digital Object Identifier: doi:10.1214/09-AOAS277
Zentralblatt MATH identifier: 1189.62102
Mathematical Reviews number (MathSciNet): MR2758182

References

Ackermann, M. and Strimmer, K. (2009). A general modular framework for gene set enrichment. BMC Bioinformatics 10 47.
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 503–511.
Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99 6562–6566.
Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naive Bayes,’ and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
Mathematical Reviews (MathSciNet): MR2108040
Digital Object Identifier: doi:10.3150/bj/1106314847
Project Euclid: euclid.bj/1106314847
Dabney, A. R. and Storey, J. D. (2007). Optimality driven nearest centroid classification from genomic data. PLoS ONE 2 e1002.
Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–15795.
Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892–896.
Mathematical Reviews (MathSciNet): MR391403
Zentralblatt MATH: 0319.62039
Digital Object Identifier: doi:10.2307/2285453
Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
Mathematical Reviews (MathSciNet): MR2054289
Zentralblatt MATH: 1089.62502
Digital Object Identifier: doi:10.1198/016214504000000089
Efron, B. (2008a). Empirical Bayes estimates for large-scale prediction problems. Technical report, Dept. Statistics, Stanford Univ.
Efron, B. (2008b). Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 23 1–22.
Mathematical Reviews (MathSciNet): MR2431866
Digital Object Identifier: doi:10.1214/07-STS236
Project Euclid: euclid.ss/1215441276
Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
Mathematical Reviews (MathSciNet): MR2485009
Zentralblatt MATH: 05503372
Digital Object Identifier: doi:10.1214/07-AOS504
Project Euclid: euclid.aos/1231165181
Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165–175.
Mathematical Reviews (MathSciNet): MR999675
Digital Object Identifier: doi:10.2307/2289860
Guo, Y., Hastie, T. and Tibshirani, T. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics 8 86–100.
Hand, D. J. (2006). Classifier technology and the illusion of progress. Statist. Sci. 21 1–14.
Mathematical Reviews (MathSciNet): MR2275965
Digital Object Identifier: doi:10.1214/088342306000000060
Project Euclid: euclid.ss/1149600839
Hausser, J. and Strimmer, K. (2009). Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 10 1469–1484.
Mathematical Reviews (MathSciNet): MR2534868
Hintze, J. L. and Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. Amer. Statist. 52 181–184.
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7 673–679.
Opgen-Rhein, R. and Strimmer, K. (2007). Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Statist. Appl. Genet. Mol. Biol. 6 9.
Mathematical Reviews (MathSciNet): MR2306944
Zentralblatt MATH: 1166.62361
Digital Object Identifier: doi:10.2202/1544-6115.1252
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J. P., Lander, E. S. and Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 436–442.
Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4 32.
Mathematical Reviews (MathSciNet): MR2183942
Schwender, H., Ickstadt, K. and Rahnenführer, J. (2008). Classification with high-dimensional genetic data: Assigning patients and genetic features to known classes. Biometr. J. 50 911–926.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R. and Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
Slawski, M., Daumer, M. and Boulesteix, A.-L. (2008). CMA—a comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bionformatics 9 439.
Strimmer, K. (2008). A unified approach to false discovery rate estimation. BMC Bioinformatics 9 303.
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer type by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
Tibshirani, R., Hastie, T., Narsimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Sci. 18 104–117.
Mathematical Reviews (MathSciNet): MR1997067
Digital Object Identifier: doi:10.1214/ss/1056397488
Project Euclid: euclid.ss/1056397488
Wilson, E. and Hilferty, M. (1931). The distribution of chi-square. Proc. Natl. Acad. Sci. 17 684–688.
Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high-dimensional problems. J. Roy. Statist. Soc. Ser. B 71 615–636.
Xu, P., Brock, G. N. and Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional micoarray data. Comput. Stat. Data Anal. 53 1674–1687.
Zuber, V. and Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation. Bioinformatics 25 2700–2707.

2012 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics