The Annals of Applied Statistics

Rank discriminants for predicting phenotypes from RNA expression

Bahman Afsari, Ulisses M. Braga-Neto, and Donald Geman

Full-text: Open access


Statistical methods for analyzing large-scale biomolecular data are commonplace in computational biology. A notable example is phenotype prediction from gene expression data, for instance, detecting human cancers, differentiating subtypes and predicting clinical outcomes. Still, clinical applications remain scarce. One reason is that the complexity of the decision rules that emerge from standard statistical learning impedes biological understanding, in particular, any mechanistic interpretation. Here we explore decision rules for binary classification utilizing only the ordering of expression among several genes; the basic building blocks are then two-gene expression comparisons. The simplest example, just one comparison, is the TSP classifier, which has appeared in a variety of cancer-related discovery studies. Decision rules based on multiple comparisons can better accommodate class heterogeneity, and thereby increase accuracy, and might provide a link with biological mechanism. We consider a general framework (“rank-in-context”) for designing discriminant functions, including a data-driven selection of the number and identity of the genes in the support (“context”). We then specialize to two examples: voting among several pairs and comparing the median expression in two groups of genes. Comprehensive experiments assess accuracy relative to other, more complex, methods, and reinforce earlier observations that simple classifiers are competitive.

Article information

Ann. Appl. Stat., Volume 8, Number 3 (2014), 1469-1491.

First available in Project Euclid: 23 October 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Cancer classification gene expression rank discriminant order statistics


Afsari, Bahman; Braga-Neto, Ulisses M.; Geman, Donald. Rank discriminants for predicting phenotypes from RNA expression. Ann. Appl. Stat. 8 (2014), no. 3, 1469--1491. doi:10.1214/14-AOAS738.

Export citation


  • Afsari, B., Braga-Neto, U. M. and Geman, D. (2014a). Supplement to “Rank discriminants for predicting phenotypes from RNA expression.” DOI:10.1214/14-AOAS738SUPPA.
  • Afsari, B., Braga-Neto, U. M. and Geman, D. (2014b). Supplement to “Rank discriminants for predicting phenotypes from RNA expression.” DOI:10.1214/14-AOAS738SUPPB.
  • Afsari, B., Braga-Neto, U. M. and Geman, D. (2014c). Supplement to “Rank discriminants for predicting phenotypes from RNA expression.” DOI:10.1214/14-AOAS738SUPPC.
  • Alon, U., Barkai, N., Notterman, D. et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745–6750.
  • Altman, R. B., Kroemer, H. K., McCarty, C. A. (2011). Pharmacogenomics: Will the promise be fulfilled. Nat. Rev. 12 69–73.
  • Anderson, T., Tchernyshyov, I., Diez, R. et al. (2007). Discovering robust protein biomarkers for disease from relative expression reversals in 2-D DIGE data. Proteomics 7 1197–1208.
  • Armstrong, S. A., Staunton, J. E., Silverman, L. B. et al. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30 41–47.
  • Auffray, C. (2007). Protein subnetwork markers improve prediction of cancer outcome. Mol. Syst. Biol. 3 1–2.
  • Bicciato, S., Pandin, M., Didonè, G. and Bello, C. D. (2003). Pattern identification and classification in gene expression data using an autoassociative neural network model. Biotechnol. Bioeng. 81 594–606.
  • Bloated, B., Irizarry, R. and Speed, T. (2004). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19 185–193.
  • Bloom, G., Yang, I., Boulware, D. et al. (2004). Multi-platform, multisite, microarray-based human tumor classification. Am. J. Pathol. 164 9–16.
  • Boulesteix, A. L., Tutz, George. and Strimmer, K. (2003). A CART-based approach to discover emerging patterns in microarray data. Bioinformatics 19 2465–2472.
  • Bradley, P. S. and Mangasarian, O. L. (1998). Feature selection via voncave minimization and support vector machines. In ICML 82–90. Morgan Kaufmann, Madison, WI.
  • Braga-Neto, U. M. (2007). Fads and fallacies in the name of small-sample microarray classification—a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing. IEEE Signal Process. Mag. 24 91–99.
  • Braga-Neto, U. M. and Dougherty, E. R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics 20 374–380.
  • Buffa, F., Camps, C., Winchester, L., Snell, C., Gee, H., Sheldon, H., Taylor, M., Harris, A. and Ragoussis, J. (2011). microRNA-associated progression pathways and potential therapeutic targets identified by integrated mRNA and microRNA expression profiling in breast cancer. Cancer Res. 71 5635–5645.
  • Burczynski, M., Peterson, R., Twine, N. et al. (2006). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. Cancer Res. 8 51–61.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • Casella, G. and Berger, R. L. (2002). Statistical Inference, 2nd ed. Duxbury, Pacific Grove, CA.
  • Dettling, M. and Buhlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
  • Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
  • Edelman, L., Toia, G., Geman, D. et al. (2009). Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases. BMC Genomics 10 583.
  • Enerly, E., Steinfeld, I., Kleivi, K., Leivonen, S.-K. et al. (2011). miRNA–mRNA integrated analysis reveals roles for miRNAs in primary breast tumors. PLoS ONE 6 0016915.
  • Evans, J. P., Meslin, E. M., Marteau, T. M. and Caulfield, T. (2011). Deflating the genomic bubble. Science 331 861–862.
  • Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • Funk, C. (2012). Personal communication. Institute for Systems Biology, Seattle, WA.
  • Geman, D., d’Avignon, C., Naiman, D. Q. and Winslow, R. L. (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. 3 Art. 19, 21 pp. (electronic).
  • Geman, D., Afsari, B., Tan, A. C. and Naiman, D. Q. (2008). Microarray classification from several two-gene expression comparisons. In Machine Learning and Applications, 2008. ICMLA’08. Seventh International Conference 583–585. IEEE, San Diego, CA.
  • Golub, T. R., Slonim, D. K., Tamayo, P. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62 4963–4967.
  • Guo, Y., Hastie, T. and Tibshirani, R. (2005). Regularized discriminant analysis and its application in microarrays. Biostatistics 1 1–18.
  • Guo, Y., Hastie, T. and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8 86–100.
  • Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46 389–422.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York.
  • Jones, S., Zhang, X., Parsons, D. W. et al. (2008). Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321 1801–1806.
  • Khan, J., Wei, J. S., Ringnér, M. et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7 673–679.
  • Kohlmann, A., Kipps, T. J., Rassenti, L. Z. and Downing, J. R. (2008). An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: The microarray innovations in leukemia study prephase. Br. J. Haematol. 142 802–807.
  • Kuriakose, M. A., Chen, W. T. et al. (2004). Selection and validation of differentially expressed genes in head and neck cancer. Cell. Mol. Life Sci. 61 1372–1383.
  • Lee, E., Chuang, H. Y., Kim, J. W. et al. (2008). Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4 e1000217.
  • Lin, X., Afsari, B., Marchionni, L. et al. (2009). The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations. BMC Bioinformatics 10 256.
  • Ma, X. J., Wang, Z., Ryan, P. D. et al. (2004). A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 5 607–616.
  • Marron, J. S., Todd, M. J. and Ahn, J. (2007). Distance-weighted discrimination. J. Amer. Statist. Assoc. 102 1267–1271.
  • Marshall, E. (2011). Waiting for the revolution. Science 331 526–529.
  • Mills, K. I., Kohlmann, A., Williams, P. M., Wieczorek, L. et al. (2009). Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. Blood 114 1063–1072.
  • Peng, S., Xu, Q., Ling, X. et al. (2003). Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett. 555 358–362.
  • Pomeroy, C., Tamayo, P., Gaasenbeek, M. et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 436–442.
  • Price, N., Trent, J., El-Naggar, A. et al. (2007). Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leimyosarcomas. Proc. Natl. Acad. Sci. USA 43 3414–3419.
  • Qu, Y., Adam, B., Yasui, Y. et al. (2002). Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem. 48 1835–1843.
  • Ramaswamy, S., Tamayo, P., Rifkin, R. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98 15149–15154.
  • Raponi, M., Lancet, J. E., Fan, H. et al. (2008). A 2-gene classifier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia. Blood 111 2589–2596.
  • Shipp, M., Ross, K., Tamayo, P. et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8 68–74.
  • Simon, R., Radmacher, M. D., Dobbin, K. and McShane, L. M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95 14–18.
  • Singh, D., Febbo, P., Ross, K. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
  • Stuart, R., Wachsman, W., Berry, C. et al. (2004). In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc. Natl. Acad. Sci. USA 101 615–620.
  • Tan, A. C., Naiman, D. Q., Xu, L. et al. (2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21 3896–3904.
  • Thomas, R. K., Baker, A. C., DeBiasi, R. M. et al. (2007). High-throughput oncogene mutation profiling in human cancer. Nature Genetics 39 347–351.
  • Tibshirani, R. (2011). PAM R Package. Available at
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
  • Wang, X. (2012). Robust two-gene classifiers for cancer prediction. Genomics 99 90–95.
  • Weichselbaum, R. R., Ishwaranc, H., Yoona, T. et al. (2008). An interferon-related gene signature for DNA damage resistance is a predictive marker for chemotherapy and radiation for breast cancer. Proc. Natl. Acad. Sci. USA 105 18490–18495.
  • Welsh, J., Sapinoso, L., Su, A. et al. (2001). Analysis of gene expression identifies candidate markers and pharmacological targets inprostate cancer. Cancer Res. 61 5974–5978.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics 80–83.
  • Winslow, R., Trayanova, N., Geman, D. and Miller, M. (2012). The emerging discipline of computational medicine. Sci. Transl. Med. 4 158rv11.
  • Xu, L., Geman, D. and Winslow, R. L. (2007). Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics 8 275.
  • Xu, L., Tan, A. C., Naiman, D. Q. et al. (2005). Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. BMC Bioinformatics 21 3905–3911.
  • Yao, Z., Jaeger, J., Ruzzo, W. L. et al. (2004). Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J. Cli. Oncol. 22 2790–2799.
  • Yao, Z., Jaeger, J., Ruzzo, W. et al. (2007). A Marfan syndrome gene expression phenotype in cultured skin fibroblasts. BMC Genomics 8 319.
  • Yeang, C., Ramaswamy, S., Tamayo, P. et al. (2001). Molecular classification of multiple tumor types. Bioinformatics 17 S316–S322.
  • Zhang, H., Yu, C. Y. and Singer, B. (2003). Cell and tumor classification using. Proc. Natl. Acad. Sci. USA 100 4168–4172.
  • Zhang, H. H., Ahn, J., Lin, X. and Park, C. (2006). Gene selection using support vector machines with nonconvex penalty. Bioinformatics 22 88–95.
  • Zhao, H., Logothetis, C. J. and Gorlov, I. P. (2010). Usefulness of the top-scoring pairs of genes for prediction of prostate cancer progression. Prostate Cancer Prostatic Dis. 13 252–259.

Supplemental materials