The Annals of Applied Statistics

Testing significance of features by lassoed principal components

Daniela M. Witten and Robert Tibshirani

Full-text: Open access

Abstract

We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L1 penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

Article information

Source
Ann. Appl. Stat., Volume 2, Number 3 (2008), 986-1012.

Dates
First available in Project Euclid: 13 October 2008

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1223908049

Digital Object Identifier
doi:10.1214/08-AOAS182

Mathematical Reviews number (MathSciNet)
MR2516801

Zentralblatt MATH identifier
1149.62092

Keywords
Microarray gene expression multiple testing feature selection

Citation

Witten, Daniela M.; Tibshirani, Robert. Testing significance of features by lassoed principal components. Ann. Appl. Stat. 2 (2008), no. 3, 986--1012. doi:10.1214/08-AOAS182. https://projecteuclid.org/euclid.aoas/1223908049


Export citation

References

  • Allison, D., Cui, X., Page, G. and Sabripour, M. (2006). Microarray data analysis: From disarray to consolidation and consensus. Nature Reviews Genetics 7 55–65.
  • Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D. and Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96 6745–6750.
  • Alter, O., Brown, P. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97 10101–10106.
  • Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119–137.
  • Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology 2 511–522.
  • Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., Lizyness, M. L., Kuick, R., Hayasaka, S., Taylor, J. M., Iannettoni, M. D., Orringer, M. B. and Hanash, S. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine 8 816–824.
  • Carvalho, C., Lucas, J., Wang, Q., Chang, J., Nevins, J. and West, M. (2008). High-dimensional sparse factor modeling—applications in gene expression genomics. J. Amer. Statist. Assoc. To appear.
  • Cui, X. and Churchill, G. A. (2003). Statistical test for differential expression in cdna microarray experiments. Genome Biology 4 210.
  • Cui, X., Hwang, J. T. G., Qiu, J., Blades, N. J. and Churchill, G. A. (2005). Improved statistical tests for differential gene expression by shrinking variance component estimates. Biostatistics 6 59–75.
  • Getz, G., Hoefling, H., Mesirov, J. P., Golub, T. R., Meyerson, M. L., Tibshirani, R. and Lander, E. S. (2007). Technical comment on Sjoblom et al. Science 317 1500.
  • Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics 3 1724–1735.
  • Lonnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31–46.
  • Price, A. L., Patterson, N. J., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904–909.
  • Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B. and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large b-cell lymphoma. The New England J. Medicine 346 1937–1947.
  • Shen, R., Ghosh, D., Chinnaiyan, A. and Meng, Z. (2006). Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics 22 2635–2642.
  • Sjoblom, T., Jones, S., Wood, L., Parsons, D., Lin, J., Barber, T., Mandelker, D., Leary, R., Ptak, J., Silliman, N., Szabo, S., Buckhaults, P., Farrell, C., Meeh, P., Markowitz, S., Willis, J., Dawson, D., Willson, J., Gazdar, A., Hartigan, J., Wu, L., Liu, C., Parmigiani, G., Park, B., Bachman, K., Papadopoulos, N., Vogelstein, B., Kinzler, K. and Velculescu, V. (2006). The consensus coding sequences of human breast and colorectal cancers. Science 314 268–274.
  • Smyth, G. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statist. Appl. Genet. Mol. Biol. 3.
  • Storey, J. D., Dai, J. Y. and Leek, J. T. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 8 414–432.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98 5116–5121.
  • West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. In Bayesian Statistics 7 723–732. Oxford Univ. Press, New York.
  • Witten, D. M. and Tibshirani, R. (2008). Supplement to “Testing significance of features by lassoed principal components.” DOI: 10.1214/08-AOAS182SUPP.
  • Zhao, H., Ljungberg, B., Grankvist, K., Rasmuson, T., Tibshirani, R. and Brooks, J. (2006). Gene expression profiling predicts survival in conventional renal cell carcinoma. PLOS Medicine 3 115–124.

Supplemental materials