The Annals of Applied Statistics

GaGa: A parsimonious and flexible model for differential expression analysis

David Rossell
Source: Ann. Appl. Stat. Volume 3, Number 3 (2009), 1035-1051.

Abstract

Hierarchical models are a powerful tool for high-throughput data with a small to moderate number of replicates, as they allow sharing information across units of information, for example, genes. We propose two such models and show its increased sensitivity in microarray differential expression applications. We build on the gamma–gamma hierarchical model introduced by Kendziorski et al. [Statist. Med. 22 (2003) 3899–3914] and Newton et al. [Biostatistics 5 (2004) 155–176], by addressing important limitations that may have hampered its performance and its more widespread use. The models parsimoniously describe the expression of thousands of genes with a small number of hyper-parameters. This makes them easy to interpret and analytically tractable. The first model is a simple extension that improves the fit substantially with almost no increase in complexity. We propose a second extension that uses a mixture of gamma distributions to further improve the fit, at the expense of increased computational burden. We derive several approximations that significantly reduce the computational cost. We find that our models outperform the original formulation of the model, as well as some other popular methods for differential expression analysis. The improved performance is specially noticeable for the small sample sizes commonly encountered in high-throughput experiments. Our methods are implemented in the freely available Bioconductor gaga package.

First Page: Show Hide

Related Works:

Full-text: Access denied (no subscription detected)
In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1254773277
Digital Object Identifier: doi:10.1214/09-AOAS244
Zentralblatt MATH identifier: 05758450
Mathematical Reviews number (MathSciNet): MR2750385

References

Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., Boer, M. L., Minden, M. D., Sallan, E. S., Lander, E. S., Golub, T. R. and Korsmeyer, S. J. (2002). Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30 41–47.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
Mathematical Reviews (MathSciNet): MR1325392
Coombes, K. R., Wang, J. and Baggerly, K. A. (2007). A statistical method for finding biomarkers from microarray data, with application to prostate cancer. Technical report, M.D. Anderson Cancer Center. Available at http://www.mdanderson.org/pdf/biostats_utmdabtr00704.pdf.
Coombes, K. R. (2007). ClassComparison: Classes and methods for “class comparison” problems on microarrays. R package version 2.5.0.
Damsleth, E. (1975). Conjugate classes for gamma distributions. Scand. J. Statist. 2 80–84.
Mathematical Reviews (MathSciNet): MR378169
Dennis, G., Sherman, B. T., Hosack, D. A., Yang, J., Baseler, M. W., Lane, H. C. and Lempicki, R. A. (2003). David: Database for annotation, visualization, and integrated discovery. Genome Biology 4 P3.
Dudoit, S., Yang, H. Y., Callow, M. J. and Speed, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statist. Sinica 12 972–977.
Mathematical Reviews (MathSciNet): MR1894191
Zentralblatt MATH: 1004.62088
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H. and Zhang, J. ( 2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5.
Gottardo, R., Raftery, A. E., Yeung, K. Y. and Bumgarner, R. E. (2006). Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62 10–18.
Mathematical Reviews (MathSciNet): MR2226551
Digital Object Identifier: doi:10.1111/j.1541-0420.2005.00397.x
Gottardo, R. (2004). Bridge: Bayesian robust inference for differential gene expression. R package version 1.4.0.
Irizarry, R., Hobbs, B., Collin, B., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U. and Speed, T. S. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264.
Kendziorski, C., Newton, M. and Sarkar, D. (2005). EBarrays: Empirical Bayes for microarrays. R package version 1.2.0.
Kendziorski, C. M., Newton, M. A., Lan, H. and Gould, M. N. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statist. Med. 22 3899–3914.
Lapointe, J., Li, C., Higgins, J. P., Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A. M., Tibshirani, R., Botstein, D., Brown, P. O., Brooks, J. D. and Pollack, J. R. (2004). Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl. Acad. Sci. 101 811–816.
Lo, K. and Gottardo, R. (2007). Flexible empirical Bayes models for differential gene expression. Bioinformatics 23 328–335.
Lönnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31–46.
Mathematical Reviews (MathSciNet): MR1894187
Zentralblatt MATH: 1004.62086
MAQCconsortium (2006). The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 24 1151–1161.
Miller, R. B. (1980). Bayesian analysis of the two-parameter gamma distribution. Technometrics 22 65–69.
Müller, P., Parmigiani, G. and Rice, K. (2007). FDR and Bayesian Multiple Comparisons Rules. Oxford Univ. Press.
Müller, P., Parmigiani, G., Robert, C. and Rousseau, J. (2004). Optimal sample size for multiple testing: The case of gene expression microarrays. J. Amer. Statist. Assoc. 99 990–1001.
Newton, M. A. and Kendziorski, C. M. (2003). Parametric Empirical Bayes Methods for Microarrays. Springer, New York.
Mathematical Reviews (MathSciNet): MR2001399
Digital Object Identifier: doi:10.1007/0-387-21679-0_11
Newton, M. A., Kendziorski, C. M., Richmond, C. S., Blattner, F. R. and Tsui, K. W. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8 37–52.
Newton, M. A., Noueriry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture model. Biostatistics 5 155–176.
Pounds, S. and Morris, S. W. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 10 1236–1242.
Rossell, D. (2009). Supplement to “GaGa: A simple and flexible hierarchical model for differential expression analysis.” DOI: 10.1214/09-AOAS244SUPP.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
Mathematical Reviews (MathSciNet): MR468014
Zentralblatt MATH: 0379.62005
Digital Object Identifier: doi:10.1214/aos/1176344136
Project Euclid: euclid.aos/1176344136
Schwender, H. (2007). Siggenes: Multiple testing using SAM and Efron’s empirical Bayes approaches. R package version 1.13.2.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statist. Appl. Genet. Mol. Biol. 3 3.
Mathematical Reviews (MathSciNet): MR2101454
Zentralblatt MATH: 1038.62110
Digital Object Identifier: doi:10.2202/1544-6115.1027
Smyth, G. K. (2005). Limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. Gentleman, V. Carey, S. Dudoit, R. Irizarry and W. Huber, eds.) 397–420. Springer, New York.
Mathematical Reviews (MathSciNet): MR2201836
Stafford, P. (2008). Methods in Microarray Normalization. CRC Press, USA.
Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98 5116–5121.
Wu, J. and Irizarry, J. M. J. R. with contributions from Gentry (2007). gcrma: Background Adjustment Using Sequence Information. R package version 2.8.1.
Wu, Z., Irizarry, R. A., Gentleman, R., Murillo, F. M. and Spencer, F. (2004). A model based background adjustment for oligonucleotide expression arrays. Technical report, Johns Hopkins Univ., Dept. Biostatistics.
Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering and differential expression identification. Biometrics 62 1089–1098.
Mathematical Reviews (MathSciNet): MR2297680
Digital Object Identifier: doi:10.1111/j.1541-0420.2006.00611.x

2012 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics