Brazilian Journal of Probability and Statistics

Bayesian factor models for the detection of coherent patterns in gene expression data

Vinicius D. Mayrink and Joseph E. Lucas

Full-text: Open access


A common problem in the analysis of gene expression microarray data is the identification of groups of features that are coherently expressed. For example, one often wishes to know whether a group of genes, clustered because of correlation in one data set, are still highly co-expressed in another data set. Alternatively, for some expression array platforms there are many, relatively short probes for each gene of interest. In this case, it is possible that a given probe is not measuring its targeted gene, but rather a different gene with a similar region (called cross-hybridization). Accurate detection of the collection of probe sets (groups of probes targeting the same gene) which demonstrate highly coherent expression patterns is the best approach to the identification of which genes are present in the sample. We develop a Bayesian Factor Model (BFM) to address the general problem of detection of coherent patterns in gene expression data sets. We compare our method to “state of the art” methods for the identification of expressed genes in both synthetic and real data sets, and the results indicate that the BFM outperforms the other procedures for detecting transcripts. We also demonstrate the use of factor analysis to identify the presence/absence status of gene modules (groups of coherently expressed genes). Variation in the number of copies of regions of the genome is a well known and important feature of most cancers. We examine a group of genes, representative of Copy Number Alteration (CNA) in breast cancer, then identify the presence/absence of CNA in this region of the genome for other cancers. Coherent patterns can also be evaluated in high-throughput sequencing data, a novel technology to measure gene expression. We analyze this type of data via factor model and examine the detection calls in terms of read mapping uncertainty.

Article information

Braz. J. Probab. Stat., Volume 29, Number 1 (2015), 1-33.

First available in Project Euclid: 30 October 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Coherent copy number alteration detection call factor model high-throughput data microarray


Mayrink, Vinicius D.; Lucas, Joseph E. Bayesian factor models for the detection of coherent patterns in gene expression data. Braz. J. Probab. Stat. 29 (2015), no. 1, 1--33. doi:10.1214/13-BJPS226.

Export citation


  • Affymetrix Technical Report (2001). Statistical algorithms reference guide. Available at
  • Affymetrix Technical Report (2005). Exon array background correction. Available at
  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology 215, 403–410.
  • Archer, K. J. and Reese, S. E. (2009). Detection call algorithms for high-throughput gene expression microarray data. Briefings in Bioinformatics 2, 244–252.
  • Bild, A. H., Yao, G., Chang, J. T., Wang, Q., Potti, A., Chasse, D., Joshi, M. B., Harpole, D., Lancaster, J. M., Berchuck, A., Olson, J. A. Jr, Marks, J. R., Dressman, H. K., West, M. and Nevins, J. R. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439, 353–357.
  • Boulesteix, A. L. and Strimmer, K. (2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 8, 32–44.
  • Brunet, J. P., Tamayo, P., Golub, T. R. and Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences of the United States of America 101, 4164–4169.
  • Carvalho, C., Chang, J., Lucas, J., Nevins, J. R., Wang, Q. and West, M. (2008). High-dimensional sparse factor modelling: Applications in gene expression genomics. Journal of the American Statistical Association 103, 1438–1456.
  • Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W. L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M. and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541.
  • Diskin, S. J., Eck, T., Greshock, J., Mosse, Y. P., Naylor, T., Stoeckert, C. J., Weber, B. L., Maris, J. M. and Grant, G. R. (2006). STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Research 16, 1149–1158.
  • Faulkner, G. J., Forrest, A. R., Chalk, A. M., Schroder, K., Hayashizaki, Y., Carninci, P., Hume, D. A. and Grimmond, S. M. (2008). A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics 91, 281–288.
  • Freije, W. A., Castro-Vargas, F. E., Fang, Z., Horvath, S., Cloughesy, T., Liau, L. M., Mischel, P. S. and Nelson, S. F. (2004). Gene expression profiling of gliomas strongly predicts survival. Cancer Research 64, 6503–6510.
  • Gamerman, D. and Lopes, H. F. (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2nd ed. Texts in Statistical Science 68. Boca Raton, FL: Chapman & Hall/CRC.
  • Gautier, L., Cope, L., Bolstad, B. M. and Irizarry, R. A. (2004). Affy—Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315.
  • Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. and Zhang, J. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5, R80.
  • Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, T. P. (2003a). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31, e15.
  • Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U. and Speed, T. P. (2003b). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264.
  • Kapur, K., Xing, Y., Ouyang, Z. and Wong, W. (2007). Exon arrays provide accurate assessments of gene expression. Genome Biology 8, R82.
  • Kim, P. M. and Tidor, B. (2003). Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Research 13, 1706–1718.
  • Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21, 3763–3770.
  • Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. and Dewey, C. N. (2010). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500.
  • Li, C. and Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application. Genome Biology 2, R32.
  • Liu, W., Mei, R., Di, X., Ryder, T. B., Hubbell, E., Dee, S., Webster, T. A., Harrington, C. A., Ho, M., Baid, J. and Smeekens, S. P. (2002). Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 18, 1593–1599.
  • Liu, X., Milo, M., Lawrence, N. D. and Rattray, M. (2005). A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics 21, 3637–3644.
  • Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica 14, 41–67.
  • Lucas, J. E., Carvalho, C., Wang, Q., Bild, A., Nevins, J. R. and West, M. (2006). Sparse statistical modelling in gene expression genomics. In Bayesian Inference for Gene Expression and Proteomics (P. Muller, K. Do and M. Vannucci, eds.) 155–176. Cambridge: Cambridge University Press.
  • Lucas, J. E., Kung, H. N. and Chi, J. T. (2010). Cross-study projections of genomic biomarkers: An evaluation in cancer genomics. PLoS Computational Biology 6, e1000920. DOI:10.1371/journal.pcbi.1000920.
  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18, 1509–1517.
  • Marks, J. R., Davidoff, A. M., Kerns, B. J., Humphrey, P. A., Pence, J. C., Dodge, R. K., Clarke-Pearson, D. L., Iglehart, J. D., Bast, R. C. and Berchuck, A. (1991). Overexpression and mutation of p53 in epithelial ovarian cancer. Cancer Research 51, 2979–2984.
  • McClintick, J. N. and Edenberg, H. J. (2006). Effects of filtering by present call on analysis of microarray experiments. BMC Bioinformatics 7, 49.
  • Miller, L. D., Smeds, J., George, J., Vega, V. B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E. T. and Bergh, J. (2005). An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proceedings of the National Academy of Sciences of the United States of America 102, 13550–13555.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 621–628.
  • Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50.
  • Ouandaogo, Z. G., Haouzi, D., Assou, S., Dechaud, H., Kadoch, I. J., Vos, J. D. and Hamamah, S. (2011). Human cumulus cells molecular signature in relation to oocyte nuclear maturity stage. PLoS ONE 6, e27179.
  • Pollack, J. R., Sorlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., Tibshirani, R., Botstein, D., Dale, A. L. B. and Brown, P. O. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences of the United States of America 99, 12963–12968.
  • Rueda, O. M. and Uriarte, R. D. (2007). Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Computational Biology 3, 1115–1122.
  • Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Kains, B. H., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Vijver, M. J. V. D., Bergh, J., Piccart, M. and Delorenzi, M. (2006). Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute 98, 262–272.
  • Tiedermann, R. E., Zhu, Y. X., Schimidt, J., Shi, C. X., Sereduk, C., Yin, H., Mousses, S. and Stewart, A. K. (2012). Identification of molecular vulnerabilities in human multiple myeloma cells by RNA interference lethality screening of the druggable genome. Cancer Research 72, 757–768.
  • Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57–63.
  • Wang, Y., Klijn, J. G. M., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Gelder, M. E. M. V., Yu, J., Jatkoe, T., Berns, E. M. J. J., Atkins, D. and Foekens, J. A. (2005). Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679.
  • Warren, P., Taylor, D., Martini, P. G. V., Jackson, J. and Bienkowska, J. (2007). PANP—A new method of gene detection on oligonucleotide expression arrays. In Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering 108–115. Boston, MA: IEEE. DOI:10.1109/BIBE.2007.4375552.
  • West, M. (2003). Bayesian factor regression models in the “large $p$, small $n$” paradigm. In Bayesian Statistics 7 (J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 723–732. New York: Oxford University Press.
  • Wieringen, W. N. V., Belien, J. A. M., Vosse, S. J., Achame, E. M. and Ylstra, B. (2006). ACE-it: A tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics 22, 1919–1920.
  • Whitlock, M. C. (2005). Combining probability from independent tests: The weighted Z-method is superior to Fisher’s approach. Journal of Evolutionary Biology 18, 1368–1373.
  • Wu, Z. and Irizarry, R. A. (2005). A statistical framework for the analysis of microarray probe-level data. Working Paper 73, Johns Hopkins Univ., Dept. Biostatistics. Available at
  • Wu, Z., Irizarry, R. A., Gentleman, R., Murillo, F. M. and Spencer, F. (2004). A model based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99, 909–917.
  • Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774.