Annals of Applied Statistics

Sparse latent factor models with interactions: Analysis of gene expression data

Vinicius Diniz Mayrink and Joseph Edward Lucas

Full-text: Open access


Sparse latent multi-factor models have been used in many exploratory and predictive problems with high-dimensional multivariate observations. Because of concerns with identifiability, the latent factors are almost always assumed to be linearly related to measured feature variables. Here we explore the analysis of multi-factor models with different structures of interactions between latent factors, including multiplicative effects as well as a more general framework for nonlinear interactions introduced via the Gaussian Process. We utilize sparsity priors to test whether the factors and interaction terms have significant effect. The performance of the models is evaluated through simulated and real data applications in genomics. Variation in the number of copies of regions of the genome is a well-known and important feature of most cancers. We examine interactions between factors directly associated with different chromosomal regions detected with copy number alteration in breast cancer data. In this context, significant interaction effects for specific genes suggest synergies between duplications and deletions in different regions of the chromosome.

Article information

Ann. Appl. Stat., Volume 7, Number 2 (2013), 799-822.

First available in Project Euclid: 27 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Factor model interactions sparsity prior microarray copy number alteration


Mayrink, Vinicius Diniz; Lucas, Joseph Edward. Sparse latent factor models with interactions: Analysis of gene expression data. Ann. Appl. Stat. 7 (2013), no. 2, 799--822. doi:10.1214/12-AOAS607.

Export citation


  • Abramowitz, M. and Stegun, I. A. (1965). Handbook of Mathematical Functions. Dover, New York.
  • Aldous, D. J. (1985). Exchangeability and related topics. In École D’été de Probabilités de Saint-Flour, XIII—1983. Lecture Notes in Math. 1117 1–198. Springer, Berlin.
  • Arminger, G. and Muthen, B. O. (1998). A Bayesian approach to nonlinear latent variable models using the Gibbs Sampler and the Metropolis–Hastings algorithm. Psychometrika 63 271–300.
  • Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008). High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc. 103 1438–1456.
  • Chen, B., Chen, M., Paisley, J., Zaas, A., Woods, C., Ginsburg, G. S., Hero, A., Lucas, J., Dunson, D. and Carin, L. (2010). Bayesian inference of the number of factors in gene-expression analysis: Application to human virus challenge studies. BMC Bioinformatics 11 552.
  • Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W.-L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M. and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10 529–541.
  • DeSantis, S. M., Houseman, E. A., Coull, B. A., Louis, D. N., Mohapatra, G. and Betensky, R. A. (2009). A latent class model with hidden Markov dependence for array CGH data. Biometrics 65 1296–1305.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2 615–629.
  • Fridlyand, J., Snijders, A. M., Pinkel, D., Albertson, D. G. and Jain, A. N. (2004). Hidden Markov models approach to the analysis of array CGH data. J. Multivariate Anal. 90 132–153.
  • George, E. I. and McCulloch, E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • George, E. I. and McCulloch, E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339–373.
  • Geweke, J. (1996). Variable selection and model comparison in regression. In Bayesian Statistics, 5 (Alicante, 1994) 609–620. Oxford Univ. Press, New York.
  • Henao, R. and Winther, O. (2010). Sparse linear identifiable multivariate modeling. Preprint, Cornell Univ, Ithaca, NY. Available at
  • Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. and Scholkopf, B. (2009). Nonlinear causal discovery with additive noise models. Adv. Neural Inf. Process. Syst. 21 689–696.
  • Lawrence, N. D. (2004). Gaussian process models for visualisation of high dimensional data. In Advances in Neural Information Processing Systems (S. Thrun, L. Saul and B. Scholkopf, eds.) 16 329–336. MIT Press, Cambridge, MA.
  • Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J. Mach. Learn. Res. 6 1783–1816.
  • Lucas, J. E., Kung, H.-N. and Chi, J.-T. A. (2010). Latent factor analysis to discover pathway-associated putative segmental aneuploidies in human cancers. PLoS Comput. Biol. 6 e1000920.
  • Lucas, J. E., Carvalho, C., Wang, Q., Bild, A., Nevins, J. R. and West, M. (2006). Sparse statistical modelling in gene expression genomics. In Bayesian Inference for Gene Expression and Proteomics (P. Muller, K. Do and M. Vannucci, eds.) 155–176. Cambridge Univ. Press, Cambridge.
  • Marioni, J. C., Thorne, N. P., Tavare, S. and Radvanyi, F. (2006). BioHMM: A heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics 22 1144–1146.
  • Mayrink, V. D. and Lucas, J. E. (2013). Supplement to “Sparse latent factor models with interactions: Analysis of gene expression data.” DOI:10.1214/12-AOAS607SUPP.
  • Miller, L. D., Smeds, J., George, J., Vega, V. B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E. T. and Bergh, J. (2005). An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. USA 102 13550–13555.
  • Pollack, J. R., Sorlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., Tibshirani, R., Botstein, D., Dale, A. L. B. and Brown, P. O. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99 12963–12968.
  • Przybytkowski, E., Ferrario, C. and Basik, M. (2011). The use of ultra-dense array CGH analysis for the discovery of micro-copy number alterations and gene fusions in the cancer genome. BMC Med. Genomics 4 16.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
  • Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Kains, B. H., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Vijver, M. J. V. D., Bergh, J., Piccart, M. and Delorenzi, M. (2006). Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute 98 262–272.
  • Teh, Y. W., Seeger, M. and Jordan, M. I. (2005). Semiparametric latent factor models. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (Z. Ghahramani and R. Cowell, eds.) 333–340. The Society for Artificial Intelligence and Statistics.
  • Titsias, M., Lawrence, N. D. and Rattray, M. (2009). Efficient sampling for Gaussian process inference using control variables. In Advances in Neural Information Processing Systems 21 (D. Koller, Y. Bengio, D. Schuurmans and L. Bottou, eds.) 689–696. MIT Press, Cambridge, MA.
  • Wang, Y., Klijn, J. G. M., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Gelder, M. E. M. V., Yu, J., Jatkoe, T., Berns, E. M. J. J., Atkins, D. and Foekens, J. A. (2005). Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365 671–679.
  • West, M. (2003). Bayesian factor regression models in the large $p$, small $n$ paradigm. In Bayesian Statistics 7 (J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 723–732. Oxford Univ. Press, Oxford.

Supplemental materials

  • Supplementary material: Sparse latent factor models with interactions: Posterior computation, simulated studies and gene selection procedure. Additional material containing the following: formulations of the complete conditional posterior distributions for parameters in the proposed models, simulated studies to evaluate the performance of the models, and the description of the procedure used to select genes for the real applications.