December 2021 Bayesian multistudy factor analysis for high-throughput biological data
Roberta De Vito, Ruggero Bellio, Lorenzo Trippa, Giovanni Parmigiani
Author Affiliations +
Ann. Appl. Stat. 15(4): 1723-1741 (December 2021). DOI: 10.1214/21-AOAS1456


This paper analyzes breast cancer gene expression across seven studies to identify genuine and thus replicable gene patterns shared among these studies. Our premise is that genuine biological signal is more likely to be reproducibly present in multiple studies than spurious signal. Our analysis uses a new modeling strategy for the joint analysis of high-throughput biological studies which simultaneously identifies shared as well as study-specific signal. To this end, we generalize the multi-study factor analysis model to handle high-dimensional data and generalize the sparse Bayesian infinite factor model to this context. We provide strategies for the identification of the loading matrices, common and study-specific. Through extensive simulation analysis, we characterize the performance of the proposed approach in various scenarios and show that it outperforms standard factor analysis in identifying replicable signal in all scenarios considered. The analysis of breast cancer gene expression studies identifies clear replicable gene patterns. These patterns are related to well-known biological pathways involved in breast cancer, such as the ER, cell cycle, immune system, collagen, and metabolic pathways. Some of these patterns are also associated with existing breast cancer subtypes, such as LumA, Her2, and basal subtypes, while other patterns identify novel pathways active across subtypes and missed by hierarchical clustering approaches. The R package MSFA implementing the method is available on GitHub.

Funding Statement

Research supported by the Italian Ministry for University and Research under the PRIN2015 grant No. 2015EASZFS_003 (RB), the U.S.A.’s National Science Foundation Grant DMS-1810829 (LT and GP), and the U.S.A.’s National Institutes of Health grant NCI-5P30CA006516-54 (GP).


We are grateful to Bianca Dumitrascu for help with Figure 3.


Download Citation

Roberta De Vito. Ruggero Bellio. Lorenzo Trippa. Giovanni Parmigiani. "Bayesian multistudy factor analysis for high-throughput biological data." Ann. Appl. Stat. 15 (4) 1723 - 1741, December 2021.


Received: 1 June 2020; Revised: 1 January 2021; Published: December 2021
First available in Project Euclid: 21 December 2021

Digital Object Identifier: 10.1214/21-AOAS1456

Keywords: Dimension reduction , factor analysis , gene expression , Gibbs sampling , Meta-analysis

Rights: Copyright © 2021 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.15 • No. 4 • December 2021
Back to Top