Abstract
This paper analyzes breast cancer gene expression across seven studies to identify genuine and thus replicable gene patterns shared among these studies. Our premise is that genuine biological signal is more likely to be reproducibly present in multiple studies than spurious signal. Our analysis uses a new modeling strategy for the joint analysis of high-throughput biological studies which simultaneously identifies shared as well as study-specific signal. To this end, we generalize the multi-study factor analysis model to handle high-dimensional data and generalize the sparse Bayesian infinite factor model to this context. We provide strategies for the identification of the loading matrices, common and study-specific. Through extensive simulation analysis, we characterize the performance of the proposed approach in various scenarios and show that it outperforms standard factor analysis in identifying replicable signal in all scenarios considered. The analysis of breast cancer gene expression studies identifies clear replicable gene patterns. These patterns are related to well-known biological pathways involved in breast cancer, such as the ER, cell cycle, immune system, collagen, and metabolic pathways. Some of these patterns are also associated with existing breast cancer subtypes, such as LumA, Her2, and basal subtypes, while other patterns identify novel pathways active across subtypes and missed by hierarchical clustering approaches. The R package MSFA implementing the method is available on GitHub.
Funding Statement
Research supported by the Italian Ministry for University and Research under the PRIN2015 grant No. 2015EASZFS_003 (RB), the U.S.A.’s National Science Foundation Grant DMS-1810829 (LT and GP), and the U.S.A.’s National Institutes of Health grant NCI-5P30CA006516-54 (GP).
Acknowledgments
We are grateful to Bianca Dumitrascu for help with Figure 3.
Citation
Roberta De Vito. Ruggero Bellio. Lorenzo Trippa. Giovanni Parmigiani. "Bayesian multistudy factor analysis for high-throughput biological data." Ann. Appl. Stat. 15 (4) 1723 - 1741, December 2021. https://doi.org/10.1214/21-AOAS1456
Information