The Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 12, Number 1 (2018), 178-199.
Integrative exploration of large high-dimensional datasets
Large, high-dimensional datasets containing different types of variables are becoming increasingly common. For exploring such data, there is a need for integrated methods. For example, a single genomic experiment can contain large quantities of different types of data (including clinical data) that make it a challenge to coherently describe the patterns of variability within and between the inter-related datasets. Mutual information (MI) is a widely used information theoretic dependency measure that also can identify nonlinear and nonmonotonic associations. First, we develop a computationally efficient implementation of MI between a discrete and a continuous variable. This implementation allows us to apply a coherent approach to all comparisons arising from continuous and categorical data. As commonly applied, MI can have high levels of bias. So we present a novel development of mutual information (MI) that reduces the bias, and that we term bias corrected mutual information (BCMI). Further, BCMI is useful as an association measure that can be incorporated in subsequent analyses such as clustering and visualisation procedures.
To demonstrate our approach, a genomic dataset is re-examined. This dataset contains single nucleotide polymorphisms (SNPs, a discrete variable), gene expression levels and clinical data (all continuous variables). Our approach allows us to integrate these different types of data by exploring associations both within and between these types of variables.
Ann. Appl. Stat. Volume 12, Number 1 (2018), 178-199.
Received: August 2013
Revised: April 2017
First available in Project Euclid: 9 March 2018
Permanent link to this document
Digital Object Identifier
Pardy, Christopher; Galbraith, Sally; Wilson, Susan R. Integrative exploration of large high-dimensional datasets. Ann. Appl. Stat. 12 (2018), no. 1, 178--199. doi:10.1214/17-AOAS1055. https://projecteuclid.org/euclid.aoas/1520564469
- Additional material. Supplementary material is available and includes the proof of Proposition 3.1, alternative information measures, simulations and more results for the genomic and clinical data used in the paper [Pardy, Galbraith and Wilson (2018)].