Open Access
March 2018 Integrative exploration of large high-dimensional datasets
Christopher Pardy, Sally Galbraith, Susan R. Wilson
Ann. Appl. Stat. 12(1): 178-199 (March 2018). DOI: 10.1214/17-AOAS1055

Abstract

Large, high-dimensional datasets containing different types of variables are becoming increasingly common. For exploring such data, there is a need for integrated methods. For example, a single genomic experiment can contain large quantities of different types of data (including clinical data) that make it a challenge to coherently describe the patterns of variability within and between the inter-related datasets. Mutual information (MI) is a widely used information theoretic dependency measure that also can identify nonlinear and nonmonotonic associations. First, we develop a computationally efficient implementation of MI between a discrete and a continuous variable. This implementation allows us to apply a coherent approach to all comparisons arising from continuous and categorical data. As commonly applied, MI can have high levels of bias. So we present a novel development of mutual information (MI) that reduces the bias, and that we term bias corrected mutual information (BCMI). Further, BCMI is useful as an association measure that can be incorporated in subsequent analyses such as clustering and visualisation procedures.

To demonstrate our approach, a genomic dataset is re-examined. This dataset contains single nucleotide polymorphisms (SNPs, a discrete variable), gene expression levels and clinical data (all continuous variables). Our approach allows us to integrate these different types of data by exploring associations both within and between these types of variables.

Citation

Download Citation

Christopher Pardy. Sally Galbraith. Susan R. Wilson. "Integrative exploration of large high-dimensional datasets." Ann. Appl. Stat. 12 (1) 178 - 199, March 2018. https://doi.org/10.1214/17-AOAS1055

Information

Received: 1 August 2013; Revised: 1 April 2017; Published: March 2018
First available in Project Euclid: 9 March 2018

zbMATH: 06894703
MathSciNet: MR3773390
Digital Object Identifier: 10.1214/17-AOAS1055

Keywords: categorical , continuous , data integration , exploration , mixed-types of variables , mutual information

Rights: Copyright © 2018 Institute of Mathematical Statistics

Vol.12 • No. 1 • March 2018
Back to Top