The Annals of Applied Statistics

Integrative exploration of large high-dimensional datasets

Christopher Pardy, Sally Galbraith, and Susan R. Wilson

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Large, high-dimensional datasets containing different types of variables are becoming increasingly common. For exploring such data, there is a need for integrated methods. For example, a single genomic experiment can contain large quantities of different types of data (including clinical data) that make it a challenge to coherently describe the patterns of variability within and between the inter-related datasets. Mutual information (MI) is a widely used information theoretic dependency measure that also can identify nonlinear and nonmonotonic associations. First, we develop a computationally efficient implementation of MI between a discrete and a continuous variable. This implementation allows us to apply a coherent approach to all comparisons arising from continuous and categorical data. As commonly applied, MI can have high levels of bias. So we present a novel development of mutual information (MI) that reduces the bias, and that we term bias corrected mutual information (BCMI). Further, BCMI is useful as an association measure that can be incorporated in subsequent analyses such as clustering and visualisation procedures.

To demonstrate our approach, a genomic dataset is re-examined. This dataset contains single nucleotide polymorphisms (SNPs, a discrete variable), gene expression levels and clinical data (all continuous variables). Our approach allows us to integrate these different types of data by exploring associations both within and between these types of variables.

Article information

Ann. Appl. Stat. Volume 12, Number 1 (2018), 178-199.

Received: August 2013
Revised: April 2017
First available in Project Euclid: 9 March 2018

Permanent link to this document

Digital Object Identifier

Mutual information data integration exploration mixed-types of variables continuous categorical


Pardy, Christopher; Galbraith, Sally; Wilson, Susan R. Integrative exploration of large high-dimensional datasets. Ann. Appl. Stat. 12 (2018), no. 1, 178--199. doi:10.1214/17-AOAS1055.

Export citation


  • Chu, J., Weiss, S. T., Carey, V. J. and Raby, B. A. (2009). A graphical model approach for inferring large-scale networks integrating gene expression and genetic polymorphism. BMC Syst. Biol. 3 55.
  • Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley-Interscience, Hoboken, NJ.
  • Dawy, Z., Goebel, B., Hagenauer, J., Andreoli, C., Meitinger, T. and Mueller, J. C. (2006). Gene mapping and marker clustering using Shannon’s mutual information. IEEE/ACM Trans. Comput. Biol. Bioinform. 3 47–56.
  • Dutkowski, J., Kramer, M., Surma, M. A., Balakrishnan, R., Cherry, J. M., Krogan, N. J. and Ideker, T. (2013). A gene ontology inferred from molecular networks. Nat. Biotechnol. 31 38–45.
  • Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF Regional Conference Series in Applied Mathematics 38. SIAM, Philadelphia, PA.
  • Efron, B. (2010). Large-Scale Inference. Empirical Bayes Methods For Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge.
  • Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. Amer. Statist. 37 36–48.
  • Fuller, T. F., Ghazalpour, A., Aten, J. E., Drake, T. A., Lusis, A. J. and Horvath, S. (2007). Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm. Genome 18 463–472.
  • Ghazalpour, A., Doss, S., Zhang, B., Wang, S., Plaisier, C., Castellanos, R., Brozell, A., Schadt, E. E., Drake, T. A., Lusis, A. J. et al. (2006). integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet. 2 e130.
  • Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Statist. 18 533–550.
  • Hall, P. and Miller, H. (2011). Determining and depicting relationships among components in high-dimensional variable selection. J. Comput. Graph. Statist. 20 988–1006.
  • Hintze, J. L. and Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. Amer. Statist. 52 181–184.
  • Kraskov, A., Stögbauer, H. and Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E (3) 69 066138.
  • Laird, N. M. and Lange, C. (2011). The Fundamentals of Modern Statistical Genetics. Springer, New York.
  • Marron, J. S. and Padgett, W. J. (1987). Asymptotically optimal bandwidth selection for kernel density estimators from randomly right-censored samples. Ann. Statist. 15 1520–1535.
  • Padgett, W. J. and McNichols, D. T. (1984). Nonparametric density estimation from censored data. Comm. Statist. Theory Methods 13 1581–1611.
  • Pardy, C., Galbraith, S. and Wilson, S. R. (2018). Supplement to “Integrative exploration of large high-dimensional datasets.” DOI:10.1214/17-AOAS1055SUPP.
  • Principe, J. C. (2010). Information Theoretic Learning. Renyi’s Entropy and Kernel Perspectives. Information Science and Statistics. Springer, New York.
  • Qiu, P., Gentles, A. J. and Plevritis, S. K. (2009). Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Comput. Methods Programs Biomed. 94 177–180.
  • Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika 43 353–360.
  • Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518–1524.
  • Schellhase, C. and Kauermann, G. (2012). Density estimation and comparison with a penalized mixture approach. Comput. Statist. 27 757–777.
  • Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B. and Ideker, T. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13 2498–2504.
  • Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 53 683–690.
  • Steuer, R., Kurths, J., Daub, C. O., Weise, J. and Selbig, J. (2002). The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18(Suppl. 2) S231–S240.
  • Székely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. 3 1236–1265.
  • Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Monographs on Statistics and Applied Probability 60. Chapman and Hall, London.
  • Wang, S., Yehya, N., Schadt, E. E., Wang, H., Drake, T. A. and Lusis, A. J. (2006). Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genet. 2 e15. DOI:10.1371/journal.pgen.0020015.

Supplemental materials

  • Additional material. Supplementary material is available and includes the proof of Proposition 3.1, alternative information measures, simulations and more results for the genomic and clinical data used in the paper [Pardy, Galbraith and Wilson (2018)].