The Annals of Applied Statistics

A general framework for association analysis of heterogeneous data

Gen Li and Irina Gaynanova

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional data with continuous measurements. Motivated by the Computer Audition Lab 500-song (CAL500) music annotation study, we develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two data sets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the CAL500 data sheds light on the relationship between acoustic features and semantic annotations, and provides effective means for automatic music annotation and retrieval.

Article information

Ann. Appl. Stat., Volume 12, Number 3 (2018), 1700-1726.

Received: February 2017
Revised: November 2017
First available in Project Euclid: 11 September 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Exponential family inter-battery factor analysis joint and individual structure matrix decomposition generalized linear model association coefficient


Li, Gen; Gaynanova, Irina. A general framework for association analysis of heterogeneous data. Ann. Appl. Stat. 12 (2018), no. 3, 1700--1726. doi:10.1214/17-AOAS1127.

Export citation


  • Bach, F. R. and Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Dept. Statistics, Univ. California, Berkeley, Berkeley, CA.
  • Barrington, L., Chan, A., Turnbull, D. and Lanckriet, G. (2007). Audio information retrieval using semantic similarity. In International Conference on Acoustics, Speech and Signal Processing 2 725–728. IEEE, New York.
  • Bertin-Mahieux, T., Eck, D., Maillet, F. and Lamere, P. (2008). Autotagger: A model for predicting social tags from acoustic features on large music databases. J. New Music Res. 37 115–135.
  • Björck, K. and Golub, G. H. (1973). Numerical methods for computing angles between linear subspaces. Math. Comp. 27 579–594.
  • Browne, M. W. (1979). The maximum-likelihood solution in inter-battery factor analysis. Br. J. Math. Stat. Psychol. 32 75–86.
  • Chaudhuri, K., Kakade, S. M., Livescu, K. and Sridharan, K. (2009). Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th Annual International Conference on Machine Learning 129–136. ACM, New York.
  • Chen, X. and Liu, H. (2012). An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping. Stat. Biosci. 4 3–26.
  • Chen, M., Gao, C., Ren, Z. and Zhou, H. H. (2013). Sparse cca via precision adjusted iterative thresholding. ArXiv preprint. Available at arXiv:1311.6186.
  • Cheng, J., Li, T., Levina, E. and Zhu, J. (2017). High-dimensional mixed graphical models. J. Comput. Graph. Statist. 26 367–378.
  • Collins, M., Dasgupta, S. and Schapire, R. E. (2001). A generalization of principal components analysis to the exponential family. In NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic 617–624. MIT Press, Cambridge, MA.
  • Ellis, D. P., Whitman, B., Berenzweig, A. and Lawrence, S. (2002). The quest for ground truth in musical artist similarity. In ISMIR 2002 Conference Proceedings: Third International Conference on Music Information Retrieval: October 1317, 2002, IRCAM-Centre Pompidou, Paris, France.
  • Goldsmith, J., Zipunnikov, V. and Schrack, J. (2015). Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics 71 344–353.
  • Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins Univ. Press, Baltimore, MD.
  • Goto, M. and Hirata, K. (2004). Recent studies on music information processing. Acoust. Sci. Technol. 25 419–425.
  • Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL.
  • Herlocker, J. L., Konstan, J. A. and Riedl, J. (2000). Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work 241–250. ACM, New York.
  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
  • Jia, Y., Salzmann, M. and Darrell, T. (2010). Factorized latent spaces with structured sparsity. Adv. Neural Inf. Process. Syst. 982–990.
  • Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions 165. Wiley, New York.
  • Klami, A., Virtanen, S. and Kaski, S. (2010). Bayesian exponential family projections for coupled data sources. In The Twenty-Sixth Conference on Uncertainty in Artificial Intelligence 286–293. AUAI Press.
  • Klami, A., Virtanen, S. and Kaski, S. (2013). Bayesian canonical correlation analysis. J. Mach. Learn. Res. 14 965–1003.
  • Landgraf, A. J. and Lee, Y. (2015). Generalized principal component analysis: Projection of saturated model parameters. Technical Report 892, Department of Statistics, Ohio State Univ.
  • Li, G. and Gaynanova, I. (2018). Supplement to “A general framework for association analysis of heterogeneous data.” DOI:10.1214/17-AOAS1127SUPP.
  • Li, Q., Cheng, G., Fan, J. and Wang, Y. (2018). Embracing the blessing of dimensionality in factor models. J. Amer. Statist. Assoc. 113 380–389.
  • Lock, E. F., Hoadley, K. A., Marron, J. S. and Nobel, A. B. (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7 523–542.
  • Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval (ISMIR).
  • Luo, C., Liu, J., Dey, D. K. and Chen, K. (2016). Canonical variate regression. Biostatistics 17 468–483.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, London. [Second edition of MR0727836.]
  • She, Y. (2013). Reduced rank vector generalized linear models for feature extraction. Stat. Interface 6 197–209.
  • Trygg, J. and Wold, S. (2003). O2–PLS, a two-block (X–Y) latent variable regression (LVR) method with an integral OSC filter. J. Chemom. 17 53–64.
  • Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J. and Vlahavas, I. (2011). Mulan: A Java library for multi-label learning. J. Mach. Learn. Res. 12 2411–2414.
  • Tucker, L. R. (1958). An inter-battery method of factor analysis. Psychometrika 23 111–136.
  • Turnbull, D., Barrington, L., Torres, D. and Lanckriet, G. (2007). Towards musical query-by-semantic-description using the cal500 data set. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 439–446. ACM, New York.
  • Turnbull, D., Barrington, L., Torres, D. and Lanckriet, G. (2008). Semantic annotation and retrieval of music and sound effects. IEEE/ACM Trans. Audio Speech Lang. Process. 16 467–476.
  • Virtanen, S., Klami, A. and Kaski, S. (2011). Bayesian cca via group sparsity. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011) 457–464. ACM, New York.
  • Westerhuis, J. A., Kourti, T. and MacGregor, J. F. (1998). Analysis of multiblock and hierarchical PCA and PLS models. J. Chemom. 12 301–321.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 513–534.
  • Yang, D., Ma, Z. and Buja, A. (2014). A sparse singular value decomposition method for high-dimensional data. J. Comput. Graph. Statist. 23 923–942.
  • Yang, Z., Ning, Y. and Liu, H. (2014). On semiparametric exponential family graphical models. ArXiv preprint. Available at arXiv:1412.8697.
  • Zhou, G., Cichocki, A., Zhang, Y. and Mandic, D. P. (2016a). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Trans. Neural Netw. Learn. Syst. 27 2426–2439.
  • Zhou, G., Zhao, Q., Zhang, Y., Adali, T., Xie, S. and Cichocki, A. (2016b). Linked component analysis from matrices to high-order tensors: Applications to biomedical data. Proc. IEEE 104 310–331.
  • Zoh, R. S., Mallick, B., Ivanov, I., Baladandayuthapani, V., Manyam, G., Chapkin, R. S., Lampe, J. W. and Carroll, R. J. (2016). PCAN: Probabilistic correlation analysis of two non-normal data sets. Biometrics 72 1358–1368.

Supplemental materials

  • Supplementary Material for A General Framework for Association Analysis of Heterogeneous Data. We provide proofs, technical details of the algorithm, a detailed description of the rank estimation procedure, and additional simulation results in the supplementary material.