The Annals of Applied Statistics

A method for visual identification of small sample subgroups and potential biomarkers

Charlotte Soneson and Magnus Fontes

Full-text: Open access


In order to find previously unknown subgroups in biomedical data and generate testable hypotheses, visually guided exploratory analysis can be of tremendous importance. In this paper we propose a new dissimilarity measure that can be used within the Multidimensional Scaling framework to obtain a joint low-dimensional representation of both the samples and variables of a multivariate data set, thereby providing an alternative to conventional biplots. In comparison with biplots, the representations obtained by our approach are particularly useful for exploratory analysis of data sets where there are small groups of variables sharing unusually high or low values for a small group of samples.

Article information

Ann. Appl. Stat., Volume 5, Number 3 (2011), 2131-2149.

First available in Project Euclid: 13 October 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Principal Components Analysis biplot dimension reduction multidimensional scaling visualization


Soneson, Charlotte; Fontes, Magnus. A method for visual identification of small sample subgroups and potential biomarkers. Ann. Appl. Stat. 5 (2011), no. 3, 2131--2149. doi:10.1214/11-AOAS460.

Export citation


  • Alter, O., Brown, P. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97 10101–10106.
  • Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 1373–1396.
  • Ben-Dor, A., Chor, B., Karp, R. and Yakhini, Z. (2003). Discovering local structure in gene expression data: The order-preserving submatrix problem. J. Comput. Biol. 10 373–384.
  • Bergmann, S., Ihmels, J. and Barkai, N. (2003). Iterative signature algorithm for the analysis of large-scale gene expression data. Phys. Rev. E 67 031902.
  • Bisson, G. and Hussain, F. (2008). Chi-Sim: A new similarity measure for the co-clustering task. In Proc. 2008 Seventh International Conference on Machine Learning and Applications 211–217. IEEE Computer Society, Los Alamitos, CA.
  • Cailliez, F. (1983). The analytical solution of the additive constant problem. Psychometrika 48 305–308.
  • Chapman, S., Schenk, P., Kazan, K. and Manners, J. (2001). Using biplots to interpret gene expression patterns in plants. Bioinformatics 1 202–204.
  • Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. In Proc. ISMB’00 93–103. AAAI Press, Menlo Park, CA.
  • Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling, 2nd ed. Chapman & Hall, London.
  • De Crespin de Billy, V., Dolédec, S. and Chessel, D. (2000). Biplot presentation of diet composition data: An alternative or fish stomach contents analysis. J. Fish Biol. 56 961–973.
  • Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. KDD’01 269–274. ACM, New York.
  • Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1 211–218.
  • Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C-23 881–890.
  • Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika 58 453–467.
  • Getz, G., Levine, E. and Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97 12079–12084.
  • Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53 325–338.
  • Gower, J. C. and Hand, D. J. (1996). Biplots, 1st ed. Chapman & Hall, London.
  • Gower, J. C. and Harding, S. A. (1988). Nonlinear biplots. Biometrika 75 445–455.
  • Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46 389–422.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer, New York.
  • Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D. and Brown, P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1 1–21.
  • Hotelling, H. (1933a). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24 417–441.
  • Hotelling, H. (1933b). Analysis of a complex of statistical variables into principal components (continued from September issue). J. Educ. Psychol. 24 498–520.
  • Huber, P. (1985). Projection pursuit. Ann. Statist. 13 435–475.
  • Hyvärinen, A. and Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Netw. 13 411–430.
  • Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.
  • Laub, J. and Müller, K.-R. (2004). Feature discovery in non-metric pairwise data. J. Mach. Learn. Res. 5 801–818.
  • Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via sparse singular value decomposition. Biometrics 66 1087–1095.
  • Madeira, S. C. and Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 1 24–45.
  • Mardia, K. V. (1978). Some properties of classical multidimensional scaling. Comm. Statist. Theory Methods 7 1233–1241.
  • Park, M., Lee, J. W., Lee, J. B. and Song, S. H. (2008). Several biplot methods applied to gene expression data. J. Statist. Plann. Inference 138 500–515.
  • Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Phil. Mag. (6) 2 559–572.
  • Phillips, M. S. and McNicol, J. W. (1986). The use of biplots as an aid to interpreting interactions between potato clones and populations of potato cyst nematodes. Plant Pathol. 35 185–195.
  • Rege, M., Dong, M. and Fotouhi, F. (2008). Bipartite isoperimetric graph partitioning for data co-clustering. Data Min. Knowl. Discov. 16 276–312.
  • Ross, M. E., Zhou, X., Song, G., Shurtleff, S. A., Girtman, K., Williams, W. K., Liu, H.-C., Mahfouz, R., Raimondi, S. C., Lenny, N., Patel, A. and Downing, J. R. (2003). Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood 102 2951–2959.
  • Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290 2323–2326.
  • Schölkopf, B., Smola, A. J. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 1299–1319.
  • Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y. and Elkon, R. (2005). EXPANDER—An integrative program suite for microarray data analysis. BMC Bioinformatics 6 232.
  • Sneath, P. H. A. (1957). The application of computers to taxonomy. J. Gen. Microbiol. 17 201–226.
  • Soneson, C. and Fontes, M. (2010). Supplement to “A method for visual identification of small sample subgroups and potential biomarkers.” DOI:10.1214/11-AOAS460SUPPA, DOI:10.1214/11-AOAS460SUPPB.
  • Tanay, A., Sharan, R. and Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics 18 S136–S144.
  • Tenenbaum, J. B., de Silva, V. and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290 2319–2322.
  • Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika 17 401–419.
  • Wang, H., Wang, W., Yang, J. and Yu, P. S. (2002). Clustering by pattern similarity in large data sets. In Proc. 2002 ACM SIGMOD 394–405. ACM, New York.
  • Wouters, L., Göhlmann, H., Bijnens, L., Kass, S. U., Molenberghs, G. and Lewi, P. J. (2003). Graphical exploration of gene expression data: A comparative study of three multivariate methods. Biometrics 59 1131–1139.

Supplemental materials

  • Supplementary material A: Supplementary material. In the supplementary material we give a small schematic example showing the different steps of CUMBIA. Further, we show how to emphasize both over- and underexpressed variables in the visualization and how the choice of K and s affect the resulting visualization. We also provide scree plots obtained by CUMBIA and PCA for the three data sets studied in the paper.
  • Supplementary material B: Supplementary figures—Projection pursuit results. The supplementary figures show the result of the FastICA projection pursuit algorithm applied to the three data sets considered in the paper. Note that to facilitate the interpretation of the figures, the axes are ungraded and only the origin is marked.