Annals of Applied Statistics

Finding large average submatrices in high dimensional data

Andrey A. Shabalin, Victor J. Weigman, Charles M. Perou, and Andrew B. Nobel

Full-text: Open access


The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data.

Software is available at

Article information

Ann. Appl. Stat., Volume 3, Number 3 (2009), 985-1012.

First available in Project Euclid: 5 October 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Biclustering classification gene expression microarray breast cancer lung cancer


Shabalin, Andrey A.; Weigman, Victor J.; Perou, Charles M.; Nobel, Andrew B. Finding large average submatrices in high dimensional data. Ann. Appl. Stat. 3 (2009), no. 3, 985--1012. doi:10.1214/09-AOAS239.

Export citation


  • Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C. and Park, J. (1999). Fast algorithms for projected clustering. In SIGMOD’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data 61–72. ACM, New York.
  • Asgarian, N. and Greiner, R. (2006). Using rank-1 biclusters to classify microarray data. Dept. Computing Science, and the Alberta Ingenuity Center for Machine Learning, Univ. Alberta, Edmonton, AB, Canada, T6G2E8.
  • Barron, A. and Yu, J. (1998). The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory 44 2743–2760.
  • Barry, W., Nobel, A. and Wright, F. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach. Bioinformatics 21 1943–1949.
  • Ben-Dor, A., Chor, B., Karp, R. and Yakhini, Z. (2003). Discovering local structure in gene expression data: The order-preserving submatrix problem. Journal of Computational Biology 10 373–384.
  • Bewick, V., Cheek, L. and Ball, J. (2004). Statistics review 12: Survival analysis. Critical Care 8 389–394.
  • Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98 13790–13795.
  • Caldas, J. and Kaski, S. (2008). Bayesian biclustering with the plaid model. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP).
  • Cheng, Y. and Church, G. (2000). Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8 93–103.
  • Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 269–274.
  • Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 14863–14868.
  • Fan, C., Oh, D. S., Wessels, L., Weigelt, B., Nuyten, D. S., Nobel, A. B., van’t Veer, L. J. and Perou, C. M. (2006). Concordance among gene-expression-based predictors for breast cancer. N. Engl. J. Med. 355 560–569.
  • Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 66 815–849.
  • Getz, G., Levine, E. and Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97 12079.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Grothaus, G. (2005). Biologically-interpretable disease classification based on gene expression data. M.S. thesis, Virginia Polytechnic Institute and State University.
  • Grunwald, P. (2004). A tutorial introduction to the minimum description length principle. Preprint. Available at Arxiv: math.ST/0406077.
  • Gu, J. and Liu, J. (2008). Bayesian biclustering of gene expression data. BMC Genomics 9 Suppl. 1 S4.
  • Hartigan, J. (1972). Direct clustering of a data matrix. J. Amer. Statist. Assoc. 67 123–129.
  • Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staudt, L., Chan, W., Botstein, D. and Brown, P. (2000). Gene shavingas a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1 1–21.
  • Hayes, D., Monti, S., Parmigiani, G., Gilks, C., Naoki, K., Bhattacharjee, A., Socinski, M., Perou, C. and Meyerson, M. (2006). Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. Journal of Clinical Oncology 24 5079.
  • Hu, Z., Fan, C., Oh, D., Marron, J., He, X., Qaqish, B., Livasy, C., Carey, L., Reynolds, E., Dressler, L., Nobel, A., Parker, J., Ewend, M., Sawyer, L., Wu, J., Liu, Y., Nanda, R., Tretiakova, M., Orrico, A., Dreher, D., Palazzo, J., Perreard, L., Nelson, E., Mone, M., Hansen, H., Mullins, M., Quackenbush, J., Ellis, M., Olopade, O., Bernard, P. and Perou, C. (2006). The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics 7 96.
  • Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing modular organization in the yeast transcriptional network. Nat. Genet. 31 370–377.
  • Jiang, D., Tang, C. and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16 1370–1386.
  • Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28 27–30.
  • Kluger, Y., Basri, R., Chang, J. and Gerstein, M. (2003). Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Res. 13 203–216.
  • Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica 12 61–86.
  • Liu, J., Yang, J. and Wang, W. (2004). Biclustering in gene expression data by tendency. In Proceedings of the IEEE Computational Systems Bioinformatics Conference, 2004 182–193. IEEE, Washington, DC.
  • Madeira, S. and Oliveira, A. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1 24–45.
  • Parsons, L., Haque, E. and Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter 6 90–105.
  • Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L. and Zitzler, E. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22 1122–1129.
  • Rissanen, J. (2004). An introduction to the MDL principle. Available at
  • Segal, E., Battle, A. and Koller, D. (2003). Decomposing gene expression into cellular processes. Pacific Symposium on Biocomputing 89–100.
  • Shabalin, A., Weigman, V., Perou, C. and Nobel, A. (2009). Supplement to “finding large average submatrices in high dimensional data.” DOI: 10.1214/09-AOAS239SUPP.
  • Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lonning, P. E. and Borresen-Dale, A.-L. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 98 10869–10874.
  • Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lonning, P. E., Brown, P. O., Borresen-Dale, A.-L. and Botstein, D. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. USA 100 8418–8423.
  • Tagkopoulos, I., Slavov, N. and Kung, S. (2005). Multi-class biclustering and classification based on modeling of gene regulatory networks. In Fifth IEEE Symposium on Bioinformatics and Bioengineering, 2005 89–96.
  • Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. and Golub, T. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96 2907–2912.
  • Tanay, A., Sharan, R. and Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics 18 Suppl. 1 S136–S144.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
  • Turner, H., Bailey, T. and Krzanowski, W. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Comput. Statist. Data Anal. 48 235–254.
  • Turner, H., Bailey, T., Krzanowski, W. and Hemingway, C. (2005). Biclustering models for structured microarray data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2 316–329.
  • Wang, H., Wang, W., Yang, J. and Yu, P. (2002). Clustering by pattern similarity in large data sets. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data 394–405.
  • Weigelt, B., Hu, Z., He, X., Livasy, C., Carey, L., Ewend, M., Glas, A., Perou, C. and van’t Veer, L. (2005). Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. Cancer Research 65 9155–9158.
  • Weinstein, J. N., Myers, T. G., O’Connor, P. M., Friend, S. H., Fornace, Albert J. J., Kohn, K. W., Fojo, T., Bates, S. E., Rubinstein, L. V., Anderson, N. L., Buolamwini, J. K., van Osdol, W. W., Monks, A. P., Scudiero, D. A., Sausville, E. A., Zaharevitz, D. W., Bunow, B., Viswanadhan, V. N., Johnson, G. S., Wittes, R. E. and Paull, K. D. (1997). An information-intensive approach to the molecular pharmacology of cancer. Science 275 343–349.

Supplemental materials