The Annals of Applied Statistics

Treelets—An adaptive multi-scale basis for sparse unordered data

Ann B. Lee, Boaz Nadler, and Larry Wasserman

Full-text: Open access


In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered—with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets—a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.

Article information

Ann. Appl. Stat., Volume 2, Number 2 (2008), 435-471.

First available in Project Euclid: 3 July 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Feature selection dimensionality reduction multi-resolution analysis local best basis sparsity principal component analysis hierarchical clustering small sample sizes


Lee, Ann B.; Nadler, Boaz; Wasserman, Larry. Treelets—An adaptive multi-scale basis for sparse unordered data. Ann. Appl. Stat. 2 (2008), no. 2, 435--471. doi:10.1214/07-AOAS137.

Export citation


  • Ahn, J. and Marron, J. S. (2008). Maximal data piling in discrimination., Biometrika. To appear.
  • Angeletti, C., Harvey, N. R., Khomitch, V., Fischer, A. H., Levenson, R. M. and Rimm, D. L. (2005). Detection of malignancy in cytology specimens using spectral-spatial analysis., Laboratory Investigation 85 1555–1564.
  • Asimov, D. (1985). The Grand Tour: A tool for viewing multidimensional data., SIAM J. Sci. Comput. 6 128–143.
  • Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components., J. Amer. Statist. Assoc. 101 119–137.
  • Belkin, M. and Niyogi, P. (2005). Semi-supervised learning on Riemannian manifolds., Machine Learning 56 209–239.
  • Beran, R. and Srivastava, M. (1985). Bootstrap tests and confidence regions for functions of a covariance matrix., Ann. Statist. 13 95–115.
  • Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices., Ann. Statist. 36 199–227.
  • Buckheit, J. and Donoho, D. (1995). Improved linear discrimination using time frequency dictionaries. In, Proc. SPIE 2569 540–551.
  • Candès, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when, p is much larger than n (with discussion). Ann. Statist. 35 2313–2404.
  • Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005). Geometric diffusions as a tool for harmonics analysis and structure definition of data: Diffusion maps., Proc. Natl. Acad. Sci. 102 7426–7431.
  • Coifman, R. and Saito, N. (1996). The local Karhunen–Loève basis. In, Proc. IEEE International Symposium on Time-Frequency and Time-Scale Analysis 129–132. IEEE Signal Processing Society.
  • Coifman, R. and Wickerhauser, M. (1992). Entropy-based algorithms for best basis selection. In, Proc. IEEE Trans. Inform. Theory. 32 712–718.
  • Dettling, M. and Bühlmann, P. (2004). Finding predictive gene groups from microarray data., J. Multivariate Anal. 90 106–131.
  • Donoho, D. and Elad, M. (2003). Maximal sparsity representation via, l1 minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202.
  • Donoho, D. and Johnstone, I. (1995). Adapting to unknown smoothness via wavelet shrinkage., J. Amer. Statist. Assoc. 90 1200–1224.
  • Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns., Proc. Natl. Acad. Sci. USA 95 14863–14868.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation., J. Amer. Statist. Assoc. 97 611–631.
  • Golub, G. and van Loan, C. F. (1996)., Matrix Computations, 3rd ed. Johns Hopkins Univ. Press.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P. Coller, H., Lob, M. L., Downing, J. R., Caliguiri, M., Bloomfield, C. and Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring., Science 286 531–537.
  • Gruber, M. (1998)., Improving Efficiency by Shrinkage: The James–Stein and Ridge Regression Estimators. Dekker, New York.
  • Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines., Machine Learning 46 389–422.
  • Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data., J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
  • Hastie, T., Tibshirani, R., Botstein, D. and Brown, P. (2001). Supervised harvesting of expression trees., Genome Biology 2 research0003.1–0003.12.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001)., The Elements of Statistical Learning. Springer, New York.
  • Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). Data clustering: A review., ACM Computing Surveys 31 264–323.
  • Johnstone, I. and Lu, A. (2008). Sparse principal component analysis., J. Amer. Statist. Assoc. To appear.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal component analysis., Ann. Statist. 29 295–327.
  • Jolliffe, I. T. (2002)., Principal Component Analysis, 2nd ed. Springer, New York.
  • Kalisch, M. and Bühlmann, P. (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm., J. Machine Learning Research 8 613–636.
  • Kushmerick, N. (1999). Learning to remove internet advertisements. In, Proceedings of the Third Annual Conference on Autonomous Agents 175–181.
  • Lee, A. and Nadler, B. (2007). Treelets—a tool for dimensionality reduction and multi-scale analysis of unstructured data. In, Proc. of the Eleventh International Conference on Artificial Intelligence and Statistics (M. Meila and Y. Shen, eds.).
  • Levina, E. and Zhu, J. (2007). Sparse estimation of large covariance matrices via a hierarchical lasso penalty., Submitted.
  • Lorber, A., Faber, K. and Kowalski, B. R. (1997). Net analyte signal calculation in multivariate calibration., Anal. Chemometrics 69 1620–1626.
  • Mallat, S. (1998)., A Wavelet Tour of Signal Processing. Academic Press, San Diego, CA.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso., Ann. Statist. 34 1436–1462.
  • Murtagh, F. (2004). On ultrametricity, data coding, and computation., J. Classification 21 167–184.
  • Murtagh, F. (2007). The Haar wavelet transform of a dendrogram., J. Classification 24 3–32.
  • Murtagh, F., Starck, J.-L. and Berry, M. W. (2000). Overcoming the curse of dimensionality in clustering by means of the wavelet transform., Computer J. 43 107–120.
  • Nadler, B. (2007). Finite sample approximation results for principal component analysis: A matrix perturbation approach., Submitted.
  • Nadler, B. and Coifman, R. (2005a). Partial least squares, Beer’s law and the net analyte signal: Statistical modeling and analysis., J. Chemometrics 19 45–54.
  • Nadler, B. and Coifman, R. (2005b). The prediction error in CLS and PLS: The importance of feature selection prior to multivariate calibration., J. Chemometrics 19 107–118.
  • Ogden, R. T. (1997)., Essential Wavelets for Statistical Applications and Data Analysis. Birkhäuser, Boston.
  • Saito, N. and Coifman, R. (1995). On local orthonormal bases for classification and regression. In, On Local Orthonormal Bases for Classification and Regression 1529–1532. IEEE Signal Processing Society.
  • Saito, N., Coifman, R., Geshwind, F. B. and Warner, F. (2002). Discriminant feature extraction using empirical probability density estimation and a local basis library., Pattern Recognition 35 2841–2852.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D. and Brown, P. (1999). Clustering methods for the analysis of DNA microarray data. Technical report, Dept. Statistics, Stanford, Univ.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression., Proc. Natl. Acad. Sci. 99 6567–6572.
  • Whittaker, J. (2001)., Graphical Models in Multivariate Statistics. Wiley, New York.
  • Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms., IEEE Trans. Neural Networks 16 645–678.
  • Zhao, Z. and Liu, H. (2007). Searching for interacting features. In, Proceedings of the 20th International Joint Conference on AI (IJCAI-07).
  • Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression., Biostatistics 5 427–444.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net., J. Roy. Statist. Soc. Ser. B 67 301–320.
  • Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis., J. Comput. Graph. Statist. 15 265–286.