The Annals of Statistics

Influential features PCA for high dimensional clustering

Jiashun Jin and Wanjie Wang

Full-text: Open access


We consider a clustering problem where we observe feature vectors $X_{i}\in R^{p}$, $i=1,2,\ldots,n$, from $K$ possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of $p\gg n$, where classical clustering methods face challenges.

We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov–Smirnov (KS) scores, obtain the first $(K-1)$ left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical $k$-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method.

We apply IF-PCA to $10$ gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only $29\%$ or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron [J. Amer. Statist. Assoc. 99 (2004) 96–104] on microarray data.

With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov–Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.

Article information

Ann. Statist., Volume 44, Number 6 (2016), 2323-2359.

Received: July 2014
Revised: December 2015
First available in Project Euclid: 23 November 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 62G32: Statistics of extreme values; tail inference
Secondary: 62E20: Asymptotic distribution theory

Empirical null feature selection gene microarray Hamming distance phase transition post-selection spectral clustering sparsity


Jin, Jiashun; Wang, Wanjie. Influential features PCA for high dimensional clustering. Ann. Statist. 44 (2016), no. 6, 2323--2359. doi:10.1214/15-AOS1423.

Export citation


  • Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
  • Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
  • Arias-Castro, E., Lerman, G. and Zhang, T. (2013). Spectral clustering based on local PCA. Available at arXiv:1301.2007.
  • Arias-Castro, E. and Verzelen, N. (2014). Detection and feature selection in sparse mixture models. Available at arXiv:1405.1478.
  • Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York.
  • Azizyan, M., Singh, A. and Wasserman, L. (2013). Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems 2139–2147. Curran Associates, Red Hook, NY.
  • Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084.
  • Cai, T., Ma, Z. and Wu, Y. (2015). Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields 161 781–815.
  • Chan, Y. and Hall, P. (2010). Using evidence of mixed populations to select variables for clustering very high-dimensional data. J. Amer. Statist. Assoc. 105 798–809.
  • Chen, J. and Li, P. (2009). Hypothesis test for normal mixture models: The EM approach. Ann. Statist. 37 2523–2542.
  • Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7 1–46.
  • Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
  • Donoho, D. (2015). 50 years of data science. Unpublished manuscript.
  • Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
  • Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
  • Donoho, D. and Jin, J. (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci. 30 1–25.
  • Durbin, J. (1985). The first-passage density of a continuous Gaussian process to a general boundary. J. Appl. Probab. 22 99–122.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2009). Empirical Bayes estimates for large-scale prediction problems. J. Amer. Statist. Assoc. 104 1015–1028.
  • Fan, Y., Jin, J. and Yao, Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 2537–2571.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • Fan, J., Ke, Z. T., Liu, H. and Xia, L. (2015). QUADRO: A supervised dimension reduction method via Rayleigh quotient optimization. Ann. Statist. 43 1498–1534.
  • Gordon, G. J., Jensen, R. V., Hsiao, L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62 4963–4967.
  • Guionnet, A. and Zeitouni, O. (2000). Concentration of the spectral measure for large matrices. Electron. Commun. Probab. 5 119–136 (electronic).
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Jin, J. (2015). Fast community detection by SCORE. Ann. Statist. 43 57–89.
  • Jin, J. and Ke, Z. T. (2016). Rare and weak effects in large-scale inference: Methods and phase diagrams. Statist. Sinica 26 1–34.
  • Jin, J., Ke, Z. T. and Wang, W. (2015a). Optimal spectral clustering by Higher Criticism Thresholding. Manuscript.
  • Jin, J., Ke, Z. T. and Wang, W. (2015b). Phase transitions for high dimensional clustering and related problems. Available at arXiv:1502.06952.
  • Jin, J. and Wang, W. (2016). Supplement to “Influential Features PCA for high dimensional clustering.” DOI:10.1214/15-AOS1423SUPP.
  • Jin, J., Zhang, C. and Zhang, Q. (2014). Optimality of graphlet screening in high dimensional variable selection. J. Mach. Learn. Res. 15 2723–2772.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • Ke, Z. T., Jin, J. and Fan, J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42 2202–2242.
  • Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari 4 83–91.
  • Kritchman, S. and Nadler, B. (2008). Determining the number of components in a factor model from limited noisy data. Chemometr. Intell. Lab 94 19–32.
  • Lee, A. B., Luca, D. and Roeder, K. (2010). A spectral graph approach to discovering genetic ancestry. Ann. Appl. Stat. 4 179–202.
  • Lee, S., Zou, F. and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38 3605–3629.
  • Lei, J. and Vu, V. Q. (2015). Sparsistency and agnostic inference in sparse PCA. Ann. Statist. 43 299–322.
  • Loader, C. R. (1992). Boundary crossing probabilities for locally Poisson processes. Ann. Appl. Probab. 2 199–228.
  • Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
  • Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2 849–856.
  • Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • Paul, D. and Johnstone, I. M. (2012). Augmented sparse principal component analysis for high dimensional data. Available at arXiv:1202.1242.
  • Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York.
  • Siegmund, D. (1982). Large deviations for boundary crossing probabilities. Ann. Probab. 10 581–588.
  • Woodroofe, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ann. Statist. 6 72–84.
  • Yousefi, M. R., Hua, J., Sima, C. and Dougherty, E. R. (2010). Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26 68–76.
  • Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.

See also

  • Discussion of "Influential features PCA for high dimensional clustering".
  • Discussion of "Influential features PCA for high dimensional clustering".
  • Discussion of "Influential feature PCA for high dimensional clustering".
  • Discussion of "Influential features PCA for high dimensional clustering".
  • Rejoinder: "Influential features PCA for high dimensional clustering".

Supplemental materials

  • Supplement to “Influential Features PCA for high dimensional clustering”. Owing to space constraints, the technical proofs are relegated a supplementary document Jin and Wang (2016). It contains three sections: Appendices A–C.