## The Annals of Statistics

### Influential features PCA for high dimensional clustering

#### Abstract

We consider a clustering problem where we observe feature vectors $X_{i}\in R^{p}$, $i=1,2,\ldots,n$, from $K$ possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of $p\gg n$, where classical clustering methods face challenges.

We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov–Smirnov (KS) scores, obtain the first $(K-1)$ left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical $k$-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method.

We apply IF-PCA to $10$ gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only $29\%$ or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron [J. Amer. Statist. Assoc. 99 (2004) 96–104] on microarray data.

With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov–Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.

#### Article information

Source
Ann. Statist. Volume 44, Number 6 (2016), 2323-2359.

Dates
Revised: December 2015
First available in Project Euclid: 23 November 2016

https://projecteuclid.org/euclid.aos/1479891617

Digital Object Identifier
doi:10.1214/15-AOS1423

Mathematical Reviews number (MathSciNet)
MR3576543

#### Citation

Jin, Jiashun; Wang, Wanjie. Influential features PCA for high dimensional clustering. Ann. Statist. 44 (2016), no. 6, 2323--2359. doi:10.1214/15-AOS1423. https://projecteuclid.org/euclid.aos/1479891617.

#### References

• Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
• Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
• Arias-Castro, E., Lerman, G. and Zhang, T. (2013). Spectral clustering based on local PCA. Available at arXiv:1301.2007.
• Arias-Castro, E. and Verzelen, N. (2014). Detection and feature selection in sparse mixture models. Available at arXiv:1405.1478.
• Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York.
• Azizyan, M., Singh, A. and Wasserman, L. (2013). Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems 2139–2147. Curran Associates, Red Hook, NY.
• Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
• Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084.
• Cai, T., Ma, Z. and Wu, Y. (2015). Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields 161 781–815.
• Chan, Y. and Hall, P. (2010). Using evidence of mixed populations to select variables for clustering very high-dimensional data. J. Amer. Statist. Assoc. 105 798–809.
• Chen, J. and Li, P. (2009). Hypothesis test for normal mixture models: The EM approach. Ann. Statist. 37 2523–2542.
• Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7 1–46.
• Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
• Donoho, D. (2015). 50 years of data science. Unpublished manuscript.
• Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
• Donoho, D. and Jin, J. (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci. 30 1–25.
• Durbin, J. (1985). The first-passage density of a continuous Gaussian process to a general boundary. J. Appl. Probab. 22 99–122.
• Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
• Efron, B. (2009). Empirical Bayes estimates for large-scale prediction problems. J. Amer. Statist. Assoc. 104 1015–1028.
• Fan, Y., Jin, J. and Yao, Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 2537–2571.
• Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
• Fan, J., Ke, Z. T., Liu, H. and Xia, L. (2015). QUADRO: A supervised dimension reduction method via Rayleigh quotient optimization. Ann. Statist. 43 1498–1534.
• Gordon, G. J., Jensen, R. V., Hsiao, L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62 4963–4967.
• Guionnet, A. and Zeitouni, O. (2000). Concentration of the spectral measure for large matrices. Electron. Commun. Probab. 5 119–136 (electronic).
• Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• Jin, J. (2015). Fast community detection by SCORE. Ann. Statist. 43 57–89.
• Jin, J. and Ke, Z. T. (2016). Rare and weak effects in large-scale inference: Methods and phase diagrams. Statist. Sinica 26 1–34.
• Jin, J., Ke, Z. T. and Wang, W. (2015a). Optimal spectral clustering by Higher Criticism Thresholding. Manuscript.
• Jin, J., Ke, Z. T. and Wang, W. (2015b). Phase transitions for high dimensional clustering and related problems. Available at arXiv:1502.06952.
• Jin, J. and Wang, W. (2016). Supplement to “Influential Features PCA for high dimensional clustering.” DOI:10.1214/15-AOS1423SUPP.
• Jin, J., Zhang, C. and Zhang, Q. (2014). Optimality of graphlet screening in high dimensional variable selection. J. Mach. Learn. Res. 15 2723–2772.
• Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
• Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
• Ke, Z. T., Jin, J. and Fan, J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42 2202–2242.
• Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari 4 83–91.
• Kritchman, S. and Nadler, B. (2008). Determining the number of components in a factor model from limited noisy data. Chemometr. Intell. Lab 94 19–32.
• Lee, A. B., Luca, D. and Roeder, K. (2010). A spectral graph approach to discovering genetic ancestry. Ann. Appl. Stat. 4 179–202.
• Lee, S., Zou, F. and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38 3605–3629.
• Lei, J. and Vu, V. Q. (2015). Sparsistency and agnostic inference in sparse PCA. Ann. Statist. 43 299–322.
• Loader, C. R. (1992). Boundary crossing probabilities for locally Poisson processes. Ann. Appl. Probab. 2 199–228.
• Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
• Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2 849–856.
• Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
• Paul, D. and Johnstone, I. M. (2012). Augmented sparse principal component analysis for high dimensional data. Available at arXiv:1202.1242.
• Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York.
• Siegmund, D. (1982). Large deviations for boundary crossing probabilities. Ann. Probab. 10 581–588.
• Woodroofe, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ann. Statist. 6 72–84.
• Yousefi, M. R., Hua, J., Sima, C. and Dougherty, E. R. (2010). Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26 68–76.
• Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.