## The Annals of Statistics

### Influential features PCA for high dimensional clustering

#### Abstract

We consider a clustering problem where we observe feature vectors $X_{i}\in R^{p}$, $i=1,2,\ldots,n$, from $K$ possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of $p\gg n$, where classical clustering methods face challenges.

We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov–Smirnov (KS) scores, obtain the first $(K-1)$ left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical $k$-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method.

We apply IF-PCA to $10$ gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only $29\%$ or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron [J. Amer. Statist. Assoc. 99 (2004) 96–104] on microarray data.

With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov–Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.

#### Article information

Source
Ann. Statist. Volume 44, Number 6 (2016), 2323-2359.

Dates
Revised: December 2015
First available in Project Euclid: 23 November 2016

https://projecteuclid.org/euclid.aos/1479891617

Digital Object Identifier
doi:10.1214/15-AOS1423

Mathematical Reviews number (MathSciNet)
MR3576543

#### Citation

Jin, Jiashun; Wang, Wanjie. Influential features PCA for high dimensional clustering. Ann. Statist. 44 (2016), no. 6, 2323--2359. doi:10.1214/15-AOS1423. https://projecteuclid.org/euclid.aos/1479891617.

#### References

