## The Annals of Statistics

### Phase transitions for high dimensional clustering and related problems

#### Abstract

Consider a two-class clustering problem where we observe $X_{i}=\ell_{i}\mu+Z_{i}$, $Z_{i}\stackrel{\mathit{i.i.d.}}{\sim}N(0,I_{p})$, $1\leq i\leq n$. The feature vector $\mu\in R^{p}$ is unknown but is presumably sparse. The class labels $\ell_{i}\in\{-1,1\}$ are also unknown and the main interest is to estimate them.

We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam’s idea on comparison of experiments.

We also extend the study on statistical limits for clustering to that for signal recovery and that for global testing. We compare the statistical limits for three problems and expose some interesting insight.

We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t>0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^{2}$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering.

We discover a phase transition for IF-PCA. For any threshold $t>0$, let $\xi^{(t)}$ be the first left singular vector of the post-selection data matrix. The phase space partitions into two different regions. In one region, there is a $t$ such that $\cos(\xi^{(t)},\ell)\rightarrow 1$ and IF-PCA yields successful clustering. In the other, $\cos(\xi^{(t)},\ell)\leq c_{0}<1$ for all $t>0$.

Our results require delicate analysis, especially on post-selection random matrix theory and on lower bound arguments.

#### Article information

Source
Ann. Statist., Volume 45, Number 5 (2017), 2151-2189.

Dates
Revised: June 2016
First available in Project Euclid: 31 October 2017

https://projecteuclid.org/euclid.aos/1509436831

Digital Object Identifier
doi:10.1214/16-AOS1522

Mathematical Reviews number (MathSciNet)
MR3718165

Zentralblatt MATH identifier
06821122

#### Citation

Jin, Jiashun; Ke, Zheng Tracy; Wang, Wanjie. Phase transitions for high dimensional clustering and related problems. Ann. Statist. 45 (2017), no. 5, 2151--2189. doi:10.1214/16-AOS1522. https://projecteuclid.org/euclid.aos/1509436831

#### References

• [1] Addario-Berry, L., Broutin, N., Devroye, L. and Lugosi, G. (2010). On combinatorial testing problems. Ann. Statist. 38 3063–3092.
• [2] Aldous, D. J. (1985). Exchangeability and related topics. In École D’été de Probabilités de Saint-Flour, XIII—1983. Lecture Notes in Math. 1117 1–198. Springer, Berlin.
• [3] Amini, A. and Wainwright, M. J. (2008). High-dimensional analysis of semidefinite relaxations for sparse principal components. In IEEE International Symposium on Information Theory 2454–2458. IEEE, New York.
• [4] Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
• [5] Arias-Castro, E. and Verzelen, N. (2014). Detection and feature selection in sparse mixture models. arXiv:1405.1478.
• [6] Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035. ACM, New York.
• [7] Azizyan, M., Singh, A. and Wasserman, L. (2013). Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In NIPS 2139–2147.
• [8] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 289–300.
• [9] Berthet, Q. and Rigollet, P. (2013). Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory 1046–1066.
• [10] Cai, T., Ma, Z. and Wu, Y. (2013). Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields 1–35.
• [11] Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772.
• [12] Chan, Y. and Hall, P. (2010). Using evidence of mixed populations to select variables for clustering very high-dimensional data. J. Amer. Statist. Assoc. 105.
• [13] d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. G. (2007). A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49 434–448.
• [14] Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7 1–46.
• [15] Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
• [16] Donoho, D. (2015). 50 years of data science. Manuscript.
• [17] Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• [18] Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
• [19] Donoho, D. and Jin, J. (2015). Higher criticism for large-scale inference: Especially for rare and weak effects. Statist. Sci. 30 1–25.
• [20] Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. Ann. Statist. 26 879–921.
• [21] Donoho, D. L., Maleki, A., Rahman, I. U., Shahram, M. and Stodden, V. (2009). Reproducible research in computational harmonic analysis. Comput. Sci. Eng. 11 8–18.
• [22] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
• [23] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer, Berlin.
• [24] Ingster, Y. I., Pouet, C. and Tsybakov, A. B. (2009). Classification of sparse high-dimensional vectors. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4427–4448.
• [25] Jin, J. and Ke, Z. T. (2016). Rare and weak effects in large-scale inference: Methods and phase diagrams. Statist. Sinica 26 1–34.
• [26] Jin, J., Ke, Z. T. and Wang, W. (2014). Optimal spectral clustering by higher criticism thresholding. Manuscript.
• [27] Jin, J., Ke, Z. T. and Wang, W. (2017). Supplementary material for “Phase transitions for high dimensional clustering and related problems.” DOI:10.1214/16-AOS1522SUPP.
• [28] Jin, J. and Wang, W. (2016). Influential features PCA for high dimensional clustering. Ann. Statist. 44 2323–2359.
• [29] Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
• [30] Ke, Z. T., Jin, J. and Fan, J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42 2202–2242.
• [31] Lee, A. B., Luca, D. and Roeder, K. (2010). A spectral graph approach to discovering genetic ancestry. Ann. Appl. Stat. 4 179–202.
• [32] Lei, J. and Vu, V. Q. (2015). Sparsistency and agnostic inference in sparse PCA. Ann. Statist. 43 299–322.
• [33] Le Cam, L. and Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts, 2nd ed. Springer, New York.
• [34] Ma, Z. and Wu, Y. (2015). Computational barriers in minimax submatrix detection. Ann. Statist. 43 1089–1116.
• [35] Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8 1145–1164.
• [36] Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101 168–178.
• [37] Rogers, C. A. (1963). Covering a sphere with spheres. Mathematika 10 157–164.
• [38] Shorack, G. and Wellner, J. (1986). Empirical Processes with Applications to Statistics. John Wiley & Sons, New York.
• [39] Spiegelhalter, D. J. (2014). Statistics. The future lies in uncertainty. Science 345 264–265.
• [40] Sun, W., Wang, J., Fang, Y. et al. (2012). Regularized $k$-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Stat. 6 148–167.
• [41] Van der Vaart, A. (2000). Asymptotic Statistics 3. Cambridge Univ. Press, Cambridge.
• [42] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.
• [43] Vu, V. Q. and Lei, J. (2013). Minimax sparse principal subspace estimation in high dimensions. Ann. Statist. 41 2905–2947.
• [44] Wang, Z., Lu, H. and Liu, H. (2014). Nonconvex statistical optimization: Minimax-optimal sparse PCA in polynomial time. arXiv:1408.5352.
• [45] Weyl, H. (1912). Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Math. Ann. 71 441–479.
• [46] Witten, D. M. and Tibshirani, R. (2012). A framework for feature selection in clustering. J. Amer. Statist. Assoc. 105 713–726.
• [47] Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.

#### Supplemental materials

• Supplementary Material for “Phase transitions for high dimensional clustering and related problems”. Owing to space constraints, some technical proofs and discussion are relegated a supplementary document [27]. It contains proofs of Lemmas 2.1–2.4 and 3.1–3.3, and discusses an extension of the ARW model.