August 2022 An p theory of PCA and spectral clustering
Emmanuel Abbe, Jianqing Fan, Kaizheng Wang
Author Affiliations +
Ann. Statist. 50(4): 2359-2385 (August 2022). DOI: 10.1214/22-AOS2196

Abstract

Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an p perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel p analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in p norm, which includes 2 and as special cases. For sub-Gaussian mixture models, the choice of p giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the p theory leads to simple spectral algorithms that achieve the information threshold for exact recovery and the optimal misclassification rate.

Acknowledgments

E. Abbe, was supported by the NSF CAREER Award CCF-1552131.

J. Fan was supported by ONR Grant N00014-19-1-2120 and NSF Grants DMS-2052926, DMS-1712591, and DMS-2053832.

K. Wang was supported by a startup fund from Columbia University and NIH Grant 2R01-GM072611-15 when he was a student at Princeton University.

Citation

Download Citation

Emmanuel Abbe. Jianqing Fan. Kaizheng Wang. "An p theory of PCA and spectral clustering." Ann. Statist. 50 (4) 2359 - 2385, August 2022. https://doi.org/10.1214/22-AOS2196

Information

Received: 1 July 2021; Revised: 1 April 2022; Published: August 2022
First available in Project Euclid: 25 August 2022

MathSciNet: MR4474494
zbMATH: 07610774
Digital Object Identifier: 10.1214/22-AOS2196

Subjects:
Primary: 62H25
Secondary: 60B20 , 62H30

Keywords: Community detection , contextual network models , eigenvector perturbation , Mixture models , Phase transitions , Principal Component Analysis , spectral clustering

Rights: Copyright © 2022 Institute of Mathematical Statistics

JOURNAL ARTICLE
27 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.50 • No. 4 • August 2022
Back to Top