## The Annals of Applied Statistics

### $e$PCA: High dimensional exponential family PCA

#### Abstract

Many applications involve large datasets with entries from exponential family distributions. Our main motivating application is photon-limited imaging, where we observe images with Poisson distributed pixels. We focus on X-ray Free Electron Lasers (XFEL), a quickly developing technology whose goal is to reconstruct molecular structure. In XFEL, estimating the principal components of the noiseless distribution is needed for denoising and for structure determination. However, the standard method, Principal Component Analysis (PCA), can be inefficient in non-Gaussian noise.

Motivated by this application, we develop $e$PCA (exponential family PCA), a new methodology for PCA on exponential families. $e$PCA is a fast method that can be used very generally for dimension reduction and denoising of large data matrices with exponential family entries.

We conduct a substantive XFEL data analysis using $e$PCA. We show that $e$PCA estimates the PCs of the distribution of images more accurately than PCA and alternatives. Importantly, it also leads to better denoising. We also provide theoretical justification for our estimator, including the convergence rate and the Marchenko–Pastur law in high dimensions. An open-source implementation is available.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 4 (2018), 2121-2150.

Dates
Revised: November 2017
First available in Project Euclid: 13 November 2018

https://projecteuclid.org/euclid.aoas/1542078039

Digital Object Identifier
doi:10.1214/18-AOAS1146

Mathematical Reviews number (MathSciNet)
MR3875695

#### Citation

Liu, Lydia T.; Dobriban, Edgar; Singer, Amit. $e$PCA: High dimensional exponential family PCA. Ann. Appl. Stat. 12 (2018), no. 4, 2121--2150. doi:10.1214/18-AOAS1146. https://projecteuclid.org/euclid.aoas/1542078039

#### References

• Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
• Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
• Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika 35 246–254.
• Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York.
• Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
• Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
• Bartholomew, D. J. and Knott, M. (1999). Latent Variable Models and Factor Analysis, 2nd ed. Kendall’s Library of Statistics 7. Edward Arnold, London.
• Basri, R. and Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 25 218–233.
• Benaych-Georges, F. and Nadakuditi, R. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math. 227 494–521.
• Bergmann, U., Yachandra, V. and Yano, J., eds. (2017). X-Ray Free Electron Lasers. The Royal Society of Chemistry, Croydon.
• Bhamre, T., Zhang, T. and Singer, A. (2016). Denoising and covariance estimation of single particle cryo-EM images. Journal of Structural Biology 195 72–81.
• Bigot, J., Deledalle, C. and Féral, D. (2016). Generalized SURE for optimal shrinkage of singular values in low-rank matrix denoising. Preprint. Available at arXiv:1605.07412.
• Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab. 33 514–560.
• Cao, Y. and Xie, Y. (2014). Low-rank matrix recovery in Poisson noise. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on 384–388. IEEE, New York.
• Chen, X. and Storey, J. D. (2015). Consistent estimation of low-dimensional latent structure in high-dimensional data. Preprint. Available at arXiv:1510.03497.
• Collins, M., Dasgupta, S. and Schapire, R. (2001). A generalization of principal component analysis to the exponential family. Advances in Neural Information Processing Systems (NIPS).
• Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 391–407.
• Dobriban, E. (2015). Efficient computation of limit spectra of sample covariance matrices. Random Matrices Theory Appl. 4 1550019, 36.
• Dobriban, E. (2017). Sharp detection in PCA under correlations: All eigenvalues matter. Ann. Statist. 45 1810–1833.
• Donoho, D., Gavish, M. and Johnstone, I. (2013). Optimal shrinkage of eigenvalues in the spiked covariance model. Preprint. Available at arXiv:1311.0851.
• Favre-Nicolin, V., Baruchel, J., Renevier, H., Eymery, J. and Borbély, A. (2015). XTOP: High-resolution X-ray diffraction and imaging. Journal of Applied Crystallography 48 620–620.
• Freeman, M. F. and Tukey, J. W. (1950). Transformations related to the angular and the square root. Ann. Math. Stat. 21 607–611.
• Furnival, T., Leary, R. K. and Midgley, P. A. (2017). Denoising time-resolved microscopy image sequences with singular value thresholding. Ultramicroscopy 178 112–124.
• Hantke, M. F., Ekeberg, T. and Maia, F. R. N. C. (2016). Condor: A simulation tool for Flash X-Ray imaging. Journal of Applied Crystallography 49 1356–1362.
• Huber, P., Ronchetti, E. and Victoria-Feser, M.-P. (2004). Estimation of generalized linear latent variable models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 893–908.
• Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
• Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.
• Josse, J. and Wager, S. (2016). Bootstrap-based regularization for low-rank matrix estimation. J. Mach. Learn. Res. 17 1–29.
• Kam, Z. (1977). Determination of macromolecular structure in solution by spatial correlation of scattering fluctuations. Macromolecules 10 927–934.
• Kam, Z. (1980). The reconstruction of structure from electron micrographs of randomly oriented particles. J. Theoret. Biol. 82 15–39.
• Kurta, R. P., Donatelli, J. J., Yoon, C. H. et al. (2017). Correlations in scattered X-Ray laser pulses reveal nanoscale structural features of viruses. Phys. Rev. Lett. 119 158102.
• Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411.
• Lee, S., Zou, F. and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38 3605–3629.
• Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
• Li, J. and Tao, D. (2010). Simple exponential family PCA. In AISTATS 453–460.
• Liu, L. T, Dobriban, E. and Singer, A. (2018). Supplement to “$e$PCA: High dimensional exponential family PCA.” DOI:10.1214/18-AOAS1146SUPP.
• Maia, F. R. N. C. and Hajdu, J. (2016). The trickle before the torrent-diffraction data from X-ray lasers. Sci. Data 3 160059.
• Mäkitalo, M. and Foi, A. (2011). Optimal inversion of the Anscombe transformation in low-count Poisson image denoising. IEEE Trans. Image Process. 20 99–109.
• Marčenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets of random matrices. Mat. Sb. 72 507–536.
• Martin, A. V., Wang, F., Loh, N. D., Ekeberg, T. et al. (2012). Noise-robust coherent diffractive imaging with a single diffraction pattern. Opt. Express 20 16650–16661.
• Nadakuditi, R. R. (2014). OptShrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage. IEEE Trans. Inform. Theory 60 3002–3018.
• Nowak, R. D. and Baraniuk, R. G. (1999). Wavelet-domain filtering for photon imaging systems. IEEE Trans. Image Process. 8 666–678.
• Pande, K., Schwander, P., Schmidt, M. and Saldin, D. (2014). Deducing fast electron density changes in randomly orientated uncrystallized biomolecules in a pump–probe experiment. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 369 20130332.
• Pande, K., Schmidt, M., Schwander, P. and Saldin, D. K. (2015). Simulations on time-resolved structure determination of uncrystallized biomolecules in the presence of shot noise. Struct. Dyn. 2 024103.
• Patterson, N., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet. 2 e190.
• Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
• Saldin, D. K., Shneerson, V. L., Fung, R. and Ourmazd, A. (2009). Structure of isolated biomolecules obtained from ultrashort x-ray pulses: Exploiting the symmetry of random orientations. J. Phys., Condens. Matter 21 134014.
• Schwander, P., Giannakis, D., Yoon, C. H. and Ourmazd, A. (2012). The symmetries of image formation by scattering. II. Applications. Opt. Express 20 12827–12849.
• Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components. Wiley, New York.
• Shabalin, A. A. and Nobel, A. B. (2013). Reconstruction of a low-rank matrix in the presence of Gaussian noise. J. Multivariate Anal. 118 67–76.
• Starck, J.-L., Murtagh, F. and Fadili, J. M. (2010). Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity. Cambridge Univ. Press, Cambridge.
• Starodub, D., Aquila, A., Bajt, S. et al. (2012). Single-particle structure determination by correlations of snapshot X-ray diffraction patterns. Nat. Commun. 3.
• Stegle, O., Teichmann, S. A. and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16 133–145.
• Stein, C. (1956). Some problems in multivariate analysis. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
• Tropp, J. A. (2016). The expected norm of a sum of independent random matrices: An elementary approach. In High Dimensional Probability VII. Progress in Probability 71 173–202. Springer, Cham.
• Udell, M., Horn, C., Zadeh, R. and Boyd, S. (2014). Generalized low rank models. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.
• Udell, M., Horn, C., Zadeh, R. and Boyd, S. (2016). Generalized low rank models. Found. Trends Mach. Learn. 9 1–118.
• Visscher, P. M., Brown, M. A., McCarthy, M. I. and Yang, J. (2012). Five years of GWAS discovery. Am. J. Hum. Genet. 90 7–24.
• Yao, J., Zheng, S. and Bai, Z. (2015). Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge Series in Statistical and Probabilistic Mathematics 39. Cambridge Univ. Press, New York.
• Zhao, Z., Shkolnisky, Y. and Singer, A. (2016). Fast steerable principal component analysis. IEEE Trans. Comput. Imaging 2 1–12.