The Annals of Applied Statistics

Torus principal component analysis with applications to RNA structure

Benjamin Eltzner, Stephan Huckemann, and Kanti V. Mardia

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

There are several cutting edge applications needing PCA methods for data on tori, and we propose a novel torus-PCA method that adaptively favors low-dimensional representations while preventing overfitting by a new test—both of which can be generally applied and address shortcomings in two previously proposed PCA methods. Unlike tangent space PCA, our torus-PCA features structure fidelity by honoring the cyclic topology of the data space and, unlike geodesic PCA, produces nonwinding, nondense descriptors. These features are achieved by deforming tori into spheres with self-gluing and then using a variant of the recently developed principal nested spheres analysis. This PCA analysis involves a step of subsphere fitting, and we provide a new test to avoid overfitting. We validate our torus-PCA by application to an RNA benchmark data set. Further, using a larger RNA data set, torus-PCA recovers previously found structure, now globally at the one-dimensional representation, which is not accessible via tangent space PCA.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 1332-1359.

Dates
Received: March 2017
Revised: July 2017
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743497

Digital Object Identifier
doi:10.1214/17-AOAS1115

Mathematical Reviews number (MathSciNet)
MR3834306

Keywords
Statistics on manifolds tori deformation directional statistics dimension reduction dihedral angles fitting small spheres principal nested spheres analysis

Citation

Eltzner, Benjamin; Huckemann, Stephan; Mardia, Kanti V. Torus principal component analysis with applications to RNA structure. Ann. Appl. Stat. 12 (2018), no. 2, 1332--1359. doi:10.1214/17-AOAS1115. https://projecteuclid.org/euclid.aoas/1532743497


Export citation

References

  • Altis, A., Otten, M., Nguyen, P. H., Rainer, H. and Stock, G. (2008). Construction of the free energy landscape of biomolecules via dihedral angle principal component analysis. J. Chem. Phys. 128 245102.
  • Arsigny, V., Commowick, O., Pennec, X. and Ayache, N. (2006). A log-Euclidean framework for statistics on diffeomorphisms. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2006 924–931. Springer, Berlin.
  • Boisvert, J., Pennec, X., Labelle, H., Cheriet, F. and Ayache, N. (2006). Principal spine shape deformation modes using Riemannian geometry and articulated models. In Articulated Motion and Deformable Objects 346–355. Springer, Berlin.
  • Brewer, J. W. (2013). Regulatory crosstalk within the mammalian unfolded protein response. Cell. Mol. Life Sci. 71 1067–1079.
  • Čech, P., Kukal, J., Černỳ, J., Schneider, B. and Svozil, D. (2013). Automatic workflow for the classification of local DNA conformations. BMC Bioinform. 14 205.
  • Chakrabarti, A., Chen, A. W. and Varner, J. D. (2011). A review of the mammalian unfolded protein response. Biotechnol. Bioeng. 108 2777–2793.
  • Chapman, R., Sidrauski, C. and Walter, P. (1998). Intracellular signaling from the endoplasmic reticulum to the nucleus. Annu. Rev. Cell Dev. Biol. 14 459–485.
  • Chen, A. A. and García, A. E. (2013). High-resolution reversible folding of hyperstable RNA tetraloops using molecular dynamics simulations. Proc. Natl. Acad. Sci. USA 110 16820–16825.
  • Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B., Snoeyink, J., Richardson, J. S. et al. (2007). MolProbity: All-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Res. 35 W375–W383.
  • Dryden, I. L. and Mardia, K. V. (2016). Statistical Shape Analysis: With Applications in R. Wiley, New York.
  • Duarte, C. M. and Pyle, A. M. (1998). Stepping through an RNA structure: A novel approach to conformational analysis. J. Mol. Biol. 284 1465–1478.
  • Dümbgen, L. and Walther, G. (2008). Multiscale inference about a density. Ann. Statist. 36 1758–1785.
  • Dunbrack, R. L. and Karplus, M. (1994). Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nat. Struct. Mol. Biol. 1 334–340.
  • Egli, M., Portmann, S. and Usman, N. (1996). RNA hydration: A detailed look. Biochemistry 35 8489–8494.
  • Eltzner, B., Huckemann, S. and Mardia, K. V. (2018a). Supplement to “Torus principal component analysis with applications to RNA structure.” DOI:10.1214/17-AOAS1115SUPPA.
  • Eltzner, B., Huckemann, S. and Mardia, K. V. (2018b). Supplement to “Torus principal component analysis with applications to RNA structure.” DOI:10.1214/17-AOAS1115SUPPB.
  • Eltzner, B., Huckemann, S. and Mardia, K. V. (2018c). Supplement to “Torus principal component analysis with applications to RNA structure.” DOI:10.1214/17-AOAS1115SUPPC.
  • Estarellas, C., Otyepka, M., Koča, J., Banáš, P., Krepl, M. and Šponer, J. (2015). Molecular dynamic simulations of protein/RNA complexes: CRISPR/Csy4 endoribonuclease. Biochimica et Biophysica Acta (BBA)—General Subjects 1850 1072–1090.
  • Fletcher, P. T., Lu, C., Pizer, S. M. and Joshi, S. C. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Im. 23 995–1005.
  • Frellsen, J., Moltke, I., Thiim, M., Mardia, K. V., Ferkinghoff-Borg, J. and Hamelryck, T. (2009). A probabilistic model of RNA conformational space. PLoS Comput. Biol. 5 e1000406.
  • Gower, J. C. (1975). Generalized Procrustes analysis. Psychometrika 40 33–51.
  • Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika 93 235–254.
  • Hotz, T. and Huckemann, S. (2014). Intrinsic means on the circle: Uniqueness, locus and asymptotics. Ann. Inst. Statist. Math. 67 177–193.
  • Huckemann, S. F. and Eltzner, B. (2015). Polysphere PCA with applications. In Proceedings of the Leeds Annual Statistical Research (LASR) Workshop 2015.
  • Huckemann, S., Hotz, T. and Munk, A. (2010). Intrinsic shape analysis: Geodesic PCA for Riemannian manifolds modulo isometric Lie group actions. Statist. Sinica 20 1–58.
  • Huckemann, S. and Ziezold, H. (2006). Principal component analysis for Riemannian manifolds, with an application to triangular shape spaces. Adv. in Appl. Probab. 2 299–319.
  • Huckemann, S., Kim, K.-R., Munk, A., Rehfeldt, F., Sommerfeld, M., Weickert, J. and Wollnik, C. (2016). The circular SiZer, inferred persistence of shape parameters and application to early stem cell differentiation. Bernoulli 22 2113–2142.
  • Jain, S., Richardson, D. C. and Richardson, J. S. (2015). Computational methods for RNA structure validation and improvement. In Structures of Large RNA Molecules and Their Complexes (S. A. Woodson and F. H. Allain, eds.) 558 181–212. Academic Press, Cambridge, MA.
  • Jung, S., Dryden, I. L. and Marron, J. S. (2012). Analysis of principal nested spheres. Biometrika 99 551–568.
  • Jung, S., Foskey, M. and Marron, J. S. (2011). Principal arc analysis on direct product manifolds. Ann. Appl. Stat. 5 578–603.
  • Jung, S., Liu, X., Marron, J. S. and Pizer, S. M. (2010). Generalized PCA via the backward stepwise approach in image analysis. In Brain, Body and Machine: Proceedings of an International Symposium on the 25th Anniversary of McGill University Centre for Intelligent Machines, Advances in Intelligent and Soft Computing. Body and Machine 83 111–123. Springer, Berlin.
  • Kent, J. T. and Mardia, K. V. (2009). Principal component analysis for the wrapped normal torus model. In Proceedings of the Leeds Annual Statistical Research (LASR) Workshop 2009.
  • Kent, J. T. and Mardia, K. V. (2015). The winding number for circular data. In Proceedings of the Leeds Annual Statistical Research (LASR) Workshop 2015.
  • Laborde, J., Robinson, D., Srivastava, A., Klassen, E. and Zhang, J. (2013). RNA global alignment in the joint sequence–structure space using elastic shape analysis. Nucleic Acids Res. 41 e114–e114.
  • Liu, W., Srivastava, A. and Zhang, J. (2011). A mathematical framework for protein structure comparison. PLoS Comput. Biol. 7 e1001075.
  • Mardia, K. V. (2013). Statistical approaches to three key challenges in protein structural bioinformatics. J. R. Stat. Soc. Ser. C. Appl. Stat. 62 487–514.
  • Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. 49. Wiley, Chichester. Revised reprint of Statistics of Directional Data by Mardia [MR 0336854].
  • Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • Murray, L. J. W., Arendall, W. B. I., Richardson, D. C. and Richardson, J. S. (2003). RNA backbone is rotameric. Proc. Natl. Acad. Sci. USA 100 13904–13909.
  • Richardson, J. S., Schneider, B., Murray, L. W., Kapral, G. J., Immormino, R. M., Headd, J. J., Richardson, D. C., Ham, D., Hershkovits, E., Williams, L. D., Keating, K. S., Pyle, A. M., Micallef, D., Westbrook, J. and Berman, H. M. (2008). RNA backbone: Consensus all-angle conformers and modular string nomenclature (an RNA ontology consortium contribution). RNA 14 465–481.
  • Sargsyan, K., Wright, J. and Lim, C. (2012). GeoPCA: A new tool for multivariate analysis of dihedral angles based on principal component geodesics. Nucleic Acids Res. 40 e25.
  • Schmidt-Hieber, J., Munk, A. and Dümbgen, L. (2013). Multiscale methods for shape constraints in deconvolution: Confidence statements for qualitative features. Ann. Statist. 41 1299–1328.
  • Schneider, B., Morávek, Z. and Berman, H. M. (2004). RNA conformational classes. Nucleic Acids Res. 32 1666–1677.
  • Seetin, M. G. and Mathews, D. H. (2012). RNA structure prediction: An overview of methods. In Bacterial Regulatory RNA: Methods and Protocols 99–122. Springer, New York.
  • Sommer, S. (2013). Horizontal dimensionality reduction and iterated frame bundle and development. In Geometric Science of Information. Lecture Notes in Computer Science 8085 76–83.
  • Srivastava, A. and Klassen, E. P. (2016). Functional and Shape Data Analysis. Springer, Berlin.
  • Wadley, L. M., Keating, K. S., Duarte, C. M. and Pyle, A. M. (2007). Evaluating and learning from RNA pseudotorsional space: Quantitative validation of a reduced representation for RNA structure. Journal of Molecular Biology 372 942–957.
  • Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and Westhof, E. (2003). Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 31 3450–3460.

Supplemental materials

  • Supplement A: Data. An illustration how to choose data-driven parameters for torus PCA.
  • Supplement B: Data. RNA residue data used for the analysis in this paper.
  • Supplement C: Implementation. Source code of the T-PCA implementation used for this paper.