The Annals of Applied Statistics

Local kernel canonical correlation analysis with application to virtual drug screening

Daniel Samarov, J. S. Marron, Yufeng Liu, Christopher Grulke, and Alexander Tropsha

Full-text: Open access

Abstract

Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A major challenge that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested.

In this paper we propose several novel approaches to the problem of virtual screening based on Canonical Correlation Analysis (CCA) and on a kernel-based extension. Spectral learning ideas motivate our proposed new method called Indefinite Kernel CCA (IKCCA). We show the strong performance of this approach both for a toy problem as well as using real world data with dramatic improvements in predictive accuracy of virtual screening over an existing methodology.

Article information

Source
Ann. Appl. Stat., Volume 5, Number 3 (2011), 2169-2196.

Dates
First available in Project Euclid: 13 October 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1318514300

Digital Object Identifier
doi:10.1214/11-AOAS472

Mathematical Reviews number (MathSciNet)
MR2884936

Zentralblatt MATH identifier
1228.62072

Keywords
Kernel methods canonical correlation analysis indefinite kernels drug discovery virtual screening

Citation

Samarov, Daniel; Marron, J. S.; Liu, Yufeng; Grulke, Christopher; Tropsha, Alexander. Local kernel canonical correlation analysis with application to virtual drug screening. Ann. Appl. Stat. 5 (2011), no. 3, 2169--2196. doi:10.1214/11-AOAS472. https://projecteuclid.org/euclid.aoas/1318514300


Export citation

References

  • Abraham, D. J. (2003). Burger’s Medicinal Chemistry and Drug Discovery, 6th ed. Drug Discovery 1. Wiley, New York.
  • Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley-Interscience, Hoboken, NJ.
  • Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1–48.
  • Chen, J. and Ye, J. (2008). Training svm with indefinite kernels. In ICML’08: Proceedings of the 25th International Conference on Machine Learning 136–143. ACM Press, New York.
  • Daylight (2004). World drug index. Available at www.daylight.com.
  • Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels. IEEE Transaction on Pattern Analysis and Machine Intelligence 27 482–492.
  • Hardoon, D. and Shawe-Taylor, J. (2008). Sparse canonical correlation analysis. Technical report, PASCAL EPrints [http://eprints.pascal-network.org/perl/oai2] (United Kingdom). Available at http://eprints.pascal-network.org/archive/00004673/.
  • Hardoon, D. R., Szedmak, S. and Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16 2639–2664.
  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
  • Kitchen, D. B., Decornez, H., Furr, J. R. and Bajorath, J. (2004). Docking and scoring in virtual screening for drug discovery: Methods and application. Nature Reviews Drug Discovery 3 935–949.
  • Kuss, M. and Graepel, T. (2002). The geometry of kernel canonical correlation analysis. Technical report, Max Planck Institute for Biological Cybernetics, Tubingen, Germany.
  • Lai, P. L. and Fyfe, C. (2000). Kernel and nonlinear canonical correlation analysis. Int. J. Neural. Syst. 10 365–377.
  • Luss, R. and d’Aspremont, A. (2008). Support vector machine classification with indefinite kernels. CoRR abs/0804.0188.
  • Oloff, S., Zhang, S., Sukumar, N., Breneman, C. and Tropsha, A. (2006). Chemometric analysis of ligand receptor complementarity: Identifying complementary ligands based on receptor information (CoLiBRI). J. Chem. Inf. Model. 46 844–851.
  • Ong, C. S., Canu, S. and Smola, A. J. (2004a). Learning with non-positive kernels. In Proc. of the 21st International Conference on Machine Learning (ICML) 639–646.
  • Ong, C. S., Canu, S. and Smola, A. J. (2004b). Learning with non-positive kernels. In Proc. of the 21st International Conference on Machine Learning (ICML) 639–646.
  • Parkhomenko, E. (2008). Sparse canonical correlation analysis. Ph.D. thesis, Dept. Biostatistics, Univ. Toronto.
  • Samarov, D., Marron, J. S., Liu, Y., Grulke, C. and Tropsha, A. (2011). Supplement to “Local kernel canonical correlation analysis with application to virtual drug screening.” DOI:10.1214/11-AOAS472SUPP.
  • Saul, L. K. and Roweis, S. T. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. J. Mach. Learn. Res. 4 119–155.
  • Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
  • Todeschini, R. and Consonni, V. (2009a). Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing 1. Wiley, New York.
  • Todeschini, R. and Consonni, V. (2009b). Molecular Descriptors for Chemoinformatics: Volume II: Appendices, References (Methods and Principles in Medicinal Chemistry) 1. Wiley, New York.
  • Tropsha, A. (2003). Recent Trends in Quantitative Structure-Activity Relationships, 6th ed. Wiley, New York.
  • Vinod, H. (1976). Canonical ridge and econometrics of joint production. J. Econometrics 4 147–166.
  • von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
  • Wang, R., Fang, X., Lu, Y. and Wang, S. (2004). The PDBbind database: Collection of binding affinities for protein–ligand complexes with known three-dimensional structures. J. Med. Chem. 47 2977–2980.
  • Warren, G. L., Andrews, C. W., Capelli, A.-M., Clarke, B., LaLonde, J., Lambert, M. H., Lindvall, M., Nevins, N., Semus, S. F., Senger, S., Tedesco, G., Wall, I. D., Woolven, J. M., Peishoff, C. E. and Head, M. S. (2006). A critical assessment of docking programs and scoring functions. J. Med. Chem. 49 5912–5931.
  • Witten, D. M. and Tibshirani, R. J. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 Art. 28, 29.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.

Supplemental materials

  • Supplementary material: Local kernel canonical correlation analysis with application to virtual drug screening. Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A major challenge that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested. In this paper we propose several novel approaches to the problem of virtual screening based on Canonical Correlation Analysis (CCA) and on a kernel based extension. Spectral learning ideas motivate our proposed new method called Indefinite Kernel CCA (IKCCA). We show the strong performance of this approach both for a toy problem as well as using real world data with dramatic improvements in predictive accuracy of virtual screening over an existing methodology.