The Annals of Applied Statistics

Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment

Irina Czogiel, Ian L. Dryden, and Christopher J. Brignell

Full-text: Open access

Abstract

Statistical methodology is proposed for comparing unlabeled marked point sets, with an application to aligning steroid molecules in chemoinformatics. Methods from statistical shape analysis are combined with techniques for predicting random fields in spatial statistics in order to define a suitable measure of similarity between two marked point sets. Bayesian modeling of the predicted field overlap between pairs of point sets is proposed, and posterior inference of the alignment is carried out using Markov chain Monte Carlo simulation. By representing the fields in reproducing kernel Hilbert spaces, the degree of overlap can be computed without expensive numerical integration. Superimposing entire fields rather than the configuration matrices of point coordinates thereby avoids the problem that there is usually no clear one-to-one correspondence between the points. In addition, mask parameters are introduced in the model, so that partial matching of the marked point sets can be carried out. We also propose an adaptation of the generalized Procrustes analysis algorithm for the simultaneous alignment of multiple point sets. The methodology is illustrated with a simulation study and then applied to a data set of 31 steroid molecules, where the relationship between shape and binding activity to the corticosteroid binding globulin receptor is explored.

Article information

Source
Ann. Appl. Stat., Volume 5, Number 4 (2011), 2603-2629.

Dates
First available in Project Euclid: 20 December 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1324399608

Digital Object Identifier
doi:10.1214/11-AOAS486

Mathematical Reviews number (MathSciNet)
MR2907128

Zentralblatt MATH identifier
1234.62141

Keywords
Bioinformatics chemoinformatics kriging Markov chain Monte Carlo reproducing kernel Hilbert space Procrustes shape size spatial steroids

Citation

Czogiel, Irina; Dryden, Ian L.; Brignell, Christopher J. Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment. Ann. Appl. Stat. 5 (2011), no. 4, 2603--2629. doi:10.1214/11-AOAS486. https://projecteuclid.org/euclid.aoas/1324399608


Export citation

References

  • Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
  • Carbo, R., Leyda, L. and Arnau, M. (1980). An electron density measure of the similarity between two compounds. International Journal of Quantum Chemistry 17 1185–1189.
  • Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York.
  • Czogiel, I., Dryden, I. L. and Brignell, C. J. (2011a). Supplement to “Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment.” DOI:10.1214/11-AOAS486SUPPA.
  • Czogiel, I., Dryden, I. L. and Brignell, C. J. (2011b). Supplement to “Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment.” DOI:10.1214/11-AOAS486SUPPB.
  • Dryden, I. L., Hirst, J. D. and Melville, J. L. (2007). Statistical analysis of unlabeled point sets: Comparing molecules in chemoinformatics. Biometrics 63 237–251.
  • Dryden, I. L. and Mardia, K. V. (1998). Statistical Shape Analysis. Wiley, Chichester.
  • Good, A. C., So, S. S. and Richards, W. G. (1993). Structure-activity relationships from molecular similarity matrices. J. Med. Chem. 36 433–438.
  • Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika 93 235–254.
  • Handcock, M. S. and Wallis, J. R. (1994). An approach to statistical spatial-temporal modeling of meteorological fields. J. Amer. Statist. Assoc. 89 368–390.
  • Kearsley, S. K. and Smith, G. M. (1990). An alternative method for the alignment of molecular structures: Maximizing electrostatic and steric overlaps. Tetrahedron Computer Methodology 3 315–633.
  • Kenobi, K. and Dryden, I. L. (2010). Bayesian matching of unlabelled point sets using Procrustes and configuration models. Technical report, Univ. Nottingham. Available at arXiv:1009.3072v1.
  • Kirkpatrick, S., Gelatt, C. D. Jr. and Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220 671–680.
  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Richards, W. G. (1993). Computers in drug design. Pure and Applied Chemistry 65 231–234.
  • Ruffieux, Y. and Green, P. J. (2009). Alignment of multiple configurations using hierarchical models. J. Comput. Graph. Statist. 18 756–773.
  • Schmidler, S. C. (2007). Fast Bayesian shape matching using geometric algorithms. In Bayesian Statistics 8 471–490. Oxford Univ. Press, Oxford.
  • Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York.
  • Taylor, J. E. and Worsley, K. J. (2008). Random fields of multivariate test statistics, with applications to shape analysis. Ann. Statist. 36 1–27.
  • Wackernagel, H. (2003). Multivariate Geostatistics, 3rd ed. Springer, Berlin.
  • Ward, J. H. Jr. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58 236–244.
  • Worsley, K. J. (1994). Local maxima and the expected Euler characteristic of excursion sets of χ2, F and t fields. Adv. in Appl. Probab. 26 13–42.

Supplemental materials

  • Supplementary material A: R programs for Bayesian molecule alignment. The zip file contains R programs for molecular alignment using random fields. The main R program is fields8.r which carries out a Bayesian MCMC procedure. The programs were written by Irina Czogiel, with some later edits by Ian Dryden. There are two options in the program—simulation study (as in Section 4.4) of the paper, or comparison of two molecules using steric information (as in Section 5).
  • Supplementary material B: Steroids data. The zip file contains the data set of steroids first analyzed by Dryden, Hirst and Melville (2007). The data set of (x, y, z) atom co-ordinates and partial charges was constructed by Jonathan Hirst and James Melville (School of Chemistry, University of Nottingham).