The Annals of Applied Statistics

Ranking relations using analogies in biological and information networks

Ricardo Silva, Katherine Heller, Zoubin Ghahramani, and Edoardo M. Airoldi

Full-text: Open access

Abstract

Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = {A(1) : B(1), A(2) : B(2), …, A(N) : B(N)}, measures how well other pairs A : B fit in with the set S. Our work addresses the following question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.

Article information

Source
Ann. Appl. Stat. Volume 4, Number 2 (2010), 615-644.

Dates
First available in Project Euclid: 3 August 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1280842133

Digital Object Identifier
doi:10.1214/09-AOAS321

Mathematical Reviews number (MathSciNet)
MR2758642

Zentralblatt MATH identifier
1194.62022

Keywords
Network analysis Bayesian inference variational approximation ranking information retrieval data integration Saccharomyces cerevisiae

Citation

Silva, Ricardo; Heller, Katherine; Ghahramani, Zoubin; Airoldi, Edoardo M. Ranking relations using analogies in biological and information networks. Ann. Appl. Stat. 4 (2010), no. 2, 615--644. doi:10.1214/09-AOAS321. https://projecteuclid.org/euclid.aoas/1280842133


Export citation

References

  • Airoldi, E. M. (2007). Getting started in probabilistic graphical models. PLoS Computational Biology 3 e252.
  • Airoldi, E. M., Blei, D. M., Xing, E. P. and Fienberg, S. E. (2005). A latent mixed-membership model for relational data. In Workshop on Link Discovery: Issues, Approaches and Applications, in Conjunction With the 11th International ACM SIGKDD Conference. Chicago, IL.
  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology 215 403–410.
  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubinand, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nature Genetics 25 25–29.
  • Banks, E., Nabieva, E., Peterson, R. and Singh, M. (2008). NetGrep: Fast network schema searches in interactomes. Genome Biology 24 1473–1480.
  • Bernard, A., Vaughn, D. S. and Hartemink, A. J. (2007). Reconstructing the topology of protein complexes. In Research in Computational Molecular Biology 2007 (RECOMB07) (T. Speed and H. Huang, eds.). Lecture Notes in Bioinformatics 4453 32–46. Springer, Berlin.
  • Borgwardt, K. (2007). Graph kernels. Ph.D. thesis, Ludwig-Maximilians-Univ. Munich.
  • Botstein, D., Chervitz, S. A. and Cherry, J. M. (1997). Yeast as a model organism. Science 277 1259–1260.
  • Breitkreutz, B. J., Stark, C. and Tyers, M. (2003). The GRID: The General Repository for Interaction Datasets. Genome Biology 4 R23.
  • Brem, R. B., Storey, J. D., Whittle, J. and Kruglyak, L. (2005). Genetic interactions between polymorphisms that affect gene expression in yeast. Nature 436 701–703.
  • Cafarella, M., Banko, M. and Etzioni, O. (2006). Relational web search. Technical report 2006-04-02, Univ. Washington, Dept. Computer Science and Engineering.
  • Cherry, J. M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R. K. and Botstein, D. (1997). Genetic and physical maps of saccharomyces cerevisiae. Nature 387 67–73.
  • Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S. (1998). Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of AAAI’98 509–516. MIT Press, Cambridge, MA.
  • Džeroski, S. and Lavrač, N. (2001). Relational Data Mining. Springer, Berlin.
  • Fields, S. and Song, O. (1989). A novel genetic system to detect protein–protein interactions. Nature 340 245–246.
  • Fienberg, S. E., Meyer, M. M. and Wasserman, S. (1985). Statistical analysis of multiple sociometric relations. J. Amer. Statist. Assoc. 80 51–67.
  • French, R. (2002). The computational modeling of analogy-making. Trends in Cognitive Sciences 6 200–205.
  • Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. and Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11 4241–4257.
  • Gavin, A.-C., Bösche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A.-M., Cruciat, C.-M., Remor, M., Höfert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Dickson, D., Hudak, M., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M.-A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G. and Superti-Furga, G. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415 141–147.
  • Gavin, A.-C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L. J., Bastuck, S., Dümpelfeld, B., Edelmann, A., Heurtier, M., Hoffman, V., Hoefert, C., Klein, K., Hudak, M., Michon, A., Schelder, M., Schirle, M., Remor, M., Rudi, T., Hooper, S., Bauer, A., Bouwmeester, T., Casari, G., Drewes, G., Neubauer, G., Rick, J. M., Kuster, B., Bork, P., Russell, R. B. and Superti-Furga, G. (2006). Proteome survey reveals modularity of the yeast cell machinery. Nature 440 631–636.
  • Gelman, A. and Hill, J. (2007). Data Analysis Using Multilevel/Hierarchical Models. Cambridge Univ. Press.
  • Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive Science 7 155–170.
  • Gentner, D. and Medina, J. (1998). Similarity and the development of rules. Cognition 65 263–297.
  • Getoor, L. and Taskar, B. (2007). Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA.
  • Getoor, L., Friedman, N., Koller, D. and Taskar, B. (2002). Learning probabilistic models of link structure. J. Mach. Learn. Res. 3 679–707.
  • Ghahramani, Z. and Heller, K. A. (2005). Bayesian sets. Advances in Neural Information Processing Systems 18 435–442.
  • Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J. B., Reynolds, D. B., Yoo, J., Jennings, E. G., Zeitlinger, J., Pokholok, D. K., Kellis, M., Rolfe, P. A., Takusagawa, K. T., Lander, E. S., Gifford, D. K., Fraenkel, E. and Young, R. A. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature 431 99–104.
  • Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S.-L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I., Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A. R., Sassi, H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen, L. E., Hansen, L. H., Jespersen, H., Podtelejnikov, A., Nielsen, E., Crawford, J., Poulsen, V., Sørensen, B. D., Hendrickson, R. C., Matthiesen, J., Gleeson, F., Pawson, T., Moran, M. F., Durocher, D., Mann, M., Hogue, C. W. V., Figeys, D. and Tyers, M. (2002). Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415 180–183.
  • Hoff, P. D. (2008). Modeling homophily and stochastic equivalence in symmetric relational data. Advances in Neural Information Procesing Systems 20 657–664.
  • Holland, P. W. and Leinhardt, S. (1975). Local structure in social networks. In Sociological Methodology (D. Heise, ed.) 1–45. Jossey-Bass, New York.
  • Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S. and O’Shea E. K. (2003). Global analysis of protein localization in budding yeast. Nature 425 686–691.
  • Itô, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S. and Sakaki, Y. (2000). Toward a protein–protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. 97 1143–1147.
  • Jaakkola, T. and Jordan, M. (2000). Bayesian parameter estimation via variational methods. Stat. Comput. 10 25–37.
  • Jensen, D. and Neville, J. (2002). Linkage and autocorrelation cause feature selection bias in relational learning. In Proc. 19th International Conference on Machine Learning. Morgan Kaufmann, San Francisco.
  • Jensen, L. J. and Bork, P. (2008). Biochemistry: Not comparable, but complementary. Science 322 56–57.
  • Jordan, M., Ghahramani, Z., Jaakkola, T. and Saul, L. (1999). Introduction to variational methods for graphical models. Machine Learning 37 183–233.
  • Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28 27–30.
  • Kass, R. and Raftery, A. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
  • Kemp, C., Tenenbaum, J., Griffths, T., Yamada, T. and Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In Proceedings of AAAI’06. MIT Press, Cambridge, MA.
  • Kok, S. and Domingos, P. (2007). Statistical predicate invention. In 24th International Conference on Machine Learning 12 93–104. Omnipress, Madison, WI.
  • Krogan, N. J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, N., Tikuisis, A. P., Punna, T., Peregrin-Alvarez, J. M., Shales, M., Zhang, X., Davey, M., Robinson, M. D., Paccanaro, A., Bray, J. E., Sheung, A., Beattie, B., Richards, D. P., Canadien, V., Lalev, A., Mena, F., Wong, P., Starostine, A., Canete, M. M., Vlasblom, J., Wu, S., Orsi, C., Collins, S. R., Chandran, S., Haw, R., Rilstone, J. J., Gandi, K., Thompson, N. J., Musso, G., St. Onge, P., Ghanny, S., M. Lam, H. Y., Butland, G., Altaf-Ul, A. M., Kanaya, S., Shilatifard, A., O’Shea, E., Weissman, J. S., Ingles, C. J., Hughes, T. R., Parkinson, J., Gerstein, M., Wodak, S. J., Emili, A. and Greenblatt, J. F. (2006). Global landscape of protein complexes in the yeast Saccharomyces Cerevisiae. Nature 440 637–643.
  • Lorrain, F. and White, H. C. (1971). Structural equivalence of individuals in social networks. Journal of Mathematical Sociology 1 49–80.
  • Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge Univ. Press.
  • Marx, Z., Dagan, I., Buhmann, J. and Shamir, E. (2002). Coupled clustering: A method for detecting structural correspondence. J. Mach. Learn. Res. 3 747–780.
  • Memisevic, R. and Hinton, G. (2005). Multiple relational embedding. In 18th NIPS. Vancouver, BC.
  • Mewes, H. et al. (2004). MIPS: Analysis and annotation of proteins from whole genome. Nucleic Acids Research 32 D41–D44.
  • Muggleton, S. (1981). Inverting the resolution principle. Machine Intelligence 12 93–104.
  • Myers, C., Robson, D., Wible, A., Hibbs, M., Chiriac, C., Theesfeld, C., Dolinski, K. and Troyanskaya, O. (2005). Discovery of biological networks from diverse functional genomic data. Genome Biology 6 R114.1–R114.16.
  • Myers, C. L., Barret, D. A., Hibbs, M. A., Huttenhower, C. and Troyanskaya, O. G. (2006). Finding function: An evaluation framework for functional genomics. BMC Genomics 7 187.
  • Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. and Singh. M. (2005). Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 i302–i310.
  • Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
  • Popescul, A. and Ungar, L. H. (2003). Structural logistic regression for link analysis. In Multi-Relational Data Mining Workshop at KDD-2003 92–106. ACM Press, New York.
  • Primig, M., Williams, R. M., Winzeler, E. A., Tevzadze, G. G., Conway, A. R., Hwang, S. Y., Davis, R. W. and Esposito, R. E. (2000). The core meiotic transcriptome in budding yeasts. Nature Genetics 26 415–423.
  • Qi, Y., Bar-Joseph, Z. and Klein-Seetharaman, J. (2006). Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Structure, Function, and Bioinformatics 63 490–500.
  • Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B.-J., Hon, G., Myers, C., Parsons, A., Friesen, H., Oughtred, R., Tong, A., Stark, C., Ho, Y., Botstein, D., Andrews, B., Boone, C., Troyanskya, O., Ideker, T., Dolinski, K., Batada, N. and Tyers, M. (2006). Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. Journal of Biology 5 11.
  • Rosenbaum, P. (2002). Observational Studies. Springer, Berlin.
  • Rumelhart, D. and Abrahamson, A. (1973). A model for analogical reasoning. Cognitive Psychology 5 1–28.
  • SGD. Saccharomyces genome database. Available at ftp://ftp.yeastgenome.org/yeast/.
  • Silva, R., Heller, K. A. and Ghahramani, Z. (2007). Analogical reasoning with relational Bayesian sets. In 11th International Conference on Artificial Intelligence and Statistics, AISTATS. San Juan.
  • Silva, R., Heller, K. A., Ghahramani, Z. and Airoldi, E. M. (2010). Supplement to: “Ranking relations using analogies in biological and information networks.” DOI: 10.1214/09-AOAS321SUPP.
  • Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R. and Mostafa, J. (2001). Detecting gene relations from MEDLINE abstracts. In Proceedings of the Sixth Annual Pacific Symposium on Biocomputing 483–496. World Scientific, Singapore.
  • Tarassov, K., Messier, V., Landry, C. R., Radinovic, S., Molina, M. M. S., Shames, I., Malitskaya, Y., Vogel, J., Bussey, H. and Michnick, S. W. (2008). An in vivo map of the yeast protein interactome. Science 320 1465–1470.
  • Tenenbaum, J. and Griffiths, T. (2001). Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences 24 629–641.
  • TRANSFAC. Transcription factor database. Available at http://www.gene-regulation.com/.
  • Turney, P. (2008a). The latent relation mapping engine: Algorithm and experiments. J. Artificial Intelligence Res. 33 615–655.
  • Turney, P. (2008b). A uniform approach to analogies, synonyms, antonyms, and associations. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08) 905–912. Association for Computational Linguistics, Stroudsburg, PA.
  • Turney, P. and Littman, M. (2005). Corpus-based learning of analogies and semantic relations. Machine Learning 60 251–278.
  • Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. and Rothberg, J. M. (2000). A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae. Nature 403 623–627.
  • Veloso, M. and Carbonell, J. (1993). Derivational analogy in PRODIGY: Automating case acquisition, storage and utilization. Machine Learning 10 249–278.
  • von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S. and Bork, P. (2002). Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417 399–403.
  • Wang, X.-J., Tu, X., Feng, D. and Zhang, L. (2009). Ranking community answers by modeling question–answer relationships via analogical reasoning. In Proceedings of the 32nd Annual ACM SIGIR Conference on Research & Development on Information Retrieval. Association for Computing Machinery, New York.
  • Xu, Z., Tresp, V., Yu, K. and Kriegel, H.-P. (2006). Infinite hidden relational models. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francisco, CA.
  • Yu, H., Braun, P., Yildirim, M. A., Lemmens, I., Venkatesan, K., Sahalie, J., Hirozane-Kishikawa, T., Gebreab, F., Li, N., Simonis, N., Hao, T., Rual, J.-F., Dricot, A., Vazquez, A., Murray, R. R., Simon, C., Tardivo, L., Tam, S., Svrzikapa, N., Fan, C., de Smet, A.-S., Motyl, A., Hudson, M. E., Park, J., Xin, X., Cusick, M. E., Moore, T., Boone, C., Snyder, M., Roth, F. P., Barabasi, A.-L., Tavernier, J., Hill, D. E. and Vidal, M. (2008). High-quality binary protein interaction map of the yeast interactome network. Science 322 104–110.
  • Yvert, G., Brem, R. B., Whittle, J., Akey, J. M., Foss, E., Smith, E. N., Mackelprang, R. and Kruglyak, L. (2003). Trans-acting regulatory variation in saccharomyces cerevisiae and the role of transcription factors. Nature Genetics 35 57–64.
  • Zhu, J. and Zhang, M. Q. (1999). SCPD: A promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15 607–611.

Supplemental materials

  • Supplementary material: Java implementation of the Relational Bayesian Sets method. We provide complete source code for our method, and instructions on how to rebuild our experiments. With the code it is also possible to test variations of our queries, analyzing the sensitivity of the results to different query sizes and initialization of the variational optimizer.