The Annals of Applied Statistics

Hierarchical relational models for document networks

Jonathan Chang and David M. Blei

Full-text: Open access

Abstract

We develop the relational topic model (RTM), a hierarchical model of both network structure and node attributes. We focus on document networks, where the attributes of each document are its words, that is, discrete observations taken from a fixed vocabulary. For each pair of documents, the RTM models their link as a binary random variable that is conditioned on their contents. The model can be used to summarize a network of documents, predict links between them, and predict words within them. We derive efficient inference and estimation algorithms based on variational methods that take advantage of sparsity and scale with the number of links. We evaluate the predictive performance of the RTM for large networks of scientific abstracts, web documents, and geographically tagged news.

Article information

Source
Ann. Appl. Stat., Volume 4, Number 1 (2010), 124-150.

Dates
First available in Project Euclid: 11 May 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1273584450

Digital Object Identifier
doi:10.1214/09-AOAS309

Mathematical Reviews number (MathSciNet)
MR2758167

Zentralblatt MATH identifier
1189.62191

Keywords
Mixed-membership models variational methods text analysis network models

Citation

Chang, Jonathan; Blei, David M. Hierarchical relational models for document networks. Ann. Appl. Stat. 4 (2010), no. 1, 124--150. doi:10.1214/09-AOAS309. https://projecteuclid.org/euclid.aoas/1273584450


Export citation

References

  • Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Antoniak, C. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
  • Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D. and Jordan, M. (2003). Matching words and pictures. J. Mach. Learn. Res. 3 1107–1135.
  • Blei, D. and Jordan, M. (2003). Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Available at http://portal.acm.org/citation.cfm?id=860460.
  • Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022. Available at http://www.mitpressjournals.org/doi/abs/10.1162/jmlr.2003.3.4-5.993.
  • Blei, D. M. and Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian Anal. 1 121–144.
  • Blei, D. M. and McAuliffe, J. D. (2007). Supervised topic models. In Neural Information Processsing Systems. Vancouver.
  • Boyd-Graber, J. and Blei, D. M. (2008). Syntactic topic models. In Neural Information Processing Systems. Vancouver.
  • Braun, M. and McAuliffe, J. (2007). Variational inference for large-scale models of discrete choice. Preprint. Available at arXiv:0712.2526.
  • Chakrabarti, S., Dom, B. and Indyk, P. (1998). Enhanced hypertext classification using hyperlinks. In Proc. ACM SIGMOD. Available at http://citeseer.ist.psu.edu/article/chakrabarti98enhanced.html.
  • Cohn, D. and Hofmann, T. (2001). The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13 430–436. Vancouver.
  • Craven, M., DiPasquo, D., Freitag, D. and McCallum, A. (1998). Learning to extract symbolic knowledge from the world wide web. In Proc. AAAI. Available at http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/1998/CMU-CS-98-122.pdf.
  • Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. In Proc. ICML. Available at http://portal.acm.org/citation.cfm?id=1273526.
  • Erosheva, E., Fienberg, S. and Lafferty, J. (2004). Mixed-membership models of scientific publications. Proc. Natl. Acad. Sci. USA 101 5220–5227.
  • Erosheva, E., Fienberg, S. and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Ann. Appl. Statist. 1 502–537.
  • Fei-Fei, L. and Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 2 524–531. IEEE Computer Society, Washington, DC.
  • Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fienberg, S. E., Meyer, M. M. and Wasserman, S. (1985). Statistical analysis of multiple sociometric relations. J. Amer. Statist. Assoc. 80 51–67.
  • Getoor, L., Friedman, N., Koller, D. and Taskar, B. (2001). Learning probabilistic models of relational structure. In Proc. ICML. Available at http://ai.stanford.edu/users/nir/Papers/GFTK1.pdf.
  • Gormley, I. C. and Murphy, T. B. (2009). A grade of membership model for rank data. Bayesian Anal. 4 265–296.
  • Gruber, A., Rosen-Zvi, M. and Weiss, Y. (2008). Latent topic models for hypertext. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI-08) 230–239. AUAI Press, Corvallis, WA.
  • Hoff, P., Raftery, A. and Handcock, M. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
  • Hofman, J. and Wiggins, C. (2007). A Bayesian approach to network modularity. Available at arXiv: 0709.3512.
  • Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183–233.
  • Jurafsky, D. and Martin, J. (2008). Speech and Language Processing. Prentice Hall, Upper Saddle River, NJ.
  • Kemp, C., Griffiths, T. and Tenenbaum, J. (2004). Discovering latent classes in relational data. In MIT AI Memo 2004-019. Available at http://www-psych.stanford.edu/ gruffydd/ papers/blockTR.pdf.
  • Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. J. ACM. Available at http://portal.acm.org/citation.cfm?id=324140.
  • McCallum, A., Nigam, K., Rennie, J. and Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrieval. Available at http://www.springerlink.com/index/R1723134248214T0.pdf.
  • McCallum, A., Corrada-Emmanuel, A. and Wang, X. (2005). Topic and role discovery in social networks. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Available at http://www.ijcai.org/papers/1623.pdf.
  • Mei, Q., Cai, D., Zhang, D. and Zhai, C. (2008). Topic modeling with network regularization. In WWW ’08: Proceeding of the 17th International Conference on World Wide Web. Available at http://portal.acm.org/citation.cfm?id=1367497.1367512.
  • Nallapati, R. and Cohen, W. (2008). Link-pLSA-LDA: A new unsupervised model for topics and influence of blogs. ICWSM. Seattle.
  • Nallapati, R., Ahmed, A., Xing, E. P. and Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 542–550. ACM Press, New York.
  • Newman, M. (2002). The structure and function of networks. Computer Physics Communications. Available at http://linkinghub.elsevier.com/retrieve/pii/S0010465502002011.
  • Pritchard, J., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
  • Sinkkonen, J., Aukia, J. and Kaski, S. (2008). Component models for large networks. Available at http://arxiv.org/abs/0803.1628v1.
  • Steyvers, M. and Griffiths, T. (2007). Probabilistic topic models. In Handbook of Latent Semantic Analysis. Psychology Press, London.
  • Taskar, B., Wong, M., Abbeel, P. and Koller, D. (2004). Link prediction in relational data. NIPS. Vancouver.
  • Teh, Y., Jordan, M., Beal, M. and Blei, D. (2007). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566–1581.
  • Wainwright, M. and Jordan, M. (2005). A variational principle for graphical models. In New Directions in Statistical Signal Processing, Chapter 11. MIT Press, Cambridge, MA.
  • Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learnings 1 1–305.
  • Wang, X., Mohanty, N. and McCallum, A. (2005). Group and topic discovery from relations and text. In Proceedings of the 3rd International Workshop on Link Discovery. Available at http://portal.acm.org/citation.cfm?id=1134276.
  • Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions for social networks: I. An introduction to markov graphs and p*. Psychometrika. Available at http://www.springerlink.com/index/T2W46715636R2H11.pdf.
  • Xu, Z., Tresp, V., Yu, K. and Kriegel, H.-P. (2006). Infinite hidden relational models. In Proc. 22nd Conference on Uncertainty in Artificial Intelligence (UAI’06) 1309–1314. Morgan Kaufmann, San Francisco.
  • Xu, Z., Tresp, V., Yu, S. and Yu, K. (2008). Nonparametric relational learning for social network analysis. In 2nd ACM Workshop on Social Network Mining and Analysis (SNA-KDD 2008). ACM Press, New York.