The Annals of Applied Statistics

Overlapping stochastic block models with application to the French political blogosphere

Pierre Latouche, Etienne Birmelé, and Christophe Ambroise

Full-text: Open access

Abstract

Complex systems in nature and in society are often represented as networks, describing the rich set of interactions between objects of interest. Many deterministic and probabilistic clustering methods have been developed to analyze such structures. Given a network, almost all of them partition the vertices into disjoint clusters, according to their connection profile. However, recent studies have shown that these techniques were too restrictive and that most of the existing networks contained overlapping clusters. To tackle this issue, we present in this paper the Overlapping Stochastic Block Model. Our approach allows the vertices to belong to multiple clusters, and, to some extent, generalizes the well-known Stochastic Block Model [Nowicki and Snijders (2001)]. We show that the model is generically identifiable within classes of equivalence and we propose an approximate inference procedure, based on global and local variational techniques. Using toy data sets as well as the French Political Blogosphere network and the transcriptional network of Saccharomyces cerevisiae, we compare our work with other approaches.

Article information

Source
Ann. Appl. Stat. Volume 5, Number 1 (2011), 309-336.

Dates
First available: 21 March 2011

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1300715192

Digital Object Identifier
doi:10.1214/10-AOAS382

Zentralblatt MATH identifier
05906808

Mathematical Reviews number (MathSciNet)
MR2810399

Citation

Latouche, Pierre; Birmelé, Etienne; Ambroise, Christophe. Overlapping stochastic block models with application to the French political blogosphere. The Annals of Applied Statistics 5 (2011), no. 1, 309--336. doi:10.1214/10-AOAS382. http://projecteuclid.org/euclid.aoas/1300715192.


Export citation

References

  • Airoldi, E., Blei, D., Xing, E. and Fienberg, S. (2006). Mixed membership stochastic block models for relational data with application to protein–protein interactions. In Proceedings of the International Biometrics Society Annual Meeting. Montréal, Québec, Canada.
  • Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2007). Mixed membership analysis of high-throughput interaction studies: Relational data. Available at ArXiv e-prints.
  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Allman, E. S., Matias, C. and Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37 3099–3132.
  • Bickel, P. and Chen, A. (2009). A non parametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.
  • Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Boer, P., Huisman, M., Snijders, T., Steglich, C., Wichers, L. and Zeggelink, E. (2006). StOCNET: An open software system for the advanced statistical analysis of social networks, Version 1.7.
  • Broyden, C., Fletcher, R., Goldfarb, D. and Shanno, D. F. (1970). BFGS method. J. Inst. Math. Appl. 6 76–90.
  • Byrd, R. H., Lu, P., Nocedal, J. and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16 1190–1208.
  • Chang, J. (2010). The lda package Version, 1.2.
  • Daudin, J.-J., Picard, F. and Robin, S. (2008). A mixture model for random graphs. Statist. Comput. 18 173–183.
  • Estrada, E. and Rodríguez-Velázquez, J. A. (2005). Spectral measures of bipartivity in complex networks. Phys. Rev. E (3) 72 046105.
  • Everett, M. and Borgatti, S. (1998). Analyzing clique overlap. Connections 21 49–61.
  • Fienberg, S. and Wasserman, S. (1981). Categorical data analysis of single sociometric relations. Soc. Methodol. 12 156–192.
  • Frank, O. and Harary, F. (1982). Cluster inference by using transitivity indices in empirical graphs. J. Amer. Statist. Assoc. 77 835–840.
  • Fu, Q. and Banerjee, A. (2008). Multiplicative mixture models for overlapping clustering. In Proceedings of the IEEE International Conference on Data Mining 791–796. Pisa, Italy.
  • Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99 7821–7826.
  • Goldenberg, A., Zheng, A., Fienberg, S. and Airoldi, E. (2010). A survey of statistical network models. Found. Trends Mach. Learn. 2 129–233.
  • Griffiths, T. and Ghahramani, Z. (2005). Infinite latent feature models and the Indian buffet process. Adv. Neural Inform. Process. Syst. 18 475–482.
  • Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
  • Heller, K. and Ghahramani, Z. (2007). A nonparametric Bayesian approach to modeling overlapping clusters. In Proceedings of the 11th International Conference on AI and Statistics. San Juan, Puerto Rico.
  • Heller, K., Williamson, S. and Ghahramani, Z. (2008). Statistical models for partial membership. In Proceedings of the 25th International Conference on Machine Learning 392–399. Helsinki, Finland.
  • Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
  • Hofman, J. and Wiggins, C. (2008). A Bayesian approach to network modularity. Phys. Rev. Lett. 100 258701.
  • Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5 109–137.
  • Jeffery, C. (1999). Moonlighting proteins. Trends Biochem. Sci. 24 8–11.
  • Krivitsky, P. and Handcock, M. (2009). The latentnet package, Version 2.1-1.
  • Krivitsky, P., Handcock, M., Raftery, A. and Hoff, P. (2009). Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Social Networks 31 204–213.
  • Lacroix, V., Fernandes, C. and Sagot, M.-F. (2006). Motif search in graphs: Application to metabolic networks. Trans. Comput. Biol. Bioinform. 3 360–368.
  • Latouche, P., Birmelé, E. and Ambroise, C. (2009). Advances in Data Analysis, Data Handling, and Business Intelligence, Bayesian Methods for Graph Clustering 229–239. Springer, Berlin, Heidelberg.
  • Latouche, P., Birmelé, E. and Ambroise, C. (2010). Supplement A to “Overlapping stochastic block models with application to the French blogosphere network.” DOI: 10.12.14/10-AOAS382SUPP.
  • Martin, D., Brun, C., Remy, E., Mouren, P., Thieffry, D. and Jacq, B. (2004). GOToolBox: Functional analysis of gene datasets based on Gene Ontology. Genome Biol. 5.
  • Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, D., Chklovskii, D. and Alon, U. (2002). Network motifs: Simple building blocks of complex networks. Science 298 824–827.
  • Moreno, J. (1934). Who Shall Survive?: A New Approach to the Problem of Human Interrelations. Nervous and Mental Disease Publishing, Washington, DC.
  • Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.
  • Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
  • Palla, G., Derenyi, I., Farkas, I. and Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature 435 814–818.
  • Palla, G., Derenyi, I., Farkas, I. and Vicsek, T. (2006). CFinder the community/cluster finding program, Version 2.0.1.
  • Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block sturcture. J. Classification 14 75–100.
  • White, H., Boorman, S. and Breiger, R. (1976). Social structure from multiple networks. I. Blockmodels of roles and positions. Amer. J. Soc. 81 730–780.
  • Zanghi, H., Ambroise, C. and Miele, V. (2008). Fast online graph clustering via Erdös–Renyi mixture. Pattern Recognition 41 3592–3599.

Supplemental materials

  • Supplementary material: Appendix. Describe how global and local variational techniques can be used to obtain a tractable lower bound. Introduce the optimization equations for the inference procedure.