The Annals of Applied Statistics

Strategies for online inference of model-based clustering in large and growing networks

Hugo Zanghi, Franck Picard, Vincent Miele, and Christophe Ambroise

Full-text: Open access

Abstract

In this paper we adapt online estimation strategies to perform model-based clustering on large networks. Our work focuses on two algorithms, the first based on the SAEM algorithm, and the second on variational methods. These two strategies are compared with existing approaches on simulated and real data. We use the method to decipher the connexion structure of the political websphere during the US political campaign in 2008. We show that our online EM-based algorithms offer a good trade-off between precision and speed, when estimating parameters for mixture distributions in the context of random graphs.

Article information

Source
Ann. Appl. Stat. Volume 4, Number 2 (2010), 687-714.

Dates
First available in Project Euclid: 3 August 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1280842136

Digital Object Identifier
doi:10.1214/10-AOAS359

Mathematical Reviews number (MathSciNet)
MR2758645

Zentralblatt MATH identifier
1194.62096

Keywords
Graph clustering EM Algorithms online strategies web graph structure analysis

Citation

Zanghi, Hugo; Picard, Franck; Miele, Vincent; Ambroise, Christophe. Strategies for online inference of model-based clustering in large and growing networks. Ann. Appl. Stat. 4 (2010), no. 2, 687--714. doi:10.1214/10-AOAS359. https://projecteuclid.org/euclid.aoas/1280842136.


Export citation

References

  • Adamic, L. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM Press, New York.
  • Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2007). Combining Stochastic block models and mixed membership for statistical network analysis. In Statistical Network Analysis: Models, Issues, and New Directions. Lecture Notes in Computer Science 4503 57–74. Springer, Berlin.
  • Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2008). Mixed-membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Airoldi, E., Blei, D., Xing, E. and Fienberg, S. (2005). A latent mixed-membership model for relational data. In 3rd International Workshop on Link Discovery, Issues, Approaches and Applications; 11th International ACM SIGKDD Conference 82–89. ACM Press, New York.
  • Bickel, P. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.
  • Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. and Wiener, J. (2000). Graph structure in the web. Computer Networks 33 309–320.
  • Celeux, G. and Govaert, G. (1992). A Classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14 315–332.
  • Daudin, J., Picard, F. and Robin, S. (2008). A mixture model for random graph. Statist. Comput. 18 1–36.
  • Davison, B. D. (2000). Topical locality in the Web. In SIGIR’00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 272–279. ACM Press, New York.
  • Delyon, B., Lavielle, M. and Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist. 27 94–128.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B 39 1–39.
  • Drugeon, T. (2005). A technical approach for the French web legal deposit. In 5th International Web Archiving Workshop (IWAW05), Vienna.
  • Fouetillou, G. (2007). Le Web et le traité constitutionnel européen, écologie d’une localité thématique. Réseaux 144 279–304.
  • Ghitalla, F., Boullier, D., Gkouskou, P., Le Douarin, L. and Neau, A. (2003). L’outre-lecture: manipuler, (s’) approprier, interpréter le Web. Bibliothèque publique d’information Centre Pompidou.
  • Handcock, M., Raftery, A. and Tantrum, J. (2007). Model based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193–218.
  • Jordan, M., Ghahramani, Z., Jaakkola, T. and Saul, L. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183–233.
  • Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. J. ACM 46 604–632.
  • Latouche, P., Birmele, E. and Ambroise, C. (2008). Bayesian methods for graph clustering. Statistics for Systems Biology, Technical Report No. 17.
  • Liu, Z., Almhana, J., Choulakian, V. and McGorman, R. (2006). Online EM algorithm for mixture with application to internet traffic modeling. Comput. Statist. Data Anal. 50 1052–1071.
  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Sympos. Math. Statist. Probab. 1 281–296. Univ. California Press, Berkeley, CA.
  • Newman, M. (2006). Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103 8577–8582.
  • Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.
  • Ng, A., Jordan, M. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Neural Information Processing System 14 849–856. MIT Press, Cambridge, MA.
  • Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1090.
  • Opper, M. (1999). A Bayesian approach to online learning. On-Line Learning in Neural Networks 16 363–378. Cambridge Univ. Press, Cambridge, MA.
  • Picard, F., Miele, V., Daudin, J., Cottret, L. and Robin, S. (2009). Deciphering the connectivity structure of biological networks using MixNet. BMC Bioinformatics 10 1–11.
  • Salton, G., Wong, A. and Yang, C. (1975). A vector space model for automatic indexing. Commun. ACM 18 613–620.
  • Sampson, F. S. (1968). A novitiate in a period of change: An experimental and case study of social relationship. Ph.D. thesis, Cornell Univ.
  • Shannon, P., Markiel, A., Ozier, O., Baliga, N., Wang, J., Ramage, D., Amin, N., Schwikowski, B. and Ideker, T. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13 2498–2504.
  • Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic block-structures for graphs with latent block structure. J. Classification 14 75–100.
  • Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. J. Roy. Statist. Soc. Ser. B 46 257–267.
  • Wang, S. and Zhao, Y. (2006). Almost sure convergence of Titterington’s recursive estimator for finite mixture models. Statist. Probab. Lett. 76 2001–2006.
  • Zanghi, H., Ambroise, C. and Miele, V. (2008). Fast online graph clustering via Erdős–Rényi mixture. Pattern Recognition 41 3592–3599.