## The Annals of Applied Statistics

### Model-based clustering of large networks

#### Abstract

We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

#### Article information

Source
Ann. Appl. Stat., Volume 7, Number 2 (2013), 1010-1039.

Dates
First available in Project Euclid: 27 June 2013

https://projecteuclid.org/euclid.aoas/1372338477

Digital Object Identifier
doi:10.1214/12-AOAS617

Mathematical Reviews number (MathSciNet)
MR3113499

Zentralblatt MATH identifier
1288.62106

#### Citation

Vu, Duy Q.; Hunter, David R.; Schweinberger, Michael. Model-based clustering of large networks. Ann. Appl. Stat. 7 (2013), no. 2, 1010--1039. doi:10.1214/12-AOAS617. https://projecteuclid.org/euclid.aoas/1372338477

#### References

• Adamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 U.S. election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery. LinkKDD’05 36–43. ACM, New York.
• Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
• Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester.
• Benaglia, T., Chauveau, D., Hunter, D. R. and Young, D. (2009). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software 32 1–29.
• Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B Stat. Methodol. 36 192–225.
• Britton, T. and O’Neill, P. D. (2002). Bayesian inference for stochastic epidemics in populations with random social structure. Scand. J. Stat. 29 375–390.
• Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Networks 33 41–55.
• Celisse, A., Daudin, J.-J. and Pierre, L. (2011). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Preprint. Available at http://arxiv.org/pdf/1105.3288.pdf.
• Daudin, J. J., Picard, F. and Robin, S. (2008). A mixture model for random graphs. Stat. Comput. 18 173–183.
• Daudin, J.-J., Pierre, L. and Vacher, C. (2010). Model for heterogeneous random networks using continuous latent variables and an application to a tree-fungus network. Biometrics 66 1043–1051.
• Davis, J. A. (1968). Statistical analysis of pair relationships: Symmetry, subjective consistency, and reciprocity. Sociometry 31 102–119.
• Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1–38.
• Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1–26.
• Erdős, P. and Rényi, A. (1959). On random graphs. I. Publ. Math. Debrecen 6 290–297.
• Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 222 309–368.
• Frank, O. and Strauss, D. (1986). Markov graphs. J. Amer. Statist. Assoc. 81 832–842.
• Gilbert, E. N. (1959). Random graphs. Ann. Math. Statist. 30 1141–1144.
• Groendyke, C., Welch, D. and Hunter, D. R. (2011). Bayesian inference for contact networks given epidemic data. Scand. J. Stat. 38 600–616.
• Handcock, M. (2003). Assessing degeneracy in statistical models of social networks. Technical report, Center for Statistics and the Social Sciences, Univ. Washington, Seattle. Available at http://www.csss.washington.edu/Papers.
• Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
• Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
• Holland, P. W. and Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs. J. Amer. Statist. Assoc. 76 33–65.
• Hunter, D. R. and Handcock, M. S. (2006). Inference in curved exponential family models for networks. J. Comput. Graph. Statist. 15 565–583.
• Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 30–37.
• Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2011). A scalable bootstrap for massive data. Preprint. Available at arXiv:1112.5016.
• Koskinen, J. H., Robins, G. L. and Pattison, P. E. (2010). Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Stat. Methodol. 7 366–384.
• Kunegis, J., Lommatzsch, A. and Bauckhage, C. (2009). The slashdot zoo: Mining a social network with negative edges. In WWW’09: Proceedings of the 18th International Conference on World Wide Web 741–750. ACM, New York.
• Mariadassou, M., Robin, S. and Vacher, C. (2010). Uncovering latent structure in valued graphs: A variational approach. Ann. Appl. Stat. 4 715–742.
• Massa, P. and Avesani, P. (2007). Trust metrics on controversial users: Balancing between tyranny of the majority and echo chambers. International Journal on Semantic Web and Information Systems 3 39–64.
• McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
• Møller, J., Pettitt, A. N., Reeves, R. and Berthelsen, K. K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93 451–458.
• Neal, R. M. and Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. In Learning in Graphical Models 355–368. Kluwer Academic, Dordrecht.
• Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
• Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2002). Numerical Recipes in C++. The art of scientific computing, 2nd ed. Cambridge Univ. Press, Cambridge.
• Raftery, A. E., Niu, X., Hoff, P. D. and Yeung, K. Y. (2012). Fast inference for the latent space network model using a case-control approximate likelihood. J. Comput. Graph. Statist. 21 901–919.
• Salter-Townshend, M. and Murphy, T. B. (2013). Variational Bayesian inference for the latent position cluster model for network data. Comput. Statist. Data Anal. 57 661–671.
• Schweinberger, M. (2011). Instability, sensitivity, and degeneracy of discrete exponential families. J. Amer. Statist. Assoc. 106 1361–1370.
• Schweinberger, M., Petrescu-Prahova, M. and Vu, D. Q. (2012). Disaster response on September 11, 2001 through the lens of statistical network analysis. Technical Report 116, Center for Statistics and the Social Sciences, Univ. Washington, Seattle.
• Schweinberger, M. and Snijders, T. A. B. (2003). Settings in social networks: A measurement model. In Sociological Methodology 33 (R. M. Stolzenberg, ed.) 307–341. Blackwell, Boston.
• Snijders, T. A. B. (2002). Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure 3 1–40.
• Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100.
• Snijders, T. A. B., Pattison, P. E., Robins, G. L. and Handcock, M. S. (2006). New specifications for exponential random graph models. Sociological Methodology 36 99–153.
• Stefanov, S. M. (2004). Convex quadratic minimization subject to a linear constraint and box constraints. Appl. Math. Res. Express. AMRX 1 17–42.
• Stephens, M. (2000). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795–809.
• Strauss, D. (1986). On a general class of models for interaction. SIAM Rev. 28 513–527.
• Strauss, D. and Ikeda, M. (1990). Pseudolikelihood estimation for social networks. J. Amer. Statist. Assoc. 85 204–212.
• Tallberg, C. (2005). A Bayesian approach to modeling stochastic blockstructures with covariates. J. Math. Sociol. 29 1–23.
• Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning 1 1–305.
• Wang, B. and Titterington, D. M. (2005). Inadequacy of interval estimates corresponding to variational Bayesian approximations. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Jan 6–8, 2005, Savannah Hotel, Barbados 373–380. Society for Artificial Intelligence and Statistics.
• Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions for social networks. I. An introduction to Markov graphs and $p^{*}$. Psychometrika 61 401–425.
• Zanghi, H., Picard, F., Miele, V. and Ambroise, C. (2010). Strategies for online inference of model-based clustering in large and growing networks. Ann. Appl. Stat. 4 687–714.