## The Annals of Applied Statistics

### A testing based extraction algorithm for identifying significant communities in networks

#### Abstract

A common and important problem arising in the study of networks is how to divide the vertices of a given network into one or more groups, called communities, in such a way that vertices of the same community are more interconnected than vertices belonging to different ones. We propose and investigate a testing based community detection procedure called Extraction of Statistically Significant Communities (ESSC). The ESSC procedure is based on $p$-values for the strength of connection between a single vertex and a set of vertices under a reference distribution derived from a conditional configuration network model. The procedure automatically selects both the number of communities in the network and their size. Moreover, ESSC can handle overlapping communities and, unlike the majority of existing methods, identifies “background” vertices that do not belong to a well-defined community. The method has only one parameter, which controls the stringency of the hypothesis tests. We investigate the performance and potential use of ESSC and compare it with a number of existing methods, through a validation study using four real network data sets. In addition, we carry out a simulation study to assess the effectiveness of ESSC in networks with various types of community structure, including networks with overlapping communities and those with background vertices. These results suggest that ESSC is an effective exploratory tool for the discovery of relevant community structure in complex network systems. Data and software are available at http://www.unc.edu/~jameswd/research.html.

#### Article information

Source
Ann. Appl. Stat., Volume 8, Number 3 (2014), 1853-1891.

Dates
First available in Project Euclid: 23 October 2014

https://projecteuclid.org/euclid.aoas/1414091237

Digital Object Identifier
doi:10.1214/14-AOAS760

Mathematical Reviews number (MathSciNet)
MR3271356

Zentralblatt MATH identifier
1304.62141

#### Citation

Wilson, James D.; Wang, Simi; Mucha, Peter J.; Bhamidi, Shankar; Nobel, Andrew B. A testing based extraction algorithm for identifying significant communities in networks. Ann. Appl. Stat. 8 (2014), no. 3, 1853--1891. doi:10.1214/14-AOAS760. https://projecteuclid.org/euclid.aoas/1414091237

#### References

• Adamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM, New York.
• Airoldi, E. M., Costa, T. B. and Chan, S. H. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems 692–700.
• Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
• Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122.
• Ball, B., Karrer, B. and Newman, M. E. J. (2011). Efficient and principled method for detecting communities in networks. Phys. Rev. E (3) 84 036103.
• Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509–512.
• Bassett, D. S., Wymbs, N. F., Porter, M. A., Mucha, P. J., Carlson, J. M. and Grafton, S. T. (2011). Dynamic reconfiguration of human brain networks during learning. Proc. Natl. Acad. Sci. USA 108 7641–7646.
• Bender, E. A. and Canfield, E. R. (1978). The asymptotic number of labeled graphs with given degree sequences. J. Combin. Theory Ser. A 24 296–307.
• Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289–300.
• Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.
• Blondel, Vi. D., Guillaume, J. L., Lambiotte, R. and Lefebvre, E. (2008). Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008 P10008.
• Bollobás, B. (1979). A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. Aarhus Universitet.
• Clauset, A., Moore, C. and Newman, M. E. (2008). Hierarchical structure and the prediction of missing links in networks. Nature 453 98–101.
• Clauset, A., Newman, M. E. J. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E (3) 70 066111.
• Decelle, A., Krzakala, F., Moore, C. and Zdeborová, L. (2011). Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107 065701.
• Erdős, P. and Rényi, A. (1960). On the evolution of random graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl. 5 17–61.
• Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD 96 226–231.
• Fortunato, S. (2010). Community detection in graphs. Phys. Rep. 486 75–174.
• Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
• Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99 7821–7826 (electronic).
• Glover, F. (1989). Tabu search—part I. ORSA Journal on Computing 1 190–206.
• Goldberg, A. V. and Tarjan, R. E. (1988). A new approach to the maximum-flow problem. J. Assoc. Comput. Mach. 35 921–940.
• Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.
• Greene, D., Doyle, D. and Cunningham, P. (2010). Tracking the evolution of communities in dynamic social networks. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 176–183. Springer, New York.
• Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
• Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
• Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In KDD, 1998 58–65.
• Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
• Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5 109–137.
• Jutla, I. S., Jeub, L. G. S. and Mucha, P. J. (2011/2012). A generalized Louvain method for community detection implemented in Matlab. Available at http://netwiki.amath.unc.edu/GenLouvain.
• Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborová, L. and Zhang, P. (2013). Spectral redemption: Clustering sparse networks. Preprint. Available at arXiv:1306.5550.
• Lancichinetti, A. and Fortunato, S. (2009a). Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys. Rev. E (3) 80 016118.
• Lancichinetti, A. and Fortunato, S. (2009b). Community detection algorithms: A comparative analysis. Phys. Rev. E (3) 80 056117.
• Lancichinetti, A., Fortunato, S. and Kertész, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11 033015.
• Lancichinetti, A., Radicchi, F., Ramasco, J. J. and Fortunato, S. (2011). Finding statistically significant communities in networks. PloS One 6 e18961.
• Lee, C. and Cunningham, P. (2013). Benchmarking community detection methods on social media data. Preprint. Available at arXiv:1302.0739.
• Leskovec, J., Lang, K. J., Dasgupta, A. and Mahoney, M. W. (2009). Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6 29–123.
• Lewis, A. C., Jones, N. S., Porter, M. A. and Deane, C. M. (2010). The function of communities in protein interaction networks at multiple scales. BMC Systems Biology 4 1–14.
• Mézard, M. and Montanari, A. (2009). Information, Physics, and Computation. Oxford Univ. Press, Oxford.
• Miritello, G., Moro, E. and Lara, R. (2011). Dynamical strength of social ties in information spreading. Phys. Rev. E (3) 83 045102.
• Molloy, M. and Reed, B. (1995). A critical point for random graphs with a given degree sequence. Random Structures Algorithms 6 161–179.
• Mucha, P. J., Richardson, T., Macon, K., Porter, M. A. and Onnela, J.-P. (2010). Community structure in time-dependent, multiscale, and multiplex networks. Science 328 876–878.
• Muhammad, S. A. and Van Laerhoven, K. (2013). Quantitative analysis of community detection methods for longitudinal mobile data. In International Conference on Social Intelligence and Technology (SOCIETY) 47–56. Springer, New York.
• Newman, M. E. J. (2006). Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103 8577–8582.
• Newman, M. E. J. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E (3) 69 026113.
• Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2 849–856.
• Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
• Olhede, S. C. and Wolfe, P. J. (2013). Network histograms and universality of blockmodel approximation. Preprint. Available at arXiv:1312.5306.
• Onnela, J.-P., Arbesman, S., González, M. C., Barabási, A.-L. and Christakis, N. A. (2011). Geographic constraints on social network groups. PLoS ONE 6 e16939.
• Papadopoulos, S., Kompatsiaris, Y., Vakali, A. and Spyridonos, P. (2012). Community detection in social media. Data Min. Knowl. Discov. 24 515–554.
• Porter, M. A., Onnela, J.-P. and Mucha, P. J. (2009). Communities in networks. Notices Amer. Math. Soc. 56 1082–1097.
• Rosvall, M., Axelsson, D. and Bergstrom, C. T. (2009). The map equation. The European Physical Journal Special Topics 178 13–23.
• Rosvall, M. and Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 105 1118–1123.
• Rosvall, M. and Bergstrom, C. T. (2010). Mapping change in large networks. PLoS ONE 5 e8694.
• Shabalin, A. A., Weigman, V. J., Perou, C. M. and Nobel, A. B. (2009). Finding large average submatrices in high dimensional data. Ann. Appl. Stat. 3 985–1012.
• Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905.
• Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100.
• Traud, A. L., Mucha, P. J. and Porter, M. A. (2012). Social structure of Facebook networks. Phys. A: Statistical Mechanics and Its Applications 391 4165–4180.
• Traud, A. L., Kelsic, E. D., Mucha, P. J. and Porter, M. A. (2011). Comparing community structure to characteristics in online collegiate social networks. SIAM Rev. 53 526–543.
• Wei, Y. C. and Cheng, C. K. (1989). Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design (ICCAD-89). Digest of Technical papers 298–301. IEEE, New York.
• Wilson, J., Wang, S., Mucha, P., Bhamidi, S. and Nobel, A. (2014). Supplement to “A testing based extraction algorithm for identifying significant communities in networks.” DOI:10.1214/14-AOAS760SUPP.
• Xie, J., Kelley, S. and Szymanski, B. K. (2011). Overlapping community detection in networks: The state of the art and comparative study. Preprint. Available at arXiv:1110.5813.
• Yang, J. and Leskovec, J. (2012). Defining and Evaluating Network Communities based on Ground-truth. In Proceedings of the ACM SIGKDD Workshop on Data Semantics, 2012. ACM, New York.
• Zhao, Y., Levina, E. and Zhu, J. (2011). Community extraction for social networks. Proc. Natl. Acad. Sci. USA 108 7321–7326.
• Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266–2292.