The Annals of Statistics

Spectral clustering and the high-dimensional stochastic blockmodel

Karl Rohe, Sourav Chatterjee, and Bin Yu

Full-text: Open access

Abstract

Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities.

The stochastic blockmodel [Social Networks 5 (1983) 109–137] is a social network model with well-defined communities; each node is a member of one community. For a network generated from the Stochastic Blockmodel, we bound the number of nodes “misclustered” by spectral clustering. The asymptotic results in this paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional.

In order to study spectral clustering under the stochastic blockmodel, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a “population” normalized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original.

Article information

Source
Ann. Statist. Volume 39, Number 4 (2011), 1878-1915.

Dates
First available: 24 August 2011

Permanent link to this document
http://projecteuclid.org/euclid.aos/1314190618

Digital Object Identifier
doi:10.1214/11-AOS887

Zentralblatt MATH identifier
1227.62042

Mathematical Reviews number (MathSciNet)
MR2893856

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 62H25: Factor analysis and principal components; correspondence analysis
Secondary: 60B20: Random matrices (probabilistic aspects; for algebraic aspects see 15B52)

Keywords
Spectral clustering latent space model Stochastic Blockmodel clustering convergence of eigenvectors principal components analysis

Citation

Rohe, Karl; Chatterjee, Sourav; Yu, Bin. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39 (2011), no. 4, 1878--1915. doi:10.1214/11-AOS887. http://projecteuclid.org/euclid.aos/1314190618.


Export citation

References

  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Albert, R. and Barabási, A. L. (2002). Statistical mechanics of complex networks. Rev. Modern Phys. 74 47–97.
  • Barabási, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509–512.
  • Belkin, M. (2003). Problems of learning on manifolds. Ph.D. thesis, Univ. Chicago.
  • Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 1373–1396.
  • Belkin, M. and Niyogi, P. (2008). Towards a theoretical foundation for Laplacian-based manifold methods. J. Comput. System Sci. 74 1289–1308.
  • Bhatia, R. (1987). Perturbation Bounds for Matrix Eigenvalues. Longman, Harlow.
  • Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Nat. Acad. Sci. India Sect. A 106 21068.
  • Bock, R. D. and Husain, S. (1952). Factors of the Tele: A preliminary report. Sociometry 15 206–219.
  • Bousquet, O., Chapelle, O. and Hein, M. (2004). Measure based regularization. Adv. Neural Inf. Process. Syst. 16.
  • Brandes, U., Delling, D., Gaertler, M., Görke, R., Hoefer, M., Nikoloski, Z. and Wagner, D. (2008). On modularity clustering. IEEE Transactions on Knowledge and Data Engineering 20 172–188.
  • Breiger, R. L., Boorman, S. A. and Arabie, P. (1975). An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. J. Math. Psych. 12 328–383.
  • Champion, M. (1998). How many atoms make up the universe? Available at http://www.madsci.org/posts/archives/oct98/905633072.As.r.html.
  • Choi, D., Wolfe, P. and Airoldi, E. (2010). Stochastic blockmodels with growing number of classes. Preprint. Available at arXiv:1011.4644.
  • Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005). Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Nat. Acad. Sci. India Sect. A 102 7426–7431.
  • Condon, A. and Karp, R. M. (1999). Algorithms for graph partitioning on the planted partition model. In Randomization, Approximation, and Combinatorial Optimization (Berkeley, CA, 1999). Lecture Notes in Comput. Sci. 1671 221–232. Springer, Berlin.
  • Donath, W. E. and Hoffman, A. J. (1973). Lower bounds for the partitioning of graphs. IBM J. Res. Develop. 17 420–425.
  • Erdös, P. and Rényi, A. (1959). On random graphs. Publ. Math. Debrecen 6 290–297.
  • Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J. 23 298–305.
  • Fjällström, P. (1998). Algorithms for graph partitioning: A survey. Computer and Information Science 3. Available at http://www.ep.liu.se/ea/cis/1998/010/.
  • Fortunato, S. (2010). Community detection in graphs. Phys. Rep. 486 75–174.
  • Frank, O. and Strauss, D. (1986). Markov graphs. J. Amer. Statist. Assoc. 81 832–842.
  • Freeman, L. C. (2000). Visualizing social networks. Journal of Social Structure 1(1).
  • Giné, E. and Koltchinskii, V. (2006). Empirical graph Laplacian approximation of Laplace–Beltrami operators: Large sample results. In High Dimensional Probability. Institute of Mathematical Statistics Lecture Notes—Monograph Series 51 238–259. IMS, Beachwood, OH.
  • Girvan, M. and Newman, M. (2002). Community structure in social and biological networks. Proc. Nat. Acad. Sci. India Sect. A 99 7821–7826.
  • Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.
  • Hagen, L. and Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design 11 1074–1085.
  • Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
  • Hein, M. (2006). Uniform convergence of adaptive graph-based regularization. In Learning Theory. Lecture Notes in Comput. Sci. 4005 50–64. Springer, Berlin.
  • Hein, M., Audibert, J. Y. and von Luxburg, U. (2005). From graphs to manifolds—weak and strong pointwise consistency of graph Laplacians. In Learning Theory. Lecture Notes in Comput. Sci. 3559 470–485. Springer, Berlin.
  • Hendrickson, B. and Leland, R. (1995). An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM J. Sci. Comput. 16 452–469.
  • Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
  • Holland, P., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: Some first steps. Social Networks 5 109–137.
  • Holland, P. W. and Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs. J. Amer. Statist. Assoc. 76 33–50.
  • Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles. Springer, New York.
  • Klein, D. and Randić, M. (1993). Resistance distance. J. Math. Chem. 12 81–95.
  • Koren, Y. (2005). Drawing graphs by eigenvectors: Theory and practice. Comput. Math. Appl. 49 1867–1888.
  • Lafon, S. S. (2004). Diffusion maps and geometric harmonics. Ph.D. thesis, Yale Univ.
  • Leskovec, J., Lang, K. J., Dasgupta, A. and Mahoney, M. W. (2008). Statistical properties of community structure in large social and information networks. In Proceeding of the 17th International Conference on World Wide Web 695–704. ACM, Beijing, China.
  • Liotta, G., ed. (2004). Graph Drawing. Lecture Notes in Comput. Sci. 2912. Springer, Berlin.
  • Meilă, M. and Shi, J. (2001). A random walks view of spectral segmentation. In AI and Statistics (AISTATS) 2001.
  • Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E (3) 69 026113.
  • Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
  • Pothen, A., Simon, H. D. and Liou, K. P. (1990). Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl. 11 430–452.
  • Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905.
  • Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100.
  • Spielman, D. A. and Teng, S. H. (2007). Spectral partitioning works: Planar graphs and finite element meshes. Linear Algebra Appl. 421 284–305.
  • Steinhaus, H. (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. Cl. III 4 801–804.
  • Stewart, G. W. and Sun, J. (1990). Matrix Perturbation Theory. Academic Press, San Diego, CA.
  • Traud, A. L., Kelsic, E. D., Mucha, P. J. and Porter, M. A. (2008). Community structure in online collegiate social networks. Preprint.
  • Van Driessche, R. and Roose, D. (1995). An improved spectral bisection algorithm and its application to dynamic load balancing. Parallel Comput. 21 29–48.
  • Van Duijn, M. A. J., Snijders, T. A. B. and Zijlstra, B. J. H. (2004). p2: A random effects model with covariates for directed graphs. Stat. Neerl. 58 234–254.
  • von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
  • von Luxburg, U., Belkin, M. and Bousquet, O. (2008). Consistency of spectral clustering. Ann. Statist. 36 555–586.
  • Vu, V. (2011). Singular vectors under random perturbation. Random Structures Algorithms. DOI:10.1002/rsa.20367.
  • Wasserman, S. and Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge Univ. Press, Cambridge.
  • Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of ‘small world’ networks. Nature 393 440–442.