## The Annals of Applied Statistics

### Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks

#### Abstract

Networks are a popular tool for representing elements in a system and their interconnectedness. Many observed networks can be viewed as only samples of some true underlying network. Such is frequently the case, for example, in the monitoring and study of massive, online social networks. We study the problem of how to estimate the degree distribution—an object of fundamental interest—of a true underlying network from its sampled network. In particular, we show that this problem can be formulated as an inverse problem. Playing a key role in this formulation is a matrix relating the expectation of our sampled degree distribution to the true underlying degree distribution. Under many network sampling designs, this matrix can be defined entirely in terms of the design and is found to be ill-conditioned. As a result, our inverse problem frequently is ill-posed. Accordingly, we offer a constrained, penalized weighted least-squares approach to solving this problem. A Monte Carlo variant of Stein’s unbiased risk estimation (SURE) is used to select the penalization parameter. We explore the behavior of our resulting estimator of network degree distribution in simulation, using a variety of combinations of network models and sampling regimes. In addition, we demonstrate the ability of our method to accurately reconstruct the degree distributions of various sub-communities within online social networks corresponding to Friendster, Orkut and LiveJournal. Overall, our results show that the true degree distributions from both homogeneous and inhomogeneous networks can be recovered with substantially greater accuracy than reflected in the empirical degree distribution resulting from the original sampling.

#### Article information

Source
Ann. Appl. Stat., Volume 9, Number 1 (2015), 166-199.

Dates
First available in Project Euclid: 28 April 2015

https://projecteuclid.org/euclid.aoas/1430226089

Digital Object Identifier
doi:10.1214/14-AOAS800

Mathematical Reviews number (MathSciNet)
MR3341112

Zentralblatt MATH identifier
06446565

#### Citation

Zhang, Yaonan; Kolaczyk, Eric D.; Spencer, Bruce D. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Ann. Appl. Stat. 9 (2015), no. 1, 166--199. doi:10.1214/14-AOAS800. https://projecteuclid.org/euclid.aoas/1430226089

#### References

• Achlioptas, D., Clauset, A., Kempe, D. and Moore, C. (2005). On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In STOC’05: Proceedings of the 37th Annual ACM Symposium on Theory of Computing 694–703. ACM, New York.
• Ahmed, N., Neville, J. and Kompella, R. R. (2011). Network sampling via edge-based node selection with graph induction. CSD TR # 11-016 1–10. Purdue Univ., West Lafayette, IN.
• Ahmed, N. K., Neville, J. and Kompella, R. R. (2012). Network sampling designs for relational classification. In ICWSM.
• Ahmed, N. K., Berchmans, F., Neville, J. and Kompella, R. (2010). Time-based sampling of social network activity graphs. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs 1–9. ACM, New York.
• Ahn, Y.-Y., Han, S., Kwak, H., Moon, S. and Jeong, H. (2007). Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web 835–844. ACM, New York.
• Bailey, N. T. et alet al. (1975). The Mathematical Theory of Infectious Diseases and Its Applications. Charles Griffin & Company Ltd., London.
• Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
• Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York.
• CVX Research, Inc. (2012). CVX: Matlab software for disciplined convex programming, version 2.0 beta. Available at: http://cvxr.com/cvx.
• Daley, D. J. and Gani, J. M. (1999). Epidemic Modelling: An Introduction. Cambridge Univ. Press, Cambridge.
• Dong, J. and Simonoff, J. S. (1994). The construction and properties of boundary kernels for smoothing sparse multinomials. J. Comput. Graph. Statist. 3 57–66.
• Eldar, Y. C. (2009). Generalized SURE for exponential families: Applications to regularization. IEEE Trans. Signal Process. 57 471–481.
• Frank, O. (1971). Statistical inference in graphs. Ph.D. thesis, Foa Repro Stockholm.
• Frank, O. (1980). Estimation of the number of vertices of different degrees in a graph. J. Statist. Plann. Inference 4 45–50.
• Frank, O. (1981). A survey of statistical methods for graph analysis. Sociol. Method. 12 110–155.
• Frank, O. (2005). Network sampling and model fitting. In Models and Methods in Social Network Analysis 31–56. Cambridge Univ. Press, Cambridge.
• Gjoka, M., Kurant, M., Butts, C. T. and Markopoulou, A. (2010). Walking in Facebook: A case study of unbiased sampling of OSNs. In INFOCOM, 2010 Proceedings IEEE 1–9. IEEE, New York.
• Gjoka, M., Butts, C. T., Kurant, M. and Markopoulou, A. (2011). Multigraph sampling of online social networks. IEEE J. Sel. Areas Commun. 29 1893–1905.
• Hall, P. and Titterington, D. M. (1987). On smoothing sparse multinomial data. Aust. J. Stat. 29 19–37.
• Handcock, M. S. and Gile, K. J. (2010). Modeling social networks from sampled data. Ann. Appl. Stat. 4 5–25.
• Hubler, C., Kriegel, H.-P., Borgwardt, K. and Ghahramani, Z. (2008). Metropolis algorithms for representative subgraph sampling. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on 283–292. IEEE, New York.
• Jin, L., Chen, Y., Hui, P., Ding, C., Wang, T., Vasilakos, A. V., Deng, B. and Li, X. (2011). Albatross sampling: Robust and effective hybrid vertex sampling for social graphs. In Proceedings of the 3rd ACM International Workshop on MobiArch 11–16. ACM, New York.
• Kephart, J. O. and White, S. R. (1991). Directed-graph epidemiological models of computer viruses. In Research in Security and Privacy, 1991. Proceedings, 1991 IEEE Computer Society Symposium on 343–359. IEEE, New York.
• Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer, New York.
• Kurant, M., Markopoulou, A. and Thiran, P. (2011). Towards unbiased BFS sampling. IEEE J. Sel. Areas Commun. 29 1799–1809.
• Kurant, M., Gjoka, M., Butts, C. T. and Markopoulou, A. (2011). Walking on a graph with a magnifying glass: Stratified sampling via weighted random walks. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems 281–292. ACM, New York.
• Kurant, M., Gjoka, M., Wang, Y., Almquist, Z. W., Butts, C. T. and Markopoulou, A. (2012). Coarse-grained topology estimation via graph sampling. In Proceedings of the 2012 ACM Workshop on Workshop on Online Social Networks 25–30. ACM, New York.
• Lakhina, A., Byers, J. W., Crovella, M. and Xie, P. (2003). Sampling biases in IP topology measurements. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies 1 332–341. IEEE, New York.
• Leskovec, J. and Faloutsos, C. (2006). Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 631–636. ACM, New York.
• Li, J.-Y. and Yeh, M.-Y. (2011). On sampling type distribution from heterogeneous social networks. In Advances in Knowledge Discovery and Data Mining 111–122. Springer, Berlin.
• Lim, Y.-s., Menasché, D. S., Ribeiro, B., Towsley, D. and Basu, P. (2011). Online estimating the $k$ central nodes of a network. In Network Science Workshop (NSW), 2011 IEEE 118–122. IEEE, New York.
• Lovász, L. (1993). Combinatorial Problems and Exercises, 2nd ed. North-Holland, Amsterdam.
• Lu, X. and Bressan, S. (2012). Sampling connected induced subgraphs uniformly at random. In Scientific and Statistical Database Management 195–212. Springer, Berlin.
• Maiya, A. S. and Berger-Wolf, T. Y. (2010a). Online sampling of high centrality individuals in social networks. In Advances in Knowledge Discovery and Data Mining 91–98. Springer, Berlin.
• Maiya, A. S. and Berger-Wolf, T. Y. (2010b). Sampling community structure. In Proceedings of the 19th International Conference on World Wide Web 701–710. ACM, New York.
• Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P. and Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement 29–42. ACM, New York.
• Mohaisen, A., Luo, P., Li, Y., Kim, Y. and Zhang, Z.-L. (2012). Measuring bias in the mixing time of social graphs due to graph sampling. In Military Communications Conference, 2012-MILCOM 2012 1–6. IEEE, New York.
• Pastor-Satorras, R. and Vespignani, A. (2001). Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86 3200.
• Ramani, S., Blu, T. and Unser, M. (2008). Monte-Carlo SURE: A black-box optimization of regularization parameters for general denoising algorithms. IEEE Trans. Image Process. 17 1540–1554.
• Ribeiro, B. and Towsley, D. (2010). Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement 390–403. ACM, New York.
• Rolls, D. A., Daraganova, G., Sacks-Davis, R., Hellard, M., Jenkinson, R., McBryde, E., Pattison, P. E. and Robins, G. L. (2012). Modelling hepatitis C transmission over a social network of injecting drug users. J. Theoret. Biol. 297 73–87.
• Salehi, M., Rabiee, H. R., Nabavi, N. and Pooya, S. (2011). Characterizing twitter with respondent-driven sampling. In Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on 1211–1217. IEEE, New York.
• Shi, X., Bonner, M., Adamic, L. A. and Gilbert, A. C. (2008). The very small world of the well-connected. In Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia 61–70. ACM, New York.
• Stumpf, M. P. H. and Wiuf, C. (2005). Sampling properties of random graphs: The degree distribution. Phys. Rev. E (3) 72 036118.
• Stumpf, M. P. H., Wiuf, C. and May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proc. Natl. Acad. Sci. USA 102 4221–4224.
• Van Mieghem, P. (2011). Graph Spectra for Complex Networks. Cambridge Univ. Press, Cambridge.
• Van Mieghem, P., Omic, J. and Kooij, R. (2009). Virus spread in networks. IEEE/ACM Transactions on Networking 17 1–14.
• Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., Deng, B. and Li, X. (2011). Understanding graph sampling algorithms for social network analysis. In Distributed Computing Systems Workshops (ICDCSW), 2011 31st International Conference on 123–128. IEEE, New York.
• Yang, J. and Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics 3. ACM, New York.
• Yoon, S.-H., Kim, K.-N., Kim, S.-W. and Park, S. (2011). A community-based sampling method using DPL for online social network. CoRR abs/1109.1063.
• Zhou, J., Li, Y., Adhikari, V. K. and Zhang, Z.-L. (2011). Counting youtube videos via random prefix sampling. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference 371–380. ACM, New York.