## The Annals of Applied Statistics

### Power-law distributions in binned empirical data

#### Abstract

Many man-made and natural phenomena, including the intensity of earthquakes, population of cities and size of international wars, are believed to follow power-law distributions. The accurate identification of power-law patterns has significant consequences for correctly understanding and modeling complex systems. However, statistical evidence for or against the power-law hypothesis is complicated by large fluctuations in the empirical distribution’s tail, and these are worsened when information is lost from binning the data. We adapt the statistically principled framework for testing the power-law hypothesis, developed by Clauset, Shalizi and Newman, to the case of binned data. This approach includes maximum-likelihood fitting, a hypothesis test based on the Kolmogorov–Smirnov goodness-of-fit statistic and likelihood ratio tests for comparing against alternative explanations. We evaluate the effectiveness of these methods on synthetic binned data with known structure, quantify the loss of statistical power due to binning, and apply the methods to twelve real-world binned data sets with heavy-tailed patterns.

#### Article information

Source
Ann. Appl. Stat. Volume 8, Number 1 (2014), 89-119.

Dates
First available in Project Euclid: 8 April 2014

https://projecteuclid.org/euclid.aoas/1396966280

Digital Object Identifier
doi:10.1214/13-AOAS710

Mathematical Reviews number (MathSciNet)
MR3191984

Zentralblatt MATH identifier
06302229

#### Citation

Virkar, Yogesh; Clauset, Aaron. Power-law distributions in binned empirical data. Ann. Appl. Stat. 8 (2014), no. 1, 89--119. doi:10.1214/13-AOAS710. https://projecteuclid.org/euclid.aoas/1396966280.

#### References

• Aban, I. B. and Meerschaert, M. M. (2004). Generalized least-squares estimators for the thickness of heavy tails. J. Statist. Plann. Inference 119 341–352.
• Arnold, B. C. (1983). Pareto Distributions. Statistical Distributions in Scientific Work 5. International Co-operative Publishing House, Burtonsville, MD.
• Asal, V. and Rethemeyer, R. K. (2008). The nature of the beast: Organizational structures and the lethality of terrorist attacks. The Journal of Politics 70 437–449.
• Barndorff-Nielsen, O. E. and Cox, D. R. (1995). Inference and Asymptotics. Chapman & Hall, London.
• Beirlant, J. and Teugels., J. L. (1989). Asymptotic normality of Hill’s estimator. Extreme Value Theory 51 148–155.
• Breiman, L., Stone, C. J. and Kooperberg, C. (1990). Robust confidence bounds for extreme upper quantiles. J. Stat. Comput. Simul. 37 127–149.
• Cadez, I. V., Smyth, P., McLachlan, G. J. and McLaren, C. E. (2002). Maximum likelihood estimation of mixture of densities for binned and truncated multivariate data. Machine Learning 47 7–34.
• Chapuis, A. and Tetzlaff, T. (2012). The variability of tidewater-glacier calving: Origin of event-size and interval distributions. Available at arXiv:1205.1640.
• Clauset, A., Shalizi, C. R. and Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Rev. 51 661–703.
• Clauset, A. and Woodard, R. (2013). Estimating the historical and future probabilities of large terrorist events. Ann. Appl. Stat. 7 1838–1865.
• Clauset, A., Young, M. and Gleditsch, K. S. (2007). On the frequency of severe terrorist events. Journal of Conflict Resolution 51 58–87.
• Cramér, H. (1946). A contribution to the theory of statistical estimation. Skand. Aktuarietidskr. 29 85–94.
• Danielsson, J., de Haan, L., Peng, L. and de Vries, C. G. (2001). Using a bootstrap method to choose the sample fraction in tail index estimation. J. Multivariate Anal. 76 226–248.
• Dekkers, A. L. M. and de Haan, L. (1993). Optimal choice of sample fraction in extreme-value estimation. J. Multivariate Anal. 47 173–195.
• Drees, H. and Kaufmann, E. (1998). Selecting the optimal sample fraction in univariate extreme value estimation. Stochastic Process. Appl. 75 149–172.
• Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability 57. Chapman & Hall, New York.
• Gabaix, X. (2009). Power laws in economics and finance. Annual Review of Economics 1 255–293.
• Goh, K.-I., Cusick, M. E., Valle, D., Childs, B., Vidal, M. and Barabási, A.-L. (2007). The human disease network. Proc. Natl. Acad. Sci. USA 104 8685–8690.
• Goldstein, M. L., Morris, S. A. and Yen, G. G. (2004). Problems with fitting to the power-law distribution. Eur. Phys. J. B 41 255–258.
• Grünwald, P. D. (2007). The Minimum Length Description Principle. MIT Press, Cambridge, MA.
• Hall, P. (1982). On some simple estimates of an exponent of regular variation. J. R. Stat. Soc. Ser. B Stat. Methodol. 44 37–42.
• Handcock, M. S. and Jones, J. H. (2004). Likelihood-based inference for stochastic models of sexual network evolution. Theoretical Population Biology 65 413–422.
• Heritage Provider Network (2012). Health heritage prize data files, HHP_release3. Available at http://bit.ly/wG8Psl.
• Hill, B. M. (1975). A simple general approach to inference about the tail of a distribution. Ann. Statist. 3 1163–1174.
• Horn, S. D. (1977). Goodness-of-fit tests for discrete data: A review and an application to a health development scale. Biometrics 33 237–247.
• Ijiri, Y. and Simon, H. A. (1977). Skew Distributions and the Sizes of Business Firms. North-Holland, Amsterdam.
• Jarvinen, B., Neumann, C. and Davis, M. A. S. (2012). NHC data archive. National Hurricane Center. Available at http://1.usa.gov/cCcwTg.
• Kass, R. E. and Raftery, A. E. (1994). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
• Kratz, M. and Resnick, S. I. (1996). The $\mathrm{QQ}$-estimator and heavy tails. Comm. Statist. Stochastic Models 12 699–724.
• McLachlan, G. J. and Jones, P. N. (1988). Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics 44 571–578.
• Mitzenmacher, M. (2004). A brief history of generative models for power law and lognormal distributions. Internet Math. 1 226–251.
• Mitzenmacher, M. (2006). The future of power law research. Internet Math. 2 525–534.
• Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46 323–351.
• Noether, G. E. (1963). Note on the Kolmogorov statistic in the discrete case. Metrika 7 115–116.
• Orphanet Report Series, Rare Diseases collection (2011). Prevalence of rare diseases: Bibliographic data. Available at http://bit.ly/MezSZ6.
• Persing, J. and Montgomery, M. T. (2003). Hurricane superintensity. J. Atmospheric Sci. 60 2349–2371.
• Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1992). Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge Univ. Press, Cambridge.
• Rao, C. R. (1947). Minimum variance and the estimation of several parameters. Proc. Cambridge Philos. Soc. 43 280–283.
• Rao, C. R. (1957). Maximum likelihood estimation for the multinomial distribution. Sankhyā 18 139–148.
• Reed, W. J. and Hughes, B. D. (2002). From gene families and genera to income and internet file sizes: Why power laws are so common in nature. Phys. Rev. E (3) 66 067103.
• Reiss, R.-D. and Thomas, M. (2007). Statistical Analysis of Extreme Values with Applications to Insurance, Finance, Hydrology and Other Fields, 3rd ed. Birkhäuser, Basel.
• Richardson, L. F. (1960). Statistics of Deadly Quarrels. The Boxwood Press, Pittsburgh.
• Schultze, J. and Steinebach, J. (1996). On least squares estimates of an exponential tail coefficient. Statist. Decisions 14 353–372.
• Shinokazi, K., Yoda, K., Hozumi, K. and Kira, T. (1964). A quantitative analysis of plant form—The pipe model theory II: Further evidence of the theory and its application in forest ecology. Japanese Journal of Ecology 14 133–139.
• Sornette, D. (2006). Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools, 2nd ed. Springer, Berlin.
• Stoev, S. A., Michailidis, G. and Taqqu, M. S. (2011). Estimating heavy-tail exponents through max self-similarity. IEEE Trans. Inform. Theory 57 1615–1636.
• Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Stat. Methodol. 36 111–147.
• Storm Prediction Center (2011). Severe weather database files (1950–2011). Available at http://1.usa.gov/Lj7cC9.
• Stumpf, M. P. H. and Porter, M. A. (2012). Critical truths about power laws. Science 335 665–666.
• Tate, M. W. and Hye, L. A. (1973). Inaccuracy of the $\chi^{2}$ test of goodness of fit when expected frequencies are small. J. Amer. Statist. Assoc. 68 836–841.
• Virkar, Y. and Clauset, A. (2014). Supplement to “Power-law distributions in binned empirical data.” DOI:10.1214/13-AOAS710SUPP.
• Vuong, Q. H. (1989). Likelihood ratio tests for model selection and nonnested hypotheses. Econometrica 57 307–333.
• Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer, New York.
• West, G. B., Enquist, B. J. and Brown, J. H. (2009). A general quantitative theory of forest structure and dynamics. Proc. Natl. Acad. Sci. USA 106 7040–7045.
• World Glacier Monitoring Service and National Snow and Ice Data Center (2012). World glacier inventory. Available at http://bit.ly/MhLdt6.
• Yamamoto, K. and Kobayashi, S. (1993). Analysis of crown structure based on the pipe model theory. Journal of the Japanese Forestry Society 75 445–448.

#### Supplemental materials

• Supplementary material: Supplement to “Power-law distributions in binned empirical data”. In this supplemental file, we derive a closed-form expression for the binned MLE in Section 1.1, quantify the amount of information loss on using a coarser binning scheme in Section 1.2 and include the likelihood ratio test for the binned case in Section 2.