The Annals of Applied Statistics

Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs

Thomas Rusch, Paul Hofmarcher, Reinhold Hatzinger, and Kurt Hornik

Full-text: Open access

Abstract

The WikiLeaks Afghanistan war logs contain nearly $77,000$ reports of incidents in the US-led Afghanistan war, covering the period from January 2004 to December 2009. The recent growth of data on complex social systems and the potential to derive stories from them has shifted the focus of journalistic and scientific attention increasingly toward data-driven journalism and computational social science. In this paper we advocate the usage of modern statistical methods for problems of data journalism and beyond, which may help journalistic and scientific work and lead to additional insight. Using the WikiLeaks Afghanistan war logs for illustration, we present an approach that builds intelligible statistical models for interpretable segments in the data, in this case to explore the fatality rates associated with different circumstances in the Afghanistan war. Our approach combines preprocessing by Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process the natural language information contained in each report summary by estimating latent topics and assigning each report to one of them. Together with other variables these topic assignments serve as splitting variables for finding segments in the data to which local statistical models for the reported number of fatalities are fitted. Segmentation and fitting is carried out with recursive partitioning of negative binomial distributions. We identify segments with different fatality rates that correspond to a small number of topics and other variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This gives an unprecedented description of the war in Afghanistan and serves as an example of how data journalism, computational social science and other areas with interest in database data can benefit from modern statistical techniques.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 2 (2013), 613-639.

Dates
First available in Project Euclid: 27 June 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1372338461

Digital Object Identifier
doi:10.1214/12-AOAS618

Mathematical Reviews number (MathSciNet)
MR3112911

Zentralblatt MATH identifier
1288.62200

Keywords
Afghanistan count data database data latent Dirichlet allocation model-based recursive partitioning WikiLeaks computational social science tree stability tree validation text mining

Citation

Rusch, Thomas; Hofmarcher, Paul; Hatzinger, Reinhold; Hornik, Kurt. Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs. Ann. Appl. Stat. 7 (2013), no. 2, 613--639. doi:10.1214/12-AOAS618. https://projecteuclid.org/euclid.aoas/1372338461


Export citation

References

  • Aitkin, M., Francis, B., Hinde, J. and Darnell, R. (2009). Statistical Modelling in R. Oxford Univ. Press, New York.
  • Amnesty International (2009). Afghanistan: German government must investigate deadly Kunduz airstrikes. Available at http://www.amnesty.org/en/news-and-updates/news/afghanistan-german-government-must-investigate-deadly-kunduz-airstrikes-20091030.
  • Bhutta, Z. A. (2002). Children of war: The real casualties of the Afghan conflict. BMJ 324 349–352.
  • Bird, S. M. and Fairweather, C. B. (2007). Military fatality rates (by cause) in Afghanistan and Iraq: A measure of hostilities. Int. J. Epidemiol. 36 841–846.
  • Blei, D. (2012). Probabilistic topic models. Communications of the ACM 55 77–84.
  • Blei, D. M., Jordan, M. I. and Ng, A. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of Science. Ann. Appl. Stat. 1 17–35.
  • Blei, D. M. and Lafferty, J. D. (2009). Topic models. In Text Mining: Classification, Clustering, and Applications (A. Srivastava and M. Sahami, eds.). Chapman & Hall/CRC Press, Boca Raton, FL.
  • Bohannon, J. (2011). The war in Afghanistan. Counting the dead in Afghanistan. Science 331 1256–1260.
  • Bortkiewicz, L. (1898). Das Gesetz der Kleinen Zahlen [The Law of Small Numbers]. Teubner, Leipzig.
  • Burnham, G., Lafta, R., Doocy, S. and Roberts, L. (2006). Mortality after the 2003 invasion of Iraq a cross-sectional cluster sample survey. Lancet 368 1421–1428.
  • Buzzell, E. and Preston, S. H. (2007). Mortality of American troops in the Iraq war. Population and Development Review 33 555–566.
  • Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S. and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009 Desc: Proceedings of a Meeting Held 710 December 2009, Vancouver, British Columbia, Canada.
  • Choi, Y., Ahn, H. and Chen, J. J. (2005). Regression trees for analysis of count data with extra Poisson variation. Comput. Statist. Data Anal. 49 893–915.
  • Cioffi-Revilla, C. (2010). Computational social science. Wiley Interdisciplinary Reviews: Computational Statistics 2 259–271.
  • Cohen, S., Hamilton, J. T. and Turner, F. (2011). Computational journalism: How computer scientists can empower journalists, democracy’s watchdogs, in the production of news in the public interest. Communciations of the ACM 54 66–71.
  • Conway, D. (2010). Wikileaks Afghanistan data. Available at http://www.drewconway.com/zia/?p=2226.
  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 391–407.
  • Degomme, O. and Guha-Sapir, D. (2010). Patterns of mortality rates in Darfur conflict. Lancet 375 294–300.
  • FIRM (2011). Cluster@WU. Available at http://www.wu.ac.at/firm/cluster_folder.
  • Friendly, M. (2001). Visualizing Categorical Data. SAS publishing, Cary, NC.
  • Garfield, R. M. and Neugut, A. I. (1991). Epidemiologic analysis of warfare. A historical review. J. Amer. Med. Assoc. 266 688–692.
  • Gebauer, M. (2010). Explosive leaks provide image of war from those fighting it. Available at http://www.spiegel.de/international/world/0,1518,708314,00.html.
  • Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proc. Natl. Acad. Sci. USA 101 Suppl 1 5228–5235.
  • Grün, B. and Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software 40 1–30.
  • guardian.co.uk (2010). Afghanistan war logs: 56 civilians killed in Nato bombing. Available at http://www.guardian.co.uk/world/afghanistan/warlogs/826B488C-EA6F-A132-511610DB68C2EDBD.
  • Haushofer, J., Biletzki, A. and Kanwisher, N. (2010). Both sides retaliate in the Israeli–Palestinian conflict. Proc. Natl. Acad. Sci. USA 107 17927–17932.
  • Hennig, C. (2007). Cluster-wise assessment of cluster stability. Comput. Statist. Data Anal. 52 258–271.
  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’99 50–57. ACM, New York.
  • Hofmarcher, P., Theußl, S. and Hornik, K. (2011). Do media sentiments reflect economic indices. Chinese Business Review 10 487–492.
  • Holcomb, J. B., McMullin, N. R., Pearse, L., Caruso, J., Wade, C. E., Oetyen-Gerdes, L., Champion, H. R., Lawnick, M., Farr, W., Rodriguez, S. and Butler, F. K. (2007). Causes of death in US Special Operations Forces in the global war on terrorism—2001–2004. Annals of Surgery 245 986–991.
  • Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651–674.
  • Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines [Distribution of alpine flora in the Dranse basin and several neighboring regions]. Bulletin de la Société Vaudoise des Sciences Naturelles 37 241–272.
  • Kampstra, P. (2008). Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software, Code Snippets 28 1–9.
  • Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96 589–604.
  • Kopf, J., Augustin, T. and Strobl, C. (2010). The potential of model-based recursive partitioning in the social sciences—Revisiting Ockham’s Razor. Technical report, Ludwig-Maximilians Univ., Munich.
  • Lakstein, D. and Blumenfeld, A. (2005). Israeli army casualties in the second Palestinian uprising. Mil. Med. 170 427–430.
  • Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canad. J. Statist. 15 209–225.
  • Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D. and Alstyne, M. V. (2009). Life in the network: The coming age of computational social science. Science 323 721–723.
  • Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statist. Sinica 12 361–386.
  • Loh, W.-Y. (2009). Improving the precision of classification trees. Ann. Appl. Stat. 3 1710–1737.
  • Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees. Statist. Sinica 7 815–840.
  • Marshall, H. and Balfour, T. G. (1838). Statistical Report on the Sickness, Mortality, & Invaliding Among the Troops in the West Indies. W. Clowes and Sons, London.
  • Nightingale, F. (1863). Notes on Hospitals, 3rd ed. Longman, Green, Longman, Roberts, and Green, London.
  • O’Loughlin, J., Witmer, F. D., Linke, A. M. and Thorwardson, N. (2010). Peering into the fog of war: The geography of the Wikileaks Afghanistan war logs, 2004–2009. Eurasian Geography and Economics 51 472–495.
  • R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Rogers, S. (2010). Wikileaks’ Afghanistan war logs: How our datajournalism operation worked. Available at http://www.guardian.co.uk/news/datablog/2010/jul/27/wikileaks-afghanistan-data-datajournalism.
  • Roggio, B. (2009). US, Afghan troops beat back bold enemy assault in eastern Afghanistan. Available at http://www.longwarjournal.org/archives/2009/10/us_afghan_troops_bea.php.
  • Rusch, T. and Zeileis, A. (2013). Gaining insight with recursive partitioning of generalized linear models. J. Stat. Comput. Simul. To appear.
  • Rusch, T., Hofmarcher, P., Hatzinger, R. and Hornik, K. (2011). Modeling mortality rates in the WikiLeaks Afghanistan war logs. Technical Report 112, Research Report Series, Institute for Statistics and Mathematics, WU Wirtschaftsunversität Wien, Vienna.
  • Rusch, T., Zeileis, A., Hothorn, T. and Leisch, F. (2012). mobtools: A collection of new StatModels and of utilities for extending mob. R package version 0.0-1.
  • Rusch, T., Hofmarcher, P., Hatzinger, R. and Hornik, K. (2013a). Supplement to “Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs.” DOI:10.1214/12-AOAS618SUPPA.
  • Rusch, T., Hofmarcher, P., Hatzinger, R. and Hornik, K. (2013b). Supplement to “Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs.” DOI:10.1214/12-AOAS618SUPPB.
  • Spiegel, P. B. and Salama, P. (2001). War and mortality in Kosovo, 1998–99: An epidemiological testimony. Lancet 355 2204–2209.
  • Steyvers, M., Smyth, P., Rosen-Zvi, M. and Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’04 306–315. ACM, New York.
  • Thomas, T. L., Parker, A. L., Horn, W. G., Mole, D., Spiro, T. R., Hooper, T. I. and Garland, F. C. (2001). Accidents and injuries among US Navy crewmembers during extended submarine patrols, 1997 to 1999. Mil. Med. 166 534–540.
  • Tibshirani, R. and Walther, G. (2005). Cluster validation by prediction strength. J. Comput. Graph. Statist. 14 511–528.
  • Titov, I. and McDonald, R. (2008). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th International Conference on World Wide Web. WWW’08 111–120. ACM, New York.
  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th ed. Springer, New York.
  • Wikipedia (2010). Operation Medusa—Wikipedia, the Free Encyclopedia. Available at http://en.wikipedia.org/wiki/Operation_Medusa.
  • Wikipedia (2011). Special Forces (United States Army)—Wikipedia, the Free Encyclopedia. Available at http://en.wikipedia.org/wiki/Special_Forces_(United_States_Army).
  • Zammit-Mangion, A., Dewar, M., Kadirkamanathan, V. and Sanguinetti, G. (2012). Point process modelling of the Afghan war diary. Proc. Natl. Acad. Sci. USA 109 12414–12419.
  • Zeileis, A. and Hornik, K. (2007). Generalized $M$-fluctuation tests for parameter instability. Stat. Neerl. 61 488–508.
  • Zeileis, A., Hothorn, T. and Hornik, K. (2008). Model-based recursive partitioning. J. Comput. Graph. Statist. 17 492–514.
  • Zhang, Y., Jin, R. and Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics 1 43–52.

Supplemental materials