Statistical Science

The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

Laura Anderlucci, Angela Montanari, and Cinzia Viroli

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


In this paper, we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: The Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, Series B and Statistical Science. The aim is to construct a kind of “taxonomy” of the statistical papers by organizing and clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data.

Article information

Statist. Sci., Volume 34, Number 2 (2019), 280-300.

First available in Project Euclid: 19 July 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Model-based clustering cosine distance textual data analysis


Anderlucci, Laura; Montanari, Angela; Viroli, Cinzia. The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015. Statist. Sci. 34 (2019), no. 2, 280--300. doi:10.1214/18-STS686.

Export citation


  • Ambroise, C. and Govaert, G. (2000). EM Algorithm for Partially Known Labels. In Data analysis, classification, and related methods, 161–166. Springer, Berlin.
  • Banerjee, A., Dhillon, I. S., Ghosh, J. and Sra, S. (2005). Clustering on the unit hypersphere using von Mises–Fisher distributions. J. Mach. Learn. Res. 6 1345–1382.
  • Ben-Israel, A. and Iyigun, C. (2008). Probabilistic D-clustering. J. Classification 25 5–26.
  • Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In ICML ’06 Proceedings of the 23rd international conference on Machine learning 113–120. ACM, New York.
  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Bouveyron, C., Latouche, P. and Zreik, R. (2018). The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28 11–31.
  • Chang, J. and Blei, D. M. (2009). Relational topic models for document networks. In International Conference on Artificial Intelligence and Statistics 81–88. Avaialble at
  • Côme, E., Oukhellou, L., Denœux, T. and Aknin, P. (2009). Learning from partially supervised data using mixture models and belief functions. Pattern Recognition 42 334–348.
  • Deerwester, S., Dumais, S., Furnas, G., Landauer, T. and Harshman, R. (1990). Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41 391–407.
  • Dhillon, I. S. and Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Mach. Learn. 42 143–175.
  • Diaconis, P. (1988). Group Representations in Probability and Statistics. Institute of Mathematical Statistics Lecture Notes—Monograph Series 11. IMS, Hayward, CA.
  • Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. J. Roy. Statist. Soc. Ser. B 48 359–369.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–57. ACM, New York.
  • Ji, P. and Jin, J. (2016). Coauthorship and citation networks for statisticians. Ann. Appl. Stat. 10 1779–1812.
  • Kolar, M. and Taddy, M. (2016). Discussion of “Coauthorship and citation networks for statisticians” [MR3592033]. Ann. Appl. Stat. 10 1835–1841.
  • Maitra, R. and Ramler, I. P. (2010). A $k$-mean-directions algorithm for fast clustering of data on the sphere. J. Comput. Graph. Statist. 19 377–396.
  • Mallows, C. L. (1957). Non-null ranking models. I. Biometrika 44 114–130.
  • Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics, 2nd ed. Wiley Series in Probability and Statistics. Wiley, Chichester.
  • McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability and Statistics: Applied Probability and Statistics. Wiley Interscience, New York.
  • Murphy, T. B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Comput. Statist. Data Anal. 41 645–655.
  • Nigam, K., McCallum, A., Thrun, S. and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Mach. Learn. 39 103–134.
  • Salton, G. and McGill, M. J. (1986). Introduction to Modern Information Retrieval. McGraw-Hill, New York.
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J. 27 379–423, 623–656.
  • Sun, Y., Han, J., Gao, J. and Yu, Y. (2009). Itopicmodel: Information network-integrated topic modeling. In Ninth IEEE International Conference on Data Mining 493–502.
  • Vandewalle, V., Biernacki, C., Celeux, G. and Govaert, G. (2013). A predictive deviance criterion for selecting a generative model in semi-supervised classification. Comput. Statist. Data Anal. 64 220–236.
  • Varin, C., Cattelan, M. and Firth, D. (2016). Statistical modelling of citation exchange between statistics journals. J. Roy. Statist. Soc. Ser. A 179 1–63.
  • Zhong, S. and Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems 8 374–384.
  • Zhu, X., Goldberg, A. B., Brachman, R. and Dietterich, T. (2009). Introduction to Semi-Supervised Learning. Morgan and Claypool, Williston, VT.