## Statistical Science

### The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

#### Abstract

In this paper, we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: The Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, Series B and Statistical Science. The aim is to construct a kind of “taxonomy” of the statistical papers by organizing and clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data.

#### Article information

Source
Statist. Sci., Volume 34, Number 2 (2019), 280-300.

Dates
First available in Project Euclid: 19 July 2019

https://projecteuclid.org/euclid.ss/1563501642

Digital Object Identifier
doi:10.1214/18-STS686

Mathematical Reviews number (MathSciNet)
MR3983329

Zentralblatt MATH identifier
07110697

#### Citation

Anderlucci, Laura; Montanari, Angela; Viroli, Cinzia. The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015. Statist. Sci. 34 (2019), no. 2, 280--300. doi:10.1214/18-STS686. https://projecteuclid.org/euclid.ss/1563501642

#### References

• Ambroise, C. and Govaert, G. (2000). EM Algorithm for Partially Known Labels. In Data analysis, classification, and related methods, 161–166. Springer, Berlin.
• Banerjee, A., Dhillon, I. S., Ghosh, J. and Sra, S. (2005). Clustering on the unit hypersphere using von Mises–Fisher distributions. J. Mach. Learn. Res. 6 1345–1382.
• Ben-Israel, A. and Iyigun, C. (2008). Probabilistic D-clustering. J. Classification 25 5–26.
• Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In ICML ’06 Proceedings of the 23rd international conference on Machine learning 113–120. ACM, New York.
• Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
• Bouveyron, C., Latouche, P. and Zreik, R. (2018). The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat. Comput. 28 11–31.
• Chang, J. and Blei, D. M. (2009). Relational topic models for document networks. In International Conference on Artificial Intelligence and Statistics 81–88. Avaialble at http://proceedings.mlr.press/v5/chang09a/chang09a.pdf.
• Côme, E., Oukhellou, L., Denœux, T. and Aknin, P. (2009). Learning from partially supervised data using mixture models and belief functions. Pattern Recognition 42 334–348.
• Deerwester, S., Dumais, S., Furnas, G., Landauer, T. and Harshman, R. (1990). Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41 391–407.
• Dhillon, I. S. and Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Mach. Learn. 42 143–175.
• Diaconis, P. (1988). Group Representations in Probability and Statistics. Institute of Mathematical Statistics Lecture Notes—Monograph Series 11. IMS, Hayward, CA.
• Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. J. Roy. Statist. Soc. Ser. B 48 359–369.
• Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
• Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–57. ACM, New York.
• Ji, P. and Jin, J. (2016). Coauthorship and citation networks for statisticians. Ann. Appl. Stat. 10 1779–1812.
• Kolar, M. and Taddy, M. (2016). Discussion of “Coauthorship and citation networks for statisticians” [MR3592033]. Ann. Appl. Stat. 10 1835–1841.
• Maitra, R. and Ramler, I. P. (2010). A $k$-mean-directions algorithm for fast clustering of data on the sphere. J. Comput. Graph. Statist. 19 377–396.
• Mallows, C. L. (1957). Non-null ranking models. I. Biometrika 44 114–130.
• Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics, 2nd ed. Wiley Series in Probability and Statistics. Wiley, Chichester.
• McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability and Statistics: Applied Probability and Statistics. Wiley Interscience, New York.
• Murphy, T. B. and Martin, D. (2003). Mixtures of distance-based models for ranking data. Comput. Statist. Data Anal. 41 645–655.
• Nigam, K., McCallum, A., Thrun, S. and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Mach. Learn. 39 103–134.
• Salton, G. and McGill, M. J. (1986). Introduction to Modern Information Retrieval. McGraw-Hill, New York.
• Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J. 27 379–423, 623–656.
• Sun, Y., Han, J., Gao, J. and Yu, Y. (2009). Itopicmodel: Information network-integrated topic modeling. In Ninth IEEE International Conference on Data Mining 493–502.
• Vandewalle, V., Biernacki, C., Celeux, G. and Govaert, G. (2013). A predictive deviance criterion for selecting a generative model in semi-supervised classification. Comput. Statist. Data Anal. 64 220–236.
• Varin, C., Cattelan, M. and Firth, D. (2016). Statistical modelling of citation exchange between statistics journals. J. Roy. Statist. Soc. Ser. A 179 1–63.
• Zhong, S. and Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems 8 374–384.
• Zhu, X., Goldberg, A. B., Brachman, R. and Dietterich, T. (2009). Introduction to Semi-Supervised Learning. Morgan and Claypool, Williston, VT.