The Annals of Applied Statistics

Topic-adjusted visibility metric for scientific articles

Linda S. L. Tan, Aik Hui Chan, and Tian Zheng

Full-text: Open access

Abstract

Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles’ visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations among them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.

Article information

Source
Ann. Appl. Stat. Volume 10, Number 1 (2016), 1-31.

Dates
Received: June 2015
Revised: October 2015
First available in Project Euclid: 25 March 2016

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1458909905

Digital Object Identifier
doi:10.1214/15-AOAS887

Mathematical Reviews number (MathSciNet)
MR3480485

Zentralblatt MATH identifier
06586134

Keywords
Article level metric citation network models stochastic blockmodels variational Bayes stochastic variational inference

Citation

Tan, Linda S. L.; Chan, Aik Hui; Zheng, Tian. Topic-adjusted visibility metric for scientific articles. Ann. Appl. Stat. 10 (2016), no. 1, 1--31. doi:10.1214/15-AOAS887. https://projecteuclid.org/euclid.aoas/1458909905.


Export citation

References

  • Abramo, G. and D’Angelo, C. A. (2011). Evaluating research: From informed peer review to bibliometrics. Scientometrics 87 499–514.
  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Alberts, B. (2013). Impact factor distortions. Science 340 787.
  • Amari, S. (1998). Natural gradient works efficiently in learning. Neural Comput. 10 251–276.
  • Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press, New York.
  • Balasubramanyan, R. and Cohen, W. W. (2013). Block-LDA: Jointly modeling entity-annotated text and entity-entity links. In Proceedings of the 2011 SIAM International Conference on Data Mining (B. Liu, H. Liu, C. Clifton, T. Washio and C. Kamath, eds.) 450–461. SIAM Publications Online.
  • Blei, D. M. and Lafferty, J. D. (2009). Topic models. In Text Mining: Classification, Clustering, and Applications (A. N. Srivastava and M. Sahami, eds.) 71–89. Chapman & Hall/CRC, Boca Raton, FL.
  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Bornmann, L. and Daniel, H. (2008). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation 64 45–80.
  • Braun, M. and McAuliffe, J. (2010). Variational inference for large-scale models of discrete choice. J. Amer. Statist. Assoc. 105 324–335.
  • Casadevall, A. and Fang, F. C. (2014). Causes for the persistence of impact factor mania. The American Society for Microbiology 5 e00064–14.
  • Chang, J. (2012). Collapsed Gibbs sampling methods for topic models. R package: lda (version 1.3.2). Available at http://cran.r-project.org/web/packages/lda/index.html.
  • Chang, J. and Blei, D. M. (2010). Hierarchical relational models for document networks. Ann. Appl. Stat. 4 124–150.
  • Chen, P. and Redner, S. (2010). Community structure of the physical review citation network. J. Informetr. 4 278–290.
  • Chen, N., Zhu, L., Xia, F. and Zhang, B. (2013). Generalized relational topic models with data augmentation. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (F. Rossi, ed.) 1273–1279. AAAI Press, Menlo Park, CA.
  • Crespo, J. A., Li, Y. and Ruiz-Castillo, J. (2013). The measurement of the effect on citation inequality of differences in citation practices across scientific fields. PLOS ONE 7 e33833.
  • Crespo, J. A., Herranz, N., Li, Y. and Ruiz-Castillo, J. (2013). The effect on citation inequality of differences in citation practices at the web of science subject category level. Journal of the Association for Information Science and Technology 65 1244–1256.
  • Fenner, M. (2014). Altmetrics and other novel measures for scientific impact. In Opening Science (S. Bartling and S. Friesike, eds.) 179–189. Springer, New York.
  • Garfield, E. (1979). Citation Indexing. Its Theory and Applications in Science, Technology, and Humanities. Wiley, New York.
  • Garfield, E. (2006). The history and meaning of the journal impact factor. The Journal of the American Medical Association 295 90–93.
  • Gehrke, J., Ginsparg, P. and Kleinberg, J. M. (2003). Overview of the 2003 KDD cup. SIGKDD Explorations 5 149–151.
  • Gopalan, P. K. and Blei, D. M. (2013). Efficient discovery of overlapping communities in massive networks. Proc. Natl. Acad. Sci. USA 110 14534–14539.
  • Gopalan, P., Charlin, L. and Blei, D. M. (2014). Content-based recommendations with Poisson factorization. In Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.) 3176–3184. Curran Associates, Red Hook, NY.
  • Gubser, S. S. (2010). The Little Book of String Theory. Princeton Univ. Press, Princeton, NJ.
  • Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. USA 102 16569–16572.
  • Ho, Q., Eisenstein, J. and Xing, E. P. (2012). Document hierarchies from text and links. In Proceedings of the 21st International Conference on World Wide Web 739–748. ACM, New York.
  • Ho, Q., Parikh, A. P. and Xing, E. P. (2012). A multiscale community blockmodel for network exploration. J. Amer. Statist. Assoc. 107 916–934.
  • Hoffman, M. D., Blei, D. M. and Bach, F. (2010). Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel and A. Culotta, eds.) 856–864. Curran Associates, Red Hook, NY.
  • Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). Stochastic variational inference. J. Mach. Learn. Res. 14 1303–1347.
  • Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183–233.
  • Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. J. ACM 46 604–632.
  • Knowles, D. A. and Minka, T. P. (2011). Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems 24 1701–1709. Curran Associates, Red Hook, NY.
  • Kolaczyk, E. D. (2009). Statistical Analysis of Network Data. Methods and Models. Springer, New York.
  • Moed, H. F. (2010). Measuring contextual citation impact of scientific journals. J. Informetr. 4 265–277.
  • Nallapati, R., Ahmed, A., Xing, E. P. and Cohen, W. W. (2008). Joint latent topic model for text and citations. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discover and Data Mining 542–550. ACM Press, New York.
  • Neiswanger, W., Wang, C., Ho, Q. and Xing, E. P. (2014). Modeling citation networks using latent random offsets. In Proceedings of 30th Conference on Uncertainty in Artificial Intelligence (N. L. Zhang and J. Tian, eds.) 633–642. AUAI Press, Corvallis, OR.
  • Neylon, C. and Wu, S. (2009). Article-level metrics and the evolution of scientific impact. PLOS Biology 7 e1000242.
  • Rabinovich, M. and Blei, D. M. (2014). The inverse regression topic model. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China (E. P. Xing and T. Jebara, eds.) J. Mach. Learn. Res. Workshop and Conference Proceedings 32 199–207.
  • Radicchi, F., Fortunato, S. and Castellano, C. (2008). Universality of citation distributions: Toward an objective measure of scientific impact. Proc. Natl. Acad. Sci. USA 105 17268–17272.
  • Raftery, A. E., Niu, X., Hoff, P. D. and Yeung, K. Y. (2012). Fast inference for the latent space network model using a case-control approximate likelihood. J. Comput. Graph. Statist. 21 901–919.
  • Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
  • Roberts, M. E., Stewart, B. M., Tingley, D. and Airoldi, E. M. (2013). The structural topic model and applied social science. In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation, Nevada, US.
  • Schubert, A. and Braun, T. (1996). Cross-field normalization of scientometric indicators. Scientometrics 36 311–324.
  • Seglen, P. O. (1997). Why the impact factor of journals should not be used for evaluating research. Br. Med. J. 314 498–502.
  • Simons, K. (2008). The misused impact factor. Science 322 165.
  • Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, Hoboken, NJ.
  • Taddy, M. (2013). Multinomial inverse regression for text analysis. J. Amer. Statist. Assoc. 108 755–770.
  • Taddy, M. (2015). Distributed multinomial regression. Ann. Appl. Stat. 9 1394–1414.
  • Tan, L. S. L., Chan, A. and Zheng, T. (2016). Supplement to “Topic-adjusted visibility metric for scientific articles.” DOI:10.1214/15-AOAS887SUPP.
  • Vinkler, P. (2003). Relations of relative scientometric indicators. Scientometrics 58 687–694.
  • Wang, C. and Blei, D. (2011). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 448–456. ACM Press, New York.
  • Wang, C., Paisley, J. and Blei, D. M. (2011). Online variational inference for the hierarchical Dirichlet process. In Proc. of the 14th Int’l. Conf. on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA. (G. Gordon, D. Dunson and M. Dudík, eds.) J. Mach. Learn. Res. Workshop and Conference Proceedings 15 752–760.
  • Zhang, A., Zhu, J. and Zhang, B. (2013). Sparse relational topic models for document networks. In Machine Learning and Knowledge Discovery in Databases 8188 (H. Blockeel, K. Kersting S. Nijssen and F. Železný, eds.) 670–685. Springer, Heidelberg.
  • Zhu, Y., Yan, X., Getoor, L. and Moore, C. (2013). Scalable text and link analysis with mixed-topic link models. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 473–481. ACM, New York.

Supplemental materials

  • Supplement to “Topic-adjusted visibility metric for scientific articles”. We provide additional material to support the results in this paper. This includes further discussions, detailed derivations, illustrations and a simulation study.