Statistics Surveys

Text Data Mining: Theory and Methods

Jeffrey L. Solka

Full-text: Open access


This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

Article information

Statist. Surv., Volume 2 (2008), 94-112.

First available in Project Euclid: 16 July 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62-01: Instructional exposition (textbooks, tutorial papers, etc.)
Secondary: 62A01: Foundations and philosophical topics

text data mining clustering visualization pattern recognition discriminant analysis dimensionality reduction feature extraction manifold learning


Solka, Jeffrey L. Text Data Mining: Theory and Methods. Statist. Surv. 2 (2008), 94--112. doi:10.1214/07-SS016.

Export citation


  • [1] Baeza-Yates, R. and Ribeiro-Neto, B. (1999)., Modern Information Retrieval. Addison Wesley.
  • [2] Bao, Y. G. and Ishii, N. (2002). Combining multiple k-nreast neighbor classifiers for text classification by reducts. In, Discovery Science 2002.
  • [3] Belkin, M. and Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation., Neural Comp. 15, 6, 1373–1396.
  • [4] Berry, M. and Browne, M. (2005)., Understanding Search Engines: Mathematical Modeling and Text Retrieval(Software, Environments, Tools). SIAM.
  • [5] Berry, M. W. (2003)., Survey of Text Mining: Clustering, Classification, and Retrieval (Hardcover). Springer.
  • [6] Bird, S., Klein, E., and Loper, E. (2007)., Natural Language Processing in Python. Creative Commons Attribution-NonCommercial-No Derivative Works 3.0, New York, New York.
  • [7] Böner, K., Dall’Asta, L., Ke, W., and Vespignani, A. (2005). Studying the emerging global brain: Analyzing and visualizing the impact of co-authorship teams., Complexity, special issue on Understanding Complex Systems 10, 4, 58–67.
  • [8] Börner, K., Maru, J., and Goldstone, R. (2004). Simultaneous evolution of author and paper networks., Proceedings of the National Academy of Sciences of the United States of America Supplement 1 101, 5266–5273.
  • [9] Boyack, K. W., Klavans, R., and Böner, K. (2005). Mapping the backbone of science., Scientometrics 64, 3, 351–374.
  • [10] Brner, K., Chen, C., and Boyack, K. W. (2003). Visualizing knowledge domains., Annual Review of Information Science and Technology (ARIST) 37, 179–255.
  • [11] Chen, D.-Y., Li, X., Dong, Z. Y., and Chen, X. (2005). Effectiveness of document representation for classification. In, DaWaK. 368–377.
  • [12] Chouchoulas, A. and Shen, Q. (2001). Rough set-aided keyword reduction for text categorization., Applied Artificial Intelligence 15, 9, 843–873.
  • [13] Cleuziou, G., Martin, L., Clavier, V., and Vrain, C. (2004). Ddoc: Overlapping clustering of words for document classification. In, SPIRE. 127–128.
  • [14] Cox, T. and Cox, M. (2000)., Multidimensional Scaling. Chapman and Hall/CRC.
  • [15] Deerwester, S., Dumais, S. T., Furnas, G. W., and Landauer, T. K. (1990). Indexing by latent semantic analysis., Journal of the Am. Soc. for Information Science 41, 6, 391–407.
  • [16] Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In, KDD.
  • [17] Duda, R. O., Hart, P. E., and Stork, D. G. (2000)., Pattern Classification , Second ed. Wiley Interscience.
  • [18] Dy, J. G. and Brodley, C. E. (2004). Feature selection for unsupervised learning., J. Mach. Learn. Res. 5, 845–889.
  • [19] Fei, L., Wen-Ju, Z., Shui, Y., Fan-Yuan, M., and Ming-Lu, L. (2004). A peer-t-peer hypertext categorization using directed acyclic graph support vector machines., Lecture Notes in Computer Science 3320, 54–57.
  • [20] Hadi, A. (2000)., Matrix Algebra as a Tool. CRC.
  • [21] Hand, D. J., Mannila, H., and Smyth, P. (2001)., Principles of Data Mining (Adaptive Computation and Machine Learning). MIT Press.
  • [22] How, B. C. and Kiong, W. T. (2005). An examination of feature selection frameworks in text categorization. In, AIRS. 558–564.
  • [23] Howland, P., Jeon, M., and Park, H. (2003). Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition., SIAM J. Matrix Anal. Appl. 25, 1, 165–179.
  • [24] Howland, P. and Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition., IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 8, 995–1006.
  • [25] Jolliffe, I. (1986)., Principal Component Analysis. Springer-Verlag.
  • [26] Kang, D.-K., Zhang, J., Silvescu, A., and Honavar, V. (2005). Multinomial event model based abstraction for sequence and text classification. In, SARA. 134–148.
  • [27] Karras, D. A. and Mertzios, B. G. (2002). A robust meaning extraction methodology using supervised neural networks. In, Australian Joint Conference on Artificial Intelligence. 498–510.
  • [28] Kim, S.-S., Kwon, S., and Cook, D. (2000). Interactive visualization of hierarchical clusters using mds and mst., Metrika 51, 1, 39–51.
  • [29] Kostoff, R. N. and Block, J. A. (2005). Factor matrix text filtering and clustering: Research articles., J. Am. Soc. Inf. Sci. Technol. 56, 9.
  • [30] Lagus, K., Kaski, S., and Kohonen, T. (2004). Mining massive document collections by the websom method., Information Sciences 163, 1-3, 135–156.
  • [31] Lakshminarayan, C., Yu, Q., and Benson, A. (2005). Improving customer experience via text mining. In, DNIS. 288–299.
  • [32] Lee, J. A. and Verleysen, M. (2007)., Nonlinear Dimensionality Reduction (Information Science and Statistics) , First ed. Springer, New York, New York.
  • [33] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). Text classification using string kernels., J. Mach. Learn. Res. 2, 419–444.
  • [34] Lu, J., Xu, B., and Jiang, J. (2004). Generating different semantic spaces for document classification. In, AWCC. 430–436.
  • [35] Mane, K. and Börner, K. (2004). Mapping topics and topic bursts in pnas., Proceedings of the National Academy of Sciences of the United States of America Supplement 1 101, 5287–5290.
  • [36] Manning, C. D. and Schütze, H. (1999)., Foundations of Statistical Natural Language Processing (Hardcover). MIT.
  • [37] Martinez, A. (2002). A framework for the representation of semantics. Ph.D. thesis, George Mason, University.
  • [38] Mather, L. A. (2000). A linear algebra measure of cluster quality., J. Am. Soc. Inf. Sci. 51, 7, 602–613.
  • [39] Morris, S. A. and Yen, G. G. (2004). Crossmaps: Visualization of overlapping relationships in collections of journal papers., Proceedings of the National Academy of Sciences of the United States of America Supplement 1 101, 5291–5296.
  • [40] Park, H., Jeon, M., and Ben Rosen, J. (2003). Lower dimensional representation of text data based on centroids and least squares., Bit 43, 2, 1–22.
  • [41] Plaisant, C., Grosjean, J., and Bedderson, B. (2002). Spacetree: Supporting exploration in large node link tree, design evolution and empirical evaluation. In, Proceedings of the IEEE Symposium on Information Visualization (InfoVis’02).
  • [42] Porter, M. F. (1980). Algorithm for suffix striping., Program , 130–137.
  • [43] Priebe, C. E., Marchette, D. J., Park, Y., Wegman, E. J., Solka, J., Socolinsky, D., Karakos, D., Church, K., Guglielmi, R., Coifman, R., Lin, D., Healy, D., Jacobs, M., and Tsao, A. (2004). Iteravie denoising for cross-corpus discovery. In, Proceedings of the 2004 Symposium on Computational Statistics.
  • [44] Romero, E., Marquez, L., and Carreras, X. (2004). Margin maximization with feed-forward neural networks: a comparative study with svm and adaboost., Neurocomputing 57, 313–344.
  • [45] Roweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding., Science 290, 2323–2326.
  • [46] Sebastiani, F. (2002). Machine learning in automated text categorization., ACM Comput. Surv. 34, 1, 1–47.
  • [47] Shahnaz, F., Berry, M. W., Pauca, V. P., and Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization., Inf. Process. Manage. 42, 2.
  • [48] Solka, J. L., Bryant, A., and Wegman, E. J. (2005). Text data mining with minimal spanning trees. In, Data Mining and Data Visualization, C. R. Rao, E. J. Wegman, and J. L. Solka, Eds. handbook of statistics, Vol. 24. Elsevier North Holland, Chapter 5, 133–169.
  • [49] Song, M. (1998). Can visualizing document space improve users’ information foraging. In, Proceedings of the Asis Annual Meeting.
  • [50] Steinbach, M., Karypris, G., and Kumar, V. (2000). A comparison of document clustering techniques. In, KDD Workshop on Text Mining.
  • [51] Tenebaum, J., de Silva, V., and Langford, J. (2001). A global framework for nonlinear dimensionality reduction., Science 290, 2319–2323.
  • [52] Toffler, A. (1991)., Future Shock. Bantam.
  • [53] Xu, Y., Olman, V., and Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees., Bioinformatics 18, 4, 536–545.
  • [54] Zhao, Y. and Karypris, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. In, 11th Conference of Information and Knowledge Management (CIKM). 515–524.
  • [55] Zheng, X.-S., He, P.-L., Tian, M., and Yuan, F.-Y. (2003). Algorithm of documents clustering based on minimum spanning tree. In, Machine Learning and Cybernetics, 2003 International Conference on Vol. 1. 199–203.
  • [56] Zhong, S. and Ghosh, J. (2005). Generative model-based document clustering: a comparative study., Knowl. Inf. Syst. 8, 3, 374–384.