Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations

Xin Bing; Florentina Bunea; Seth Strimas-Mackey; Marten Wegkamp

doi:10.1214/22-AOS2229

December 2022 Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations

Xin Bing, Florentina Bunea, Seth Strimas-Mackey, Marten Wegkamp

Author Affiliations +

Xin Bing,¹ Florentina Bunea,² Seth Strimas-Mackey,² Marten Wegkamp³
¹Department of Statistical Sciences, University of Toronto
²Department of Statistics and Data Science, Cornell University
³Departments of Mathematics, and of Statistics and Data Science, Cornell University

Ann. Statist. 50(6): 3307-3333 (December 2022). DOI: 10.1214/22-AOS2229

ABOUT
FIRST PAGE
CITED BY
REFERENCES
SUPPLEMENTAL CONTENT
DOWNLOAD PAPER SAVE TO MY LIBRARY

PERSONAL SIGN IN
Full access may be available with your subscription

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account
or Sign in with your institutional credentials

PURCHASE SINGLE ARTICLE

This article is only available to subscribers. It is not available for individual sale.

This will count as one of your downloads.

You will have access to both the presentation and article (if available).

DOWNLOAD NOW

This content is available for download via your institution's subscription. To access this item, please sign in to your personal account.

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account

My Library

You currently do not have any folders to save your paper to! Create a new folder below.

Abstract

This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in the context of topic models. The data consists of observed multinomial counts of p words across n independent documents. In topic models, the $p\times n$ expected word frequency matrix is assumed to be factorized as a $p\times K$ word-topic matrix A and a $K\times n$ topic-document matrix T. Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of A are viewed as p-dimensional mixture components that are common to all documents while columns of T are viewed as the K-dimensional mixture weights that are document specific and are allowed to be sparse.

The main interest is to provide sharp, finite sample, ${\ell _{1}}$ -norm convergence rates for estimators of the mixture weights T when A is either known or unknown. For known A, we suggest MLE estimation of T. Our nonstandard analysis of the MLE not only establishes its ${\ell _{1}}$ convergence rate, but also reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of T. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When A is unknown, we estimate T by optimizing the likelihood function corresponding to a plug in, generic, estimator $\widehat{A}$ of A. For any estimator $\widehat{A}$ that satisfies carefully detailed conditions for proximity to A, we show that the resulting estimator of T retains the properties established for the MLE. Our theoretical results allow the ambient dimensions K and p to grow with the sample sizes.

Our main application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. We derive finite sample bounds on the estimated proposed 1-Wasserstein distances. For word level document-distances, we provide contrast with existing rates on the 1-Wasserstein distance between standard empirical frequency estimates. The effectiveness of the proposed 1-Wasserstein distances is illustrated by an analysis of an IMDB movie reviews data set. Finally, our theoretical results are supported by extensive simulation studies.

Funding Statement

Bunea is supported in part by NSF Grant DMS-2015195 and DMS-2210563. Wegkamp is supported in part by NSF Grants DMS-2015195 and DMS-2210557.

Acknowledgments

We thank the Editor, the Associate Editor and two referees for their detailed reviews, which helped to improve the paper substantially.

Citation

Download Citation

Xin Bing. Florentina Bunea. Seth Strimas-Mackey. Marten Wegkamp. "Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations." Ann. Statist. 50 (6) 3307 - 3333, December 2022. https://doi.org/10.1214/22-AOS2229

Information

Received: 1 July 2021; Revised: 1 September 2022; Published: December 2022

First available in Project Euclid: 21 December 2022

MathSciNet: MR4524498

zbMATH: 07641127

Digital Object Identifier: 10.1214/22-AOS2229

Subjects:

Primary: 62H12 , 62H30

Secondary: 62F10

Keywords: adaptive estimation , anchor words , high-dimensional estimation , maximum likelihood estimation , minimax estimation , mixture model , multinomial distribution , nonnegative matrix factorization , Sparse estimation , topic models

ACCESS THE FULL ARTICLE

PERSONAL SIGN IN
Full access may be available with your subscription

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account
or Sign in with your institutional credentials

PURCHASE THIS CONTENT

PURCHASE SINGLE ARTICLE

This article is only available to subscribers.
It is not available for individual sale.

JOURNAL ARTICLE
27 PAGES

This article is only available to subscribers.
It is not available for individual sale.

+ SAVE TO MY LIBRARY

GET CITATION

My Library

You currently do not have any folders to save your paper to! Create a new folder below.

Folder Name

Folder Description

< Previous Article

Next Article >

Ann. Statist.

Vol.50 • No. 6 • December 2022

Institute of Mathematical Statistics

Subscribe to Project Euclid

Receive erratum alerts for this article

Xin Bing, Florentina Bunea, Seth Strimas-Mackey, Marten Wegkamp "Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations," The Annals of Statistics, Ann. Statist. 50(6), 3307-3333, (December 2022)

Include:

Citation Only

Citation & Abstract

Format:

RIS

EndNote

BibTex

Print Friendly Version (PDF)

Abstract

Funding Statement

Acknowledgments

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS