A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Xin Bing; Florentina Bunea; Marten Wegkamp

doi:10.3150/19-BEJ1166

August 2020 A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Xin Bing, Florentina Bunea, Marten Wegkamp

Bernoulli 26(3): 1765-1796 (August 2020). DOI: 10.3150/19-BEJ1166

Abstract

Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters $N_{i}\in\mathbb{N}$ and $\Pi_{i}\in[0,1]^{p}$ for $i=1,\ldots,n$. The model links all cell probabilities, collected in a $p\times n$ matrix $\Pi$, via the assumption that $\Pi$ can be factorized as the product of two nonnegative matrices $A\in[0,1]^{p\times K}$ and $W\in[0,1]^{K\times n}$. Topic models have been originally developed in text mining, when one browses through $n$ documents, based on a dictionary of $p$ words, and covering $K$ topics. In this terminology, the matrix $A$ is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique $A$ is required to satisfy what is commonly known as the anchor word assumption, under which $A$ has an unknown number of rows respectively proportional to the canonical basis vectors in $\mathbb{R}^{K}$. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the $K$ vertices of a simplex. This crucial step in the estimation of $A$ requires $K$ to be known, and cannot be easily extended to the more realistic set-up when $K$ is unknown.

This work takes a different view on anchor word estimation, and on the estimation of $A$. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates $K$ from the observed data. We derive new finite sample minimax lower bounds for the estimation of $A$, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any $n,N_{i},p$ and $K$, and both $p$ and $K$ are allowed to increase with $n$, a situation not handled well by previous analyses.

We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics $K$, while we provide the competing methods with the correct value in our simulations.

Citation

Download Citation

Xin Bing. Florentina Bunea. Marten Wegkamp. "A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics." Bernoulli 26 (3) 1765 - 1796, August 2020. https://doi.org/10.3150/19-BEJ1166

Information

Received: 1 May 2018; Revised: 1 June 2019; Published: August 2020

First available in Project Euclid: 27 April 2020

zbMATH: 07193942

MathSciNet: MR4091091

Digital Object Identifier: 10.3150/19-BEJ1166

Keywords: adaptive estimation , anchor words , high dimensional estimation , Identification , latent model , minimax estimation , nonnegative matrix factorization , Overlapping clustering , separability , topic model