June 2022 Higher criticism for discriminating word-frequency tables and authorship attribution
Alon Kipnis
Author Affiliations +
Ann. Appl. Stat. 16(2): 1236-1252 (June 2022). DOI: 10.1214/21-AOAS1544


We adapt the higher criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning, reporting accuracy at the state-of-the-art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.

Funding Statement

This work is supported in parts by a fellowship from the Koret Foundation.


The author would like to thank David Donoho for fruitful discussions and three anonymous reviewers for providing comments that have greatly improved this paper.


Download Citation

Alon Kipnis. "Higher criticism for discriminating word-frequency tables and authorship attribution." Ann. Appl. Stat. 16 (2) 1236 - 1252, June 2022. https://doi.org/10.1214/21-AOAS1544


Received: 1 October 2020; Revised: 1 August 2021; Published: June 2022
First available in Project Euclid: 13 June 2022

MathSciNet: MR4438832
zbMATH: 1498.62348
Digital Object Identifier: 10.1214/21-AOAS1544

Keywords: authorship attribution , Feature selection , higher criticism , nonparametric methods , two-sample testing , unsupervised learning

Rights: Copyright © 2022 Institute of Mathematical Statistics

Vol.16 • No. 2 • June 2022
Back to Top