Open Access
2022 Semi-supervised multiple testing
David Mary, Etienne Roquain
Author Affiliations +
Electron. J. Statist. 16(2): 4926-4981 (2022). DOI: 10.1214/22-EJS2050

Abstract

An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical p-values, called here the semi-supervised BH procedure. These bounds match when α(n+1)m is an integer, where n is the NTS sample size and m is the number of tests. Second, we give a power analysis for that procedure suggesting that it mimics an oracle power when n is sufficiently large in front of m; namely nm(max(1,k)), where k denotes the number of “detectable” alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem and shows that the semi-supervised BH method is optimal in the sense that its performance boundary follows this transition phase. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in Mary et al. (2020) for galaxy detection.

Funding Statement

This work has been supported by ANR-16-CE40-0019 (SansSouci), ANR-17-CE40-0001 (BASICS) and ANR-21-CE23-0035 (ASCAI) of the French National Research Agency ANR and by the GDR ISIS through the “projets exploratoires” program (project TASTY).

Acknowledgments

We are grateful to Lihua Lei for very interesting discussions, to Sabine Houssaye for her help when proving Lemma E.4 and to Guillaume Lecué for helpful comments.

Citation

Download Citation

David Mary. Etienne Roquain. "Semi-supervised multiple testing." Electron. J. Statist. 16 (2) 4926 - 4981, 2022. https://doi.org/10.1214/22-EJS2050

Information

Received: 1 November 2021; Published: 2022
First available in Project Euclid: 29 September 2022

MathSciNet: MR4490412
zbMATH: 07603100
Digital Object Identifier: 10.1214/22-EJS2050

Keywords: BH procedure , empirical p-values , False discovery rate , galaxy detection , knockoff , Lasso , multiple testing , phase transition

Vol.16 • No. 2 • 2022
Back to Top