Open Access
2024 A distance metric-based space-filling subsampling method for nonparametric models
Huaimin Diao, Dianpeng Wang, Xu He
Author Affiliations +
Electron. J. Statist. 18(2): 3247-3273 (2024). DOI: 10.1214/24-EJS2251

Abstract

Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. To optimize inference and prediction accuracy, it is crucial to employ a subsampling scheme to collect observations intelligently. In this paper, we propose a space-filling subsampling method that uses distance metric-based strata to select subsamples from high-volume data sets. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the input space. In addition, subsamples that are space-filling according to the response are collected from each stratum. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with Gaussian process models.

Funding Statement

This work is supported by National Key R&D Program of China 2021YFA 1000300, 2021YFA 1000301, National Natural Science Foundation of China (Grant no. NSFC 12171033, NSFC 12022115, NSFC 11971465, NSFC 71988101), China Institute of Marine Technology and Economy (Contact Number 2019A128), and National Center for Mathematics and Interdisciplinary Sciences, CAS.

Acknowledgments

We thank the editor, one associate editor, and three referees for their constructive comments that lead to significant improvement of the paper.

Citation

Download Citation

Huaimin Diao. Dianpeng Wang. Xu He. "A distance metric-based space-filling subsampling method for nonparametric models." Electron. J. Statist. 18 (2) 3247 - 3273, 2024. https://doi.org/10.1214/24-EJS2251

Information

Received: 1 February 2023; Published: 2024
First available in Project Euclid: 2 August 2024

Digital Object Identifier: 10.1214/24-EJS2251

Subjects:
Primary: 60K35 , 60K35
Secondary: 60K35

Keywords: big data , nonparametric model , space-filling design , tall data

Vol.18 • No. 2 • 2024
Back to Top