This paper considers distributed statistical inference for general symmetric statistics in the context of massive data with efficient computation. Estimation efficiency and asymptotic distributions of the distributed statistics are provided, which reveal different results between the nondegenerate and degenerate cases, and show the number of the data subsets plays an important role. Two distributed bootstrap methods are proposed and analyzed to approximation the underlying distribution of the distributed statistics with improved computation efficiency over existing methods. The accuracy of the distributional approximation by the bootstrap are studied theoretically. One of the methods, the pseudo-distributed bootstrap, is particularly attractive if the number of datasets is large as it directly resamples the subset-based statistics, assumes less stringent conditions and its performance can be improved by studentization.
Chen’s research is partially supported by National Natural Science Foundation of China grants 92046021, 12026607, 12071013 and 71973005 and LMEQF at Peking University.
"Distributed statistical inference for massive data." Ann. Statist. 49 (5) 2851 - 2869, October 2021. https://doi.org/10.1214/21-AOS2062