Open Access
2022 Score-matching representative approach for big data analysis with generalized linear models
Keren Li, Jie Yang
Author Affiliations +
Electron. J. Statist. 16(1): 592-635 (2022). DOI: 10.1214/21-EJS1965


We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.

Funding Statement

Jie Yang was partly supported by NSF grant DMS-1924859.


This work was supported in part by the U.S. National Science Foundation. The authors thank the Editor, the Associate Editor and the referee for their constructive comments and suggestions. The authors also thank Mr. Lie He from the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland for his help with the COLA package.


Download Citation

Keren Li. Jie Yang. "Score-matching representative approach for big data analysis with generalized linear models." Electron. J. Statist. 16 (1) 592 - 635, 2022.


Received: 1 November 2020; Published: 2022
First available in Project Euclid: 10 January 2022

Digital Object Identifier: 10.1214/21-EJS1965

Primary: 62J12
Secondary: 62R07

Keywords: Big data regression , distributed database , Divide and conquer , mean representative approach , subsampling , user data localization

Vol.16 • No. 1 • 2022
Back to Top