September 2021 Orthogonal subsampling for big data linear regression
Lin Wang, Jake Elmstedt, Weng Kee Wong, Hongquan Xu
Author Affiliations +
Ann. Appl. Stat. 15(3): 1273-1290 (September 2021). DOI: 10.1214/21-AOAS1462

Abstract

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big-data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points, and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates, and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.

Acknowledgments

The authors would like to thank an Associate Editor and reviewers for their helpful comments and suggestions.

Citation

Download Citation

Lin Wang. Jake Elmstedt. Weng Kee Wong. Hongquan Xu. "Orthogonal subsampling for big data linear regression." Ann. Appl. Stat. 15 (3) 1273 - 1290, September 2021. https://doi.org/10.1214/21-AOAS1462

Information

Received: 1 May 2020; Revised: 1 January 2021; Published: September 2021
First available in Project Euclid: 23 September 2021

MathSciNet: MR4316648
zbMATH: 1478.62384
Digital Object Identifier: 10.1214/21-AOAS1462

Keywords: Data reduction , Experimental design , optimal design , orthogonal array

Rights: Copyright © 2021 Institute of Mathematical Statistics

JOURNAL ARTICLE
18 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.15 • No. 3 • September 2021
Back to Top