Abstract
Streaming data refers to a data collection scheme where observations arrive sequentially and perpetually over time, making it challenging to fit into computer memory for statistical analysis. The ordinary least squares estimate for linear regression is sensitive to heavy-tailed errors and outliers, which are commonly encountered in applications. In this case, the Huber loss function is a useful criterion for robust regression. In this paper, we propose robust regression estimation and variable selection for streaming datasets. Unlike the renewable estimation generalized linear regression for streaming datasets, however, the Huber loss function is only first-order differentiable, which poses challenges to renewable estimation in both computation and theoretical development. To address the challenge, we introduce a new smoothed version of the Huber first derivative, which admits a fast and scalable algorithm to perform optimization for streaming data sets and achieves the best fitting of Huber function among different versions. Theoretically, the proposed statistics are shown to have the same asymptotic properties as the standard version computed on an entire data stream with the data batches pooled into one data set, without additional condition. The proposed methods are illustrated using current data and the summary statistics of historical data. Both simulations and real data analysis are conducted to illustrate the finite sample performance of the proposed methods.
Citation
Rong Jiang. Lei Liang. Keming Yu. "Renewable Huber estimation method for streaming datasets." Electron. J. Statist. 18 (1) 674 - 705, 2024. https://doi.org/10.1214/24-EJS2223
Information