Open Access
August 2016 A partially linear framework for massive heterogeneous data
Tianqi Zhao, Guang Cheng, Han Liu
Ann. Statist. 44(4): 1400-1437 (August 2016). DOI: 10.1214/15-AOS1410


We consider a partially linear framework for modeling massive heterogeneous data. The major goal is to extract common features across all subpopulations while exploring heterogeneity of each subpopulation. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (nonasymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracle result holds when the number of subpopulations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of subpopulations. All the above results require to regularize each subestimation as though it had the entire sample. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is statistical inferences for general kernel ridge regression. Thorough numerical results are also provided to back up our theory.


Download Citation

Tianqi Zhao. Guang Cheng. Han Liu. "A partially linear framework for massive heterogeneous data." Ann. Statist. 44 (4) 1400 - 1437, August 2016.


Received: 1 June 2015; Revised: 1 October 2015; Published: August 2016
First available in Project Euclid: 7 July 2016

zbMATH: 1358.62050
MathSciNet: MR3519928
Digital Object Identifier: 10.1214/15-AOS1410

Primary: 62F25 , 62G20
Secondary: 62F10 , 62F12

Keywords: Divide-and-conquer method , heterogeneous data , kernel ridge regression , massive data , partially linear model

Rights: Copyright © 2016 Institute of Mathematical Statistics

Vol.44 • No. 4 • August 2016
Back to Top