Electronic Journal of Statistics

Additive partially linear models for massive heterogeneous data

Binhuan Wang, Yixin Fang, Heng Lian, and Hua Liang

Full-text: Open access


We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data.

Article information

Electron. J. Statist., Volume 13, Number 1 (2019), 391-431.

Received: August 2017
First available in Project Euclid: 9 February 2019

Permanent link to this document

Digital Object Identifier

Primary: 62G08: Nonparametric regression
Secondary: 62J99: None of the above, but in this section

Divide-and-conquer homogeneity heterogeneity oracle property regression splines

Creative Commons Attribution 4.0 International License.


Wang, Binhuan; Fang, Yixin; Lian, Heng; Liang, Hua. Additive partially linear models for massive heterogeneous data. Electron. J. Statist. 13 (2019), no. 1, 391--431. doi:10.1214/18-EJS1528. https://projecteuclid.org/euclid.ejs/1549681242

Export citation


  • [1] Aitkin, M. and Rubin, D. B. (1985). Estimation and hypothesis testing in finite mixture models., Journal of the Royal Statistical Society. Series B 47 67–75.
  • [2] Carroll, R., Ruppert, D. and Wand, M. (2003)., Semiparametric Regression. Cambridge: Cambridge University Press.
  • [3] Davidian, M. and Carroll, R. J. (1987). Variance function estimation., Journal of the American Statistical Association 82 1079–1091.
  • [4] De Boor, C. (1978)., A Practical Guide to Splines 27. Springer-Verlag New York.
  • [5] de Jong, P. (1987). A central limit theorem for generalized quadratic forms., Probability Theory and Related Fields 75 261–277.
  • [6] Fang, Y., Lian, H., Liang, H. and Ruppert, D. (2015). Variance function additive partial linear models., Electronic Journal of Statistics 9 2793–2827.
  • [7] Hastie, T. and Tibshirani, R. (1990)., Generalized Additive Models, 1st ed. Monographs on statistics and applied probability. Chapman and Hall, London; New York.
  • [8] Huang, J. (1999). Efficient estimation of the partly linear additive Cox model., The Annals of Statistics 27 1536–1563.
  • [9] Huang, J. Z. et al. (2003). Local asymptotics for polynomial spline regression., The Annals of Statistics 31 1600–1635.
  • [10] Li, R., Lin, D. K. and Li, B. (2013). Statistical inference in massive data sets., Applied Stochastic Models in Business and Industry 29 399–409.
  • [11] Liu, X., Wang, L. and Liang, H. (2011). Estimation and Variable Selection for Semiparametric Additive Partial Linear Models., Statistica Sinica 21 1225–1248.
  • [12] Lu, J., Cheng, G. and Liu, H. (2016). Nonparametric Heterogeneity Testing For Massive Data., arXiv preprint arXiv:1601.06212.
  • [13] Opsomer, J. D. and Ruppert, D. (1999). A root-n consistent backfitting estimator for semiparametric additive modeling., Journal of Computational and Graphical Statistics 8 715–732.
  • [14] Stone, C. J. (1985). Additive regression and other nonparametric models., The annals of Statistics 13 689–705.
  • [15] Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models., The Annals of Statistics 14 590–606.
  • [16] Volgushev, S., Chao, S.-K. and Cheng, G. (2017). Distributed inference for quantile regression processes., arXiv preprint arXiv:1701.06088.
  • [17] Wahba, G. (1990)., Spline models for observational data 59. Siam.
  • [18] Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive autoregression model., The Annals of Statistics 35 2474–2503.
  • [19] Wang, C., Chen, M.-H., Schifano, E., Wu, J. and Yan, J. (2015). Statistical Methods and Computing for Big Data., arXiv preprint arXiv:1502.07989.
  • [20] Xu, G., Shang, Z. and Cheng, G. (2016). Optimal tuning for divide-and-conquer kernel ridge regression with massive data., arXiv preprint arXiv:1612.05907.
  • [21] Zhao, T., Cheng, G. and Liu, H. (2016). A partially linear framework for massive heterogeneous data., The Annals of Statistics 44 1400–1437.