The Annals of Statistics

A semiparametric model for cluster data

Wenyang Zhang, Jianqing Fan, and Yan Sun

Full-text: Open access


In the analysis of cluster data, the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model also accounts for within-cluster correlation. Local, linear-based estimation procedure is proposed for estimating functional coefficients, residual variance and within-cluster correlation matrix. The asymptotic properties of the proposed estimators are established, and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are carried out to demonstrate the methodological power of the proposed methods in the finite sample. The proposed model and methods are used to analyse the second birth interval in Bangladesh, leading to some interesting findings.

Article information

Ann. Statist., Volume 37, Number 5A (2009), 2377-2408.

First available in Project Euclid: 15 July 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 62G10: Hypothesis testing 62G15: Tolerance and confidence regions

Varying-coefficient models local linear modeling cluster level variable cluster effect


Zhang, Wenyang; Fan, Jianqing; Sun, Yan. A semiparametric model for cluster data. Ann. Statist. 37 (2009), no. 5A, 2377--2408. doi:10.1214/08-AOS662.

Export citation


  • [1] Bickel, P. L. and Rosenblatt, M. (1973). On some global measures of the derivations of density function estimates. Ann. Statist. 1 1071–1095.
  • [2] Brumback, B. and Rice, J. A. (1998). Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion). J. Amer. Statist. Assoc. 93 961–994.
  • [3] Chiang, C.-T., Rice, J. A. and Wu, C. O. (2001). Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. J. Amer. Statist. Assoc. 96 605–619.
  • [4] Chiou, J.-M. and Müller, H.-G. (2005). Estimated estimating equations: Semiparametric inference for clustered/longitudinal data. J. Roy. Statist. Soc. Ser. B 67 531–553.
  • [5] Csörgö, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics. Academic Press, New York.
  • [6] Diggle, P. J., Heagerty, P., Liang, K. Y. and Zeger, S. L. (2002). Analysis of Longitudinal Data. Oxford Univ. Press, London.
  • [7] Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B 57 371–394.
  • [8] Fan, J. and Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. 99 710–723.
  • [9] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. Ann. Statist. 27 1491–1518.
  • [10] Fan, J. and Zhang, W. (2000). Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scand. J. Statist. 27 715–731.
  • [11] Fan, J. and Wu, Y. (2008). Semiparametric estimation of covariance matrices for longitudinal data. J. Amer. Statist. Assoc. To appear.
  • [12] Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Statist. 29 153–193.
  • [13] Härdle, W. (1989). Asymptotic maximal deviation of M-smoothers. J. Multivariate Anal. 29 163–179.
  • [14] Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L.-P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85 809–822.
  • [15] Huang, J. Z., Wu, C. O. and Zhou, L. (2002). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89 111–128.
  • [16] Lam, C. and Fan, J. (2008). Profile-Kernel likelihood inference with diverging number of parameters. Ann. Statist. 36 2232–2260.
  • [17] Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. Ann. Statist. 36 261–286.
  • [18] Lin, X. and Carroll, R. J. (2000). Nonparametric function estimation for clustered data when the predictor is measured without/with error. J. Amer. Statist. Assoc. 95 520–534.
  • [19] Lin, X. and Carroll, R. J. (2006). Semiparametric estimation in general repeated measures problems. J. Roy. Statist. Soc. Ser. B 68 69–88.
  • [20] Lin, Z. Y. and Lu, C. R. (1992). Strong Approximation Theorem. Science Press, Beijing, China. (In Chinese.)
  • [21] Martinussen, T. and Scheike, T. H. (1999). A semiparametric additive regression model for longitudinal data. Biometrika 86 691–702.
  • [22] Mitra, S. N., Al-Sabir, A., Cross, A. R. and Jamil, K. (1997). Bangladesh and demographic health survey 1996–1997. National Institute of Population Research and Training (NIPORT), Mitra and Associates, and Macro International Inc., Dhaka and Calverton, MD.
  • [23] Qu, A. and Li, R. (2006). Quadratic inference functions for varying-coefficient models with longitudinal data. Biometrics 62 379–391.
  • [24] Sun, Y., Zhang, W. and Tong, H. (2007). Estimation of the covariance matrix of random effects in longitudinal studies. Ann. Statist. 35 2795–2814.
  • [25] Wang, N. (2003). Marginal nonparametric kernel regression accounting within-subject correlation. Biometrika 90 43–52.
  • [26] Wang, N., Carroll, R. J. and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Amer. Statist. Assoc. 100 147–157.
  • [27] Welsh, A. H., Lin, X. and Carroll, R. J. (2002). Marginal longitudinal nonparametric regression: Locality and efficiency of spline and kernel methods. J. Amer. Statist. Assoc. 97 482–493.
  • [28] Wu, C. O., Chiang, C. T. and Hoover, D. R. (1998). Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. J. Amer. Statist. Assoc. 93 1388–1402.
  • [29] Xia, Y. (1998). Bias-corrected confidence bands in nonparametric regression. J. Roy. Statist. Soc. Ser. B 60 797–811.
  • [30] Xia, Y. and Li, W. K. (1999). On the estimation and testing of functional-coefficient linear models. Statist. Sinica 9 735–757.
  • [31] Zeger, S. L. and Diggle, P. J. (1994). Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics 50 689–699.
  • [32] Zhang, W., Lee, S. Y. and Song, X. (2002). Local polynomial fitting in semivarying coefficient models, J. Multivariate Anal. 82 166–188.