The Annals of Statistics

GEE analysis of clustered binary data with diverging number of covariates

Lan Wang

Full-text: Open access


Clustered binary data with a large number of covariates have become increasingly common in many scientific disciplines. This paper develops an asymptotic theory for generalized estimating equations (GEE) analysis of clustered binary data when the number of covariates grows to infinity with the number of clusters. In this “large n, diverging p” framework, we provide appropriate regularity conditions and establish the existence, consistency and asymptotic normality of the GEE estimator. Furthermore, we prove that the sandwich variance formula remains valid. Even when the working correlation matrix is misspecified, the use of the sandwich variance formula leads to an asymptotically valid confidence interval and Wald test for an estimable linear combination of the unknown parameters. The accuracy of the asymptotic approximation is examined via numerical simulations. We also discuss the “diverging p” asymptotic theory for general GEE. The results in this paper extend the recent elegant work of Xie and Yang [Ann. Statist. 31 (2003) 310–347] and Balan and Schiopu-Kratina [Ann. Statist. 32 (2005) 522–541] in the “fixed p” setting.

Article information

Ann. Statist., Volume 39, Number 1 (2011), 389-417.

First available in Project Euclid: 3 December 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62F12: Asymptotic properties of estimators
Secondary: 62J12: Generalized linear models

Clustered binary data generalized estimating equations (GEE) high-dimensional covariates sandwich variance formula


Wang, Lan. GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist. 39 (2011), no. 1, 389--417. doi:10.1214/10-AOS846.

Export citation


  • Balan, R. M. and Schiopu-Kratina, I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist. 32 522–541.
  • Bai, Z. and Wu, Y. (1994). Limiting behavior of M-estimators of regression coefficients in high dimensional linear models, I. Scale-dependent case. J. Multivariate Anal. 51 211–239.
  • Chaganty, N. R. and Joe, H. (2004). Efficiency of generalized estimating equations for binary responses. J. Roy. Statist. Soc. Ser. B 66 851–860.
  • Chen, K. and Jin, Z. (2006). Partial linear regression models for clustered data. J. Amer. Statist. Assoc. 101 195–204.
  • Chen, S. X., Peng, L. and Qin, Y-L. (2009). Effects of data dimension on empirical likelihood. Biometrika 96 711–722.
  • Chiou, J.-M. and Müller, H.-G. (2005). Estimated estimating equations: Semiparametric inference for clustered and longitudinal data. J. Roy. Statist. Soc. Ser. B 67 531–553.
  • Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. In Math Challenges of 21st Century 1–32. Amer. Math. Soc., Providence, RI.
  • Fan, J. and Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. 99 710–723.
  • Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In Proceedings of the International Congress of Mathematicians (M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera eds.) III 595–622. Eur. Math. Soc., Zürich.
  • Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961.
  • Fitzmaurice, G. M. (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics 51 309–317.
  • He, X., Fung, W. K. and Zhu, Z. Y. (2006). Robust estimation in generalized partial linear models for clustered data. J. Amer. Statist. Assoc. 100 1176–1184.
  • He, X. and Shao, Q. M. (2000). On parameters of increasing dimensions. J. Multivariate Anal. 73 120–135.
  • He, X., Zhu, Z. Y. and Fung, W. K. (2002). Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 89 579–590.
  • Hjort, H. L., McKeague, I. W. and Van Keilegom, I. (2009). Extending the scope of empirical likelihood. Ann. Statist. 37 1079–1111.
  • Huang, J., Horowitz, J. L. and Ma, S. J. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587–613.
  • Huang, J. Z., Zhang, L. and Zhou, L. (2007). Efficient estimation in marginal partially linear models for longitudinal/clustered data using splines. Scand. J. Statist. 34 451–477.
  • Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Statist. 1 799–821.
  • Lam, C. and Fan, J. (2008). Profile-kernel likelihood inference with diverging number of parameters. Ann. Statist. 36 2232–2260.
  • Li, B. (1997). On the consistency of generalized estimating equations. In Selected Proceedings of the Symposium on Estimating Functions 115–136. IMS Lecture Notes—Monograph Series 32. IMS, Hayward, CA.
  • Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalised linear models. Biometrika 73 12–22.
  • Lin, D. Y. and Ying, Z. (2001). Semiparametric and nonparametric regression analysis of longitudinal data. J. Amer. Statist. Assoc. 96 103–112.
  • Lin, X. and Carroll, R. J. (2001a). Semiparametric regression for clustered data using generalized estimating equations. J. Amer. Statist. Assoc. 96 1045–1056.
  • Lin, X. and Carroll, R. J. (2001b). Semiparametric regression for clustered data. Biometrika 88 1179–1185.
  • Mammen, E. (1989). Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Statist. 17 382–400.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
  • Ortega, J. M. and Rheinboldt, W. C. (1970). Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, San Diego.
  • Pan, W. (2002). Goodness-of-fit tests for GEE with correlated binary data. Scand. J. Statist. 29 101–110.
  • Portnoy, S. (1984). Asymptotic behavior of M estimators of p regression parameters when p2 ∕ n is large. I. Consistency. Ann. Statist. 12 1298–1309.
  • Portnoy, S. (1985). Asymptotic behavior of M estimators of p regression parameters when p2 ∕ n is large. II. Normal approximation. Ann. Statist. 13 1403–1417.
  • Portnoy, S. (1988). Asymptotic properties of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Statist. 16 356–366.
  • Xie, M. and Yang, Y. (2003). Asymptotics for generalized estimating equations with large cluster sizes. Ann. Statist. 31 310–347.
  • Wang, J. L., Xue, L. G., Zhu, L. X. and Chong, Y. S. (2010). Estimation for a partial-linear single-index model. Ann. Statist. 38 246–274.
  • Wang, L. (2010). Supplement to “GEE analysis of clustered binary data with diverging number of covariates.” DOI: 10.1214/10-AOS846SUPP.
  • Wang, N., Carroll, R. J. and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Amer. Statist. Assoc. 100 147–157.
  • Welsh, A. H. (1989). On M-processes and M-estimation. Ann. Statist. 17 337–361.
  • Zhu, L. P. and Zhu, L. X. (2009). On distribution-weighted partial least squares with diverging number of highly correlated predictors. J. Roy. Statist. Soc. Ser. B 71 525–548.
  • Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Ann. Statist. 37 1733–1751.

Supplemental materials

  • Supplementary material: Supplement to “GEE analysis of clustered binary data with diverging number of covariates”. The proofs of (3.3), Lemma 3.5, (3.11) and Theorem 5.1 are provided in this supplementary article [Wang (2010)].