The Annals of Statistics

Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models

Degui Li, Yuan Ke, and Wenyang Zhang

Full-text: Open access

Abstract

In this paper, we study the model selection and structure specification for the generalised semi-varying coefficient models (GSVCMs), where the number of potential covariates is allowed to be larger than the sample size. We first propose a penalised likelihood method with the LASSO penalty function to obtain the preliminary estimates of the functional coefficients. Then, using the quadratic approximation for the local log-likelihood function and the adaptive group LASSO penalty (or the local linear approximation of the group SCAD penalty) with the help of the preliminary estimation of the functional coefficients, we introduce a novel penalised weighted least squares procedure to select the significant covariates and identify the constant coefficients among the coefficients of the selected covariates, which could thus specify the semiparametric modelling structure. The developed model selection and structure specification approach not only inherits many nice statistical properties from the local maximum likelihood estimation and nonconcave penalised likelihood method, but also computationally attractive thanks to the computational algorithm that is proposed to implement our method. Under some mild conditions, we establish the asymptotic properties for the proposed model selection and estimation procedure such as the sparsity and oracle property. We also conduct simulation studies to examine the finite sample performance of the proposed method, and finally apply the method to analyse a real data set, which leads to some interesting findings.

Article information

Source
Ann. Statist., Volume 43, Number 6 (2015), 2676-2705.

Dates
Received: January 2015
Revised: June 2015
First available in Project Euclid: 7 October 2015

Permanent link to this document
https://projecteuclid.org/euclid.aos/1444222089

Digital Object Identifier
doi:10.1214/15-AOS1356

Mathematical Reviews number (MathSciNet)
MR3405608

Zentralblatt MATH identifier
1327.62262

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62G20: Asymptotic properties

Keywords
GSVCM LASSO local maximum likelihood oracle estimation SCAD sparsity ultra-high dimension

Citation

Li, Degui; Ke, Yuan; Zhang, Wenyang. Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models. Ann. Statist. 43 (2015), no. 6, 2676--2705. doi:10.1214/15-AOS1356. https://projecteuclid.org/euclid.aos/1444222089


Export citation

References

  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Bradic, J., Fan, J. and Wang, W. (2011). Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 325–349.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • Cai, Z., Fan, J. and Li, R. (2000). Efficient estimation and inferences for varying-coefficient models. J. Amer. Statist. Assoc. 95 888–902.
  • Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95 759–771.
  • Cheng, M.-Y., Zhang, W. and Chen, L.-H. (2009). Statistical estimation in generalized multiparameter likelihood models. J. Amer. Statist. Assoc. 104 1179–1191.
  • Cheng, M.-Y., Honda, T., Li, J. and Peng, H. (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Statist. 42 1819–1849.
  • de Leon, A. P., Anderson, H. R., Bland, J. M., Strachan, D. P. and Bower, J. (1996). Effects of air pollution on daily hospital admissions for respiratory disease in London between 1987-88 and 1991-92. Journal of Epidemiology and Community Health 50 s63–s70.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability 66. Chapman & Hall, London.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Fan, J. and Lv, J. (2011). Non-concave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484.
  • Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Amer. Statist. Assoc. 109 1270–1284.
  • Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. J. Mach. Learn. Res. 10 2013–2038.
  • Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 531–552.
  • Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. Ann. Statist. 27 1491–1518.
  • Fan, J. and Zhang, W. (2000). Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scand. J. Stat. 27 715–731.
  • Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587–613.
  • Huang, J. and Xie, H. (2007). Asymptotic oracle properties of SCAD-penalized least squares estimators. In Asymptotics: Particles, Processes and Inverse Problems. Institute of Mathematical Statistics Lecture Notes—Monograph Series 55 149–166. IMS, Beachwood, OH.
  • Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617–1642.
  • Janson, S. (1987). Maximal spacings in several dimensions. Ann. Probab. 15 274–280.
  • Kai, B., Li, R. and Zou, H. (2011). New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models. Ann. Statist. 39 305–332.
  • Li, D., Ke, Y. and Zhang, W. (2013). Model selection in generalised semi-varying coefficient models with diverging number of potential covariates. Working paper, Dept. Mathematics, Univ. York.
  • Li, D., Ke, Y. and Zhang, W. (2015). Supplement to “Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models.” DOI:10.1214/15-AOS1356SUPP.
  • Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. Ann. Statist. 36 261–286.
  • Li, J. and Zhang, W. (2011). A semiparametric threshold model for censored longitudinal data analysis. J. Amer. Statist. Assoc. 106 685–696.
  • Lian, H. (2012). Variable selection for high-dimensional generalized varying-coefficient models. Statist. Sinica 22 1563–1588.
  • Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Amer. Statist. Assoc. 109 266–274.
  • Schwartz, J. (1995). Short term fluctuations in air pollution and hospital admissions of the elderly for respiratory disease. Thorax 50 531–538.
  • Song, R., Yi, F. and Zuo, H. (2015). On varying-coefficient independence screening for high-dimensional varying-coefficient models. Statist. Sinica 24 1735–1752.
  • Strachan, D. P. and Sanders, C. H. (1989). Damp housing and childhood asthma; respiratory effects of indoor air temperature and relative humidity. J. Epidemiol. Community Health 43 7–14.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • Wang, L., Kai, B. and Li, R. (2009). Local rank inference for varying coefficient models. J. Amer. Statist. Assoc. 104 1631–1645.
  • Wang, L., Li, H. and Huang, J. Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Amer. Statist. Assoc. 103 1556–1569.
  • Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Amer. Statist. Assoc. 104 747–757.
  • Wei, F., Huang, J. and Li, H. (2011). Variable selection and estimation in high-dimensional varying-coefficient models. Statist. Sinica 21 1515–1540.
  • World Health Organization (2003). Health aspects of air pollution with particulate matter, ozone, and nitrogen dioxide. Rep. EUR/03/5042688, Bonn.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, W., Fan, J. and Sun, Y. (2009). A semiparametric model for cluster data. Ann. Statist. 37 2377–2408.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • Zhang, W. and Peng, H. (2010). Simultaneous confidence band and hypothesis test in generalised varying-coefficient models. J. Multivariate Anal. 101 1656–1680.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533.
  • Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Ann. Statist. 37 1773.

Supplemental materials

  • Supplement to “Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models”. We provide the detailed proofs of the main results stated in Section 3 as well as some technical lemmas which are useful in the proofs.