The Annals of Statistics

Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data

Ming-Yen Cheng, Toshio Honda, Jialiang Li, and Heng Peng

Full-text: Open access

Abstract

Ultra-high dimensional longitudinal data are increasingly common and the analysis is challenging both theoretically and methodologically. We offer a new automatic procedure for finding a sparse semivarying coefficient model, which is widely accepted for longitudinal data analysis. Our proposed method first reduces the number of covariates to a moderate order by employing a screening procedure, and then identifies both the varying and constant coefficients using a group SCAD estimator, which is subsequently refined by accounting for the within-subject correlation. The screening procedure is based on working independence and B-spline marginal models. Under weaker conditions than those in the literature, we show that with high probability only irrelevant variables will be screened out, and the number of selected variables can be bounded by a moderate order. This allows the desirable sparsity and oracle properties of the subsequent structure identification step. Note that existing methods require some kind of iterative screening in order to achieve this, thus they demand heavy computational effort and consistency is not guaranteed. The refined semivarying coefficient model employs profile least squares, local linear smoothing and nonparametric covariance estimation, and is semiparametric efficient. We also suggest ways to implement the proposed methods, and to select the tuning parameters. An extensive simulation study is summarized to demonstrate its finite sample performance and the yeast cell cycle data is analyzed.

Article information

Source
Ann. Statist. Volume 42, Number 5 (2014), 1819-1849.

Dates
First available in Project Euclid: 11 September 2014

Permanent link to this document
https://projecteuclid.org/euclid.aos/1410440626

Digital Object Identifier
doi:10.1214/14-AOS1236

Mathematical Reviews number (MathSciNet)
MR3262469

Zentralblatt MATH identifier
1305.62169

Subjects
Primary: 62G08: Nonparametric regression

Keywords
Independence screening longitudinal data B-spline SCAD sparsity oracle property

Citation

Cheng, Ming-Yen; Honda, Toshio; Li, Jialiang; Peng, Heng. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Statist. 42 (2014), no. 5, 1819--1849. doi:10.1214/14-AOS1236. https://projecteuclid.org/euclid.aos/1410440626


Export citation

References

  • [1] Antoniadis, A., Gijbels, I. and Lambert-Lacroix, S. (2014). Penalized estimation in additive varying coefficient models using grouped regularization. Statist. Papers 55 727–750.
  • [2] Antoniadis, A., Gijbels, I. and Verhasselt, A. (2012). Variable selection in varying-coefficient models using P-splines. J. Comput. Graph. Statist. 21 638–661.
  • [3] Cheng, M. Y., Honda, T., Li, J. and Peng, H. (2014). Supplement to “Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data.” DOI:10.1214/14-AOS1236SUPP.
  • [4] Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • [5] Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 11 1031–1057.
  • [6] Fan, J., Huang, T. and Li, R. (2007). Analysis of longitudinal data with semiparametric estimation of convariance function. J. Amer. Statist. Assoc. 102 632–641.
  • [7] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • [8] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • [9] Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484.
  • [10] Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J. Amer. Statist. Assoc. To appear.
  • [11] Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • [12] Fan, J. and Wu, Y. (2008). Semiparametric estimation of covariance matrixes for longitudinal data. J. Amer. Statist. Assoc. 103 1520–1533.
  • [13] Fan, Y. and Lv, J. (2013). Asymptotic equivalence of regularization methods in thresholded parameter space. J. Amer. Statist. Assoc. 108 1044–1061.
  • [14] Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B Stat. Methodol. 75 531–552.
  • [15] Hall, P., Müller, H.-G. and Wang, J.-L. (2006). Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist. 34 1493–1517.
  • [16] Hu, T. and Xia, Y. (2012). Adaptive semi-varying coefficient model selection. Statist. Sinica 22 575–599.
  • [17] Lam, C. and Fan, J. (2008). Profile-kernel likelihood inference with diverging number of parameters. Ann. Statist. 36 2232–2260.
  • [18] Li, G., Peng, H., Zhang, J. and Zhu, L. (2012). Robust rank correlation based screening. Ann. Statist. 40 1846–1877.
  • [19] Li, Y. (2011). Efficient semiparametric regression for longitudinal data with nonparametric covariance estimation. Biometrika 98 355–370.
  • [20] Lian, H., Lai, P. and Liang, H. (2013). Partially linear structure selection in Cox models with varying coefficients. Biometrics 69 348–357.
  • [21] Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Amer. Statist. Assoc. 109 266–274.
  • [22] Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53–71.
  • [23] Noh, H. S. and Park, B. U. (2010). Sparse varying coefficient models for longitudinal data. Statist. Sinica 20 1183–1202.
  • [24] Schumaker, L. L. (2007). Spline Functions: Basic Theory, 3rd ed. Cambridge Univ. Press, Cambridge.
  • [25] Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstien, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast sacchromyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 3273–3297.
  • [26] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [27] Wang, H., Li, R. and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94 553–568.
  • [28] Wang, L., Chen, G. and Li, H. (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23 1486–1494.
  • [29] Wang, L., Li, H. and Huang, J. Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Amer. Statist. Assoc. 103 1556–1569.
  • [30] Wang, L., Zhou, J. and Qu, A. (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 68 353–360.
  • [31] Wang, N., Carroll, R. J. and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Amer. Statist. Assoc. 100 147–157.
  • [32] Wei, F., Huang, J. and Li, H. (2011). Variable selection and estimation in high-dimensional varying-coefficient models. Statist. Sinica 21 1515–1540.
  • [33] Xia, Y., Zhang, W. and Tong, H. (2004). Efficient estimation for semivarying-coefficient models. Biometrika 91 661–681.
  • [34] Xue, L. and Qu, A. (2012). Variable selection in high-dimensional varying-coefficient models with global optimality. J. Mach. Learn. Res. 13 1973–1998.
  • [35] Yan, J. and Huang, J. (2012). Model selection for Cox models with time-varying coefficients. Biometrics 68 419–428.
  • [36] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • [37] Zhang, H. H., Cheng, G. and Liu, Y. (2011). Linear or nonlinear? Automatic structure discovery for partially linear models. J. Amer. Statist. Assoc. 106 1099–1112.
  • [38] Zhang, W., Fan, J. and Sun, Y. (2009). A semiparametric model for cluster data. Ann. Statist. 37 2377–2408.
  • [39] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

Supplemental materials

  • Supplementary material: Supplement to “Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data”. Some lemmas, and proofs of Theorems 2.1–2.2 and Remark 1.