The Annals of Statistics

Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data

Xuming He, Lan Wang, and Hyokyoung Grace Hong

Full-text: Open access

Abstract

We introduce a quantile-adaptive framework for nonlinear variable screening with high-dimensional heterogeneous data. This framework has two distinctive features: (1) it allows the set of active variables to vary across quantiles, thus making it more flexible to accommodate heterogeneity; (2) it is model-free and avoids the difficult task of specifying the form of a statistical model in a high dimensional space. Our nonlinear independence screening procedure employs spline approximations to model the marginal effects at a quantile level of interest. Under appropriate conditions on the quantile functions without requiring the existence of any moments, the new procedure is shown to enjoy the sure screening property in ultra-high dimensions. Furthermore, the quantile-adaptive framework can naturally handle censored data arising in survival analysis. We prove that the sure screening property remains valid when the response variable is subject to random right censoring. Numerical studies confirm the fine performance of the proposed method for various semiparametric models and its effectiveness to extract quantile-specific information from heteroscedastic data.

Article information

Source
Ann. Statist., Volume 41, Number 1 (2013), 342-369.

Dates
First available in Project Euclid: 26 March 2013

Permanent link to this document
https://projecteuclid.org/euclid.aos/1364302746

Digital Object Identifier
doi:10.1214/13-AOS1087

Mathematical Reviews number (MathSciNet)
MR3059421

Zentralblatt MATH identifier
1295.62053

Subjects
Primary: 68Q32: Computational learning theory [See also 68T05] 62G99: None of the above, but in this section
Secondary: 62E99: None of the above, but in this section 62N99: None of the above, but in this section

Keywords
Feature screening high dimension polynomial splines quantile regression randomly censored data sure independence screening

Citation

He, Xuming; Wang, Lan; Hong, Hyokyoung Grace. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Statist. 41 (2013), no. 1, 342--369. doi:10.1214/13-AOS1087. https://projecteuclid.org/euclid.aos/1364302746


Export citation

References

  • Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2 511–522.
  • Beran, R. (1981). Nonparametric regression with randomly censored survival data, Technical report. Univ. California, Berkeley.
  • Bühlmann, P., Kalisch, M. and Maathuis, M. H. (2010). Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm. Biometrika 97 261–278.
  • Fan, J., Feng, Y. and Wu, Y. (2010). Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collections 6 70–86.
  • Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space (with discussion). J. Roy. Statist. Soc. Ser. B 70 849–911.
  • Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the linear model. J. Mach. Learn. Res. 10 1829–1853.
  • Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • Gonzalez-Manteiga, W. and Cadarso-Suarez, C. (1994). Asymptotic properties of a generalized Kaplan–Meier estimator with some applications. J. Nonparametr. Stat. 4 65–78.
  • Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Statist. 18 533–550.
  • He, X. and Shi, P. (1996). Bivariate tensor-product $B$-splines in a partly linear model. J. Multivariate Anal. 58 162–181.
  • He, X., Wang, L. and Hong, H. G. (2013). Supplement to “Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data.” DOI:10.1214/13-AOS1087SUPP.
  • Hjort, N. L. and Pollard, D. (1993). Asymptotics for minimisers of convex processes. Technical report, Dept. Statistics, Yale Univ., New Haven, CT. Available at http://citeseer.ist.psu.edu/hjort93asymptotics.html.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
  • Knight, K. (1998). Limiting distributions for $L_1$ regression estimators under general conditions. Ann. Statist. 26 755–770.
  • Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. Cambridge Univ. Press, Cambridge.
  • Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Springer, Berlin.
  • Li, H. and Luan, Y. (2005). Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data. Bioinformatics 21 2403–2409.
  • Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. J. Amer. Statist. Assoc. 107 1129–1139.
  • Lo, S.-H. and Singh, K. (1986). The product-limit estimator and the bootstrap: Some asymptotic representations. Probab. Theory Related Fields 71 455–465.
  • Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. (6) 9 245–303.
  • McKeague, I. W., Subramanian, S. and Sun, Y. (2001). Median regression and the missing information principle. J. Nonparametr. Stat. 13 709–727.
  • Peng, L. and Huang, Y. (2008). Survival analysis with quantile regression models. J. Amer. Statist. Assoc. 103 637–649.
  • Portnoy, S. (2003). Censored regression quantiles. J. Amer. Statist. Assoc. 98 1001–1012.
  • Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Hermelink, H. K., Smeland, E. B. and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England Journal of Medicine 346 1937–1947.
  • Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • Wang, H. J. and Wang, L. (2009). Locally weighted censored quantile regression. J. Amer. Statist. Assoc. 104 1117–1128.
  • Ying, Z., Jung, S. H. and Wei, L. J. (1995). Survival analysis with median regression models. J. Amer. Statist. Assoc. 90 178–184.
  • Zhao, S. D. and Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J. Multivariate Anal. 105 397–411.
  • Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics for regression splines and confidence regions. Ann. Statist. 26 1760–1782.
  • Zhu, L.-P., Li, L., Li, R. and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106 1464–1475.

Supplemental materials

  • Supplementary material: “Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data”. We provide additional technical details and numerical examples in the supplemental material.