The Annals of Statistics

Robust rank correlation based screening

Gaorong Li, Heng Peng, Jun Zhang, and Lixing Zhu

Full-text: Open access

Abstract

Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or “large $p$, small $n$” paradigms when $p$ can be as large as an exponential of the sample size $n$. In this paper we propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall $\tau$ correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods. First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.

Article information

Source
Ann. Statist., Volume 40, Number 3 (2012), 1846-1877.

Dates
First available in Project Euclid: 16 October 2012

Permanent link to this document
https://projecteuclid.org/euclid.aos/1350394519

Digital Object Identifier
doi:10.1214/12-AOS1024

Mathematical Reviews number (MathSciNet)
MR3015046

Zentralblatt MATH identifier
1257.62067

Subjects
Primary: 62J02: General nonlinear regression 62J12: Generalized linear models
Secondary: 62F07: Ranking and selection 62F35: Robustness and adaptive procedures

Keywords
Variable selection rank correlation screening dimensionality reduction semiparametric models large $p$ small $n$ SIS

Citation

Li, Gaorong; Peng, Heng; Zhang, Jun; Zhu, Lixing. Robust rank correlation based screening. Ann. Statist. 40 (2012), no. 3, 1846--1877. doi:10.1214/12-AOS1024. https://projecteuclid.org/euclid.aos/1350394519


Export citation

References

  • Albright, S. C., Winston, W. L. and Zappe, C. J. (1999). Data Analysis and Decision Making with Microsoft Excel. Duxbury, Pacific Grove, CA.
  • Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited. J. Amer. Statist. Assoc. 76 296–311.
  • Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 26 211–252.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • Cario, M. C. and Nelson, B. L. (1997). Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical report, Dept. Industrial Engineering and Management Sciences, Northwestern Univ., Evanston, IL.
  • Carroll, R. J. and Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman & Hall, New York.
  • Channouf, N. and L’Ecuyer, P. (2009). Fitting a normal copula for a multivariate distribution with both discrete and continuous marginals. In Proceedings of the 2009 Winter Simulation Conference 352–358.
  • Cook, R. D. and Weisberg, S. (1991). Discussion with “Sliced inverse regression for dimension reduction,” by K. C. Li. J. Amer. Statist. Assoc. 86 328–332.
  • Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. In Aide-Memoire of a Lecture at AMS Conference on Math Challenges of 21st Century.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–451.
  • Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians. Vol. III (M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera, eds.) 595–622. Eur. Math. Soc., Zürich.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148.
  • Fan, J. and Lv, J. (2011). Non-concave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961.
  • Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the lienar model. J. Mach. Learn. Res. 10 1829–1853.
  • Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109–148.
  • Ghosh, S. and Henderson, S. G. (2003). Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation 13 276–294.
  • Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Statist. 18 533–550.
  • Han, A. K. (1987). Nonparametric analysis of a generalized regression model. The maximum rank correlation estimator. J. Econometrics 35 303–316.
  • Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587–613.
  • Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, 2nd ed. Wiley, Hoboken, NJ.
  • Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30 81–93.
  • Kendall, M. G. (1949). Rank and product-moment correlation. Biometrika 36 177–193.
  • Kendall, M. G. (1962). Rank Correlation Methods, 3rd ed. Griffin & Co, London.
  • Klaassen, C. A. J. and Wellner, J. A. (1997). Efficient estimation in the bivariate normal copula model: Normal margins are least favourable. Bernoulli 3 55–77.
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316–342.
  • Li, G., Peng, H. and Zhu, L. (2011). Nonconcave penalized $M$-estimation with a diverging number of parameters. Statist. Sinica 21 391–419.
  • Li, G. R., Peng, H., Zhang, J. and Zhu, L. X. (2012). Supplement to “Robust rank correlation based screening.” DOI:10.1214/12-AOS1024SUPP.
  • Lin, H. and Peng, H. (2013). Smoothed rank correlation of the linear transformation regression model. Comput. Statist. Data Anal. 57 615–630.
  • Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498–3528.
  • Nelsen, R. B. (2006). An Introduction to Copulas, 2nd ed. Springer, New York.
  • Pitt, M., Chan, D. and Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93 537–554.
  • Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall’s tau. J. Amer. Statist. Assoc. 63 1379–1389.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
  • van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • Wackerly, D. D., Mendenhall, W. and Scheaffer, R. L. (2002). Mathematical Statistics with Applications. Duxbury, Pacific Grove, CA.
  • Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99 15–28.
  • Xu, P. R. and Zhu, L. X. (2010). Sure independence screening for marginal longitudinal generalized linear models. Unpublished manuscript.
  • Zhu, L. P., Li, L. X., Li, R. Z. and Zhu, L. X. (2011). Model-free feature screening for ultrahigh-demensional data. J. Amer. Statist. Assoc. 106 1464–1474.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Statist. 36 1509–1566.

Supplemental materials

  • Supplementary material: Supplement to “Robust rank correlation based screening”. Application to Cardiomyopathy microarray Data and the proofs of Theorems 1–3 and Proposition 1 require some technical and lengthy arguments that we develop in this supplement.