Electronic Journal of Statistics

Oracle P-values and variable screening

Ning Hao and Hao Helen Zhang

Full-text: Open access

Abstract

The concept of P-value was proposed by Fisher to measure inconsistency of data with a specified null hypothesis, and it plays a central role in statistical inference. For classical linear regression analysis, it is a standard procedure to calculate P-values for regression coefficients based on least squares estimator (LSE) to determine their significance. However, for high dimensional data when the number of predictors exceeds the sample size, ordinary least squares are no longer proper and there is not a valid definition for P-values based on LSE. It is also challenging to define sensible P-values for other high dimensional regression methods such as penalization and resampling methods. In this paper, we introduce a new concept called oracle P-value to generalize traditional P-values based on LSE to high dimensional sparse regression models. Then we propose several estimation procedures to approximate oracle P-values for real data analysis. We show that the oracle P-value framework is useful for developing new and powerful tools to enhance high dimensional data analysis, including variable ranking, variable selection, and screening procedures with false discovery rate (FDR) control. Numerical examples are then presented to demonstrate performance of the proposed methods.

Article information

Source
Electron. J. Statist. Volume 11, Number 2 (2017), 3251-3271.

Dates
Received: December 2014
First available in Project Euclid: 2 October 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1506931546

Digital Object Identifier
doi:10.1214/17-EJS1284

Zentralblatt MATH identifier
06790060

Keywords
False discovery rate high dimensional data inference P-value variable selection

Rights
Creative Commons Attribution 4.0 International License.

Citation

Hao, Ning; Zhang, Hao Helen. Oracle P-values and variable screening. Electron. J. Statist. 11 (2017), no. 2, 3251--3271. doi:10.1214/17-EJS1284. https://projecteuclid.org/euclid.ejs/1506931546


Export citation

References

  • Bauer, P., Potscher, B. M. & Hackl, P. (1988). Model selection by multiple test, procedures.Statistics: A Journal of Theoretical and Applied Statistics19, 39–44.
  • Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple, testing.Journal of the Royal Statistical Society. Series B (Methodological)57, 289–300.
  • Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under, dependency.Annals of statistics29, 1165–1188.
  • Bühlmann, P. & van de Geer, S., (2011).Statistics for High-Dimensional Data. Springer Series in Statistics. Springer.
  • Bühlmann, P. et al. (2013). Statistical significance in high-dimensional linear, models.Bernoulli19, 1212–1242.
  • Bunea, F., Wegkamp, M. H. & Auguste, A. (2006). Consistent variable selection in high dimensional regression via multiple, testing.Journal of Statistical Planning and Inference136, 4349–4364.
  • Chiang, A., Beck, J., Yen, H., Tayeh, M., Scheetz, T., Nishimura, D., Braun, T., Kim, K., Huang, J., Elbedour, K., Carmi, R., Slusarski, D., Casavant, T., Stone, E. & Sheffield, V. (2006). Homozygosity mapping with snp arrays identifies TRIM32, an e3 ubiquitin ligase, as a Bardet-Biedl syndrome gene, (BBS11).Proceeding of National Academy Science, USA18, 6287–92.
  • Dezeure, R., Buhlmann, P., Meier, L. & Nicolai, M. (2015). High-dimensional inference: confidence intervals, p-values, and R-software, hdi.Statistical Science30, 533–558.
  • Fan, J., Guo, S. & Hao, N. (2012a). Variance estimation using refitted cross-validation in ultrahigh dimensional, regression.Journal of the Royal Statistical Society: Series B (Statistical Methodology)74, 37–65.
  • Fan, J., Han, X. & Gu, W. (2012b). Estimating false discovery proportion under arbitrary covariance, dependence.Journal of the American Statistical Association107, 1019–1035.
  • Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature, space.Journal of the Royal Statistical Society: Series B (Statistical Methodology)70, 849–911.
  • Fan, J. & Lv, J. (2010). A selective overview of variable selection in high dimensional feature, space.Statistica Sinica20, 101–148.
  • Fan, J. & Lv, J. (2011). Nonconcave penalized likelihood with, NP-dimensionality.IEEE Transactions on Information Theory57, 5467–5484.
  • Meinshausen, N., Meier, L. & Bühlmann, P. (2009). P-values for high-dimensional, regression.Journal of the American Statistical Association104, 1667–1681.
  • Ning, Y. & Liu, H. (2016). A general theory of hypothesis tests and confidence regions for sparse high dimensional, models.Annals of statistics, to appear.
  • Scheetz, T. E., Kim, K. Y., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. G., Huang, J., Casavant, T. L., Sheffeld, V. C. & Stone, E. M. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye, disease.Proceedings of the National Academy of Sciences103, 14429–14434.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the, lasso.Journal of the Royal Statistical Society. Series B (Methodological)58, 267–288.
  • Wang, H. (2009). Forward regression for ultra-high dimensional variable, screening.Journal of the American Statistical Association104, 1512–1524.
  • Wasserman, L. & Roeder, K. (2009). High dimensional variable, selection.Annals of statistics37, 2178–2201.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave, penalty.Annals of Statistics38, 894–942.
  • Zhang, C.-H. & Zhang, S. S. (2014). Confidence intervals for low-dimensional parameters in high-dimensional linear, models.Journal of the Royal Statistical Society, Series B.76, 217–242.
  • Zhao, P. & Yu, B. (2006). On model selection consistency of, lasso.J. Mach. Learn. Res.7, 2541–2563.