Statistical Science

High-Dimensional Inference: Confidence Intervals, $p$-Values and R-Software hdi

Ruben Dezeure, Peter Bühlmann, Lukas Meier, and Nicolai Meinshausen

Full-text: Open access

Abstract

We present a (selective) review of recent frequentist high-dimensional inference methods for constructing $p$-values and confidence intervals in linear and generalized linear models. We include a broad, comparative empirical study which complements the viewpoint from statistical methodology and theory. Furthermore, we introduce and illustrate the R-package hdi which easily allows the use of different methods and supports reproducibility.

Article information

Source
Statist. Sci., Volume 30, Number 4 (2015), 533-558.

Dates
First available in Project Euclid: 9 December 2015

Permanent link to this document
https://projecteuclid.org/euclid.ss/1449670857

Digital Object Identifier
doi:10.1214/15-STS527

Mathematical Reviews number (MathSciNet)
MR3432840

Zentralblatt MATH identifier
06946201

Keywords
Clustering confidence interval generalized linear model high-dimensional statistical inference linear model multiple testing $p$-value R-software

Citation

Dezeure, Ruben; Bühlmann, Peter; Meier, Lukas; Meinshausen, Nicolai. High-Dimensional Inference: Confidence Intervals, $p$-Values and R-Software hdi. Statist. Sci. 30 (2015), no. 4, 533--558. doi:10.1214/15-STS527. https://projecteuclid.org/euclid.ss/1449670857


Export citation

References

  • Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085.
  • Belloni, A., Chernozhukov, V. and Kato, K. (2015). Uniform post-selection inference for least absolute deviation regression and other $Z$-estimation problems. Biometrika 102 77–94.
  • Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806.
  • Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
  • Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93.
  • Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
  • Bogdan, M., van den Berg, E., Su, W. and Candès, E. (2013). Statistical estimation and testing via the sorted l1 norm. Preprint. Available at arXiv:1310.1969.
  • Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. (2014). SLOPE—adaptive variable selection via convex optimization. Preprint. Available at arXiv:1407.3824.
  • Breiman, L. (1996a). Bagging predictors. Mach. Learn. 24 123–140.
  • Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
  • Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.
  • Bühlmann, P., Kalisch, M. and Meier, L. (2014). High-dimensional statistics with a view towards applications in biology. Annual Review of Statistics and Its Applications 1 255–278.
  • Bühlmann, P. and Mandozzi, J. (2014). High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Statist. 29 407–430.
  • Bühlmann, P., Meier, L. and van de Geer, S. (2014). Discussion: “A significance test for the Lasso”. Ann. Statist. 42 469–477.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • Bühlmann, P. and van de Geer, S. (2015). High-dimensional inference in misspecified linear models. Electron. J. Stat. 9 1449–1473.
  • Candes, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory 52 5406–5425.
  • Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012). Latent variable graphical model selection via convex optimization. Ann. Statist. 40 1935–1967.
  • Chatterjee, A. and Lahiri, S. N. (2013). Rates of convergence of the adaptive LASSO estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Statist. 41 1232–1259.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
  • Dezeure, R., Bühlmann, P., Meier, L. and Meinshausen, N. (2015). Supplement to “High-Dimensional Inference: Confidence Intervals, $p$-Values and R-Software hdi.” DOI:10.1214/15-STS527SUPP.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148.
  • Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849.
  • Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model selection. Preprint. Available at arXiv:1410.2597.
  • Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
  • Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28 1356–1378.
  • Lee, J., Sun, D., Sun, Y. and Taylor, J. (2013). Exact post-selection inference, with application to the Lasso. Preprint. Available at arXiv:1311.6238.
  • Leeb, H. and Pötscher, B. M. (2003). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econometric Theory 19 100–142.
  • Liu, H. and Yu, B. (2013). Asymptotic properties of Lasso${}+{}$mLS and Lasso${}+{}$Ridge in sparse high-dimensional linear regression. Electron. J. Stat. 7 3124–3169.
  • Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the Lasso. Ann. Statist. 42 413–468.
  • Mandozzi, J. and Bühlmann, P. (2015). Hierarchical testing in the high-dimensional setting with correlated variables. J. Amer. Statist. Assoc. To appear. DOI:10.1080/01621459.2015.1007209. Available at arXiv:1312.5556.
  • McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models, 2nd ed. Chapman & Hall, London.
  • Meier, L., Meinshausen, N. and Dezeure, R. (2014). hdi: High-Dimensional Inference. R package version 0.1-2.
  • Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika 95 265–278.
  • Meinshausen, N. (2015). Group-bound: Confidence intervals for groups of variables in sparse high-dimensional regression without assumptions on the design. J. R. Stat. Soc. Ser. B. Stat. Methodol. To appear. DOI:10.1111/rssb.12094. Available at arXiv:1309.3489.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436–1462.
  • Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
  • Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
  • Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge Univ. Press, Cambridge.
  • Reid, S., Tibshirani, R. and Friedman, J. (2013). A study of error variance estimation in Lasso regression. Preprint. Available at arXiv:1311.5274.
  • Shah, R. D. and Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 55–80.
  • Shao, J. and Deng, X. (2012). Estimation in high-dimensional linear models with deterministic design matrices. Ann. Statist. 40 812–831.
  • Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA.
  • Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
  • Taylor, J., Lockhart, R., Tibshirani, R. and Tibshirani, R. (2014). Exact post-selection inference for forward stepwise and least angle regression. Preprint. Available at arXiv:1401.3889.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • van de Geer, S. (2007). The deterministic Lasso. In JSM Proceedings 140. American Statistical Association, Alexandria, VA.
  • van de Geer, S. (2014). Statistical theory for high-dimensional models. Preprint. Available at arXiv:1409.8557.
  • van de Geer, S. (2015). $\chi^{2}$-confidence sets in high-dimensional regression. Preprint. Available at arXiv:1502.07131.
  • van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392.
  • van de Geer, S., Bühlmann, P. and Zhou, S. (2011). The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5 688–749.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • Wasserman, L. (2014). Discussion: “A significance test for the Lasso”. Ann. Statist. 42 501–508.
  • Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
  • Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

Supplemental materials