The Annals of Statistics

Valid post-selection inference

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao

Full-text: Open access

Abstract

It is common practice in statistical data analysis to perform data-driven variable selection and derive statistical inference from the resulting model. Such inference enjoys none of the guarantees that classical statistical theory provides for tests and confidence intervals when the model has been chosen a priori. We propose to produce valid “post-selection inference” by reducing the problem to one of simultaneous inference and hence suitably widening conventional confidence and retention intervals. Simultaneity is required for all linear functions that arise as coefficient estimates in all submodels. By purchasing “simultaneity insurance” for all possible submodels, the resulting post-selection inference is rendered universally valid under all possible model selection procedures. This inference is therefore generally conservative for particular selection procedures, but it is always less conservative than full Scheffé protection. Importantly it does not depend on the truth of the selected submodel, and hence it produces valid inference even in wrong models. We describe the structure of the simultaneous inference problem and give some asymptotic results.

Article information

Source
Ann. Statist. Volume 41, Number 2 (2013), 802-837.

Dates
First available in Project Euclid: 29 May 2013

Permanent link to this document
https://projecteuclid.org/euclid.aos/1369836961

Digital Object Identifier
doi:10.1214/12-AOS1077

Mathematical Reviews number (MathSciNet)
MR3099122

Zentralblatt MATH identifier
1267.62080

Subjects
Primary: 62J05: Linear regression 62J15: Paired and multiple comparisons

Keywords
Linear regression model selection multiple comparison family-wise error high-dimensional inference sphere packing

Citation

Berk, Richard; Brown, Lawrence; Buja, Andreas; Zhang, Kai; Zhao, Linda. Valid post-selection inference. Ann. Statist. 41 (2013), no. 2, 802--837. doi:10.1214/12-AOS1077. https://projecteuclid.org/euclid.aos/1369836961.


Export citation

References

  • Angrist, J. D. and Pischke, J. S. (2009). Mostly Harmless Econometrics. Princeton Univ. Press, Princeton.
  • Bahadur, R. R. (1966). A note on quantiles in large samples. Ann. Math. Statist. 37 577–580.
  • Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Supplement to “Valid post-selection inference.” DOI:10.1214/12-AOS1077SUPP.
  • Brown, L. (1967). The conditional level of Student’s $t$ test. Ann. Math. Statist. 38 1068–1071.
  • Buehler, R. J. and Feddersen, A. P. (1963). Note on a conditional property of Student’s $t$. Ann. Math. Statist. 34 1098–1100.
  • Claeskens, G. and Hjort, N. L. (2003). The focused information criterion (with discussion). J. Amer. Statist. Assoc. 98 900–945.
  • Dijkstra, T. K. and Veldkamp, J. H. (1988). Data-driven selection of regressors and the bootstrap. In On Model Uncertainty and Its Statistical Implications (T. K. Dijkstra, ed.) 17–38. Springer, Berlin.
  • Hall, P. and Carroll, R. J. (1989). Variance function estimation in regression: The effect of estimating the mean. J. R. Stat. Soc. Ser. B Stat. Methodol. 51 3–14.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. J. Amer. Statist. Assoc. 98 879–899.
  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2 e124. DOI:10.1371/journal.pmed.0020124.
  • Kabaila, P. (1998). Valid confidence intervals in regression after variable selection. Econometric Theory 14 463–482.
  • Kabaila, P. (2009). The coverage properties of confidence regions after model selection. International Statistical Review 77 405–414.
  • Kabaila, P. and Leeb, H. (2006). On the large-sample minimal coverage probability of confidence intervals after model selection. J. Amer. Statist. Assoc. 101 619–629.
  • Leeb, H. (2006). The distribution of a linear predictor after model selection: Unconditional finite-sample distributions and asymptotic approximations. In Optimality. Institute of Mathematical Statistics Lecture Notes—Monograph Series 49 291–311. IMS, Beachwood, OH.
  • Leeb, H. and Pötscher, B. M. (2003). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econometric Theory 19 100–142.
  • Leeb, H. and Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory 21 21–59.
  • Leeb, H. and Pötscher, B. M. (2006a). Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory 22 69–97.
  • Leeb, H. and Pötscher, B. M. (2006b). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554–2591.
  • Leeb, H. and Pötscher, B. M. (2008a). Model selection. In The Handbook of Financial Time Series (T. G. Anderson, R. A. Davis, J. P. Kreiss and T. Mikosch, eds.) 785–821. Springer, New York.
  • Leeb, H. and Pötscher, B. M. (2008b). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376.
  • Leeb, H. and Pötscher, B. M. (2008c). Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econometrics 142 201–211.
  • Moore, D. S. and McCabe, G. P. (2003). Introduction to the Practice of Statistics, 4th ed. Freeman, New York.
  • Olshen, R. A. (1973). The conditional level of the $F$-test. J. Amer. Statist. Assoc. 68 692–698.
  • Pötscher, B. M. (1991). Effects of model selection on inference. Econometric Theory 7 163–185.
  • Pötscher, B. M. (2006). The distribution of model averaging estimators and an impossibility result regarding its estimation. In Time Series and Related Topics. Institute of Mathematical Statistics Lecture Notes—Monograph Series 52 113–129. IMS, Beachwood, OH.
  • Pötscher, B. M. and Leeb, H. (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. Multivariate Anal. 100 2065–2082.
  • Pötscher, B. M. and Schneider, U. (2009). On the distribution of the adaptive LASSO estimator. J. Statist. Plann. Inference 139 2775–2790.
  • Pötscher, B. M. and Schneider, U. (2010). Confidence sets based on penalized maximum likelihood estimators in Gaussian regression. Electron. J. Stat. 4 334–360.
  • Pötscher, B. M. and Schneider, U. (2011). Distributional results for thresholding estimators in high-dimensional Gaussian regression models. Electron. J. Stat. 5 1876–1934.
  • Scheffé, H. (1959). The Analysis of Variance. Wiley, New York.
  • Sen, P. K. (1979). Asymptotic properties of maximum likelihood estimators based on conditional specification. Ann. Statist. 7 1019–1033.
  • Sen, P. K. and Saleh, A. K. M. E. (1987). On preliminary test and shrinkage $M$-estimation in linear models. Ann. Statist. 15 1580–1592.
  • Wyner, A. D. (1967). Random packings and coverings of the unit $n$-sphere. Bell System Tech. J. 46 2111–2118.

Supplemental materials

  • Supplementary material: Supplement to “Valid post-selection inference”. The online supplement contains the following sections: B.1 The Full Model Interpretation of Parameters (as a contrast to the sub-model interpretation adopted in this article). B.2 “Omitted Variables Bias” (which is not bias in the sense of this article). B.3 Proof of Corollary 4.2 (strong error control). B.4 Alternative PoSI Guarantees (conditional on selection). B.5 PoSI P-Value Adjustment for Model Selection. B.6 The PoSI Process [the PoSI problem in terms of a $(j,\mathrm{M})$-indexed process]. B.7 Figures (illustrating PoSI polytopes and results of a simulation for exchangeable designs).