Abstract
Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. (Ann. Statist. 41 (2013) 802–837) and further extended by Bachoc, Preinerstorfer and Steinberger (2016). These proposals, labeled “PoSI methods,” provide valid inference after arbitrary model selection. They are computationally NP-hard and have limitations in their theoretical justifications. We therefore propose computationally efficient confidence regions, named “UPoSI’ (“U” is for “uniform” or “universal.”) and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. (Ann. Statist. 41 (2013) 802–837). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates $p$ diverges (almost exponentially) with the sample size $n$. Under standard tail assumptions, we only require $(\log p)^{7}=o(n)$ and $k=o(\sqrt{n/\log p})$ where $k$ ($\le p$) is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them theoretically with those proposed previously.
Citation
Arun K. Kuchibhotla. Lawrence D. Brown. Andreas Buja. Junhui Cai. Edward I. George. Linda H. Zhao. "Valid post-selection inference in model-free linear regression." Ann. Statist. 48 (5) 2953 - 2981, October 2020. https://doi.org/10.1214/19-AOS1917
Information