Valid post-selection inference in model-free linear regression

Arun K. Kuchibhotla; Lawrence D. Brown; Andreas Buja; Junhui Cai; Edward I. George; Linda H. Zhao

doi:10.1214/19-AOS1917

October 2020 Valid post-selection inference in model-free linear regression

Arun K. Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai, Edward I. George, Linda H. Zhao

Ann. Statist. 48(5): 2953-2981 (October 2020). DOI: 10.1214/19-AOS1917

Abstract

Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. (Ann. Statist. 41 (2013) 802–837) and further extended by Bachoc, Preinerstorfer and Steinberger (2016). These proposals, labeled “PoSI methods,” provide valid inference after arbitrary model selection. They are computationally NP-hard and have limitations in their theoretical justifications. We therefore propose computationally efficient confidence regions, named “UPoSI’ (“U” is for “uniform” or “universal.”) and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. (Ann. Statist. 41 (2013) 802–837). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates $p$ diverges (almost exponentially) with the sample size $n$. Under standard tail assumptions, we only require $(\log p)^{7}=o(n)$ and $k=o(\sqrt{n/\log p})$ where $k$ ($\le p$) is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them theoretically with those proposed previously.

Citation

Download Citation

Arun K. Kuchibhotla. Lawrence D. Brown. Andreas Buja. Junhui Cai. Edward I. George. Linda H. Zhao. "Valid post-selection inference in model-free linear regression." Ann. Statist. 48 (5) 2953 - 2981, October 2020. https://doi.org/10.1214/19-AOS1917

Information

Received: 1 October 2018; Revised: 1 September 2019; Published: October 2020

First available in Project Euclid: 19 September 2020

MathSciNet: MR4152630

Digital Object Identifier: 10.1214/19-AOS1917

Subjects:

Primary: 62F12 , 62F25 , 62F40 , 62J05

Keywords: Concentration inequalities , high-dimensional linear regression , Model selection , multiplier bootstrap , Orlicz norms , simultaneous inference , uniform consistency

Access the abstract

JOURNAL ARTICLE
29 PAGES

DOWNLOAD PDF + SAVE TO MY LIBRARY