Abstract
We study linear subset regression in the context of a high-dimensional linear model. Consider $y=\vartheta +\theta 'z+\epsilon $ with univariate response $y$ and a $d$-vector of random regressors $z$, and a submodel where $y$ is regressed on a set of $p$ explanatory variables that are given by $x=M'z$, for some $d\times p$ matrix $M$. Here, “high-dimensional” means that the number $d$ of available explanatory variables in the overall model is much larger than the number $p$ of variables in the submodel. In this paper, we present Pinsker-type results for prediction of $y$ given $x$. In particular, we show that the mean squared prediction error of the best linear predictor of $y$ given $x$ is close to the mean squared prediction error of the corresponding Bayes predictor $\mathbb{E}[y\|x]$, provided only that $p/\log d$ is small. We also show that the mean squared prediction error of the (feasible) least-squares predictor computed from $n$ independent observations of $(y,x)$ is close to that of the Bayes predictor, provided only that both $p/\log d$ and $p/n$ are small. Our results hold uniformly in the regression parameters and over large collections of distributions for the design variables $z$.
Citation
Lukas Steinberger. Hannes Leeb. "Prediction when fitting simple models to high-dimensional data." Ann. Statist. 47 (3) 1408 - 1442, June 2019. https://doi.org/10.1214/18-AOS1719
Information