Prediction when fitting simple models to high-dimensional data

Lukas Steinberger; Hannes Leeb

doi:10.1214/18-AOS1719

June 2019 Prediction when fitting simple models to high-dimensional data

Lukas Steinberger, Hannes Leeb

Ann. Statist. 47(3): 1408-1442 (June 2019). DOI: 10.1214/18-AOS1719

Abstract

We study linear subset regression in the context of a high-dimensional linear model. Consider $y=\vartheta +\theta 'z+\epsilon $ with univariate response $y$ and a $d$-vector of random regressors $z$, and a submodel where $y$ is regressed on a set of $p$ explanatory variables that are given by $x=M'z$, for some $d\times p$ matrix $M$. Here, “high-dimensional” means that the number $d$ of available explanatory variables in the overall model is much larger than the number $p$ of variables in the submodel. In this paper, we present Pinsker-type results for prediction of $y$ given $x$. In particular, we show that the mean squared prediction error of the best linear predictor of $y$ given $x$ is close to the mean squared prediction error of the corresponding Bayes predictor $\mathbb{E}[y\|x]$, provided only that $p/\log d$ is small. We also show that the mean squared prediction error of the (feasible) least-squares predictor computed from $n$ independent observations of $(y,x)$ is close to that of the Bayes predictor, provided only that both $p/\log d$ and $p/n$ are small. Our results hold uniformly in the regression parameters and over large collections of distributions for the design variables $z$.

Citation

Download Citation

Lukas Steinberger. Hannes Leeb. "Prediction when fitting simple models to high-dimensional data." Ann. Statist. 47 (3) 1408 - 1442, June 2019. https://doi.org/10.1214/18-AOS1719

Information

Received: 1 May 2016; Revised: 1 April 2017; Published: June 2019

First available in Project Euclid: 13 February 2019

zbMATH: 07053513

MathSciNet: MR3911117

Digital Object Identifier: 10.1214/18-AOS1719

Subjects:

Primary: 62H99

Secondary: 62F99 , 62G99

Keywords: Bayes predictor , best linear predictor , high-dimensional models , linear subset regression , non-Gaussian data , Pinsker theorem , small sample size

Access the abstract

JOURNAL ARTICLE
35 PAGES

DOWNLOAD PDF + SAVE TO MY LIBRARY