The Annals of Statistics
- Ann. Statist.
- Volume 47, Number 6 (2019), 3438-3469.
Bootstrapping and sample splitting for high-dimensional, assumption-lean inference
Alessandro Rinaldo, Larry Wasserman, and Max G’Sell
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Abstract
Several new methods have been recently proposed for performing valid inference after model selection. An older method is sample splitting: use part of the data for model selection and the rest for inference. In this paper, we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-lean approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. Finally, we elucidate an inference-prediction trade-off: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions.
Article information
Source
Ann. Statist., Volume 47, Number 6 (2019), 3438-3469.
Dates
Received: April 2018
Revised: November 2018
First available in Project Euclid: 31 October 2019
Permanent link to this document
https://projecteuclid.org/euclid.aos/1572487399
Digital Object Identifier
doi:10.1214/18-AOS1784
Mathematical Reviews number (MathSciNet)
MR4025748
Subjects
Primary: 62F40: Bootstrap, jackknife and other resampling methods 62F35: Robustness and adaptive procedures
Secondary: 62J05: Linear regression 62G09: Resampling methods 62G20: Asymptotic properties
Keywords
Sample splitting bootstrap regression assumption-lean
Citation
Rinaldo, Alessandro; Wasserman, Larry; G’Sell, Max. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Statist. 47 (2019), no. 6, 3438--3469. doi:10.1214/18-AOS1784. https://projecteuclid.org/euclid.aos/1572487399
References
- [1] Anastasiou, A. and Gaunt, R. E. (2016). Multivariate normal approximation of the maximum likelihood estimator via the delta method. Preprint. Available at arXiv:1609.03970.arXiv: 1609.03970
- [2] Anastasiou, A. and Ley, C. (2015). New simpler bounds to assess the asymptotic normality of the maximum likelihood estimator. Preprint. Available at arXiv:1508.04948.
- [3] Anastasiou, A. and Reinert, G. (2017). Bounds for the normal approximation of the maximum likelihood estimator. Bernoulli 23 191–218.Zentralblatt MATH: 1362.60017
Digital Object Identifier: doi:10.3150/15-BEJ741
Project Euclid: euclid.bj/1475001353 - [4] Andrews, D. W. K. and Guggenberger, P. (2009). Hybrid and size-corrected subsampling methods. Econometrica 77 721–762.
- [5] Bachoc, F., Leeb, H. and Pötscher, B. M. (2014). Valid confidence intervals for post-model-selection predictors. Available at arXiv:1412.4605.arXiv: 1412.4605
- [6] Bachoc, F., Preinerstorfer, D. and Steinberger, L. (2016). Uniformly valid confidence intervals post-model-selection. Available at arXiv:1611.01043.arXiv: 1611.01043
- [7] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085.Zentralblatt MATH: 1327.62082
Digital Object Identifier: doi:10.1214/15-AOS1337
Project Euclid: euclid.aos/1438606853 - [8] Barnard, G. A. (1974). Discussion of “Cross-validatory choice and assessment of statistical predictions,” by M. Stone. J. Roy. Statist. Soc. Ser. B 133–135.
- [9] Belloni, A., Chernozhukov, V. and Hansen, C. B. (2013). Inference for High-Dimensional Sparse Econometric Models. vol. 3 245–295. Cambridge Univ. Press.
- [10] Belloni, A., Chernozhukov, V. and Kato, K. (2015). Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika 102 77–94.
- [11] Bentkus, V. Y. (1985). Lower bounds for the rate of convergence in the central limit theorem in Banach spaces. Lith. Math. J. 25 312–320.
- [12] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.Zentralblatt MATH: 1267.62080
Digital Object Identifier: doi:10.1214/12-AOS1077
Project Euclid: euclid.aos/1369836961 - [13] Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.Zentralblatt MATH: 1273.62173
Digital Object Identifier: doi:10.3150/12-BEJSP11
Project Euclid: euclid.bj/1377612849 - [14] Bühlmann, P. and van de Geer, S. (2015). High-dimensional inference in misspecified linear models. Electron. J. Stat. 9 1449–1473.
- [15] Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L. and Zhang, K. (2015). Models as approximations—A conspiracy of random regressors and model deviations against classical inference in regression. Statist. Sci. 1460.
- [16] Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: “model-X” knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 551–577.
- [17] Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. J. Amer. Statist. Assoc. 106 608–625.
- [18] Chatterjee, A. and Lahiri, S. N. (2013). Rates of convergence of the adaptive LASSO estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Statist. 41 1232–1259.Zentralblatt MATH: 1293.62153
Digital Object Identifier: doi:10.1214/13-AOS1106
Project Euclid: euclid.aos/1371150899 - [19] Chen, L. H. Y. and Shao, Q.-M. (2007). Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli 13 581–599.Zentralblatt MATH: 1146.62310
Digital Object Identifier: doi:10.3150/07-BEJ5164
Project Euclid: euclid.bj/1179498762 - [20] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.Zentralblatt MATH: 1292.62030
Digital Object Identifier: doi:10.1214/13-AOS1161
Project Euclid: euclid.aos/1387313390 - [21] Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70.
- [22] Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45 2309–2352.Zentralblatt MATH: 1377.60040
Digital Object Identifier: doi:10.1214/16-AOP1113
Project Euclid: euclid.aop/1502438428 - [23] Cox, D. R. (1975). A note on data-splitting for the evaluation of significance levels. Biometrika 62 441–444.
- [24] Dezeure, R., Bühlmann, P., Meier, L. and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, $p$-values and R-software hdi. Statist. Sci. 30 533–558.
- [25] Dezeure, R., Bühlmann, P. and Zhang, C.-H. (2017). High-dimensional simultaneous inference with the bootstrap. TEST 26 685–719.
- [26] Efron, B. (2014). Estimation and accuracy after model selection. J. Amer. Statist. Assoc. 109 991–1007.
- [27] Faraway, J. J. (1995). Data splitting strategies for reducing the e ect of model selection on inference. Technical report, Citeseer.
- [28] Fithian, W., Sun, D. L. and Taylor, J. (2014). Optimal inference after model selection. Available at arXiv:1410.2597.arXiv: 1410.2597
- [29] Hartigan, J. A. (1969). Using subsample values as typical values. J. Amer. Statist. Assoc. 64 1303–1317.
- [30] Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. J. Amer. Statist. Assoc. 98 879–899.
- [31] Hsu, D., Kakade, S. M. and Zhang, T. (2014). Random design analysis of ridge regression. Found. Comput. Math. 14 569–600.
- [32] Hurvich, C. M. and Tsai, C. (1990). The impact of model selection on inference in linear regression. Amer. Statist. 44 214–217.
- [33] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.Zentralblatt MATH: 1319.62145
- [34] Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.Zentralblatt MATH: 1341.62061
Digital Object Identifier: doi:10.1214/15-AOS1371
Project Euclid: euclid.aos/1460381681 - [35] Leeb, H. and Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376.
- [36] Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J. and Wasserman, L. (2018). Distribution-free predictive inference for regression. J. Amer. Statist. Assoc. 113 1094–1111.
- [37] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 1001–1008.Zentralblatt MATH: 0681.62047
Digital Object Identifier: doi:10.1214/aos/1176347253
Project Euclid: euclid.aos/1176347253 - [38] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.Zentralblatt MATH: 1305.62254
Digital Object Identifier: doi:10.1214/13-AOS1175
Project Euclid: euclid.aos/1400592161 - [39] Loftus, J. R. and Taylor, J. E. (2015). Selective inference in regression models with groups of variables. Preprint. Available at arXiv:1511.01478.arXiv: 1511.01478
- [40] Markovic, J. and Taylor, J. (2016). Bootstrap inference after using multiple queries for model selection. Available at arXiv:1612.07811.arXiv: 1612.07811
- [41] Markovic, J., Xia, L. and Taylor, J. (2017). Comparison of prediction errors: Adaptive p-values after cross-validation. Available at arXiv:1703.06559.arXiv: 1703.06559
- [42] Meinshausen, N. (2015). Group bound: Confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 923–945.
- [43] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
- [44] Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
- [45] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 Paper No. 26, 41.Zentralblatt MATH: 1360.62095
- [46] Miller, A. J. (1990). Subset Selection in Regression. Monographs on Statistics and Applied Probability 40. CRC Press, London.
- [47] Moran, P. A. P. (1973). Dividing a sample into two parts. A statistical dilemma. Sankhyā Ser. A 35 329–333.Zentralblatt MATH: 0284.62004
- [48] Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods.
- [49] Nazarov, F. (2003). On the maximal perimeter of a convex set in ${\mathbb{R}}^{n}$ with respect to a Gaussian measure. In Geometric Aspects of Functional Analysis. Lecture Notes in Math. 1807 169–187. Springer, Berlin.
- [50] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.Zentralblatt MATH: 1288.62108
Digital Object Identifier: doi:10.1214/13-AOS1170
Project Euclid: euclid.aos/1387313392 - [51] Picard, R. R. and Berk, K. N. (1990). Data splitting. Amer. Statist. 44 140–147.
- [52] Pinelis, I. and Molzon, R. (2016). Optimal-order bounds on the rate of convergence to normality in the multivariate delta method. Electron. J. Stat. 10 1001–1063.
- [53] Portnoy, S. (1987). A central limit theorem applicable to robust regression estimators. J. Multivariate Anal. 22 24–50.
- [54] Pouzo, D. (2015). Bootstrap consistency for quadratic forms of sample averages with increasing dimension. Electron. J. Stat. 9 3046–3097.
- [55] Rinaldo, A., Wasserman, L. and G’Sell, M. (2019). Supplement to “Bootstrapping and sample splitting for high-dimensional, assumption-lean inference.” DOI:10.1214/18-AOS1784SUPP.
- [56] Shah, R. D. and Bühlmann, P. (2018). Goodness-of-fit tests for high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 113–135.
- [57] Shah, R. D. and Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 55–80.
- [58] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
- [59] Shao, Q.-M., Zhang, K. and Zhou, W.-X. (2016). Stein’s method for nonlinear statistics: A brief survey and recent progress. J. Statist. Plann. Inference 168 68–89.
- [60] Shorack, G. R. (2000). Probability for Statisticians. Springer Texts in Statistics. Springer, New York.Zentralblatt MATH: 0951.62005
- [61] Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. Ann. Statist. 46 679–710.Zentralblatt MATH: 1392.62144
Digital Object Identifier: doi:10.1214/17-AOS1564
Project Euclid: euclid.aos/1522742433 - [62] Tibshirani, R. J., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2018). Uniform asymptotic inference and the bootstrap after model selection. Ann. Statist. 46 1255–1287.Zentralblatt MATH: 1392.62210
Digital Object Identifier: doi:10.1214/17-AOS1584
Project Euclid: euclid.aos/1525313082 - [63] Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.
- [64] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.Zentralblatt MATH: 1305.62259
Digital Object Identifier: doi:10.1214/14-AOS1221
Project Euclid: euclid.aos/1403276911 - [65] Wager, S., Hastie, T. and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15 1625–1651.Zentralblatt MATH: 1319.62132
- [66] Wasserman, L. (2014). Discussion: “A significance test for the lasso” [MR3210970]. Ann. Statist. 42 501–508.Zentralblatt MATH: 1305.62257
Digital Object Identifier: doi:10.1214/13-AOS1175E
Project Euclid: euclid.aos/1400592166 - [67] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.Zentralblatt MATH: 1173.62054
Digital Object Identifier: doi:10.1214/08-AOS646
Project Euclid: euclid.aos/1247663752 - [68] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
- [69] Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc. 112 757–768.
Supplemental materials
- Supplement to “Bootstrapping and sample splitting for high-dimensional, assumption-lean inference”. This supplement provides additional material, including numerical examples, comments on other approaches, an alternative bootstrap approach, and algorithmic statements of the studied procedures. The supplement also includes proofs of many of the results stated in this paper.Digital Object Identifier: doi:10.1214/18-AOS1784SUPPSupplemental files are immediately available to subscribers. Non-subscribers gain access to supplemental files with the purchase of the article.

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Berry-Esseen bounds for estimating undirected graphs
Wasserman, Larry, Kolar, Mladen, and Rinaldo, Alessandro, Electronic Journal of Statistics, 2014 - Randomized urn models revisited using stochastic approximation
Laruelle, Sophie and Pagès, Gilles, The Annals of Applied Probability, 2013 - Linear Thompson sampling revisited
Abeille, Marc and Lazaric, Alessandro, Electronic Journal of Statistics, 2017
- Berry-Esseen bounds for estimating undirected graphs
Wasserman, Larry, Kolar, Mladen, and Rinaldo, Alessandro, Electronic Journal of Statistics, 2014 - Randomized urn models revisited using stochastic approximation
Laruelle, Sophie and Pagès, Gilles, The Annals of Applied Probability, 2013 - Linear Thompson sampling revisited
Abeille, Marc and Lazaric, Alessandro, Electronic Journal of Statistics, 2017 - Bayesian shrinkage methods for partially observed data with many predictors
Boonstra, Philip S., Mukherjee, Bhramar, and Taylor, Jeremy M. G., The Annals of Applied Statistics, 2013 - Perturbation bootstrap in adaptive Lasso
Das, Debraj, Gregory, Karl, and Lahiri, S. N., The Annals of Statistics, 2019 - Joint variable and rank selection for parsimonious estimation of high-dimensional matrices
Bunea, Florentina, She, Yiyuan, and Wegkamp, Marten H., The Annals of Statistics, 2012 - Data-driven rate-optimal specification testing in regression models
Guerre, Emmanuel and Lavergne, Pascal, The Annals of Statistics, 2005 - The naming game in language dynamics revisited
Lanchier, Nicolas, Journal of Applied Probability, 2014 - Model selection in semiparametric expectile regression
Spiegel, Elmar, Sobotka, Fabian, and Kneib, Thomas, Electronic Journal of Statistics, 2017 - Bootstrap prediction and Bayesian prediction under misspecified models
Fushiki, Tadayoshi, Bernoulli, 2005