The Annals of Statistics

Distributed testing and estimation under sparse high dimensional models

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

This paper studies hypothesis testing and parameter estimation in the context of the divide-and-conquer algorithm. In a unified likelihood-based framework, we propose new test statistics and point estimators obtained by aggregating various statistics from $k$ subsamples of size $n/k$, where $n$ is the sample size. In both low dimensional and sparse high dimensional settings, we address the important question of how large $k$ can be, as $n$ grows large, such that the loss of efficiency due to the divide-and-conquer algorithm is negligible. In other words, the resulting estimators have the same inferential efficiencies and estimation rates as an oracle with access to the full sample. Thorough numerical results are provided to back up the theory.

Article information

Source
Ann. Statist., Volume 46, Number 3 (2018), 1352-1382.

Dates
Received: September 2015
Revised: December 2016
First available in Project Euclid: 3 May 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1525313085

Digital Object Identifier
doi:10.1214/17-AOS1587

Mathematical Reviews number (MathSciNet)
MR3798006

Zentralblatt MATH identifier
1392.62060

Subjects
Primary: 62F05: Asymptotic properties of tests 62F10: Point estimation
Secondary: 62F12: Asymptotic properties of estimators

Keywords
Divide and conquer debiasing massive data thresholding

Citation

Battey, Heather; Fan, Jianqing; Liu, Han; Lu, Junwei; Zhu, Ziwei. Distributed testing and estimation under sparse high dimensional models. Ann. Statist. 46 (2018), no. 3, 1352--1382. doi:10.1214/17-AOS1587. https://projecteuclid.org/euclid.aos/1525313085


Export citation

References

  • Battey, H., Fan, J., Liu, H., Lu, J. and Zhu, Z. (2018). Supplement to “Distributed testing and estimation under sparse high dimensional models.” DOI:10.1214/17-AOS1587SUPP.
  • Bickel, P. J. (1975). One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 428–434.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • Chen, X. and Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statist. Sinica 24 1655–1684.
  • Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
  • Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman & Hall, London.
  • Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65.
  • Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis. Nat. Sci. Rev. 1 293–314.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484.
  • Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
  • Kallenberg, O. (1997). Foundations of Modern Probability. Springer, New York.
  • Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 795–816.
  • Lee, J. D., Liu, Q., Sun, Y. and Taylor, J. E. (2017). Communication-efficient sparse regression. J. Mach. Learn. Res. 18 Paper No. 5.
  • Liu, Q. and Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. In Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.) 1098–1106. MIT Press, Cambridge, MA.
  • Loh, P.-L. and Wainwright, M. J. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger, eds.) 476–484. Curran Associates Inc., Red Hook, NY.
  • Loh, P.-L. and Wainwright, M. J. (2015). Regularized $M$-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 559–616.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Negahban, S. N., Yu, B., Wainwright, M. J. and Ravikumar, P. (2009). A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams and A. Culotta, eds.) 1348–1356. Curran Associates Inc., Red Hook, NY.
  • Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45 158–195.
  • Rosenblatt, J. D. and Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Inf. Inference 5 379–404.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • Wang, Z., Liu, H. and Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann. Statist. 42 2164–2201.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, Y., Duchi, J. and Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16 3299–3340.
  • Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statist. Sci. 27 576–593.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
  • Zhao, T., Cheng, G. and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Ann. Statist. 44 1400–1437.

Supplemental materials

  • Supplement to “Distributed testing and estimation under sparse high dimensional models”. We put all technical lemmas, proofs and low dimensional results in the supplementary materials for reference.