Open Access
February 2020 Statistical inference for model parameters in stochastic gradient descent
Xi Chen, Jason D. Lee, Xin T. Tong, Yichen Zhang
Ann. Statist. 48(1): 251-273 (February 2020). DOI: 10.1214/18-AOS1801
Abstract

The stochastic gradient descent (SGD) algorithm has been widely used in statistical estimation for large-scale data due to its computational and memory efficiency. While most existing works focus on the convergence of the objective function or the error of the obtained solution, we investigate the problem of statistical inference of true model parameters based on SGD when the population loss function is strongly convex and satisfies certain smoothness conditions.

Our main contributions are twofold. First, in the fixed dimension setup, we propose two consistent estimators of the asymptotic covariance of the average iterate from SGD: (1) a plug-in estimator, and (2) a batch-means estimator, which is computationally more efficient and only uses the iterates from SGD. Both proposed estimators allow us to construct asymptotically exact confidence intervals and hypothesis tests.

Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal. This gives a one-pass algorithm for computing both the sparse regression coefficients and confidence intervals, which is computationally attractive and applicable to online data.

References

1.

Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In Proceedings of the Advances in Neural Information Processing Systems1373.62244 10.1214/12-AOS1032 euclid.aos/1359987527Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In Proceedings of the Advances in Neural Information Processing Systems1373.62244 10.1214/12-AOS1032 euclid.aos/1359987527

2.

Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Proceedings of the Advances in Neural Information Processing Systems.Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Proceedings of the Advances in Neural Information Processing Systems.

3.

Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$. In Proceedings of the Advances in Neural Information Processing Systems.Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$. In Proceedings of the Advances in Neural Information Processing Systems.

4.

Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. 06168762 10.3150/11-BEJ410 euclid.bj/1363192037Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. 06168762 10.3150/11-BEJ410 euclid.bj/1363192037

5.

Bühlmann, P. and Mandozzi, J. (2014). High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Statist. 29 407–430. 1306.65035 10.1007/s00180-013-0436-3Bühlmann, P. and Mandozzi, J. (2014). High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Statist. 29 407–430. 1306.65035 10.1007/s00180-013-0436-3

6.

Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.

7.

Buja, A., Berk, R., Brown, L., George, E., Traskin, M., Zhang, K. and Zhao, L. (2013). A conspiracy of random $X$ and model violation against classical inference in linear regression. Technical Report, Dept. Statistics, The Wharton School, Univ. Pennsylvania, Philadelphia, PA.Buja, A., Berk, R., Brown, L., George, E., Traskin, M., Zhang, K. and Zhao, L. (2013). A conspiracy of random $X$ and model violation against classical inference in linear regression. Technical Report, Dept. Statistics, The Wharton School, Univ. Pennsylvania, Philadelphia, PA.

8.

Chen, X., Lee, J. D., Tong, X. T. and Zhang, Y. (2020). Supplement to “Statistical inference for model parameters in stochastic gradient descent.”  https://doi.org/10.1214/18-AOS1801SUPP.Chen, X., Lee, J. D., Tong, X. T. and Zhang, Y. (2020). Supplement to “Statistical inference for model parameters in stochastic gradient descent.”  https://doi.org/10.1214/18-AOS1801SUPP.

9.

Damerdji, H. (1991). Strong consistency and other properties of the spectral variance estimator. Manage. Sci. 37 1424–1440. 0741.62086 10.1287/mnsc.37.11.1424Damerdji, H. (1991). Strong consistency and other properties of the spectral variance estimator. Manage. Sci. 37 1424–1440. 0741.62086 10.1287/mnsc.37.11.1424

10.

Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39 1327–1332. 0176.48402 10.1214/aoms/1177698258 euclid.aoms/1177698258Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39 1327–1332. 0176.48402 10.1214/aoms/1177698258 euclid.aoms/1177698258

11.

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911. 1411.62187 10.1111/j.1467-9868.2008.00674.xFan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911. 1411.62187 10.1111/j.1467-9868.2008.00674.x

12.

Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer Series in Operations Research. Springer, New York. 0859.65001Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer Series in Operations Research. Springer, New York. 0859.65001

13.

Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Statist. 38 1034–1070. 1184.62161 10.1214/09-AOS735 euclid.aos/1266586622Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Statist. 38 1034–1070. 1184.62161 10.1214/09-AOS735 euclid.aos/1266586622

14.

Geyer, C. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7 473–483.Geyer, C. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7 473–483.

15.

Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM J. Optim. 22 1469–1492. 1301.62077 10.1137/110848864Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM J. Optim. 22 1469–1492. 1301.62077 10.1137/110848864

16.

Glynn, P. W. and Iglehart, D. L. (1990). Simulation output analysis using standardized time series. Math. Oper. Res. 15 1–16. 0704.65110 10.1287/moor.15.1.1Glynn, P. W. and Iglehart, D. L. (1990). Simulation output analysis using standardized time series. Math. Oper. Res. 15 1–16. 0704.65110 10.1287/moor.15.1.1

17.

Glynn, P. W. and Whitt, W. (1991). Estimating the asymptotic variance with batch means. Oper. Res. Lett. 10 431–435. 0744.62113 10.1016/0167-6377(91)90019-LGlynn, P. W. and Whitt, W. (1991). Estimating the asymptotic variance with batch means. Oper. Res. Lett. 10 431–435. 0744.62113 10.1016/0167-6377(91)90019-L

18.

Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909. 1319.62145Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909. 1319.62145

19.

Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 101 1537–1547. 1171.62316 10.1198/016214506000000492Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 101 1537–1547. 1171.62316 10.1198/016214506000000492

20.

Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. 1113.62082 10.1214/009053606000000281 euclid.aos/1152540754Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. 1113.62082 10.1214/009053606000000281 euclid.aos/1152540754

21.

Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.

22.

Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609. 1189.90109 10.1137/070704277Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609. 1189.90109 10.1137/070704277

23.

Nesterov, Yu. and Vial, J.-Ph. (2008). Confidence level solutions for stochastic programming. Automatica J. IFAC 44 1559–1568. 1283.93314 10.1016/j.automatica.2008.01.017Nesterov, Yu. and Vial, J.-Ph. (2008). Confidence level solutions for stochastic programming. Automatica J. IFAC 44 1559–1568. 1283.93314 10.1016/j.automatica.2008.01.017

24.

Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45 158–195. 1364.62128 10.1214/16-AOS1448 euclid.aos/1487667620Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45 158–195. 1364.62128 10.1214/16-AOS1448 euclid.aos/1487667620

25.

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838–855. 0762.62022 10.1137/0330046Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838–855. 0762.62022 10.1137/0330046

26.

Rakhlin, A., Shamir, O. and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning.Rakhlin, A., Shamir, O. and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning.

27.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407. 0054.05901 10.1214/aoms/1177729586 euclid.aoms/1177729586Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407. 0054.05901 10.1214/aoms/1177729586 euclid.aoms/1177729586

28.

Roux, N. L., Schmidt, M. and Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Proceedings of the Advances in Neural Information Processing Systems.Roux, N. L., Schmidt, M. and Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Proceedings of the Advances in Neural Information Processing Systems.

29.

Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process. Technical Report, Dept. Operations Research and Industrial Engineering, Cornell Univ.Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process. Technical Report, Dept. Operations Research and Industrial Engineering, Cornell Univ.

30.

Srebro, N. and Tewari, A. (2010). Stochastic optimization for machine learning. Tutorial at International Conference on Machine Learning.Srebro, N. and Tewari, A. (2010). Stochastic optimization for machine learning. Tutorial at International Conference on Machine Learning.

31.

Sullivan, T. J. (2015). Introduction to Uncertainty Quantification. Texts in Applied Mathematics 63. Springer, Cham. 1336.60002Sullivan, T. J. (2015). Introduction to Uncertainty Quantification. Texts in Applied Mathematics 63. Springer, Cham. 1336.60002

32.

Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Statist. 45 1694–1727. 1378.62046 10.1214/16-AOS1506 euclid.aos/1498636871Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Statist. 45 1694–1727. 1378.62046 10.1214/16-AOS1506 euclid.aos/1498636871

33.

van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202. 1305.62259 10.1214/14-AOS1221 euclid.aos/1403276911van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202. 1305.62259 10.1214/14-AOS1221 euclid.aos/1403276911

34.

Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202. 1367.62220 10.1109/TIT.2009.2016018Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202. 1367.62220 10.1109/TIT.2009.2016018

35.

Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11 2543–2596. 1242.62011Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11 2543–2596. 1242.62011

36.

Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24 2057–2075. 1321.65016 10.1137/140961791Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24 2057–2075. 1321.65016 10.1137/140961791

37.

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning.Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning.

38.

Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. 1411.62196 10.1111/rssb.12026Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. 1411.62196 10.1111/rssb.12026
Copyright © 2020 Institute of Mathematical Statistics
Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang "Statistical inference for model parameters in stochastic gradient descent," The Annals of Statistics 48(1), 251-273, (February 2020). https://doi.org/10.1214/18-AOS1801
Received: 1 October 2017; Published: February 2020
Vol.48 • No. 1 • February 2020
Back to Top