The Annals of Statistics

A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning

Hanyuan Hang and Ingo Steinwart

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We establish a Bernstein-type inequality for a class of stochastic processes that includes the classical geometrically $\phi$-mixing processes, Rio’s generalization of these processes and many time-discrete dynamical systems. Modulo a logarithmic factor and some constants, our Bernstein-type inequality coincides with the classical Bernstein inequality for i.i.d. data. We further use this new Bernstein-type inequality to derive an oracle inequality for generic regularized empirical risk minimization algorithms and data generated by such processes. Applying this oracle inequality to support vector machines using the Gaussian kernels for binary classification, we obtain essentially the same rate as for i.i.d. processes, and for least squares and quantile regression; it turns out that the resulting learning rates match, up to some arbitrarily small extra term in the exponent, the optimal rates for i.i.d. processes.

Article information

Ann. Statist., Volume 45, Number 2 (2017), 708-743.

Received: March 2015
Revised: March 2016
First available in Project Euclid: 16 May 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 60E15: Inequalities; stochastic orderings
Secondary: 60G10: Stationary processes 37D20: Uniformly hyperbolic systems (expanding, Anosov, Axiom A, etc.) 60F10: Large deviations 68T05: Learning and adaptive systems [See also 68Q32, 91E40] 62G08: Nonparametric regression 62M10: Time series, auto-correlation, regression, etc. [See also 91B84]

Bernstein-type inequalities mixing processes dynamical systems nonparametric classification and regression support vector machines (SVMs)


Hang, Hanyuan; Steinwart, Ingo. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. Ann. Statist. 45 (2017), no. 2, 708--743. doi:10.1214/16-AOS1465.

Export citation


  • [1] Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 13 1000–1034.
  • [2] Adams, R. A. and Fournier, J. J. F. (2003). Sobolev Spaces, 2nd ed. Pure and Applied Mathematics (Amsterdam) 140. Elsevier/Academic Press, Amsterdam.
  • [3] Alquier, P., Li, X. and Wintenberger, O. (2013). Prediction of time series by statistical learning: general losses and fast rates. Dependence Modeling 1 65–93.
  • [4] Ambrosio, L., Fusco, N. and Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Oxford Univ. Press, New York.
  • [5] Araújo, V., Galatolo, S. and Pacifico, M. J. (2014). Decay of correlations for maps with uniformly contracting fibers and logarithm law for singular hyperbolic attractors. Math. Z. 276 1001–1048.
  • [6] Baladi, V. (2000). Positive Transfer Operators and Decay of Correlations. Advanced Series in Nonlinear Dynamics 16. World Scientific, River Edge, NJ.
  • [7] Baladi, V. (2001). Decay of correlations. In Smooth Ergodic Theory and Its Applications (Seattle, WA, 1999). Proc. Sympos. Pure Math. 69 297–325. Amer. Math. Soc., Providence, RI.
  • [8] Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
  • [9] Belomestny, D. (2011). Spectral estimation of the Lévy density in partially observed affine models. Stochastic Process. Appl. 121 1217–1244.
  • [10] Benedicks, M. and Young, L.-S. (2000). Markov extensions and decay of correlations for certain Hénon maps. Astérisque 261 13–56.
  • [11] Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894.
  • [12] Bosq, D. (1993). Bernstein-type large deviations inequalities for partial sums of strong mixing processes. Statistics 24 59–70.
  • [13] Bowen, R. (1975). Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms. Lecture Notes in Mathematics 470. Springer, Berlin.
  • [14] Bradley, R. C. (2007). Introduction to Strong Mixing Conditions. Vol. 1. Kendrick Press, Heber City, UT.
  • [15] Chazottes, J.-R., Collet, P. and Schmitt, B. (2005). Statistical consequences of the Devroye inequality for processes. Applications to a class of non-uniformly hyperbolic dynamical systems. Nonlinearity 18 2341–2364.
  • [16] Chazottes, J.-R., Collet, P. and Schmitt, B. (2005). Devroye inequality for a class of non-uniformly hyperbolic dynamical systems. Nonlinearity 18 2323–2340.
  • [17] Chazottes, J.-R. and Gouëzel, S. (2012). Optimal concentration inequalities for dynamical systems. Comm. Math. Phys. 316 843–889.
  • [18] Chernov, N. (1999). Decay of correlations and dispersing billiards. J. Stat. Phys. 94 513–556.
  • [19] Collet, P., Martinez, S. and Schmitt, B. (2002). Exponential inequalities for dynamical measures of expanding maps of the interval. Probab. Theory Related Fields 123 301–322.
  • [20] Davydov, Y. A. (1968). Convergence of distributions generated by stationary stochastic processes. Theory Probab. Appl. 13 691–696.
  • [21] Dedecker, J., Doukhan, P., Lang, G., León, J. R., Louhichi, S. and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statistics 190. Springer, New York.
  • [22] Dedecker, J. and Prieur, C. (2005). New dependence coefficients. Examples and applications to statistics. Probab. Theory Related Fields 132 203–236.
  • [23] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
  • [24] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation. Springer, New York.
  • [25] Eberts, M. and Steinwart, I. (2013). Optimal regression rates for SVMs using Gaussian kernels. Electron. J. Stat. 7 1–42.
  • [26] Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
  • [27] Hang, H. and Steinwart, I. (2014). Fast learning from $\alpha$-mixing observations. J. Multivariate Anal. 127 184–199.
  • [28] Hang, H. and Steinwart, I. (2016). Supplement to “A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning.” DOI:10.1214/16-AOS1465SUPP.
  • [29] Hofbauer, F. and Keller, G. (1982). Ergodic properties of invariant measures for piecewise monotonic transformations. Math. Z. 180 119–140.
  • [30] Ibragimov, I. A. (1962). Some limit theorems for stationary processes. Theory Probab. Appl. 7 349–382.
  • [31] Jager, L., Maes, J. and Ninet, A. (2015). Exponential decay of correlations for a real-valued dynamical system embedded in $\mathbb{R}_{2}$. C. R. Math. Acad. Sci. Paris 353 1041–1045.
  • [32] Keller, G. and Nowicki, T. (1992). Spectral theory, zeta functions and the distribution of periodic points for Collet–Eckmann maps. Comm. Math. Phys. 149 31–69.
  • [33] Lasota, A. and Mackey, M. C. (1985). Probabilistic Properties of Deterministic Systems. Cambridge Univ. Press, Cambridge.
  • [34] Liverani, C. (1995). Decay of correlations. Ann. of Math. (2) 142 239–301.
  • [35] Luzzatto, S. and Melbourne, I. (2013). Statistical properties and decay of correlations for interval maps with critical points and singularities. Comm. Math. Phys. 320 21–35.
  • [36] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • [37] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
  • [38] Maume-Deschamps, V. (2006). Exponential inequalities and functional estimations for weak dependent data; applications to dynamical systems. Stoch. Dyn. 6 535–560.
  • [39] McGoff, K., Mukherjee, S., Nobel, A. and Pillai, N. (2015). Consistency of maximum likelihood estimation for some dynamical systems. Ann. Statist. 43 1–29.
  • [40] McGoff, K., Mukherjee, S. and Pillai, N. S. (2012). Statistical inference for dynamical systems: A review. Preprint. Available at arXiv:1204.6265.
  • [41] Merlevède, F., Peligrad, M. and Rio, E. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High Dimensional Probability V: The Luminy Volume. Inst. Math. Stat. Collect. 5 273–292. IMS, Beachwood, OH.
  • [42] Modha, D. S. and Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inform. Theory 42 2133–2145.
  • [43] Rio, E. (1996). Sur le théorème de Berry–Esseen pour les suites faiblement dépendantes. Probab. Theory Related Fields 104 255–282.
  • [44] Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition. Proc. Nat. Acad. Sci. USA 42 43–47.
  • [45] Ruelle, D. (1976). A measure associated with axiom-A attractors. Amer. J. Math. 98 619–654.
  • [46] Runst, T. and Sickel, W. (1996). Sobolev Spaces of Fractional Order, Nemytskij Operators, and Nonlinear Partial Differential Equations. De Gruyter Series in Nonlinear Analysis and Applications 3. de Gruyter, Berlin.
  • [47] Rychlik, M. (1983). Bounded variation and invariant measures. Studia Math. 76 69–80.
  • [48] Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes. Ann. Probab. 28 416–461.
  • [49] Shub, M. (1987). Global Stability of Dynamical Systems. Springer, New York.
  • [50] Sinaĭ, J. G. (1972). Gibbs measures in ergodic theory. Russ. Math. Surveys 27 21–69.
  • [51] Steinwart, I. (2009). Two oracle inequalities for regularized boosting classifiers. Stat. Interface 2 271–284.
  • [52] Steinwart, I. and Anghel, M. (2009). Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise. Ann. Statist. 37 841–875.
  • [53] Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York.
  • [54] Steinwart, I. and Christmann, A. (2011). Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17 211–225.
  • [55] Steinwart, I., Hush, D. and Scovel, C. (2006). An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Inform. Theory 52 4635–4643.
  • [56] Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (S. Dasgupta and A. Klivans, eds.) 79–93. Available at
  • [57] Takeuchi, I., Le, Q. V., Sears, T. D. and Smola, A. J. (2006). Nonparametric quantile estimation. J. Mach. Learn. Res. 7 1231–1264.
  • [58] Triebel, H. (2010). Theory of Function Spaces. Birkhäuser/Springer, Basel.
  • [59] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [60] Viana, M. (1997). Stochastic Dynamics of Deterministic Systems 21. IMPA, Brazil.
  • [61] Wintenberger, O. (2010). Deviation inequalities for sums of weakly dependent time series. Electron. Commun. Probab. 15 489–503.
  • [62] Young, L.-S. (1998). Statistical properties of dynamical systems with some hyperbolicity. Ann. of Math. (2) 147 585–650.
  • [63] Zhang, J. (2004). Sieve estimates via neural network for strong mixing processes. Stat. Inference Stoch. Process. 7 115–135.

Supplemental materials

  • Supplement to “A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning”. The supplement [28] contains an Appendix, in which we provide the proofs for Sections 2 and 4.