The Annals of Statistics

A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning

Abstract

We establish a Bernstein-type inequality for a class of stochastic processes that includes the classical geometrically $\phi$-mixing processes, Rio’s generalization of these processes and many time-discrete dynamical systems. Modulo a logarithmic factor and some constants, our Bernstein-type inequality coincides with the classical Bernstein inequality for i.i.d. data. We further use this new Bernstein-type inequality to derive an oracle inequality for generic regularized empirical risk minimization algorithms and data generated by such processes. Applying this oracle inequality to support vector machines using the Gaussian kernels for binary classification, we obtain essentially the same rate as for i.i.d. processes, and for least squares and quantile regression; it turns out that the resulting learning rates match, up to some arbitrarily small extra term in the exponent, the optimal rates for i.i.d. processes.

Article information

Source
Ann. Statist., Volume 45, Number 2 (2017), 708-743.

Dates
Revised: March 2016
First available in Project Euclid: 16 May 2017

https://projecteuclid.org/euclid.aos/1494921955

Digital Object Identifier
doi:10.1214/16-AOS1465

Mathematical Reviews number (MathSciNet)
MR3650398

Zentralblatt MATH identifier
06754748

Citation

Hang, Hanyuan; Steinwart, Ingo. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. Ann. Statist. 45 (2017), no. 2, 708--743. doi:10.1214/16-AOS1465. https://projecteuclid.org/euclid.aos/1494921955

References

• [1] Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 13 1000–1034.
• [2] Adams, R. A. and Fournier, J. J. F. (2003). Sobolev Spaces, 2nd ed. Pure and Applied Mathematics (Amsterdam) 140. Elsevier/Academic Press, Amsterdam.
• [3] Alquier, P., Li, X. and Wintenberger, O. (2013). Prediction of time series by statistical learning: general losses and fast rates. Dependence Modeling 1 65–93.
• [4] Ambrosio, L., Fusco, N. and Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Oxford Univ. Press, New York.
• [5] Araújo, V., Galatolo, S. and Pacifico, M. J. (2014). Decay of correlations for maps with uniformly contracting fibers and logarithm law for singular hyperbolic attractors. Math. Z. 276 1001–1048.
• [6] Baladi, V. (2000). Positive Transfer Operators and Decay of Correlations. Advanced Series in Nonlinear Dynamics 16. World Scientific, River Edge, NJ.
• [7] Baladi, V. (2001). Decay of correlations. In Smooth Ergodic Theory and Its Applications (Seattle, WA, 1999). Proc. Sympos. Pure Math. 69 297–325. Amer. Math. Soc., Providence, RI.
• [8] Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
• [9] Belomestny, D. (2011). Spectral estimation of the Lévy density in partially observed affine models. Stochastic Process. Appl. 121 1217–1244.
• [10] Benedicks, M. and Young, L.-S. (2000). Markov extensions and decay of correlations for certain Hénon maps. Astérisque 261 13–56.
• [11] Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894.
• [12] Bosq, D. (1993). Bernstein-type large deviations inequalities for partial sums of strong mixing processes. Statistics 24 59–70.
• [13] Bowen, R. (1975). Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms. Lecture Notes in Mathematics 470. Springer, Berlin.
• [14] Bradley, R. C. (2007). Introduction to Strong Mixing Conditions. Vol. 1. Kendrick Press, Heber City, UT.
• [15] Chazottes, J.-R., Collet, P. and Schmitt, B. (2005). Statistical consequences of the Devroye inequality for processes. Applications to a class of non-uniformly hyperbolic dynamical systems. Nonlinearity 18 2341–2364.
• [16] Chazottes, J.-R., Collet, P. and Schmitt, B. (2005). Devroye inequality for a class of non-uniformly hyperbolic dynamical systems. Nonlinearity 18 2323–2340.
• [17] Chazottes, J.-R. and Gouëzel, S. (2012). Optimal concentration inequalities for dynamical systems. Comm. Math. Phys. 316 843–889.
• [18] Chernov, N. (1999). Decay of correlations and dispersing billiards. J. Stat. Phys. 94 513–556.
• [19] Collet, P., Martinez, S. and Schmitt, B. (2002). Exponential inequalities for dynamical measures of expanding maps of the interval. Probab. Theory Related Fields 123 301–322.
• [20] Davydov, Y. A. (1968). Convergence of distributions generated by stationary stochastic processes. Theory Probab. Appl. 13 691–696.
• [21] Dedecker, J., Doukhan, P., Lang, G., León, J. R., Louhichi, S. and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statistics 190. Springer, New York.
• [22] Dedecker, J. and Prieur, C. (2005). New dependence coefficients. Examples and applications to statistics. Probab. Theory Related Fields 132 203–236.
• [23] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
• [24] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation. Springer, New York.
• [25] Eberts, M. and Steinwart, I. (2013). Optimal regression rates for SVMs using Gaussian kernels. Electron. J. Stat. 7 1–42.
• [26] Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
• [27] Hang, H. and Steinwart, I. (2014). Fast learning from $\alpha$-mixing observations. J. Multivariate Anal. 127 184–199.
• [28] Hang, H. and Steinwart, I. (2016). Supplement to “A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning.” DOI:10.1214/16-AOS1465SUPP.
• [29] Hofbauer, F. and Keller, G. (1982). Ergodic properties of invariant measures for piecewise monotonic transformations. Math. Z. 180 119–140.
• [30] Ibragimov, I. A. (1962). Some limit theorems for stationary processes. Theory Probab. Appl. 7 349–382.
• [31] Jager, L., Maes, J. and Ninet, A. (2015). Exponential decay of correlations for a real-valued dynamical system embedded in $\mathbb{R}_{2}$. C. R. Math. Acad. Sci. Paris 353 1041–1045.
• [32] Keller, G. and Nowicki, T. (1992). Spectral theory, zeta functions and the distribution of periodic points for Collet–Eckmann maps. Comm. Math. Phys. 149 31–69.
• [33] Lasota, A. and Mackey, M. C. (1985). Probabilistic Properties of Deterministic Systems. Cambridge Univ. Press, Cambridge.
• [34] Liverani, C. (1995). Decay of correlations. Ann. of Math. (2) 142 239–301.
• [35] Luzzatto, S. and Melbourne, I. (2013). Statistical properties and decay of correlations for interval maps with critical points and singularities. Comm. Math. Phys. 320 21–35.
• [36] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
• [37] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
• [38] Maume-Deschamps, V. (2006). Exponential inequalities and functional estimations for weak dependent data; applications to dynamical systems. Stoch. Dyn. 6 535–560.
• [39] McGoff, K., Mukherjee, S., Nobel, A. and Pillai, N. (2015). Consistency of maximum likelihood estimation for some dynamical systems. Ann. Statist. 43 1–29.
• [40] McGoff, K., Mukherjee, S. and Pillai, N. S. (2012). Statistical inference for dynamical systems: A review. Preprint. Available at arXiv:1204.6265.
• [41] Merlevède, F., Peligrad, M. and Rio, E. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High Dimensional Probability V: The Luminy Volume. Inst. Math. Stat. Collect. 5 273–292. IMS, Beachwood, OH.
• [42] Modha, D. S. and Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inform. Theory 42 2133–2145.
• [43] Rio, E. (1996). Sur le théorème de Berry–Esseen pour les suites faiblement dépendantes. Probab. Theory Related Fields 104 255–282.
• [44] Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition. Proc. Nat. Acad. Sci. USA 42 43–47.
• [45] Ruelle, D. (1976). A measure associated with axiom-A attractors. Amer. J. Math. 98 619–654.
• [46] Runst, T. and Sickel, W. (1996). Sobolev Spaces of Fractional Order, Nemytskij Operators, and Nonlinear Partial Differential Equations. De Gruyter Series in Nonlinear Analysis and Applications 3. de Gruyter, Berlin.
• [47] Rychlik, M. (1983). Bounded variation and invariant measures. Studia Math. 76 69–80.
• [48] Samson, P.-M. (2000). Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes. Ann. Probab. 28 416–461.
• [49] Shub, M. (1987). Global Stability of Dynamical Systems. Springer, New York.
• [50] Sinaĭ, J. G. (1972). Gibbs measures in ergodic theory. Russ. Math. Surveys 27 21–69.
• [51] Steinwart, I. (2009). Two oracle inequalities for regularized boosting classifiers. Stat. Interface 2 271–284.
• [52] Steinwart, I. and Anghel, M. (2009). Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise. Ann. Statist. 37 841–875.
• [53] Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York.
• [54] Steinwart, I. and Christmann, A. (2011). Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17 211–225.
• [55] Steinwart, I., Hush, D. and Scovel, C. (2006). An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Inform. Theory 52 4635–4643.
• [56] Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (S. Dasgupta and A. Klivans, eds.) 79–93. Available at http://www.cs.mcgill.ca/~colt2009/papers/038.pdf.
• [57] Takeuchi, I., Le, Q. V., Sears, T. D. and Smola, A. J. (2006). Nonparametric quantile estimation. J. Mach. Learn. Res. 7 1231–1264.
• [58] Triebel, H. (2010). Theory of Function Spaces. Birkhäuser/Springer, Basel.
• [59] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [60] Viana, M. (1997). Stochastic Dynamics of Deterministic Systems 21. IMPA, Brazil.
• [61] Wintenberger, O. (2010). Deviation inequalities for sums of weakly dependent time series. Electron. Commun. Probab. 15 489–503.
• [62] Young, L.-S. (1998). Statistical properties of dynamical systems with some hyperbolicity. Ann. of Math. (2) 147 585–650.
• [63] Zhang, J. (2004). Sieve estimates via neural network for strong mixing processes. Stat. Inference Stoch. Process. 7 115–135.

Supplemental materials

• Supplement to “A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning”. The supplement [28] contains an Appendix, in which we provide the proofs for Sections 2 and 4.