## Bernoulli

• Bernoulli
• Volume 24, Number 4A (2018), 2610-2639.

### Perturbation theory for Markov chains via Wasserstein distance

#### Abstract

Perturbation theory for Markov chains addresses the question of how small differences in the transition probabilities of Markov chains are reflected in differences between their distributions. We prove powerful and flexible bounds on the distance of the $n$th step distributions of two Markov chains when one of them satisfies a Wasserstein ergodicity condition. Our work is motivated by the recent interest in approximate Markov chain Monte Carlo (MCMC) methods in the analysis of big data sets. By using an approach based on Lyapunov functions, we provide estimates for geometrically ergodic Markov chains under weak assumptions. In an autoregressive model, our bounds cannot be improved in general. We illustrate our theory by showing quantitative estimates for approximate versions of two prominent MCMC algorithms, the Metropolis–Hastings and stochastic Langevin algorithms.

#### Article information

Source
Bernoulli, Volume 24, Number 4A (2018), 2610-2639.

Dates
Revised: January 2017
First available in Project Euclid: 26 March 2018

https://projecteuclid.org/euclid.bj/1522051219

Digital Object Identifier
doi:10.3150/17-BEJ938

Mathematical Reviews number (MathSciNet)
MR3779696

Zentralblatt MATH identifier
06853259

#### Citation

Rudolf, Daniel; Schweizer, Nikolaus. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli 24 (2018), no. 4A, 2610--2639. doi:10.3150/17-BEJ938. https://projecteuclid.org/euclid.bj/1522051219

#### References

• [1] Ahn, S., Korattikara, A. and Welling, M. (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th International Conference on Machine Learning.
• [2] Alquier, P., Friel, N., Everitt, R. and Boland, A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Stat. Comput. 26 29–47.
• [3] Bardenet, R., Doucet, A. and Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: An adaptive subsampling approach. In Proceedings of the 31st International Conference on Machine Learning 405–413.
• [4] Bardenet, R., Doucet, A. and Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. arXiv:1505.02827.
• [5] Baxendale, P.H. (2005). Renewal theory and computable convergence rates for geometrically ergodic Markov chains. Ann. Appl. Probab. 15 700–738.
• [6] Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In Proceedings of the 32nd International Conference on Machine Learning 533–540.
• [7] Breyer, L., Roberts, G.O. and Rosenthal, J.S. (2001). A note on geometric ergodicity and floating-point roundoff error. Statist. Probab. Lett. 53 123–127.
• [8] Dobrushin, R.L. (1956). Central limit theorem for non-stationary Markov chains. I. Teor. Veroyatn. Primen. 1 72–89.
• [9] Dobrushin, R.L. (1956). Central limit theorem for nonstationary Markov chains. II. Teor. Veroyatn. Primen. 1 365–425.
• [10] Dobrushin, R.L. (1996). Perturbation methods of the theory of Gibbsian fields. In Lectures on Probability Theory and Statistics: Ecole d’Eté de Probabilités de Saint-Flour XXIV – 1994. Lecture Notes in Mathematics 1648 1–66. Berlin: Springer.
• [11] Durmus, A. and Moulines, E. (2015). Quantitative bounds of convergence for geometrically ergodic Markov chain in the Wasserstein distance with application to the Metropolis Adjusted Langevin Algorithm. Stat. Comput. 25 5–19.
• [12] Eberle, A. (2014). Error bounds for Metropolis–Hastings algorithms applied to perturbations of Gaussian measures in high dimensions. Ann. Appl. Probab. 24 337–377.
• [13] Ferré, D., Hervé, L. and Ledoux, J. (2013). Regular perturbation of $V$-geometrically ergodic Markov chains. J. Appl. Probab. 50 184–194.
• [14] Gibbs, A.L. (2004). Convergence in the Wasserstein metric for Markov chain Monte Carlo algorithms with applications to image restoration. Stoch. Models 20 473–492.
• [15] Guibourg, D., Hervé, L. and Ledoux, J. (2012). Quasi-compactness of Markov kernels on weighted-supremum spaces and geometrical ergodicity. arXiv:1110.3240v5.
• [16] Hairer, M. (2006). Ergodic properties of Markov processes. Lecture notes, Univ. Warwick. Available at http://www.hairer.org/notes/Markov.pdf.
• [17] Hairer, M. and Mattingly, J.C. (2011). Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI. Progress in Probability 63 109–117. Basel: Birkhäuser/Springer Basel AG.
• [18] Hairer, M., Stuart, A.M. and Vollmer, S.J. (2014). Spectral gaps for a Metropolis–Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24 2455–2490.
• [19] Johndrow, J., Mattingly, J.C., Mukherjee, S. and Dunson, D. (2015). Approximations of Markov chains and Bayesian inference. arXiv:1508.03387.
• [20] Kartashov, N.V. (1986). Inequalities in stability and ergodicity theorems for Markov chains with a common phase space. I. Theory Probab. Appl. 30 247–259.
• [21] Kartashov, N.V. and Golomozyĭ, V. (2013). Maximal coupling procedure and stability of discrete Markov chains. I. Theory Probab. Math. Statist. 86 93–104.
• [22] Keller, G. and Liverani, C. (1999). Stability of the spectrum for transfer operators. Ann. Sc. Norm. Super. Pisa Cl. Sci. (4) 28 141–152.
• [23] Korattikara, A., Chen, Y. and Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis–Hastings budget. In Proceedings of the 31st International Conference on Machine Learning 181–189.
• [24] Lee, A., Doucet, A. and Łatuszyński, K. (2014). Perfect simulation using atomic regeneration with application to sequential Monte Carlo. arXiv:1407.5770.
• [25] Madras, N. and Sezer, D. (2010). Quantitative bounds for Markov chain convergence: Wasserstein and total variation distances. Bernoulli 16 882–908.
• [26] Mao, Y., Zhang, M. and Zhang, Y. (2013). A generalization of Dobrushin coefficient. Chinese J. Appl. Probab. Statist. 29 489–494.
• [27] Marin, J.-M., Pudlo, P., Robert, C.P. and Ryder, R.J. (2012). Approximate Bayesian computational methods. Stat. Comput. 22 1167–1180.
• [28] Mathé, P. (2004). Numerical integration using V-uniformly ergodic Markov chains. J. Appl. Probab. 41 1104–1112.
• [29] Medina-Aguayo, F.J., Lee, A. and Roberts, G.O. (2016). Stability of noisy Metropolis–Hastings. Stat. Comput. 26 1187–1211.
• [30] Mengersen, K.L. and Tweedie, R.L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24 101–121.
• [31] Meyn, S.P. and Tweedie, R.L. (2009). Markov Chains and Stochastic Stability, 2nd ed. Cambridge: Cambridge Univ. Press.
• [32] Mitrophanov, A.Yu. (2003). Stability and exponential convergence of continuous-time Markov chains. J. Appl. Probab. 40 970–979.
• [33] Mitrophanov, A.Yu. (2005). Sensitivity and convergence of uniformly ergodic Markov chains. J. Appl. Probab. 42 1003–1014.
• [34] Ollivier, Y. (2009). Ricci curvature of Markov chains on metric spaces. J. Funct. Anal. 256 810–864.
• [35] Pillai, N. and Smith, A. (2015). Ergodicity of approximate MCMC chains with applications to large data sets. arXiv:1405.0182v2.
• [36] Roberts, G.O. and Rosenthal, J.S. (1997). Geometric ergodicity and hybrid Markov chains. Electron. Commun. Probab. 2 13–25.
• [37] Roberts, G.O. and Rosenthal, J.S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv. 1 20–71.
• [38] Roberts, G.O., Rosenthal, J.S. and Schwartz, P.O. (1998). Convergence properties of perturbed Markov chains. J. Appl. Probab. 35 1–11.
• [39] Roberts, G.O. and Tweedie, R.L. (1996). Exponential convergence of Langevin distributions and their discrete approximation. Bernoulli 2 341–363.
• [40] Rudolf, D. (2012). Explicit error bounds for Markov chain Monte Carlo. Dissertationes Math. 485 93 pp.
• [41] Shardlow, T. and Stuart, A.M. (2000). A perturbation theory for ergodic Markov chains and application to numerical approximations. SIAM J. Numer. Anal. 37 1120–1137.
• [42] Singh, S., Wick, M. and McCallum, A. (2012). Monte Carlo MCMC: Efficient inference by approximate sampling. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 1104–1113. Stroudsburg, PA: Association for Computational Linguistics.
• [43] Teh, Y.W., Thiery, A.H. and Vollmer, S.J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17 Art. ID 7.
• [44] Tierney, L. (1998). A note on the Metropolis–Hastings kernels for general state spaces. Ann. Appl. Probab. 8 1–9.
• [45] Villani, C. (2003). Topics in Optimal Transportation. Graduate Studies in Mathematics 58. Providence, RI: Amer. Math. Soc.
• [46] Villani, C. (2009). Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften 338. Berlin: Springer.
• [47] Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning 681–688.