## The Annals of Applied Probability

### Measuring sample quality with diffusions

#### Abstract

Stein’s method for measuring convergence to a continuous target distribution relies on an operator characterizing the target and Stein factor bounds on the solutions of an associated differential equation. While such operators and bounds are readily available for a diversity of univariate targets, few multivariate targets have been analyzed. We introduce a new class of characterizing operators based on Itô diffusions and develop explicit multivariate Stein factor bounds for any target with a fast-coupling Itô diffusion. As example applications, we develop computable and convergence-determining diffusion Stein discrepancies for log-concave, heavy-tailed and multimodal targets and use these quality measures to select the hyperparameters of biased Markov chain Monte Carlo (MCMC) samplers, compare random and deterministic quadrature rules and quantify bias-variance tradeoffs in approximate MCMC. Our results establish a near-linear relationship between diffusion Stein discrepancies and Wasserstein distances, improving upon past work even for strongly log-concave targets. The exposed relationship between Stein factors and Markov process coupling may be of independent interest.

#### Article information

Source
Ann. Appl. Probab., Volume 29, Number 5 (2019), 2884-2928.

Dates
Revised: November 2018
First available in Project Euclid: 18 October 2019

https://projecteuclid.org/euclid.aoap/1571385625

Digital Object Identifier
doi:10.1214/19-AAP1467

Mathematical Reviews number (MathSciNet)
MR4019878

#### Citation

Gorham, Jackson; Duncan, Andrew B.; Vollmer, Sebastian J.; Mackey, Lester. Measuring sample quality with diffusions. Ann. Appl. Probab. 29 (2019), no. 5, 2884--2928. doi:10.1214/19-AAP1467. https://projecteuclid.org/euclid.aoap/1571385625

#### References

• [1] Ambrosio, L., Fusco, N. and Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Oxford University Press. The Clarendon Press, New York.
• [2] Bach, F., Lacoste-Julien, S. and Obozinski, G. (2012). On the equivalence between herding and conditional gradient algorithms. In Proc. 29th ICML, ICML’12.
• [3] Barbour, A. D. (1988). Stein’s method and Poisson process convergence. J. Appl. Probab. Special Vol. 25A 175–184. A celebration of applied probability.
• [4] Barbour, A. D. (1990). Stein’s method for diffusion approximations. Probab. Theory Related Fields 84 297–322.
• [5] Bouts, Q. W., ten Brink, A. P. and Buchin, K. (2014). A framework for computing the greedy spanner. In Computational Geometry (SoCG’14) 11–19. ACM, New York.
• [6] Brooks, S., Gelman, A., Jones, G. L. and Meng, X.-L., eds. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL.
• [7] Cattiaux, P. and Guillin, A. (2014). Semi log-concave Markov diffusions. In Séminaire de Probabilités XLVI. Lecture Notes in Math. 2123 231–292. Springer, Cham.
• [8] Cerrai, S. (2001). Second Order PDE’s in Finite and Infinite Dimension: A Probabilistic Approach. Lecture Notes in Math. 1762. Springer, Berlin.
• [9] Chatterjee, S. and Meckes, E. (2008). Multivariate normal approximation using exchangeable pairs. ALEA Lat. Am. J. Probab. Math. Stat. 4 257–283.
• [10] Chatterjee, S. and Shao, Q.-M. (2011). Nonnormal approximation by Stein’s method of exchangeable pairs with application to the Curie–Weiss model. Ann. Appl. Probab. 21 464–483.
• [11] Chen, L. H. Y., Goldstein, L. and Shao, Q.-M. (2011). Normal Approximation by Stein’s Method. Probability and Its Applications (New York). Springer, Heidelberg.
• [12] Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X. and Oates, C. (2018). Stein points. In Proc. 35th ICML, ICML’18.
• [13] Chen, Y., Welling, M. and Smola, A. (2010). Super-samples from kernel herding. In UAI.
• [14] Chew, P. (1986). There is a Planar Graph Almost As Good As the Complete Graph. In Proc. 2nd SOCG 169–177. ACM, New York.
• [15] Chwialkowski, K., Strathmann, H. and Gretton, A. (2016). A kernel test of goodness of fit. In Proc. 33rd ICML, ICML.
• [16] Conca, C. and Vanninathan, M. (2007). Periodic homogenization problems in incompressible fluid equations. Handbook of Mathematical Fluid Dynamics 4 649–698.
• [17] Doersek, P. and Teichmann, J. (2010). A semigroup point of view on splitting schemes for stochastic (partial) differential equations. Available at https://arxiv.org/abs/1011.2651.
• [18] Drummond, P. D. and Gardiner, C. W. (1980). Generalised $P$-representations in quantum optics. J. Phys. A 13 2353–2368.
• [19] Drummond, P. D. and Walls, D. F. (1980). Quantum theory of optical bistability. I. Nonlinear polarisability model. J. Phys. A 13 725.
• [20] Duncan, A. B., Lelièvre, T. and Pavliotis, G. A. (2016). Variance reduction using nonreversible Langevin samplers. J. Stat. Phys. 163 457–491.
• [21] Dynkin, E. B. (1965). Markov Processes. Vols. I. Springer, Berlin-Göttingen-Heidelberg.
• [22] Eberle, A. (2016). Reflection couplings and contraction rates for diffusions. Probab. Theory Related Fields 166 851–886.
• [23] Engel, K.-J. and Nagel, R. (2000). One-Parameter Semigroups for Linear Evolution Equations. Graduate Texts in Mathematics 194. Springer, New York.
• [24] Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, New York.
• [25] Fan, Y., Brooks, S. P. and Gelman, A. (2006). Output assessment for Monte Carlo simulations via the score statistic. J. Comput. Graph. Statist. 15 178–206.
• [26] Fang, X., Shao, Q.-M. and Xu, L. (2018). Multivariate approximations in Wasserstein distance by Stein’s method and Bismut’s formula. Probab. Theory Related Fields 1–35.
• [27] Fournié, E., Lasry, J.-M., Lebuchoux, J., Lions, P.-L. and Touzi, N. (1999). Applications of Malliavin calculus to Monte Carlo methods in finance. Finance Stoch. 3 391–412.
• [28] Friedman, A. (1975). Stochastic Differential Equations and Applications. Vol. 1. Academic Press, New York.
• [29] Gan, H. L., Röllin, A. and Ross, N. (2017). Dirichlet approximation of equilibrium distributions in Cannings models with mutation. Adv. in Appl. Probab. 49 927–959.
• [30] Gaunt, R. E. (2016). Rates of convergence in normal approximation under moment conditions via new bounds on solutions of the Stein equation. J. Theoret. Probab. 29 231–247.
• [31] Gelman, A. (2006). Multilevel (hierarchical) modeling: What it can and cannot do. Technometrics 48 432–435.
• [32] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014). Bayesian Data Analysis, 3rd ed. Texts in Statistical Science Series. CRC Press, Boca Raton, FL.
• [33] Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 123–214.
• [34] Glaeser, G. (1958). Étude de quelques algèbres tayloriennes. J. Anal. Math. 6 1–124; erratum, insert to 6 (1958), no. 2.
• [35] Gorham, J., Duncan, A. B, Vollmer, S. J and Mackey, L. (2019). Supplement to “Measuring sample quality with diffusions.” DOI:10.1214/19-AAP1467SUPP.
• [36] Gorham, J. and Mackey, L. (2015). Measuring sample quality with Stein’s method. Adv. NIPS 28 226–234.
• [37] Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. In Proc. of 34st ICML, ICML’17.
• [38] Götze, F. (1991). On the rate of convergence in the multivariate CLT. Ann. Probab. 19 724–739.
• [39] Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. and Smola, A. (2006). A kernel method for the two-sample-problem. Adv. NIPS 19 513–520.
• [40] Gu, Z., Rothberg, E. and Bixby, R. (2015). Gurobi optimizer reference manual. Available at http://www.gurobi.com.
• [41] Gudmundsson, J., Klein, O., Knauer, C. and Smid, M. (2007). Small manhattan networks and algorithmic applications for the Earth movers distance. In Proc. 23rd EuroCG 174–177.
• [42] Hairer, M., Stuart, A. M. and Vollmer, S. J. (2014). Spectral gaps for a Metropolis–Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24 2455–2490.
• [43] Har-Peled, S. and Mendel, M. (2006). Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput. 35 1148–1184.
• [44] Hartley, R. and Zisserman, A. (2003). Multiple View Geometry in Computer Vision, 2nd ed. Cambridge Univ. Press, Cambridge.
• [45] Horowitz, A. M. (1987). The second order Langevin equation and numerical simulations. Nuclear Phys. B 280 510–522.
• [46] Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, 2nd ed. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ.
• [47] Huggins, J. and Zou, J. (2017). Quantifying the accuracy of approximate diffusions and Markov chains. In Proc. 20th AISTATS 382–391.
• [48] Huggins, J. H. and Mackey, L. (2018). Random feature Stein discrepancies. In Adv. NIPS 31.
• [49] Hwang, C.-R., Hwang-Ma, S.-Y. and Sheu, S. J. (1993). Accelerating Gaussian diffusions. Ann. Appl. Probab. 3 897–913.
• [50] Joulin, A. and Ollivier, Y. (2010). Curvature, concentration and error estimates for Markov chain Monte Carlo. Ann. Probab. 38 2418–2442.
• [51] Kallenberg, O. (2002). Foundations of Modern Probability, 2nd ed. Probability and Its Applications (New York). Springer, New York.
• [52] Kent, J. (1978). Time-reversible diffusions. Adv. in Appl. Probab. 10 819–835.
• [53] Khasminskii, R. (2012). Stochastic Stability of Differential Equations, 2nd ed. Stochastic Modelling and Applied Probability 66. Springer, Heidelberg.
• [54] Korattikara, A., Chen, Y. and Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis–Hastings budget. In Proc. of 31st ICML, ICML’14.
• [55] Lacoste-Julien, S., Lindsten, F. and Bach, F. (2015). Sequential kernel herding: Frank–Wolfe optimization for particle filtering. In AISTATS.
• [56] Landim, C., Olla, S. and Yau, H. T. (1998). Convection–diffusion equation with space-time ergodic random flow. Probab. Theory Related Fields 112 203–220.
• [57] Ley, C., Reinert, G. and Swan, Y. (2017). Stein’s method for comparison of univariate distributions. Probab. Surv. 14 1–52.
• [58] Liu, C. (1996). Bayesian robust multivariate linear regression with incomplete data. J. Amer. Statist. Assoc. 91 1219–1227.
• [59] Liu, Q. and Lee, J. (2017). Black-box importance sampling. In Proc. 20th AISTATS 952–961.
• [60] Liu, Q., Lee, J. and Jordan, M. (2016). A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML. ICML 48 276–284.
• [61] Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose Bayesian inference algorithm. Adv. NIPS 29 2378–2386.
• [62] Lubin, M. and Dunning, I. (2015). Computing in operations research using Julia. INFORMS J. Comput. 27 238–248.
• [63] Lunardi, A. (2018). Interpolation Theory. Springer, Berlin.
• [64] Ma, Y., Chen, T. and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. Adv. NIPS 28 2899–2907.
• [65] Mackey, L. and Gorham, J. (2016). Multivariate Stein factors for a class of strongly log-concave distributions. Electron. Commun. Probab. 21 56.
• [66] Manca, L. (2008). Kolmogorov Operators in Spaces of Continuous Functions and Equations for Measures. Tesi. Scuola Normale Superiore di Pisa (Nuova Series) [Theses of Scuola Normale Superiore di Pisa (New Series)] 10. Edizioni della Normale, Pisa.
• [67] Mattingly, J. C., Stuart, A. M. and Tretyakov, M. V. (2010). Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM J. Numer. Anal. 48 552–577.
• [68] Meckes, E. (2009). On Stein’s method for multivariate normal approximation. In High Dimensional Probability V: The Luminy Volume. Inst. Math. Stat. (IMS) Collect. 5 153–178. IMS, Beachwood, OH.
• [69] Müller, A. (1997). Integral probability metrics and their generating classes of functions. Adv. in Appl. Probab. 29 429–443.
• [70] Nourdin, I., Peccati, G. and Réveillac, A. (2010). Multivariate normal approximation using Stein’s method and Malliavin calculus. Ann. Inst. Henri Poincaré Probab. Stat. 46 45–58.
• [71] Oates, C. J., Girolami, M. and Chopin, N. (2017). Control functionals for Monte Carlo integration. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 695–718.
• [72] Øksendal, B. (2003). Stochastic Differential Equations: An Introduction with Applications, 6th ed. Universitext. Springer, Berlin.
• [73] Pardoux, E. and Veretennikov, A. Y. (2001). On the Poisson equation and diffusion approximation. I. Ann. Probab. 29 1061–1085.
• [74] Patterson, S. and Teh, Y. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Adv. NIPS 26 3102–3110.
• [75] Pavliotis, G. A. (2014). Stochastic Processes and Applications. Texts in Applied Mathematics 60. Springer, New York. Diffusion processes, the Fokker–Planck and Langevin equations.
• [76] Peleg, D. and Schäffer, A. A. (1989). Graph spanners. J. Graph Theory 13 99–116.
• [77] Protter, P. E. (2005). Stochastic Integration and Differential Equations. Stochastic Modelling and Applied Probability 21. Springer, Berlin. Second edition. Version 2.1, Corrected third printing.
• [78] Raič, M. (2004). A multivariate CLT for decomposable random vectors with finite second moments. J. Theoret. Probab. 17 573–603.
• [79] Ranganath, R., Tran, D., Altosaar, J. and Blei, D. (2016). Operator variational inference. In Advances in Neural Information Processing Systems 496–504.
• [80] Reinert, G. and Röllin, A. (2009). Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearity condition. Ann. Probab. 37 2150–2173.
• [81] Rey-Bellet, L. and Spiliopoulos, K. (2015). Irreversible Langevin samplers and variance reduction: A large deviations approach. Nonlinearity 28 2081–2103.
• [82] Risken, H. (1989). The Fokker–Planck Equation: Methods of Solution and Applications, 2nd ed. Springer Series in Synergetics 18. Springer, Berlin.
• [83] Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput. Appl. Probab. 4 337–357.
• [84] Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 341–363.
• [85] Röckner, M., Sobol, Z., et al. (2006). Kolmogorov equations in infinite dimensions: Well-posedness and regularity of solutions, with applications to stochastic generalized Burgers equations. Ann. Probab. 34 663–727.
• [86] Shvartsman, P. (2008). The Whitney extension problem and Lipschitz selections of set-valued mappings in jet-spaces. Trans. Amer. Math. Soc. 360 5529–5550.
• [87] Soffritti, G. and Galimberti, G. (2011). Multivariate linear regression with non-normal errors: A solution based on mixture models. Stat. Comput. 21 523–536.
• [88] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. Sixth Berkeley Symposium on Mathematical Statistics and Probability, Vol. II: Probability Theory 583–602. Univ. California Press, Berekeley, CA.
• [89] Stein, C., Diaconis, P., Holmes, S. and Reinert, G. (2004). Use of exchangeable pairs in the analysis of simulations. In Stein’s Method: Expository Lectures and Applications. Institute of Mathematical Statistics Lecture Notes—Monograph Series 46 1–26. IMS, Beachwood, OH.
• [90] Stuart, A. M., Voss, J., Wiberg, P., et al. (2004). Conditional path sampling of SDEs and the Langevin MCMC method. Commun. Math. Sci. 2 685–697.
• [91] Teh, Y. W., Thiery, A. H. and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17 7.
• [92] Vallander, S. S. (1973). Calculations of the Vasseršteĭn distance between probability distributions on the line. Theory Probab. Appl. 18 824–827.
• [93] Vollmer, S. J., Zygalakis, K. C. and Teh, Y. W. (2016). Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17 159.
• [94] Wang, F. (2016). Exponential contraction in Wasserstein distances for diffusion semigroups with negative curvature. Available at https://arxiv.org/abs/1603.05749.
• [95] Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate Student-$t$ error terms. J. Amer. Statist. Assoc. 71 400–405.
• [96] Zellner, A. and Min, C. (1995). Gibbs sampler convergence criteria. J. Amer. Statist. Assoc. 90 921–927.