Electronic Journal of Statistics

Dynamics of Bayesian updating with dependent data and misspecified models

Cosma Rohilla Shalizi

Full-text: Open access

Abstract

Much is now known about the consistency of Bayesian updating on infinite-dimensional parameter spaces with independent or Markovian data. Necessary conditions for consistency include the prior putting enough weight on the correct neighborhoods of the data-generating distribution; various sufficient conditions further restrict the prior in ways analogous to capacity control in frequentist nonparametrics. The asymptotics of Bayesian updating with mis-specified models or priors, or non-Markovian data, are far less well explored. Here I establish sufficient conditions for posterior convergence when all hypotheses are wrong, and the data have complex dependencies. The main dynamical assumption is the asymptotic equipartition (Shannon-McMillan-Breiman) property of information theory. This, along with Egorov’s Theorem on uniform convergence, lets me build a sieve-like structure for the prior. The main statistical assumption, also a form of capacity control, concerns the compatibility of the prior and the data-generating process, controlling the fluctuations in the log-likelihood when averaged over the sieve-like sets. In addition to posterior convergence, I derive a kind of large deviations principle for the posterior measure, extending in some cases to rates of convergence, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between these results and the replicator dynamics of evolutionary theory.

Article information

Source
Electron. J. Statist. Volume 3 (2009), 1039-1074.

Dates
First available in Project Euclid: 29 October 2009

Permanent link to this document
http://projecteuclid.org/euclid.ejs/1256822130

Digital Object Identifier
doi:10.1214/09-EJS485

Mathematical Reviews number (MathSciNet)
MR2557128

Zentralblatt MATH identifier
06166472

Subjects
Primary: 62C10: Bayesian problems; characterization of Bayes procedures 62G20: Asymptotic properties 62M09: Non-Markovian processes: estimation
Secondary: 60F10: Large deviations 62M05: Markov processes: estimation 92D15: Problems related to evolution 94A17: Measures of information, entropy

Keywords
Asymptotic equipartition Bayesian consistency Bayesian nonparametrics Egorov’s theorem large deviations posterior convergence replicator dynamics sofic systems

Citation

Shalizi, Cosma Rohilla. Dynamics of Bayesian updating with dependent data and misspecified models. Electron. J. Statist. 3 (2009), 1039--1074. doi:10.1214/09-EJS485. http://projecteuclid.org/euclid.ejs/1256822130.


Export citation

References

  • [1] Algoet, P. H. and Cover, T. M. (1988). A sandwich proof of the Shannon-McMillan-Breiman theorem., Annals of Probability 16, 899–909. http://projecteuclid.org/euclid.aop/1176991794.
  • [2] Arora, S., Hazan, E., and Kale, S. (2005). The multiplicative weights update method: a meta algorithm and applications., http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf.
  • [3] Badii, R. and Politi, A. (1997)., Complexity: Hierarchical Structures and Scaling in Physics. Cambridge University Press, Cambridge, England.
  • [4] Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems., The Annals of Statistics 27, 536–561. http://projecteuclid.org/euclid.aos/1018031206.
  • [5] Berk, R. H. (1966). Limiting behavior of posterior distributions when the model is incorrect., Annals of Mathematical Statistics 37, 51–58. See also correction, volume 37 (1966), pp. 745–746, http://projecteuclid.org/euclid.aoms/1177699597.
  • [6] Berk, R. H. (1970). Consistency a posteriori., Annals of Mathematical Statistics 41, 894–906. http://projecteuclid.org/euclid.aoms/1177696967.
  • [7] Blackwell, D. and Dubins, L. (1962). Merging of opinion with increasing information., Annals of Mathematical Statistics 33, 882–886. http://projecteuclid.org/euclid.aoms/1177704456.
  • [8] Börgers, T. and Sarin, R. (1997). Learning through reinforcement and replicator dynamics., Journal of Economic Theory 77, 1–14.
  • [9] Borkar, V. S. (2002). Reinforcement learning in Markovian evolutionary games., Advances in Complex Systems 5, 55–72.
  • [10] Cesa-Bianchi, N. and Lugosi, G. (2006)., Prediction, Learning, and Games. Cambridge University Press, Cambridge, England.
  • [11] Chamley, C. (2004)., Rational Herds: Economic Models of Social Learning. Cambridge University Press, Cambridge, England.
  • [12] Charniak, E. (1993)., Statistical Language Learning. MIT Press, Cambridge, Massachusetts.
  • [13] Choi, T. and Ramamoorthi, R. V. (2008). Remarks on consistency of posterior distributions. In, Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, B. Clarke and S. Ghosal, Eds. Institute of Mathematical Statistics, Beechwood, Ohio, 170–186. http://arxiv.org/abs/0805.3248.
  • [14] Choudhuri, N., Ghosal, S., and Roy, A. (2004). Bayesian estimation of the spectral density of a time series., Journal of the American Statistical Association 99, 1050–1059. http://www4.stat.ncsu.edu/~sghosal/papers/specden.pdf.
  • [15] Crutchfield, J. P. (1992). Semantics and thermodynamics. In, Nonlinear Modeling and Forecasting, M. Casdagli and S. Eubank, Eds. Addison-Wesley, Reading, Massachusetts, 317–359.
  • [16] Daw, C. S., Finney, C. E. A., and Tracy, E. R. (2003). A review of symbolic analysis of experimental data., Review of Scientific Instruments 74, 916–930. http://www-chaos.engr.utk.edu/abs/abs-rsi2002.html.
  • [17] Dębowski, Ł. (2006). Ergodic decomposition of excess entropy and conditional mutual information. Tech. Rep. 993, Institute of Computer Science, Polish Academy of Sciences (IPI PAN)., http://www.ipipan.waw.pl/~ldebowsk/docs/raporty/ee_report.pdf.
  • [18] Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates., The Annals of Statistics 14, 1–26. http://projecteuclid.org/euclid.aos/1176349830.
  • [19] Doob, J. L. (1949). Application of the theory of martingales. In, Colloques Internationaux du Centre National de la Recherche Scientifique. Vol. 13. Centre National de la Recherche Scientifique, Paris, 23–27.
  • [20] Dynkin, E. B. (1978). Sufficient statistics and extreme points., Annals of Probability 6, 705–730. http://projecteuclid.org/euclid.aop/1176995424.
  • [21] Earman, J. (1992)., Bayes or Bust? A Critical Account of Bayesian Confirmation Theory. MIT Press, Cambridge, Massachusetts.
  • [22] Eichelsbacher, P. and Ganesh, A. (2002). Moderate deviations for Bayes posteriors., Scandanavian Journal of Statistics 29, 153–167.
  • [23] Fisher, R. A. (1958)., The Genetical Theory of Natural Selection, Second ed. Dover, New York. First edition published Oxford: Clarendon Press, 1930.
  • [24] Fraser, A. M. (2008)., Hidden Markov Models and Dynamical Systems. SIAM Press, Philadelphia.
  • [25] Geman, S. and Hwang, C.-R. (1982). Nonparametric maximum likelihood estimation by the method of sieves., The Annals of Statistics 10, 401–414. http://projecteuclid.org/euclid.aos/1176345782.
  • [26] Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1999). Consistency issues in Bayesian nonparametrics. In, Asymptotics, Nonparametrics and Time Series: A Tribute to Madan Lal Puri, S. Ghosh, Ed. Marcel Dekker, 639–667. http://www4.stat.ncsu.edu/~sghosal/papers/review.pdf.
  • [27] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions., Annals of Statistics 28, 500–531. http://projecteuclid.org/euclid.aos/1016218228.
  • [28] Ghosal, S. and Tang, Y. (2006). Bayesian consistency for Markov processes., Sankhya 68, 227–239. http://sankhya.isical.ac.in/search/68_2/2006010.html.
  • [29] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-iid observations., Annals of Statistics 35, 192–223. http://arxiv.org/abs/0708.0491.
  • [30] Ghosh, J. K. and Ramamoorthi, R. V. (2003)., Bayesian Nonparametrics. Springer Verlag, New York.
  • [31] Gray, R. M. (1988)., Probability, Random Processes, and Ergodic Properties. Springer-Verlag, New York. http://ee.stanford.edu/~gray/arp.html.
  • [32] Gray, R. M. (1990)., Entropy and Information Theory. Springer-Verlag, New York. http://ee.stanford.edu/~gray/it.html.
  • [33] Haldane, J. B. S. (1954). The measurement of natural selection. In, Proceedings of the 9th International Congress of Genetics. Vol. 1. 480–487.
  • [34] Hofbauer, J. and Sigmund, K. (1998)., Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, England.
  • [35] Kallenberg, O. (2002)., Foundations of Modern Probability, Second ed. Springer-Verlag, New York.
  • [36] Kitchens, B. and Tuncel, S. (1985)., Finitary Measures for Subshifts of Finite Type and Sofic Systems. Memoirs of the American Mathematical Society, Vol. 338. American Mathematical Society, Providence, Rhode Island.
  • [37] Kitchens, B. P. (1998)., Symbolic Dynamics: One-sided, Two-sided and Countable State Markov Shifts. Springer-Verlag, Berlin.
  • [38] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics., Annals of Statistics 34, 837–877. http://arxiv.org/math.ST/0607023.
  • [39] Knight, F. B. (1975). A predictive view of continuous time processes., Annals of Probability 3, 573–596. http://projecteuclid.org/euclid.aop/1176996302.
  • [40] Krogh, A. and Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In, Advances in Neural Information Processing 7 [NIPS 1994], G. Tesauro, D. Tourtetsky, and T. Leen, Eds. MIT Press, Cambridge, Massachusetts, 231–238. http://books.nips.cc/papers/files/nips07/0231.pdf.
  • [41] Lian, H. (2007). On rates of convergence for posterior distributions under misspecification. E-print, arxiv.org., http://arxiv.org/abs/math.ST/0702126.
  • [42] Lijoi, A., Prünster, I., and Walker, S. G. (2007). Bayesian consistency for stationary models., Econometric Theory 23, 749–759.
  • [43] Lind, D. and Marcus, B. (1995)., An Introduction to Symbolic Dynamics and Coding. Cambridge University Press, Cambridge, England.
  • [44] Marton, K. and Shields, P. C. (1994). Entropy and the consistent estimation of joint distributions., The Annals of Probability 22, 960–977. Correction, The Annals of Probability, 24 (1996): 541–545, http://projecteuclid.org/euclid.aop/1176988736.
  • [45] McAllister, D. A. (1999). Some PAC-Bayesian theorems., Machine Learning 37, 355–363.
  • [46] Meir, R. (2000). Nonparametric time series prediction through adaptive model selection., Machine Learning 39, 5–34. http://www.ee.technion.ac.il/~rmeir/Publications/MeirTimeSeries00.pdf.
  • [47] Ornstein, D. S. and Weiss, B. (1990). How sampling reveals a process., The Annals of Probability 18, 905–930. http://projecteuclid.org/euclid.aop/1176990729.
  • [48] Page, S. E. (2007)., The Difference: How the Power of Diveristy Creates Better Groups, Firms, Schools, and Societies. Princeton University Press, Princeton, New Jersey.
  • [49] Papangelou, F. (1996). Large deviations and the Bayesian estimation of higher-order Markov transition functions., Journal of Applied Probability 33, 18–27. http://www.jstor.org/stable/3215260.
  • [50] Perry, N. and Binder, P.-M. (1999). Finite statistical complexity for sofic systems., Physical Review E 60, 459–463.
  • [51] Rivers, D. and Vuong, Q. H. (2002). Model selection tests for nonlinear dynamic models., The Econometrics Journal 5, 1–39.
  • [52] Roy, A., Ghosal, S., and Rosenberger, W. F. (2009). Convergence properties of sequential Bayesian, d-optimal designs. Journal of Statistical Planning and Inference 139, 425–440.
  • [53] Ryabko, D. and Ryabko, B. (2008). Testing statistical hypotheses about ergodic processes. E-print, arxiv.org, 0804.0510., http://arxiv.org/abs/0804.0510.
  • [54] Sato, Y. and Crutchfield, J. P. (2003). Coupled replicator equations for the dynamics of learning in multiagent systems., Physical Review E 67, 015206. http://arxiv.org/abs/nlin.AO/0204057.
  • [55] Schervish, M. J. (1995)., Theory of Statistics. Springer Series in Statistics. Springer-Verlag, Berlin.
  • [56] Schwartz, L. (1965). On Bayes procedures., Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 4, 10–26.
  • [57] Shalizi, C. R. and Crutchfield, J. P. (2001). Computational mechanics: Pattern and prediction, structure and simplicity., Journal of Statistical Physics 104, 817–879. http://arxiv.org/abs/cond-mat/9907176.
  • [58] Shalizi, C. R. and Klinkner, K. L. (2004). Blind construction of optimal nonlinear recursive predictors for discrete sequences. In, Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004), M. Chickering and J. Y. Halpern, Eds. AUAI Press, Arlington, Virginia, 504–511. http://arxiv.org/abs/cs.LG/0406011.
  • [59] Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions., Annals of Statistics 29, 687–714. http://projecteuclid.org/euclid.aos/1009210686.
  • [60] Shields, P. C. (1996)., The Ergodic Theory of Discrete Sample Paths. American Mathematical Society, Providence, Rhode Island.
  • [61] Strelioff, C. C., Crutchfield, J. P., and Hübler, A. W. (2007). Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and out-of-class modeling., Physical Review E 76, 011106. http://arxiv.org/math.ST/0703715.
  • [62] Varn, D. P. and Crutchfield, J. P. (2004). From finite to infinite range order via annealing: The causal architecture of deformation faulting in annealed close-packed crystals., Physics Letters A 324, 299–307. http://arxiv.org/abs/cond-mat/0307296.
  • [63] Vidyasagar, M. (2003)., Learning and Generalization: With Applications to Neural Networks, Second ed. Springer-Verlag, Berlin.
  • [64] Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses., Econometrica 57, 307–333. http://www.jstor.org/pss/1912557.
  • [65] Walker, S. (2004). New approaches to Bayesian consistency., Annals of Statistics 32, 2028–2043. http://arxiv.org/abs/math.ST/0503672.
  • [66] Weiss, B. (1973). Subshifts of finite type and sofic systems., Monatshefte für Mathematik 77, 462–474.
  • [67] Xing, Y. and Ranneby, B. (2008). Both necessary and sufficient conditions for Bayesian exponential consistency., http://arxiv.org/abs/0812.1084.
  • [68] Zhang, T. (2006). From, ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Annals of Statistics 34, 2180–2210. http://arxiv.org/math.ST/0702653.