Electronic Journal of Statistics

Linear Thompson sampling revisited

Marc Abeille and Alessandro Lazaric

Full-text: Open access

Abstract

We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order $\widetilde{O}(d^{3/2}\sqrt{T})$ as in previous results, the proof sheds new light on the functioning of the TS. We leverage the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to optimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

Article information

Source
Electron. J. Statist., Volume 11, Number 2 (2017), 5165-5197.

Dates
Received: June 2017
First available in Project Euclid: 15 December 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1513306870

Digital Object Identifier
doi:10.1214/17-EJS1341SI

Mathematical Reviews number (MathSciNet)
MR3738208

Zentralblatt MATH identifier
06825043

Keywords
Linear bandit Thompson sampling

Rights
Creative Commons Attribution 4.0 International License.

Citation

Abeille, Marc; Lazaric, Alessandro. Linear Thompson sampling revisited. Electron. J. Statist. 11 (2017), no. 2, 5165--5197. doi:10.1214/17-EJS1341SI. https://projecteuclid.org/euclid.ejs/1513306870


Export citation

References

  • Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In, Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), 2011a.
  • Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online least squares estimation with self-normalized processes: An application to bandit problems., arXiv preprint arXiv :1102.2670, 2011b.
  • Jacob D. Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. In, Advances in Neural Information Processing Systems 28, pages 2197–2205, 2015.
  • Rajeev Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem., Advances in Applied Probability, 27(4) :1054–1078, 1995.
  • Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In, Proceedings of the 25th Annual Conference on Learning Theory (COLT), 2012a.
  • Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs., arXiv preprint arXiv :1209.3352, 2012b.
  • Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In, Proceedings of AI&Stats, 2013.
  • Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem., Machine learning, 47(2–3):235–256, 2002.
  • Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems., Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Sebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In, Advances in Neural Information Processing Systems 26, pages 638–646. 2013.
  • Seok-Ho Chang, Pamela C Cosman, and Laurence B Milstein. Chernoff-type bounds for the gaussian error function., Communications, IEEE Transactions on, 59(11) :2939–2944, 2011.
  • Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In, Advances in Neural Information Processing Systems 24, pages 2249–2257. 2011.
  • Chao-Ping Chen and Feng Qi. Completely monotonic function associated with the gamma functions and proof of wallis’ inequality., Tamkang Journal of Mathematics, 36(4):303–307, 2005.
  • Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In, Advances in Neural Information Processing Systems 23, pages 586–594, 2010.
  • Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In, Proceedings of The 28th Conference on Learning Theory, 2015.
  • Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In, Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT 2012), pages 199–213, 2012.
  • Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In, Advances in Neural Information Processing Systems 26, pages 1448–1456, 2013.
  • Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap., Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
  • Benedict C May, Nathan Korda, Anthony Lee, and David S Leslie. Optimistic bayesian sampling in contextual-bandit problems., The Journal of Machine Learning Research, 13(1) :2069–2106, 2012.
  • Constantin Niculescu and Lars-Erik Persson., Convex functions and their applications: a contemporary approach. Springer Science & Business Media, 2006.
  • Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration., CoRR, 2015.
  • Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling., Math. Oper. Res., 39(4) :1221–1243, 2014.
  • Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling., Journal of Machine Learning Research, 17, 2016.
  • Malcolm Strens. A bayesian framework for reinforcement learning. In, ICML, pages 943–950, 2000.
  • William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples., Biometrika, pages 285–294, 1933.