## The Annals of Statistics

### Kullback–Leibler upper confidence bounds for optimal sequential allocation

#### Abstract

We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148–177], based on upper confidence bounds of the arm payoffs computed using the Kullback–Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins [Adv. in Appl. Math. 6 (1985) 4–22] and Burnetas and Katehakis [Adv. in Appl. Math. 17 (1996) 122–142], respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.

#### Article information

Source
Ann. Statist., Volume 41, Number 3 (2013), 1516-1541.

Dates
First available in Project Euclid: 1 August 2013

https://projecteuclid.org/euclid.aos/1375362558

Digital Object Identifier
doi:10.1214/13-AOS1119

Mathematical Reviews number (MathSciNet)
MR3113820

Zentralblatt MATH identifier
1293.62161

#### Citation

Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41 (2013), no. 3, 1516--1541. doi:10.1214/13-AOS1119. https://projecteuclid.org/euclid.aos/1375362558

#### References

• Agrawal, R. (1995). Sample mean based index policies with $O(\log n)$ regret for the multi-armed bandit problem. Adv. in Appl. Probab. 27 1054–1078.
• Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11 2785–2836.
• Audibert, J.-Y., Munos, R. and Szepesvári, C. (2009). Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410 1876–1902.
• Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256.
• Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5 1–122.
• Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Adv. in Appl. Math. 17 122–142.
• Burnetas, A. N. and Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22 222–255.
• Burnetas, A. N. and Katehakis, M. N. (2003). Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probab. Engrg. Inform. Sci. 17 53–82.
• Cappé, O., Garivier, A. and Kaufmann, E. (2012). py/maBandits: Matlab and Python packages for multi-armed bandits. Available at http://mloss.org/software/view/415/.
• Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. (2013). Supplement to “Kullback–Leibler upper confidence bounds for optimal sequential allocation.” DOI:10.1214/13-AOS1119SUPP.
• Chang, F. and Lai, T. L. (1987). Optimal stopping and dynamic allocation. Adv. in Appl. Probab. 19 829–853.
• Chow, Y. S. and Teicher, H. (1988). Probability Theory: Independence, Interchangeability, Martingales, 2nd ed. Springer, New York.
• Dembo, A. and Zeitouni, O. (1998). Large Deviations Techniques and Applications, 2nd ed. Applications of Mathematics (New York) 38. Springer, New York.
• Filippi, S., Cappé, O. and Garivier, A. (2010). Optimism in reinforcement learning and Kullback–Leibler divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing. IEEE Press, Piscataway, NJ.
• Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory. JMLR C&WP.
• Gittins, J. C. (1979). Bandit processes and dynamic allocation indices (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 41 148–177.
• Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices. Wiley, New York.
• Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
• Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 23rd Annual Conference on Learning Theory. Omnipress, Madison, WI.
• Honda, J. and Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning 85 361–391.
• Honda, J. and Takemura, A. (2012). Finite-time regret bound of a bandit algorithm for the semi-bounded support model. Available at arXiv:1202.2277.
• Kaufmann, E., Cappé, O. and Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics 22 592–600. JMLR W&CP.
• Kaufmann, E., Korda, N. and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite time analysis. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory 199–213. Springer, New York.
• Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6 4–22.
• Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer, New York.
• Maillard, O.-A., Munos, R. and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback–Leibler divergences. In Proceedings of the 24th Annual Conference on Learning Theory. JMLR C&WP.
• Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
• Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, Boca Raton, FL.
• Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.) 58 527–535.
• Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 285–294.
• Thompson, W. R. (1935). On the Theory of Apportionment. Amer. J. Math. 57 450–456.
• van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge Univ. Press, Cambridge.
• Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundation and Trends in Machine Learning 1 1–305.
• Wald, A. (1945). Sequential tests of statistical hypotheses. Ann. Math. Statist. 16 117–186.
• Weber, R. (1992). On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 2 1024–1033.
• Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. R. Stat. Soc. Ser. B Stat. Methodol. 42 143–149.