Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case

F. P. Kelly

doi:10.1214/aos/1176345578

September, 1981 Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case

F. P. Kelly

Ann. Statist. 9(5): 987-1001 (September, 1981). DOI: 10.1214/aos/1176345578

Abstract

Each of $n$ arms generates an infinite sequence of Bernoulli random variables. The parameters of the sequences are themselves random variables, and are independent with a common distribution satisfying a mild regularity condition. At each stage we must choose an arm to observe (or pull) based on past observations, and our aim is to maximize the expected discounted sum of the observations. In this paper it is shown that as the discount factor approaches one the optimal policy tends to the rule of least failures, defined as follows: pull the arm which has incurred the least number of failures, or if this does not define an arm uniquely select from amongst the set of arms which have incurred the least number of failures an arm with the largest number of successes.

Citation

Download Citation

F. P. Kelly. "Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case." Ann. Statist. 9 (5) 987 - 1001, September, 1981. https://doi.org/10.1214/aos/1176345578

Information

Published: September, 1981

First available in Project Euclid: 12 April 2007

zbMATH: 0478.90073

MathSciNet: MR628754

Digital Object Identifier: 10.1214/aos/1176345578

Subjects:

Primary: 90C40

Secondary: 62L05 , 62L15

Keywords: Bernoulli bandit process , discount optimality , least failures rule , limit rule , Markov decision process , multi-armed bandit , play-the-winner rule

Access the abstract

JOURNAL ARTICLE
15 PAGES

DOWNLOAD PDF + SAVE TO MY LIBRARY