The Annals of Statistics

Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case

F. P. Kelly

Full-text: Open access

Abstract

Each of $n$ arms generates an infinite sequence of Bernoulli random variables. The parameters of the sequences are themselves random variables, and are independent with a common distribution satisfying a mild regularity condition. At each stage we must choose an arm to observe (or pull) based on past observations, and our aim is to maximize the expected discounted sum of the observations. In this paper it is shown that as the discount factor approaches one the optimal policy tends to the rule of least failures, defined as follows: pull the arm which has incurred the least number of failures, or if this does not define an arm uniquely select from amongst the set of arms which have incurred the least number of failures an arm with the largest number of successes.

Article information

Source
Ann. Statist., Volume 9, Number 5 (1981), 987-1001.

Dates
First available in Project Euclid: 12 April 2007

Permanent link to this document
https://projecteuclid.org/euclid.aos/1176345578

Digital Object Identifier
doi:10.1214/aos/1176345578

Mathematical Reviews number (MathSciNet)
MR628754

Zentralblatt MATH identifier
0478.90073

JSTOR
links.jstor.org

Subjects
Primary: 90C40: Markov and semi-Markov decision processes
Secondary: 62L05: Sequential design 62L15: Optimal stopping [See also 60G40, 91A60]

Keywords
Bernoulli bandit process Markov decision process multi-armed bandit limit rule play-the-winner rule least failures rule discount optimality

Citation

Kelly, F. P. Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case. Ann. Statist. 9 (1981), no. 5, 987--1001. doi:10.1214/aos/1176345578. https://projecteuclid.org/euclid.aos/1176345578


Export citation