The Annals of Applied Probability

Finite state multi-armed bandit problems: sensitive-discount, average-reward and average-overtaking optimality

Michael N. Katehakis and Uriel G. Rothblum

Full-text: Open access

Abstract

We express Gittins indices for multi-armed bandit problems as Laurent expansions around discount factor 1. The coefficients of these expan-sions are then used to characterize stationary optimal policies when the optimality criteria are sensitive-discount optimality (otherwise known as Blackwell optimality), average-reward optimality and average-overtaking optimality. We also obtain bounds and derive optimality conditions for policies of a type that continue playing the same bandit as long as the state of that bandit remains in prescribed sets.

Article information

Source
Ann. Appl. Probab., Volume 6, Number 3 (1996), 1024-1034.

Dates
First available in Project Euclid: 18 October 2002

Permanent link to this document
https://projecteuclid.org/euclid.aoap/1034968239

Digital Object Identifier
doi:10.1214/aoap/1034968239

Mathematical Reviews number (MathSciNet)
MR1410127

Zentralblatt MATH identifier
0862.90127

Subjects
Primary: 90C47: Minimax problems [See also 49K35] 90C31: Sensitivity, stability, parametric optimization 90C39: Dynamic programming [See also 49L20] 60G40: Stopping times; optimal stopping problems; gambling theory [See also 62L15, 91A60]

Keywords
Bandit problems optimality criteria Markov decision chains Gittins index Laurent expansions

Citation

Katehakis, Michael N.; Rothblum, Uriel G. Finite state multi-armed bandit problems: sensitive-discount, average-reward and average-overtaking optimality. Ann. Appl. Probab. 6 (1996), no. 3, 1024--1034. doi:10.1214/aoap/1034968239. https://projecteuclid.org/euclid.aoap/1034968239


Export citation

References

  • BLACKWELL, D. 1962. Discrete dy namic programming. Ann. Math. Statist. 32 719 726. Z.
  • DENARDO, E. V. and MILLER, B. L. 1968. An optimality criterion for discrete dy namic programming with no discounting. Ann. Math. Statist. 39 1220 1227. Z.
  • DERMAN, C. 1970. Finite State Markovian Decision Processes. Academic Press, New York. Z.
  • GITTINS, J. C. 1989. Multi Armed Bandit Allocation Indices. Wiley-Interscience, New York. Z.
  • GITTINS, J. C. and JONES, D. M. 1974. A dy namic allocation index for the sequential design Z experiments. In Progress in Statistics. European Meeting of Statisticians J. Gani,. K. Sarkadi and I. Vince, eds. 1 241 266. North-Holland, Amsterdam. Z.
  • GLAZEBROOK, K. D. 1982. On the evaluation of suboptimal strategies for families of alternative bandit processes. J. Appl. Probab. 19 716 722. Z.
  • GLAZEBROOK, K. D. 1990. Procedures for the evaluation of strategies for resource allocation in a stochastic environment. J. Appl. Probab. 27 215 220. Z.
  • GLAZEBROOK, K. D. 1991. Bounds for discounted stochastic scheduling problems. J. Appl. Probab. 28 791 801. Z.
  • KATEHAKIS, M. N. and VEINOTT, A. F., JR. 1987. The multi-armed bandit problem: decomposition and computation. Math. Oper. Res. 12 262 268. Z.
  • KELLY, F. P. 1981. Multi-armed bandits with discount factor near one: the Bernoulli case. Ann. Statist. 9 987 1001. Z.
  • LAI, T. S. and YING, Z. 1988. Open bandit processes and optimal scheduling of queuing networks. Adv. in Appl. Probab. 20 447 472. Z.
  • MILLER, B. L. and VEINOTT, A. F., JR. 1969. Discrete dy namic programming with a small interest rate discounting. Ann. Math. Statist. 40 366 370. Z.
  • ROSS, S. M. 1983. Introduction to Stochastic Dy namic Programming. Academic Press, New York. Z.
  • ROTHBLUM, U. J. and VEINOTT, A. F., JR. 1992. Branching Markov decision chains: immigration induced optimality. Unpublished manuscript. Z.
  • VEINOTT, A. F., JR. 1966. On finding optimal policies in discrete dy namic programming with no discounting. Ann. Math. Statist. 37 1284 1294. Z.
  • VEINOTT, A. F., JR. 1969. Discrete dy namic programming with sensitive optimality criteria. Ann. Math. Statist. 40 1635 1660. Z. Z
  • VEINOTT, A. F., JR. 1974. Markov decision chains. In Studies in Optimization G. B. Dantzig. and B. C. Eaves, eds. 124 159. Math. Association of America, Washington, DC. Z.
  • WHITTLE, P. 1980. Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc. Ser. B 42 143 149. Z.
  • WHITTLE, P. 1982. Optimization over Time 1. Wiley, New York.
  • NEWARK, NEW JERSEY 07102 TECHNION ISRAEL INSTITUTE OF TECHNOLOGY E-MAIL: mnk@andromeda.rutgers.edu TECHNION CITY, HAIFA 32000 ISRAEL E-MAIL: rothblum@ie.technion.ac.il