Abstract
For a multiarmed bandit problem with exponential discounting the optimal allocation rule is defined by a dynamic allocation index defined for each arm on its space. The index for an arm is equal to the expected immediate reward from the arm, with an upward adjustment reflecting any uncertainty about the prospects of obtaining rewards from the arm, and the possibilities of resolving those uncertainties by selecting that arm. Thus the learning component of the index is defined to be the difference between the index and the expected immediate reward. For two arms with the same expected immediate reward the learning component should be larger for the arm for which the reward rate is more uncertain. This is shown to be true for arms based on independent samples from a fixed distribution with an unknown parameter in the cases of Bernoulli and normal distributions, and similar results are obtained in other cases.
Citation
John Gittins. You-Gan Wang. "The Learning Component of Dynamic Allocation Indices." Ann. Statist. 20 (3) 1625 - 1636, September, 1992. https://doi.org/10.1214/aos/1176348788
Information