Abstract
Suppose the arms of a two-armed bandit generate i.i.d. Bernoulli random variables with success probabilities $\rho$ and $\lambda$ respectively. It is desired to maximize the expected sum of $N$ trials where $N$ is fixed. If the prior distribution of $(\rho, \lambda)$ is concentrated at two points $(a, b)$ and $(c, d)$ in the unit square, a characterization of the optimal policy is given. In terms of $a, b, c$, and $d$, necessary and sufficient conditions are given for the optimality of the myopic policy.
Citation
Thomas A. Kelley. "A Note on the Bernoulli Two-Armed Bandit Problem." Ann. Statist. 2 (5) 1056 - 1062, September, 1974. https://doi.org/10.1214/aos/1176342827
Information