Abstract
One of two independent stochastic processes (arms) is to be selected at each of $n$ stages. The selection is sequential and depends on past observations as well as the prior information. The objective is to maximize the expected future-discounted sum of the $n$ observations. We study structural properties of this classical bandit problem, in particular how the maximum expected payoff and the optimal strategy vary with the priors, in two settings: (a) observations from each arm have an exponential family distribution and different arms are assigned independent conjugate priors; (b) observations from each arm have a nonparametric distribution and different arms are assigned independent Dirichlet process priors. In both settings, we derive results of the following type: (i) for a particular arm and a fixed prior weight, the maximum expected payoff increases as the prior mean yield increases; (ii) for a fixed prior mean yield, the maximum expected payoff increases as the prior weight decreases. Specializing to the one-armed bandit, the second result captures the intuition that, given the same immediate payoff, the less one knows about an arm, the more desirable it becomes because there remains more information to be gained when selecting that arm. In the parametric case, our results extend those of (Ann. Statist. 20 (1992) 1625–1636) concerning Bernoulli and normal bandits (see also (In Time Series and Related Topics (2006) pp. 284–294 IMS)). In the nonparametric case, we extend those of (Ann. Statist. 13 (1985) 1523–1534). A key tool in the derivation is stochastic orders.
Citation
Yaming Yu. "Some monotonicity properties of parametric and nonparametric Bayesian bandits." Bernoulli 23 (4B) 3685 - 3710, November 2017. https://doi.org/10.3150/16-BEJ862
Information