Open Access
2013 A linear response bandit problem
Alexander Goldenshluger, Assaf Zeevi
Stoch. Syst. 3(1): 230-261 (2013). DOI: 10.1214/11-SSY032

Abstract

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon $n$ and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like $\sqrt{n}$.

Citation

Download Citation

Alexander Goldenshluger. Assaf Zeevi. "A linear response bandit problem." Stoch. Syst. 3 (1) 230 - 261, 2013. https://doi.org/10.1214/11-SSY032

Information

Published: 2013
First available in Project Euclid: 24 February 2014

zbMATH: 1352.91009
MathSciNet: MR3353472
Digital Object Identifier: 10.1214/11-SSY032

Subjects:
Primary: 62L05
Secondary: 60G40, 62C20

Keywords: bandit problems , estimation , minimax , rate–optimal policy , regret , Sequential allocation

Rights: Copyright © 2013 INFORMS Applied Probability Society

Vol.3 • No. 1 • 2013
Back to Top