Stochastic Systems

A linear response bandit problem

Alexander Goldenshluger and Assaf Zeevi

Full-text: Open access

Abstract

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon $n$ and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like $\sqrt{n}$.

Article information

Source
Stoch. Syst., Volume 3, Number 1 (2013), 230-261.

Dates
First available in Project Euclid: 24 February 2014

Permanent link to this document
https://projecteuclid.org/euclid.ssy/1393251985

Digital Object Identifier
doi:10.1214/11-SSY032

Mathematical Reviews number (MathSciNet)
MR3353472

Zentralblatt MATH identifier
1352.91009

Subjects
Primary: 62L05: Sequential design
Secondary: 60G40, 62C20

Keywords
Sequential allocation estimation bandit problems regret minimax rate–optimal policy

Citation

Goldenshluger, Alexander; Zeevi, Assaf. A linear response bandit problem. Stoch. Syst. 3 (2013), no. 1, 230--261. doi:10.1214/11-SSY032. https://projecteuclid.org/euclid.ssy/1393251985


Export citation

References

  • Auer, P. (2002). Using confidence bounds for exploitation–exploration trade–offs. J. Mach. Learn. Res. 3, 397–422.
  • Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite time analysis of the multiarmed bandit problem. Machine learning 47, 235–256.
  • Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77.
  • Berry, D. A. and Fristedt, B. (1985). Bandit Problems. Chapman and Hall, London.
  • Cesa–Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cambridge University Press, Cambridge.
  • Ginebra, J. and Clayton, M. K. (1995). Response surface bandits. J. Roy. Statist. Soc. Ser. B 57, 771–784.
  • Gill, R. D. and Levit, B. Y. (1995). Applications of the Van Trees inequality: A Bayesian Cramer-Rao bound. Bernoulli 1, 59–79.
  • Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices. Wiley-Interscience Series in Systems and Optimization. John Wiley & Sons, Chichester.
  • Goldenshluger, A. and Zeevi, A. (2009). Woodroofe’s one–armed bandit problem revisited. Ann. Appl. Probab. 19, 1603–1633.
  • Goldenshluger, A. and Zeevi, A. (2011). A note on performance limitations in bandit problems with side information. IEEE Trans. Inf. Theory 57, 1707–1713.
  • Gooley, C. and Lattin, J. (2000). Dynamic customization of marketing messages in interactive media. Research Paper No. 1664, Research Paper Series, Graduate School of Business, Stanford University. Available at https://gsbapps.stanford.edu/researchpapers/Library/RP1664.pdf.
  • Juditsky, A., Nazin, A., Tsybakov, A., and Vayatis, N. (2008). Gap–free bounds for stochastic multi–armed bandit. IFAC World Congress, 2008.
  • Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15, 1091–1114.
  • Lai, T. L. (1988). Asymptotic solutions of bandit problems. Stochastic Differential Systems, Stochastic Control Theory and Applications (Minneapolis, Minn., 1986), 275–292, IMA Vol. Math. Appl., 10, Springer, New York.
  • Lai, T. L. (2001). Sequential analysis: Some classical problems and new challenges. Statist. Sinica 11, 303–408.
  • Lai, T. L. and Robbins, H. (1985). Asymptotically efficient allocation rules. Adv. Applied Math. 6, 4–22.
  • Lai, T. L. and Yakowitz, S. (1995). Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Control 40, 1199–1209.
  • Langford, J. and Zhang, T. (2008). The epoch–greedy algorithm for multiarmed bandits with side information. Advances in Neural Information Processing Systems 20, 817–824, Cambridge, MIT Press.
  • Lu, T., Pál, D., and Pál, M. (2010). Contextual multi–armed bandits. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Available at http://research.google.com/pubs/archive/37042.pdf.
  • Mersereau, A. J., Rusmevichientong, P. and Tsitsiklis, J. N. (2009). A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatic Control 54, 2787–2802.
  • Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 55, 527–535.
  • Rusmevichientong, P. and Tsitsiklis J. N. (2010). Linearly parametrized bandits. Math. Oper. Res. 35, 395–411.
  • Sarkar, J. (1991). One-armed bandit problems with covariates. Ann. Statist. 19, 1978–2002.
  • Stewart, G. W. and Sun, J. G. (1990). Matrix Perturbation Theory. Academic Press, Inc., Boston, MA.
  • Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32, 135–166.
  • Wang, C.-C., Kulkarni, S., and Poor, V. H. (2005). Bandit problems with side observations. IEEE Trans. Automat. Control 50, 799–806.
  • Woodroofe, M. (1979). A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74, 799–806.
  • Woodroofe, M. (1982). Sequential allocation with covariates. Sankhyā Ser. A 44, 403–414.
  • Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statis. 30, 100–121.