Bernoulli

  • Bernoulli
  • Volume 23, Number 4B (2017), 3685-3710.

Some monotonicity properties of parametric and nonparametric Bayesian bandits

Yaming Yu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

One of two independent stochastic processes (arms) is to be selected at each of $n$ stages. The selection is sequential and depends on past observations as well as the prior information. The objective is to maximize the expected future-discounted sum of the $n$ observations. We study structural properties of this classical bandit problem, in particular how the maximum expected payoff and the optimal strategy vary with the priors, in two settings: (a) observations from each arm have an exponential family distribution and different arms are assigned independent conjugate priors; (b) observations from each arm have a nonparametric distribution and different arms are assigned independent Dirichlet process priors. In both settings, we derive results of the following type: (i) for a particular arm and a fixed prior weight, the maximum expected payoff increases as the prior mean yield increases; (ii) for a fixed prior mean yield, the maximum expected payoff increases as the prior weight decreases. Specializing to the one-armed bandit, the second result captures the intuition that, given the same immediate payoff, the less one knows about an arm, the more desirable it becomes because there remains more information to be gained when selecting that arm. In the parametric case, our results extend those of (Ann. Statist. 20 (1992) 1625–1636) concerning Bernoulli and normal bandits (see also (In Time Series and Related Topics (2006) pp. 284–294 IMS)). In the nonparametric case, we extend those of (Ann. Statist. 13 (1985) 1523–1534). A key tool in the derivation is stochastic orders.

Article information

Source
Bernoulli Volume 23, Number 4B (2017), 3685-3710.

Dates
Received: April 2011
Revised: May 2016
First available in Project Euclid: 23 May 2017

Permanent link to this document
https://projecteuclid.org/euclid.bj/1495505106

Digital Object Identifier
doi:10.3150/16-BEJ862

Zentralblatt MATH identifier
06778300

Keywords
Bernoulli bandits convex order Dirichlet bandits log-concavity optimal stopping sequential decision two-armed bandits

Citation

Yu, Yaming. Some monotonicity properties of parametric and nonparametric Bayesian bandits. Bernoulli 23 (2017), no. 4B, 3685--3710. doi:10.3150/16-BEJ862. https://projecteuclid.org/euclid.bj/1495505106


Export citation

References

  • [1] Bellman, R. (1956). A problem in the sequential design of experiments. Sankhyā 16 221–229.
  • [2] Berry, D.A. (1972). A Bernoulli two-armed bandit. Ann. Math. Statist. 43 871–897.
  • [3] Berry, D.A. and Fristedt, B. (1979). Bernoulli one-armed bandits—arbitrary discount sequences. Ann. Statist. 7 1086–1105.
  • [4] Berry, D.A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. London: Chapman & Hall.
  • [5] Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs for maximizing the sum of $n$ observations. Ann. Math. Statist. 27 1060–1074.
  • [6] Brown, L.D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 9. Hayward, CA: IMS.
  • [7] Chattopadhyay, M.K. (1994). Two-armed Dirichlet bandits with discounting. Ann. Statist. 22 1212–1221.
  • [8] Chernoff, H. (1968). Optimal stochastic control. Sankhyā Ser. A 30 221–252.
  • [9] Chernoff, H. and Petkau, A.J. (1986). Numerical solutions for Bayes sequential decision problems. SIAM J. Sci. Statist. Comput. 7 46–59.
  • [10] Clayton, M.K. and Berry, D.A. (1985). Bayesian nonparametric bandits. Ann. Statist. 13 1523–1534.
  • [11] Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • [12] Gittins, J.C. (1979). Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser. B 41 148–177.
  • [13] Gittins, J.C., Glazebrook, K.D. and Weber, R.R. (2011). Multi-Armed Bandit Allocation Indices, 2nd ed. New York: Wiley.
  • [14] Gittins, J.C. and Jones, D.M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (European Meeting Statisticians, Budapest, 1972) (J. Gani, ed.) 241–266. North-Holland: Amsterdam.
  • [15] Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20 1625–1636.
  • [16] Herschkorn, S.J. (1997). Bandit bounds from stochastic variability extrema. Statist. Probab. Lett. 35 283–288.
  • [17] Karlin, S. (1968). Total Positivity. Vol. I. Stanford, CA: Stanford Univ. Press.
  • [18] Kaspi, H. and Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. Ann. Appl. Probab. 8 1270–1290.
  • [19] Marshall, A.W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Mathematics in Science and Engineering 143. New York: Academic Press.
  • [20] Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. Wiley Series in Probability and Statistics. Chichester: Wiley.
  • [21] Rieder, U. and Wagner, H. (1991). Structured policies in the sequential design of experiments. Ann. Oper. Res. 32 165–188.
  • [22] Shaked, M. and Shanthikumar, J.G. (2007). Stochastic Orders. Springer Series in Statistics. New York: Springer.
  • [23] Whitt, W. (1985). Uniform conditional variability ordering of probability distributions. J. Appl. Probab. 22 619–633.
  • [24] Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc. Ser. B 42 143–149.
  • [25] Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics. Institute of Mathematical Statistics Lecture Notes—Monograph Series 52 284–294. Beachwood, OH: IMS.
  • [26] Yu, Y. (2009). On the entropy of compound distributions on nonnegative integers. IEEE Trans. Inform. Theory 55 3645–3650.
  • [27] Yu, Y. (2009). Monotonic convergence in an information-theoretic law of small numbers. IEEE Trans. Inform. Theory 55 5412–5422.
  • [28] Yu, Y. (2010). Relative log-concavity and a pair of triangle inequalities. Bernoulli 16 459–470.