## Bernoulli

• Bernoulli
• Volume 23, Number 4B (2017), 3685-3710.

### Some monotonicity properties of parametric and nonparametric Bayesian bandits

Yaming Yu

#### Abstract

One of two independent stochastic processes (arms) is to be selected at each of $n$ stages. The selection is sequential and depends on past observations as well as the prior information. The objective is to maximize the expected future-discounted sum of the $n$ observations. We study structural properties of this classical bandit problem, in particular how the maximum expected payoff and the optimal strategy vary with the priors, in two settings: (a) observations from each arm have an exponential family distribution and different arms are assigned independent conjugate priors; (b) observations from each arm have a nonparametric distribution and different arms are assigned independent Dirichlet process priors. In both settings, we derive results of the following type: (i) for a particular arm and a fixed prior weight, the maximum expected payoff increases as the prior mean yield increases; (ii) for a fixed prior mean yield, the maximum expected payoff increases as the prior weight decreases. Specializing to the one-armed bandit, the second result captures the intuition that, given the same immediate payoff, the less one knows about an arm, the more desirable it becomes because there remains more information to be gained when selecting that arm. In the parametric case, our results extend those of (Ann. Statist. 20 (1992) 1625–1636) concerning Bernoulli and normal bandits (see also (In Time Series and Related Topics (2006) pp. 284–294 IMS)). In the nonparametric case, we extend those of (Ann. Statist. 13 (1985) 1523–1534). A key tool in the derivation is stochastic orders.

#### Article information

Source
Bernoulli Volume 23, Number 4B (2017), 3685-3710.

Dates
Received: April 2011
Revised: May 2016
First available in Project Euclid: 23 May 2017

Permanent link to this document
https://projecteuclid.org/euclid.bj/1495505106

Digital Object Identifier
doi:10.3150/16-BEJ862

Zentralblatt MATH identifier
06778300

#### Citation

Yu, Yaming. Some monotonicity properties of parametric and nonparametric Bayesian bandits. Bernoulli 23 (2017), no. 4B, 3685--3710. doi:10.3150/16-BEJ862. https://projecteuclid.org/euclid.bj/1495505106

#### References

• [1] Bellman, R. (1956). A problem in the sequential design of experiments. Sankhyā 16 221–229.
• [2] Berry, D.A. (1972). A Bernoulli two-armed bandit. Ann. Math. Statist. 43 871–897.
• [3] Berry, D.A. and Fristedt, B. (1979). Bernoulli one-armed bandits—arbitrary discount sequences. Ann. Statist. 7 1086–1105.
• [4] Berry, D.A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. London: Chapman & Hall.
• [5] Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs for maximizing the sum of $n$ observations. Ann. Math. Statist. 27 1060–1074.
• [6] Brown, L.D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 9. Hayward, CA: IMS.
• [7] Chattopadhyay, M.K. (1994). Two-armed Dirichlet bandits with discounting. Ann. Statist. 22 1212–1221.
• [8] Chernoff, H. (1968). Optimal stochastic control. Sankhyā Ser. A 30 221–252.
• [9] Chernoff, H. and Petkau, A.J. (1986). Numerical solutions for Bayes sequential decision problems. SIAM J. Sci. Statist. Comput. 7 46–59.
• [10] Clayton, M.K. and Berry, D.A. (1985). Bayesian nonparametric bandits. Ann. Statist. 13 1523–1534.
• [11] Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
• [12] Gittins, J.C. (1979). Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser. B 41 148–177.
• [13] Gittins, J.C., Glazebrook, K.D. and Weber, R.R. (2011). Multi-Armed Bandit Allocation Indices, 2nd ed. New York: Wiley.
• [14] Gittins, J.C. and Jones, D.M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (European Meeting Statisticians, Budapest, 1972) (J. Gani, ed.) 241–266. North-Holland: Amsterdam.
• [15] Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20 1625–1636.
• [16] Herschkorn, S.J. (1997). Bandit bounds from stochastic variability extrema. Statist. Probab. Lett. 35 283–288.
• [17] Karlin, S. (1968). Total Positivity. Vol. I. Stanford, CA: Stanford Univ. Press.
• [18] Kaspi, H. and Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. Ann. Appl. Probab. 8 1270–1290.
• [19] Marshall, A.W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Mathematics in Science and Engineering 143. New York: Academic Press.
• [20] Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. Wiley Series in Probability and Statistics. Chichester: Wiley.
• [21] Rieder, U. and Wagner, H. (1991). Structured policies in the sequential design of experiments. Ann. Oper. Res. 32 165–188.
• [22] Shaked, M. and Shanthikumar, J.G. (2007). Stochastic Orders. Springer Series in Statistics. New York: Springer.
• [23] Whitt, W. (1985). Uniform conditional variability ordering of probability distributions. J. Appl. Probab. 22 619–633.
• [24] Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc. Ser. B 42 143–149.
• [25] Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics. Institute of Mathematical Statistics Lecture Notes—Monograph Series 52 284–294. Beachwood, OH: IMS.
• [26] Yu, Y. (2009). On the entropy of compound distributions on nonnegative integers. IEEE Trans. Inform. Theory 55 3645–3650.
• [27] Yu, Y. (2009). Monotonic convergence in an information-theoretic law of small numbers. IEEE Trans. Inform. Theory 55 5412–5422.
• [28] Yu, Y. (2010). Relative log-concavity and a pair of triangle inequalities. Bernoulli 16 459–470.