## The Annals of Mathematical Statistics

### Some Remarks on the Two-Armed Bandit

#### Abstract

In this paper we consider the following situation: An experimenter has to perform a total of $N$ trials on two Bernoulli-type experiments $E_1$ and $E_2$ with success probabilities $\alpha$ and $\beta$ respectively, where both $\alpha$ and $\beta$ are unknown to him. The trials are to be carried out sequentially and independently, except that for each trial the experimenter may choose between $E_1$ and $E_2$, using the information obtained in all previous trials. The decisions on the part of the experimenter to use $E_1$ or $E_2$ in the successive trials may be randomized, i.e. for any trial he may use a chance mechanism in order to choose $E_1$ or $E_2$ with probabilities $\delta$ and $1 - \delta$ respectively, where $\delta$ may depend on the decisions taken and the results obtained in the previous trials. A strategy $\Delta$ will be a set of such $\delta$'s, completely describing the experimenters behavior in every conceivable situation. We assume the experimenter wants to maximize the number of successes. More precisely, we assume that he incurs a loss \begin{equation*}\tag{1.1} L(\alpha, \beta, s) = N \max(\alpha, \beta)- s\end{equation*} if he scores a total of $s$ successes. If he uses a strategy $\Delta$, his expected loss is then given by the risk function \begin{equation*}\tag{1.2} R(\alpha, \beta, \Delta) = N \max(\alpha, \beta) - E(S\mid\alpha, \beta, \Delta),\end{equation*} where $S$ denotes the random number of successes obtained. Thus the risk of a strategy $\Delta$ equals the expected amount by which the number of successes the experimenter will obtain using $\Delta$ falls short of the number of successes he would score if he were clairvoyant and would use the more favorable experiment throughout the $N$ trials. It is easy to see that $R(\alpha, \beta, \Delta)$ also equals $|\alpha - \beta|$ times the expected number of trials in which the less favorable experiment is performed under $\Delta$. We say that state $(m, k; n, l)$ is reached during the series of trials if in the first $m + n$ trials $E_1$ is performed $m$ times, yielding $k$ successes, and $E_2$ is performed $n$ times, yielding $l$ successes. Clearly, under a strategy $\Delta$, the probability that this will happen is of the form \begin{equation*}\tag{1.3} \pi_{\alpha,\beta,\Delta}(m, k; n, l) = p_\Delta(m, k; n, l)\alpha^k(1 - \alpha)^{m-k} \beta^iota(1 - \beta)^{n-iota},\end{equation*} where $p_\Delta(m, k; n, l)$ depends on the state $(m, k; n, l)$ and the strategy $\Delta$, but not on $\alpha$ and $\beta$. It is easy to show (e.g. by induction on $N$) that the class of all strategies is convex in the sense that there exists, for every pair of strategies $\Delta_1$ and $\Delta_2$ and for every $\lambda \in \lbrack 0, 1 \rbrack$, a strategy $\Delta$ such that \begin{equation*}\tag{1.4} p_\Delta(m, k; n, l) = \lambda p_{\Delta_1}(m, k; n, l) + (1 - \lambda)p_{\Delta_2}(m, k; n, l)\end{equation*} for every state $(m, k; n, l)$. Moreover, this strategy $\Delta$ can always be taken to be such, that according to it the experimenter should base all his decisions exclusively on the numbers of successes and failures observed with $E_1$ and $E_2$, irrespective of the order in which these data became available. Denoting the class of all such strategies by $\mathscr{D}$ and remarking that $R(\alpha, \beta, \Delta)$ can be expressed in terms of the $\pi_{\alpha,\beta,\Delta}(m, k; n, l)$, we may conclude that $\mathscr{D}$ is an essentially complete class of strategies. We denote the probabilities $\delta$ constituting any strategy in $\mathscr{D}$ by $\delta(m, k; n, l)$: the probability with which the experimenter, having completed the first $m + n$ trials and thereby having reached state $(m, k; n, l)$, chooses $E_1$ for the next trial. We note that if $p_\Delta(m, k; n, l) = 0$ for a state $(m, k; n, l)$, then $\delta(m, k; n, l)$ does not play any role in the description of $\Delta$ and may be assigned an arbitrary value without affecting the strategy. We shall say that any strategy $\Delta'$ such that $p_\Delta'(m, k; n, l) = p_\Delta(m, k; n, l)$ for all states $(m, k; n, l)$ constitutes a version of $\Delta$. Since we are considering a symmetric problem in the sense that it remains invariant when $\alpha$ and $\beta$ are interchanged, it seems reasonable to consider strategies with a similar symmetry. Thus we are led to define the class $\mathscr{L}$ of all symmetric strategies: $\Delta\in\mathscr{L} \operatorname{iff} \Delta\in\mathscr{D}$ and $\delta(m, k; n, l) = 1 - \delta(n, l; m, k)$ for all states $(m, k; n, l)$ with $p_\Delta(m, k; n, l) \neq 0$. Clearly, for $\Delta\in\mathscr{L}$, \begin{equation*}\tag{1.5} \delta(m, k; m, k) = \frac{1}{2} \text{if} p_\Delta(m, k; m, k) \geqq 0,\quad \text{and}\end{equation*} \begin{equation*}\tag{1.6} P_\delta(m, k; n, l) = p_\delta(n, l; m, k) \text{for all states} (m, k; n, l).\end{equation*} It follows that, for $\Delta \in \mathscr{L}$ and all $(\alpha, \beta)$, \begin{equation*}\tag{1.7} R(\alpha, \beta, \Delta) = R(\beta, \alpha, \Delta)\end{equation*}. Among the contributions to the two-armed bandit problem the work of W. Vogel deserves special mention. Considering the same set-up we do, he discussed a certain subclass of the class $\mathscr{L}$ in [4], and obtained asymptotic bounds for the minimax risk for $N \rightarrow \infty$ in [5]. Since we shall not be concerned with asymptotics in this paper, we state the following result without a formal proof: The lower bound for the asymptotic minimax risk for $N \rightarrow \infty$ obtained by Vogel in [5] may be raised by a factor $2^\frac{1}{2}$. This is proved by applying the same method that was used in [5] to the optimal symmetric strategy for $\alpha + \beta = 1$ that was discussed in [4]. Combining this lower bound with the upper bound given in [5] we find that the asymptotic minimax risk must be between 0.265 $N^{\frac{1}{2}}$ and 0.376 $N^{\frac{1}{2}}$. In Section 2 we study the Bayes strategies in $\mathscr{D}$. By means of a certain recurrence relation we arrive at a complete characterization of these strategies, thus generalizing D. Feldman's well-known result in [3] for the case where the experimenter knows the values of $\alpha$ and $\beta$ except for their order. In addition we obtain expressions for the Bayes risk of any prior distribution. Using these results we proceed to derive in Section 3 certain monotonicity properties of $\delta(m, k; n, l)$ for any admissible strategy $\Delta$ in $\mathscr{D}$. Though these relations may seem intuitively evident, one does well to remember that the two-armed bandit problem has been shown to defy intuition in many aspects (cf. [2]). In Section 4 we prove the existence of an admissible symmetric minimax-risk strategy having the monotonicity properties just mentioned. This fact to some degree facilitates the search for minimax-risk strategies, but even so, the algebra involved becomes progressively more complicated with increasing $N$ and seems to remain prohibitive already for $N$ as small as 5.

#### Article information

Source
Ann. Math. Statist., Volume 41, Number 6 (1970), 1906-1916.

Dates
First available in Project Euclid: 27 April 2007

https://projecteuclid.org/euclid.aoms/1177696692

Digital Object Identifier
doi:10.1214/aoms/1177696692

Mathematical Reviews number (MathSciNet)
MR278454

Zentralblatt MATH identifier
0222.62007

JSTOR