Gittins' theorem under uncertainty

We study dynamic allocation problems for discrete time multi-armed bandits under uncertainty, based on the the theory of nonlinear expectations. We show that, under strong independence of the bandits and with some relaxation in the definition of optimality, a Gittins allocation index gives optimal choices. This involves studying the interaction of our uncertainty with controls which determine the filtration. We also run a simple numerical example which illustrates the interaction between the willingness to explore and uncertainty aversion of the agent when making decisions.


Introduction
When making decisions, people generally have a strict preference for options which they understand well. Since the classical work of Knight [52] and Keynes [50], there has been a stream of thinking within economics and statistics that focuses on the difference between the randomness of an outcome and lack of knowledge of its probability distribution (sometimes called 'Knightian uncertainty'). This lack of knowledge is often related to estimation, as the probabilities used are often based on past observations. This raises a natural question: how should we make decisions, given they will affect both our short-term outcomes, and the information available in the future? Shall we make a decision to explore and obtain new information, or shall we exploit the information available to optimize our profit? A simple setting in which this arises is a multi-armed bandit problem.
Modeling learning of the distribution of outcomes leads us to a paradox due to inconsistency in our decisions. As Keynes is said to have remarked 1 , "When the facts change, I change my mind. What do you do, sir?" The question a rational decision maker faces is, "if I suspect that I will change my opinions or preferences tomorrow, how do I account for this today?" In this paper, we use the theory of nonlinear expectations (or equivalently risk measures) which are known to model Knightian uncertainty (see Föllmer and Schied [36]). This has been used to address statistical uncertainty, for example in [23,25]. To achieve consistency in decision making, we usually have to consider a time-consistent nonlinear expectation. This is widely studied through backward stochastic differential equations (BSDEs) (see, for example, the work of Peng and others [62,61,63,59]).
However, these approaches presume that the flow of information is not controlled. (Formally, the filtration of our agent is fixed and independent of their controls.) When we can control the observations which we will receive, this is not the case. In order to address this issue, while accounting for uncertainty, we discuss an alternative approach to deriving a time-consistent control problem, based on ideas from indifference pricing and the martingale optimality principle. Using this approach, we show that when comparing different independent options, we can calculate an index separately for each alternative such that the 'optimal' strategy is always to choose the option with the smallest index. This idea was initially proposed by Gittins and Jones [41] (see also [42,40]) in a context where the probability measure is fixed but estimation (in a Bayesian perspective) is modeled by the evolution of a Markov process.
Given this result, we demonstrate a numerical solution in a simple setting. We shall see that our algorithm gives behaviour which is both optimistic and pessimistic in different regimes, and compares well with existing methods for multi-armed bandits.

Multi-armed bandits
Multi-armed bandits are a classical toy example with which to study decision making with randomness. They are commonly known to have applications in medical trials (Armitage [4] or Anscombe [3]) and experimental design (Berry and Fristedt [13] or the classic paper of Robbins [69]), along with other areas. A few recent works in finance for portfolio selection can also be found in Huo and Fu [45] or Shen et al. [72]. The basic idea is that one has M 'bandits 2 ', or equivalently, a bandit with M arms, and one must choose which bandit should be played at each time. A key paper studying these systems, Gittins and Jones [41], argued that for a collection of independent bandits, each governed by a countable state Markov process, one could compute the "Gittins index" for each bandit separately, and the optimal strategy is to play the machine with the lowest index (or the highest, depending on the sign of gains/losses). The proof of this result has been obtained using a number of different perspectives, for example Weber's prevailing charge formulation [75] (which we consider in more detail below), Whittle's retirement option formulation [76] and its extension without a Markov assumption by El Karoui and Karatzas [32] (and [33] in continuous time). A review of the proofs in discrete time is given by Frostig and Weiss [39]. However, in all these cases, the objective to be optimized is the discounted expected gain/loss -in particular, we are assumed to have no risk-aversion or uncertainty-aversion.
Gittins' index theory is commonly known as the first solution to an adaptive and sequential experimental design problem (from a Bayesian perspective) where the payoff of each bandit is assumed to be generated from a fixed unknown distribution 3 which must be inferred 'on-the-fly', but where experimentation may be costly. As an alternative to Gittins' index, Agrawal [1] proposed the 'Upper Confidence Bound (UCB) algorithm' which achieves a regret (deviation of average reward from the optimal reward) with the minimal asymptotic order of log(N ), as proved by Lai and Robbins [54]. In the UCB algorithm, we compute a confidence interval for the expected reward at each step, and then play the bandit with the largest upper bound (where positive outcomes are preferred). Intuitively, using an upper bound encourages us to try bandits where we are less certain of the average reward, which encourages exploration. This is a form of 'optimism' in decisions, which is counter-intuitive from the classical 'pessimistic' utility theory (à la von Neumann and Morgenstern [74]), where our preferences are for more certain outcomes.
Typically, under appropriate assumptions, it is also the case that Gittins' index is a form of upper confidence bound for the estimated reward, an idea originally based on observations in Bather [12] and Kelly [49] and explored in more detail by Chang and Lai [21], followed up by Brezzi and Lai [18] (see Yao [78] for an error correction). Lattimore [55] also proves that Gittins' index achieves a minimal order bound on regret.
The apparent contradiction between the optimism of the UCB algorithm and Gittins' index and the pessimism of classical utility theory is what led to this paper. We extend the notion of Gittins' index to a robust (nonlinear) operator, allowing for uncertainty aversion. We work in a generic discrete-time setting, allowing for the possibility of online learning, non-stationary and continuous outcomes, embedding all these effects in an abstract 'nonlinear expectation'. (A concrete application to a simple setting with learning and uncertainty is given in Section 6.) In particular, we reformulate the proof of Gittins index theorem proposed by Weber [75], as this proof relies the least on the linearity of the expectation, and gives a natural form of time-consistency. We also remove a Markov assumption in Weber's proof by adapting El Karoui and Karatzas' formulation [32]. Our solution involves an optimal stopping problem under a nonlinear expectation, which can be converted to a low dimensional reflected BSDE (see for example El Karoui et al. [31] and Cheng and Riedel [22] in continuous time or An, Cohen and Ji [2] in discrete time). This allows us to see a balance between the desire to explore and to exploit in our decision making.
The robust version of Gittins index has some correlation to the adversarial bandit problem (see, for example, Auer et al. [7,8]) where we are playing the bandit against an adversary. Our theory proposes an 'optimal' deterministic strategy (no additional randomness is introduced at the decision time) against an adversary who tries to maximize our cost, which is slightly different from the known random algorithms for the adversarial bandit problem. The key difference is that, in the classical adversarial problem, an adversary is trying to maximize our 'regret' whereas in our setting, we view the adversary as trying to maximize our cost. In our setting, the adversary is also permitted to respond to our current controls at every time, and we do not assume a minimax theorem holds.
The study of an adversary for the payoff in the bandit problem (via Gittins index) has been considered by Caro and Gupta [20] and Kim and Lim [51] (with additional penalty in the reward) using Whittle's retirement option argument [76]. In their works, they rely heavily on a Markov assumption, which allows them to postulate a robust dynamic programming principle (see also Iyengar [46], Nilim and El Ghaoui [58]). Their formulation considers the robust Gittins' strategy as a promising solution due to its optimality for a single bandit, but they do not show optimality for multiple bandits. Furthermore, their Markov assumption restricts them to have a fixed uncertainty at all times and, therefore, it is not clear how to incorporate learning in their model. In contrast, our framework pays more attention to defining a good notion of dynamic optimality for our nonlinear expectation without any Markov assumption.
By encoding learning through nonlinear expectations, a wide range of modeling options are included in our approach. For example, statistical concerns are treated in this framework in [23,24] or Bielecki, Cialenco and Chen [14]. We could also allow adversarial choices with a range of a fixed set (as in the classical adversarial bandit problem [7] or as in [20,51]) or a random set which can be used to model learning as in the classical Gittins' theory. We also allow dynamic adversaries, which are not considered in the usual adversarial setting.
The paper proceeds as follows: In Section 2, we present some relevant existing approaches to multi-armed bandits, which we will adapt and combine to obtain our result. In Section 3, we give the required definitions for the nonlinear expectations that we use to evaluate our decisions. We also discuss the different notions of optimality which are available, and how they interact with the dynamic programming principle.
In Section 4, we give a summary of how we apply these expectations to a multi-armed bandit problem, state the key result, and give a sketch outline of the proof. The full details of this (rather technical) proof are given in two appendices: Appendix A works through the first half of the proof, giving careful analysis of an optimal stopping problem under nonlinear expectation, and the corresponding 'fair value' process, for a single bandit; Appendix B gives the second half of the proof, and demonstrates that the single bandit analysis yields an optimal strategy when deciding between multiple bandits. Further technical lemmas, which are used but do not contributed significantly to the main proof, are given in Appendix C.
Section 5 considers a simple example of a multiple bandit problem numerically, suggesting some connections with behavioural finance; the algorithm used to compute this example is given in Appendix D.
Remark 1. In most of the literature on Gittins' theorem, maximization of rewards is usually considered. For convenience, as is common in the theory of nonlinear expectations, we will consider the minimization of costs instead. Our presentation of others' results is done with the corresponding changes in sign.
The goal of the decision maker is to minimize the discounted total cost, for a given discount factor β ∈ (0, 1), when they can choose the order in which bandits are played.
Before considering a robust approach, we first outline the solution to this problem in a standard setting of classical expectation.
Definition 2.2. The Gittins index at time s ≥ 0 of the mth bandit is given by s+t ) t≥0 -stopping times 4 and the essential infimum is taken in L ∞ (F for S ∈ T(S).

Classical Gittins Theorem
Our policies will be described by a (random) path in the space S, which indicates how many times each bandit has been played.
Definition 2.4 (Mandelbaum [57]). The Mandelbaum allocation strategy is an S-valued random sequence η(n) n≥0 such that  Here, e (m) denotes the mth unit vector in S. We denote by A the family of all Mandelbaum allocation strategy.
Theorem 2.5 (Gittins' theorem, as proved by El Karoui and Karatzas [32]). Let η * (n) n≥0 be a Mandelbaum allocation strategy η * (n) n≥0 such that for all m ∈ M and n ≥ 0. Then η * (n) n≥0 is an optimal solution to the optimisation problem In particular, Theorem 2.5 says that the strategy which always plays the bandit with the minimum index minimizes the expectation of the total discounted cost.
Remark 2. A Mandelbaum allocation strategy η(n) n≥0 can also be represented by its increments, in particular, by a sequence of decision variables (ρ n ) n≥0 taking values in M. In other words, we can define (ρ n ) n≥0 such that {ρ n = m} = {η(n + 1) =η(n) + e (m) }. We may then replace the objective equation (2.1) by Remark 3. Our paper considers the orthant filtration as the product of filtrations defined on different spaces. This is slightly different from Mandelbaum [57] (and thus El Karoui and Karatzas [32]) where the orthant filtration is considered as the join of filtrations defined on the same space. This technical difference will allow us to more easily define a 'Nonlinear expectation' which still carries some form of independence and 'time-consistency'. (See discussion in Section 3.2.) Remark 4. In El Karoui and Karatzas [32], it is assumed that the cost process is predictable, instead of adapted, with respect to the filtration of the bandit. When using a classical expectation, there is no modelling difference between predictable and adapted cost processes (as one can just take the conditional expectation to reduce adapted costs to predictable costs). However, under a 'nonlinear expectation', this is not the case, so we give the more general result with adapted costs.

Robust Gittins Index
Under a Markovian assumption, Caro and Gupta [20] consider a 'robust' Gittins index based on the Robust Bellman equation studied in Iyergar [46] and Nilim and El Ghaoui [58]. (Similar work is considered by Kim and Lim [51] with an additional penalty in the formulation.) The following assumptions are used in Caro and Gupta [20] (translated into our notation): (i) The cost process is driven by some underlying finite-state process (X ) for some deterministic functionh (m) .
(ii) Ambiguity is described by families of transition matrices, (U (m) ) m∈M for the dynamics of X (m) , which may vary in time.
The construction is then based on Whittle indexibility [77]. In particular, they reduce the problem to considering two bandits, where one bandit always generates a constant cost γ and the other bandit is identical to the mth bandit. The worst-case expected cost obtained when starting in state i in the mth bandit, V (m) (i), allowing any combination of transition rates, will then satisfy the robust dynamic programming principle, that is, be the set of states for which it is optimal to rest the mth bandit when the reward of the constant bandit is γ. Caro and Gupta show that the robust bandit is Whittle indexible in the sense that D (m) (γ) increases monotonically from φ to X (m) as γ increases from −∞ to +∞. The index of the mth bandit at state i is the unique value γ such that the player is indifferent between playing the mth bandit and the constant bandit.
This index can be characterized bỹ where Q (m) is the family of measures corresponding to the family of transition matrices U (m) . Unfortunately, as discussed in Caro and Gupta [20], the robust Gittins index (2.4) does not yield a strategy optimizing the robust Bellman equation . In short, this non-optimality arises due to the fact that the robust Bellman equation introduces dependency between bandits. In particular, at equilibrium, the adversary (who determines the transition probabilities for each bandit) may choose differently depending on the state of all bandits, rather than just the bandit of interest.
The index (2.4) also can be interpreted as a Lagrangian relaxation of the optimal control problem (see also Gocgun and Ghate [43]). The natural question that arises is, 'Does this relaxation satisfy some adjusted notion of optimality?' In this paper, we propose a new form of optimality in terms of compensators of the value function. This can be seen as a relaxation of the dynamic programming principle through the martingale optimality principle, in order to address a control problem under an inconsistent nonlinear operator. We will show that the strategy given by robust Gittins index satisfies this optimality criteria. We also allow the cost to be continuous valued and non-Markovian as in El Karoui and Karatzas [32]. This allows the study of various numerical methods to estimate our probabilistic state in the learning problem, whereas the numerical method in Caro and Gupta [20] is limited to finite state Markov process. A simple numerical example then allows us to observe some qualitative peculiarities given the interaction between uncertainty aversion and learning.
Remark 5. In a non-Markovian framework, Whittle indexibility is not welldefined. Hence, the interpretation of optimality is required to understand a solution to the multi-armed bandit problem under uncertainty aversion.
Remark 6. Li [56] considers a Bayesian formulation for the index but allowing for multiple priors. The focus is on describing how the set of uncertainty affects the index, but without proving any form of Whittle indexibility. Our models also verify and generalize these results.

Uncertainty, Nonlinear Expectation and Optimality
In this section, we will outline how 'nonlinear expectation' operators can be used to model Knightian uncertainty. We will also discuss how we can use these tools to study a control problem, under uncertainty, while retaining some form of time consistency. We will build on the modelling framework of El Karoui and Karatzas [32] as proposed in Assumption 2.1. We will first outline our setup and the additional assumptions we use in our study of the robust bandit problem. We will use a 'nonlinear expectation' E (m) (Assumption 3.6) to model uncertainty on the space Ω (m) , P (m) , (F (m) t ) t≥0 of a single bandit, and then extend our uncertainty to the orthant joint space (Definition 2.3) via the combined nonlinear expectation E (Definition 3.7). We will omit the superscript (m) when it is clear from context.
In order to avoid technical difficulties, we will make the following assumption on the cost processes.
Assumption 3.1 is purely technical. We may replace boundedness of h (m) by an integrability assumption on the total discounted cost (as in [32]); we then need to generalize the domain of the nonlinear expectation. We can also remove the assumption on the convergence of h (m) to its bound, but we then need to take more care to ensure that the stopping times we considered in (2.4) and elsewhere can be assumed to be a.s. finite. Given the discount factor, this assumption does not have large impact on our modelling.

Nonlinear Expectations and Time Consistency
We now focus on the filtered probability space Ω, P, (F t ) t≥0 modelling the returns from playing a single bandit. As in Peng [61], we define a nonlinear expectation as follows: for t ∈ T := {0, 1, 2, ...} is said to be an (F t ) t≥0 -consistent coherent nonlinear expectation if it satisfies the following properties: for X n , X, Y ∈ L ∞ (F ∞ ) and c ∈ L ∞ (F t ), with all (in)equalities holding P-a.s, we have (v) Lebesgue property: If {X n } n∈N is uniformly P-a.s. bounded and X n → X P-a.s. then E(X n |F t ) → E(X|F t ) P-a.s.
We write E( · ) for E · F 0 .
Remark 7. For simplicity, we assume the Lebesgue property throughout this paper. In the static case, upper semi-continuity can be shown to be equivalent to the Lebesgue property over L ∞ (see [36,Corollary 4.38]). Moreover, if the operator E is induced by a BSDE (as in [27,28,61,34] and many other papers), then the Lebesgue property typically follows from the L 2 -continuous dependence of the BSDE on its terminal value.
Remark 8. It is also known (see e.g. Detlefsen and Scandolo [30]) that any coherent nonlinear expectation satisfies the (F t )-regularity property. That is, for any X, Y ∈ L ∞ (F ∞ ) and A ∈ F t , In order to study decision making, we often require a conditional expectation defined at a stopping time. As we are working in discrete time, this is an easy construction. Definition 3.3. Given a consistent coherent nonlinear expectation E and a stopping time τ ≤ T , we define the conditional expectation at τ by With this definition, the following easy observations can be made. Nonlinear expectations are well suited to the study of Knightian uncertainty, that is, uncertainty over the probability measure. This is most easily seen through the robust representation theorem (over a finite horizon) given by Artzner et al. [5], see also Föllmer and Schied [36] and Frittelli and Rosazza-Gianin [37]. Extensions to a dynamic setting are also considered by Detlefsen and Scandolo [30], Föllmer and Schied [36] and Riedel [67]. We state a version of this result which is dynamic over stopping times.
Theorem 3.5. Let E be a consistent coherent nonlinear expectation. If there exists T < ∞ such that F ∞ = F T , then E admits the representation where τ is a stopping time and Q ⊆ {Q : Q ≈ P}, and the essential supremum is taken in L ∞ (F τ , P). [36,Theorem 11.22] (with further discussion in Föllmer and Penner [35]). The sensitivity assumption assumed in these references (i.e. for every nonnegative nonconstant X ∈ L ∞ (F T ), there exists λ > 0 such that E λX > 0) follows from strict monotonicity in our definition.

Proof. See Föllmer and Schied
Remark 9. Theorem 3.5 can be obtained by construction via considering the stability of the pasting in the family Q. (See e.g. Bion-Nadal [15] and Artzner et al. [6]).

Uncertainty on multiple bandits
In the classical Gittins theorem, independence is crucial to separate the behaviour of different bandits. In the robust representation (Theorem 3.5) we have seen that a nonlinear expectation can be viewed as the supremum of classical expectations over a family of probability measures. Therefore, the notion of independence between bandits becomes ambiguous, as statistical independence is based on the probability measure. Thanks to our explicit construction of the space (Definition 2.3), we can explicitly construct a nonlinear expectation space where each bandit remains independent.
Remark 10. In [63], Peng proposed a definition of independence for a nonlinear expectation. In his approach, independence is not a symmetric relation, but typically describes independence based on the order of events: often 'Y is independent of X' when Y occurs after X. In the setting of multiple bandits, the order of events cannot be pre-identified, as it depends on the control chosen. Hence, it is not clear how to exploit the independence notion of [63] in this setting.
Let us make the last universal assumption in our paper, which describes model uncertainty for each individual bandit in our problem inspired by the robust representation (Theorem 3.5).
Definition 3.7. We define the partially consistent orthant nonlinear expectation (E S ) S∈T(S) , to be the family of operators where, with Q (m) as in Assumption 3.6, We also write E for E 0 .
Proposition 3.8. The system of operators (E S ) satisfies the following properties.
(i) The properties (i)-(v) in Definition 3.2 (with appropriate replacements on the operator and σ-algebra) hold for the operator E S : S ∈ T(S) (i.e. strict monotonicity, translation equivariance, subadditivity, positive homogeneity and the Lebesgue property hold for E).
(ii) Sub-consistency: For S, S ∈ T(S) with S ≤ S , we have In particular, for any measurable X, if E S (X) ≤ 0P-a.s., then for any where, for each m ∈ M, we have a non-negative random variable P -a.s.
(iv) Marginal projection: For a given m ∈ M, let X be a random variable de- Proof. See Proposition C.4 in the appendix .
Remark 12. We deliberately choose our nonlinear expectation E to be defined on a product space to simplify our discussion on the existence of the operator. In fact, one can simply weaken our assumption by having a nonlinear expectation E on a joint filtration (as in El Karoui and Karatzas [32]) such that the Proposition 3.8 holds. All proofs are identical except the proof of Theorem B.3 (in the Appendix). We just need an extra step to show that the product of the marginal probability measure is also a probability measure considered under the robust representation of E.
In the proposition above, we have seen that E is sub-consistent on the orthant filtration. However, E is not consistent in the sense of Definition 3.2, i.e. if S ≤ S (componentwise), it is not necessarily the case that E S ( · ) = E S (E S ( · )). A counterexample can be easily constructed based on the following: Example 3.9. Let X andX be random variables taking values in {0, 1} and defined on different spaces Ω andΩ. Let Q andQ be families of probability measures defined on these spaces. Suppose that for all p ∈ [0, 1] there exists Q ∈ Q such that Q(X = 0) = p and that for allQ ∈Q,Q(X = 0) = 1/2. Let f : R 2 → R be a given function. Then it is easy to show that

By considering F
(1) 1 = σ(X) and F (2) 1 = σ(X), and defining nonlinear expectation using supremum over the family Q andQ, the above result shows that the joint operator E is not consistent. In particular, we can find a function f such that

Optimality
We have discussed in the previous section that the robust Gittins index (2.4) in the sense of Caro and Gupta [20] is not optimal, as it does not lead to a solution of the robust Bellman equation (discussed in [46,58]). In order to understand what sense of optimality the robust index strategy does satisfy, we will first consider a form of optimality criteria used by El Karoui and Karatzas [32]. Let us consider an abstract stochastic control problem on a space (Ω,F,P) in which a choice of control ρ results in an instantaneous cost process g ρ (n) n≥1 . We may view g ρ (n) as a cost occured at time n. For example, we have g ρ (n) := β n h (ρn−1) (t ρ n ) in (2.2). We can also define the filtration of information obtained up to time n when following ρ by whereη is the corresponding Mandelbaum allocation sequence (Definition 2.4, Remark 2). We will discuss this filtration in detail in Remarks 22 and 23. Remark 13. It is clear from the definition that the strategy process (ρ n ) n≥0 is (G ρ n ) n≥0 -adapted. We will show later that the cost process g ρ (n) := β n h (ρn−1) (t ρ n ) (as in (2.2)) is also adapted with respect to G ρ n . Suppose that we are given a nonlinear expectation operator E, as in Definition 3.7, and consider a minimization problem over the space of Mandelbaum allocation strategies, as represented by their equivalent form ρ (Remark 2). The process ρ not only describes our strategy and the corresponding cost, but also determines the observed filtration. Therefore, at any point in time, it does not make sense to compare strategies unless those strategies yield the same information at the considered time.
Definition 3.10. We say strategies ρ and ρ are historically equivalent at time N , denoted by ρ ∼ N ρ , if ρ n = ρ n for all n ≤ N .
We can now give a standard form of optimality which is often considered when we have a consistent nonlinear expectation operator.
Definition 3.11. We say a strategy ρ * is a strong optimum if for every strategy ρ such that ρ ∼ N ρ * , we have Remark 15. When E is replaced by an (F n )-consistent nonlinear expectation and (G ρ N ) is replaced by (F N ), strong optimality simplifies to

13
A standard approach to tackle the decision making under time-inconsistency (nonlinear expectation) operator is to define 'the optimal strategy' through the solution of the robust Bellman equation [46,58] as considered in Caro and Gupta [20]. Using the tower property, we can show that the strong optimum under (F n )-consistent nonlinear expectation is equivalent to the solution to the robust Bellman equation.

C-Optimality
In the bandit setting, our nonlinear expectation is not necessary (time-)consistent. In order to understand the Gittins index strategy under an inconsistent operator, we propose an alternative notion of optimality, which is inspired by martingale optimality.
For motivation, consider an (F n )-consistent nonlinear expectation E. Suppose that we wish to solve the minimization problem For a given strategy ρ, we define a process X ρ N : Under mild conditions, we know from the martingale optimality principle that (X ρ N ) is an E-submartingale for every strategy ρ and it is a martingale for an optimal strategy ρ * .
By using the Doob-Meyer decomposition for nonlinear expectation (see e.g. [26, Theorem 8]), we can write is an E-martingale and (C ρ (n)) is a non-negative predictable process with C ρ (n) ≡ 0 for the optimal strategy ρ * .
By rearranging the equation above, for every ρ, Moreover, for an optimal strategy ρ * , we have Inspired by the analysis above, we propose an alternative notion of optimality in an inconsistent setting.
Definition 3.12. We say a strategy ρ * is C-optimal if there exists a (G ρ * n )adapted process (V n ) (called a value process) and a collection of random vari- with equality for ρ = ρ * , We can see C ρ N (n) acts as '(sub-)compensator' to the cost, and V N acts as the value function. This approach is loosely related to the capital requirement approach discussed by Frittelli and Scandolo [38]. We can interpret Definition 3.12 as requiring that the (sub-)compensators (C ρ N (n)) (i) is known one-step in advanced before observing the cost (i).
(ii) consistently (sub-)compensate the cost. In particular, as time elapses, we obtain more information and thus require the same amount, or possibly less to (sub-)compensate (ii).
(iii) complement the extra cost occurred for a sub-optimal strategy (iii).
(iv) are bounded below by a compensator of a particular strategy ρ * , which we call 'optimal' (iv).
Remark 16. We have mentioned the robust Bellman equation [46,58] as an approach to force time-consistency in our decision making. The fundamental idea of this approach is to freeze our value function and propagate its value backward in time. In particular, suppose we have V * n+1 as our expected remaining cost at time n. We then define an optimal strategy at time n to be a strategy ρ n such that g ρ (n) + V * n+1 is optimized. A closely related approach to ensure time-consistency was proposed by Strotz [73] and Pollak [64] and developed further in Peleg and Yaari [60] and Koopmans [53]. Recent extensions include Björk and Murgoci [17], Björk, Khapko and Murgoci [16], Yong [79] and Hu, Jin and Zhou [44]. For a problem with horizon L, suppose that the optimal control is determined after time n, in other words, (ρ * n+1 , ..., ρ * L ) is known. We then find a control ρ * n at time n to optimize over the space of possible strategies ρ : (ρ n+1 , ..., ρ L ) = (ρ * n+1 , ..., ρ * L ) . In this way, the (optimal) control rather than the value function, is constructed recursively. This idea is then extended by searching for (sub-game perfect) Nash equilibria, to allow for non-uniqueness of the optimal controls.
As discussed in Section 2.1.2, the robust Bellman approach may introduce some dependency between bandits in our system. Hence, Gittins index strategy is not optimal under that approach. On the other hand, when considering a system of bandits, the measurability of our future states are determined by our current action. Therefore, the σ-algebra that is used to define the future control (ρ * k ) k≥n+1 cannot be chosen independently of our current control. This means that we cannot directly consider the Strotz-Pollak approach for the bandit setting as we cannot freeze our future control without freezing our current control.
The notion of C-optimality can be loosely interpreted as a third variation on these time-consistency approaches. In particular, we can interpret the compensator process as propagating a value backward in time, as in the robust Bellman approach. Optimality can then be defined forward in time, which relaxes the dependence on the filtration.

Endowment Effect
One natural question to ask is whether we can give an interpretation of Coptimality (Definition 3.12) in terms of classical strong optimality (Definition 3.11). To see this, we will consider an endowment effect through the strong optimality.
Example 3.13. Let H and G be random variables representing the cost of two strategies and Q be a family of probability measures such that H and From these inequalities, we see that, without any endowment, we strictly prefer H to G whereas our preference reverses with an endowment (H + G)/2. We know that in the classical linear expectation theory (where the classical Gittins theorem holds), an endowment does not affect our preference in the strategy.
In this section, we will show that C-optimality is nearly equivalent to a strong optimality 'up to an endowment' when our nonlinear expectation is timeconsistent.
The following proposition follows from the definition of C-optimality and monotonicity of nonlinear expectation (in particular, Definition 3.11(iii)-(iv)).
Proposition 3.14. Let ρ * be a C-optimal strategy with a predictable compensator (C ρ * N (n)). Then for every ρ ∼ N ρ * and A ∈ G ρ * N , Let consider the case when G ρ N = F N for every strategy ρ and pretend that E is an (F n )-consistent nonlinear expectation operator. Then (3.5) says that C-optimality implies strong optimality, when our agent is given the predictable endowment − ∞ n=N +1 C ρ * N (n) at time N . We will now show that a converse result also holds, when our operator is consistent.
Definition 3.15. Let E be an (F n )-consistent nonlinear expectation. We say a strategy ρ * is optimal up to a predictable endowment if there exists a family of random variables (D N (n)) such that Suppose that ρ * is an optimal strategy up to a predictable endowment, then ρ * is C-optimal. Proof.
In the coming section, we will show that Gittins theorem holds in the sense of guaranteeing C-optimality under an operator E. This means that we prove that Gittins theorem is a (strong) optimum up to some predictable endowment.
It is an open question under which conditions the C-optimum is unique. In the most trivial case when our operator E is simply a classical expectation, the endowment never affects our evaluation; thus it is reduced to the uniqueness of the value function in the classical setting.

Overview of Bandits under uncertainty
Let us recall that the objective of our problem is to dynamically allocate a single resource amongst M bandits to minimize the total discounted cost. We have made a few assumptions to model uncertainty in the cost process which can be founded in Assumptions 2.1, 3.1 and 3.6.
We also introduce a Mandelbaum allocation strategy (Definition 2.4) and the equivalent notion ρ (Remark 2) representing the choice of our control. We are now ready to establish a robust Gittins theorem with optimality in the sense of Definition 3.12. Our robust Gittins theorem generalize the result of El Karoui and Karatzas [32] to the uncertain case. One may also see this result as providing a sense of optimality for the index strategy considered by Caro and Gupta [20] and Li [56].

Robust Gittins theorem
We will first give an alternative definition to the robust Gittins' index inspired by Weber [75], which is more convenient to use in our analysis.
s+t ) t≥0 -stopping times 5 and the outer essential infimum is taken in By using the results proved in the later sections, we can write the robust Gittins index explicitly. We present this result here for clarity, but make no use of it in subsequent arguments.
where Q is the family of probability measures defined in Theorem 3.5.
Proof. See Theorem C.3 in the appendix.
Recall that E is the partially consistent orthant nonlinear expectation induced by the family E (m) m∈M as given in Definition 3.7. We can obtain an optimal allocation strategy by considering the following theorem.
is C-optimal (Definition 3.12) under E for the cost Remark 18. We choose ρ * to be the minimum value in the (random) set of minimum Gittins index machines {arg min k γ (k) (ψ (k) n )} as a simple method of symmetry breaking, in order to avoid complexities due to measurable selection. In fact, any choice of ρ * n ∈ {arg min k γ (k) (ψ (k) n )} also yields C-optimality. Remark 19. The robust Gittins theorem states that an optimal choice is given by always playing a bandit with the lowest robust Gittins index. At each time, the indices of unplayed bandits do not change. This leads to a form of consistency in the values associated with different bandits, even though E is not consistent.

Sketch of the Proof
We will separate the proof into two parts: In Part A, we analyze a one-armed bandit in a robust setting. In Part B, we combine M bandits together. The main body of the rigorous proof can be found in Appendices A and B (respectively) as self-explained sections. We summarize the structure and approach of the proof here.

One-armed bandit optimality
We begin by considering play of the mth machine (with the superscript (m) omitted).
Step A.1 Observe that the robust Gittins index is the minimum compensation for which we are willing to continue to play the bandit (with compensation).
By minimality, the net expected cost under optimal play must be zero (Theorem A.2), i.e. ess sup In particular, for any subsequent stopping time τ ∈ T (s), we have Step A. 2 We view the process γ as the 'average' cost of playing the bandit. Once the process (γ(t)) t≥s exceeds γ(s), the reward γ(s) will no longer be sufficient to encourage continued play; so it will be optimal to stop. In particular, the stopping time σ(s, γ(s)) := inf{θ ≥ 1 : γ(s + θ) > γ(s)} (4.3) yields equality in (4.2) (Theorem A.8).
Step A.3 Imagine that, whenever the bandit (with compensating reward) is no longer attractive to play, we were to increase the compensation sufficiently to make ourselves indifferent to continuing. The expected value of future loss, with this increased compensation, must again be zero (Proposition A.11). The offered compensation can be written as a running maximum of the robust Gittins index process and we can express the expected return With the compensation reward (Γ(t)), we are always willing to continue to play.
In particular, at any point in time, we have a non-positive expected future cost (Theorem A.13), i.e. Step A.4 Now suppose we were to take a break from playing for some period, and then resume our earlier strategy. In this case, we may lose some expected profit (Equation (4.5)) due to the discount effect of the delay. . By (4.4), the total reward of this game is zero. Therefore, the delay of getting the reward must result in a possibly worse outcome. In Theorem A.14, we use this observation, together with the robust representation (Assumption 3.6) to show that for any fixed > 0 there is a probability measure Q ∈ Q such that, for every decreasing predictable process (α(t)) taking values in [0, 1], Remark 20.
Step A.4 is the key point in which positive homogeneity of E is used. A predictable process (α(t)) represents the delay due to taking a break to play another bandit. In step A.3, we choose the compensator such that the total expected return is zero but the bandit is always attractive to be played.
(i.e. we always have a reward for the future.) We therefore cannot expect a better outcome than zero if we delay our play. Mathematically, one can replace positive homogeneity and subadditivity by convexity and the property that: if E(X|F t ) ≤ 0, then for all F t -measurable random variables α taking values in [0, 1], we have E(αX|F t ) ≥ E(X|F t ).

Information structures for Multi-armed bandits
We now consider combining play over multiple machines.
To retain consistency for a single bandit, the nonlinear expectation needs to be defined together with the filtration. It follows that we need to define an 'independent' nonlinear expectation on the joint space of the bandits, which we do via an orthogonal product space. This restriction does not allow us to directly implement Mandelbaum's [57] original approach for a dynamic allocation strategy (Definition 2.4). This is because the multi-parameter process (η(n)) is only defined to be measurable with respect to the orthant filtration. In particular, it is not clear how one could directly extract the component of (η(n)) to the marginal space Ω (m) where our single-bandit nonlinear expectation is defined.
The importance of decomposing a strategy on the multi-armed bandit to strategies for one-armed bandits can be seen in the proof of El Karoui and Karatzas [32,Equation 5.1] (via Whittle's approach [76]), and is described more explicitly in their continuous time paper [33,Equation 6.9].
In order to overcome this difficulty, we introduce a class of allocation strategies where there is a component associated to the stopping times of the marginal filtrations. This component allows us to connect and separate the space of multiple bandits to the marginal space of each single bandit.
Our class of allocation strategies consists of two components (τ, p). The collection of random times τ = (τ (m) k ) k≥0,m∈{1,...,M } will identify the duration for which will play the mth bandit, the kth time we start to play. This sequence is chosen based on historical observations of the mth bandit only, that is, the random times Intuitively, the random sequence (p n ) is allowed to depend on all prior observations from all bandits. For the sake of precise bookkeeping we need to record, at each moment, how many times we have already played each bandit. This leads to the following definition.   In this case, we define (X t ) t≥1 to be the outcome of the first bandit and define θ k+1 := inf{t ≥ 1 : X t+ k i=0 θi = l}. We then have the representation of this strategy τ (1) = (θ 0 , θ 1 , ....), τ (2) = (2, 2, 2, ...), and p = (1, 2, 1, 2, ...).
The corresponding recording sequence is Extending this example, we can generally write our strategy (τ, p) in terms of ρ and vice versa. This unique representation provides a simple (if inefficient) description of our strategy, which we now make precise.
Definition 4.8. Define the random variable ρ n to be the bandit which will be observed in the nth play under an admissible allocation strategy (τ, p). We call the process (ρ n ) n≥0 , a simple form allocation sequence. The construction of the sequence (ρ n ) is given explicitly in Lemma C.8 in the appendix.
Furthermore, one can check that the recording sequence corresponding to (1, ρ) is exactly the Mandelbaum allocation strategy (Definition 2.4). In particular, we can explicitly construct a one-to-one correspondence between our equivalence class of admissible strategies (Definition 4.6) and Mandelbaum allocation strategies, and we have G  If we parameterize our actions by a simple form strategy (1, ρ) with associated recording sequence η, then ρ n−1 is the decision made at time n − 1 to generate the outcome observed at time n. The observation at the nth play is given by We define the observed filtration by H ρ 0 = {φ,Ω} and H ρ n := σ (ξ ρ 1 , ..., ξ ρ n ). We prove, in the appendix, that the observed filtration agrees with that used in Definition 4.5 when considering measurability of ρ. That is where η is the recording sequence corresponding to (1, ρ).

Multi-armed bandit optimality
We can now give the second half of the proof for the robust Gittins index theorem where we will consider 0 as our referencing value function.
In order to prove the optimality of the robust Gittins' strategy, we define the target function for an allocation strategy by where ρ is a simple form derived from (τ, p) and Γ (m) (t) is the running max of the robust Gittins index of the mth bandit, as considered in (4.4).
Step B.1 Suppose that we have M bandits, with associated indifference rewards (Γ (m) ) m∈M as in step A.3. If we mix the play of these bandits, this is equivalent to taking a break in a single bandit to play the others. This delay will result in a possibly worse outcome (Equation (4.6) in step A.4).
In Theorem B.3, we use the definition of E and apply Fubini's theorem to show that, for all allocation strategies (τ, p), this implies that for any > 0, there exists a probability measure is the delay effect on the mth bandit due to playing other bandits. As is arbitrary, it follows that for all allocation strategies (τ, p), Step B.2 In step A.2, we noticed that the total expected loss of a single bandit between S and S is zero, for S and S the consecutive stopping times when the robust Gittins index hits a new maximum (Equation (4.3)). We use this fact to construct a family of time allocation sequences as a candidate optimal strategy.
Using our construction on the class of allocation strategies, we can project the joint nonlinear valuation to its marginal space which is equipped with a consistent nonlinear expectation. We can then use the result from step A.2, that σ We consider C ρ (n) = β n Γ (ρn−1) (t ρ n ) as a (sub-)compensator in Definition 3.12. The strategy ρ * given in Theorem 4.3 is the strategy of always playing the bandit with the minimal index. Therefore, it lies in the same equivalence class as a strategy with the time allocation sequences (σ (m) ) (and with p indicating the minimum index amongst all bandits at each time). Hence, by (4.12), Furthermore, by (4.10), By monotonicity of the process Γ, we prefer lower value earlier, due to the discount effect. Thus, we prove the optimality condition when N = 0. We can now restart our analysis at the considered (orthant) time to obtain the optimal condition for N > 0. We now thus show that ρ * satisfies the condition for C-optimal. The formal proof of this result can be found in Theorem B.5.

Numerical Results
In this section, we study the behaviour of the robust Gittins index using a numerical example. Again we omit the superscript (m) for notational simplicity.
We suppose the bandit under consideration generates independent identically distributed costs (h(t)) 1≤t≤T of either $1 or $0, given (unknown) probability P(h(t) = 1) = θ and h(t) = 2 for all t > T . The filtration (F t ) t≥0 is generated by the observed cost process (ξ t ) 1≤t≤T = (h(t)) 1≤t≤T (with F 0 trivial). The horizon T can be thought of as the maximum number of times that each bandit can be played.
Remark 24. An imaginary horizon T is introduced in order to allow us to easily construct a data-driven recursive nonlinear expectation (5.1) by backward induction The future cost h(t) = 2 is introduced to simplify our numerical method. By considering (4.1), we can see that the robust Gittins index (γ(t)) t≥1 takes values between 0 and 1 when t ≤ T and γ(t) = 2 for t > T . Moreover, the optimal stopping time σ(t, γ(t)) ≤ T − t. Hence, one can calculate the robust index γ(t) by considering a finite horizon optimal stopping problem.
We model uncertainty in this setting by constructing a one-step coherent nonlinear expectation E (t) (·) : L ∞ (F t+1 ) → L ∞ (F t ). Once we have a one-step coherent nonlinear expectation, we can construct an (F t )-consistent coherent nonlinear expectation by Remark 25. We will consider one-step coherent nonlinear expectation which is inspired by the DR-Expectation [23], see also Bielecki, Chen and Cialenco [14]): where Θ t = p − (p t , n t ), p + (p t , n t ) corresponds to a credible interval for θ given our observations at time t, using a (possibly improper) Beta prior distribution. The processes n t and p t correspond to the number of observations and the (posterior mean) estimate of θ at time t.
In particular, we may choose a credible level k ∈ [0, 1] and obtain p ± (p t , n t ) by p ± (p t , n t ) = I −1 (ptnt,(1−pt)nt) (0.5 ± k/2) where q → I −1 (a,b) (q) is the quantile function of the Beta(a, b) distribution. One could also use the central limit theorem to obtain an asymptotic confidence interval. However, due to the fact that Θ t ⊆ [0, 1], we restrict ourselves to the credible set above to avoid end-effects, and allow for asymmetry in the plausible values around the 'best' estimate.
As our credible set is constructed from p t and n t , and the pair (p t , n t ) can be computed recursively, it follows that for every f : √ nt . 6 By recalling the definition of γ(s) (Definition 4.1), one can show (using a general robust dynamic programming argument, as in Ruszczyński [71], or the nonlinear Snell's envelope, as in Riedel [68]) that we can write for some function γ k,β,T −t . 6 Here, we write the nonlinear expectation as a function of n We then use a simple finite-difference algorithm (see Appendix D) to estimate the function where, in our simulations, we fix n 0 = 1. Plots of this estimate, for various values of k, β and T , can be found in Figure 1.
For h(t) = ξ t , with uncertainty modeled by (5.1), at each time step we wish to play the bandit with the lowest θ. Classically, this is estimated by p, so a naïve (greedy) strategy would suggest playing the bandit with the lowest estimated average loss p. By using C-optimality, at each point, we choose a bandit with the lowest γ. Therefore, we may think of γ as an implied probability p, distorted to account for exploration and exploitation of the system of bandits.
In Figure 1, we see the following broad phenomena: • When 1/ √ n is small, the difference between γ and p is close to zero. In particular, this says that when we have high certainty in our estimates, γ is equivalent to the estimated probability.
• When we increase β, the difference typically γ − p decreases. This corresponds to the fact that β is a discount factor which determines how much we value future costs. Therefore, increasing β increases the degree that we wish to explore the system, i.e. we become more optimistic in our evaluation. We also observe that decreasing β also yields a similar result to shortening the horizon.
• When k is increased, the difference γ − p increases. This is due to the fact that k corresponds to the 'width' of the 'credible interval'. Hence, large k means that we become more conservative and favour exploiting over exploring.

Prospect Theory
One result suggested in Figure 1 when β = 0.9999 and k = 0.01 is that, when we do not worry about uncertainty, we are more optimistic when p is large (close to 1), that is, γ is clearly less than p. On the other hand, when uncertainty dominates, e.g. when β = 0.9999 and k = 0.95, or β = 0.95 and k = 0.8, we become more pessimistic. Curiously, when β = 0.9999 and k = 0.8, or β = 0.95 and k = 0.5, both optimism and pessimism can be seen. For large p, (when the game seems bad), pessimism dominates, while for small p (when the game seems good) we become optimistic in our optimal strategy. This gives a bias in the probabilities, related to that used in the probability weighting functions as considered in prospect theory by Kahneman and Tversky [47] or in rank-dependent expected utility by Quiggin [65,66]. In this literature, they propose models to explain irrationality in human decisions under risk. They argue that people generally reweigh the probabilities of different outcomes using a nonlinear increasing map p → π(p), with various assumptions on its curvature.
Our result (for appropriate values of β and k) reflects this behaviour without imposing a probability weighting function as in classical prospect theory.

Monte-Carlo Simulation
In order to illustrate the performance of the robust Gittins index calculated above in the real decision making, we consider the Bernoulli bandit as described above over 50 exchangeable bandits and for a horizon T = 10 4 . We run 10 3 Monte-Carlo simulations and compare performance of various strategies for decision making. To provide a wide range of scenarios in which our strategies must perform, in each simulation we first generate a, b independently from a Γ(1, 1/100) distribution, then generate the 'true' probabilities for each bandit independently from Beta(a, b). We generate 10 trials on each bandit to provide initial information. N.B. Formally, we assume that each bandit can be played for at least T = 10 4 trials in constructing our Gittins index. We illustrate the performance of the first 10 4 plays to compare with other algorithms.

Measures of Regret
There are a number of possible objectives to measure the loss of our decisions. We will consider the following examples (from Bertini et al. [19] and Lai and Robbins [54]).
• Expected-expected regret. This is the difference in the true expectations under our strategy and an optimal strategy with perfect information. In our setting, this can be given by R(L) = L n=0 (θ (ρn) − θ * ) where θ (m) is the true probability of the mth bandit and θ * = min m θ (m) .
• Sub-optimal plays. This measures the number of times where we play a sub-optimal bandit which is given by N ∨ (L) = L n=0 I(θ (ρn) = θ * ).

Policy for multi-armed-bandits
In our simulation, we will label our algorithm the DR (Data-Robust) algorithm. We also consider the following classical policies which are commonly used to solve the Bernoulli bandit problem. These policies choose an arm by considering the minimal index I evaluated on each bandit separately. Literature about these policies and further developments can be found in the reviews by Bertini et al. [19] or Russo et al. [70]. For notational simplicity, we will denote by p and n the estimated probability and the number of observations of the considered bandit at the time before making a decision.
• Greedy strategy. In this policy, we choose the bandit with the minimal estimated probability given by I Greedy = p.
• Thompson strategy. This is a Bayesian adaptive decision strategy for the bandit problem. It proceeds by first randomly generating a sample from the posterior distribution of the mean cost of each bandit, then chooses to play the bandit which gave the minimal sample. In our setting, these samples are given by I T hompson ∼ Beta(a 0 + pn, b 0 + (1 − p)n) where a 0 , b 0 > 0 are the parameters of a Beta prior distribution for the mean. To avoid biasing our estimation, we consider initial values a 0 = b 0 ∈ {0.0001, 1, 50}, where larger values correspond to a more informative prior.
• UCB strategy. This is an optimistic strategy to choose the bandit based on its lower bound.  In Figure 2, considering first the cases where β = 0.9999, we can see that an increase in the value of k has a nonlinear effect on the distribution of regret. Initially, increasing k appears to lead to a reduction in the typical regret, but a possible increase in the average and variability of the number of suboptimal plays. However, setting k too large clearly leads to worse outcomes. This is because k corresponds to the level of robustness; the more robust we are, the less willing we are to explore and the more willing we are to exploit. It follows that a large value of k encourages us to exploit early, and we may not find the optimal bandit to play.
On the other hand, the discount rate β determines how much we value our future costs. If we have a high level of robustness (large k) but do not value the future cost enough (small β), we may end up settling for a sub-optimal decision. This can be seen most clearly when β = 0.9999 and k = 0.8. In this case the average expected-expected regret is relatively small when compared to other strategies, but its average number of suboptimal plays is relatively high. Reducing β to 0.95 emphasizes these effects even further.
As discussed in the introduction, the UCB algorithm asymptotically achieves a minimal regret bound (see [54]). It does so by ensuring that, over short horizons, the algorithm explores a sufficient amount, in order to guarantee good asymptotic performance. We can see that in our simulation (with 10 4 plays over 50 bandits), the UCB algorithm is still in its high exploration regime which results in a high regret and very few optimal plays. In contrast, a Greedy algorithm always chooses an arm to play without taking into account its uncertainty (and so without considering the possibility for exploration) and therefore there is no learning in its procedure. This results in the greedy algorithm yielding a low average regret but a high average number of suboptimal plays. In Figure 3, we illustrate the interquartile range and the standard deviation of the total expected-expected regret when β = 0.9999 with different values of k over 1000 simulations. We can see that by introducing an appropriate values of k, we can obtain a substantial reduction in the interquartile range and the standard deviation. In particular, the DR algorithm does not only give a low average regret but also does so consistently over different simulations.

A Part A: Analysis of a single bandit
We will now flesh out the sketch given in Section 4.2.
In this section, we will focus the discussion on a single bandit.

A.1 Step A.1: Indifference reward and Optimal Stopping problem
We first recall the definition of the robust Gittins index (process).
where T (s) denotes the family of (F s+t ) t≥0 -positive stopping times.
To study the process γ, we introduce an auxiliary optimal stopping problem. At each time step, the player decides whether to continue or to stop play of the machine. If the player decides to continue to play, he will be offered a fixed reward λ (known at the initial time s) in addition to the cost h(t).
Definition A.1. The target function V s : T (s) × L ∞ (F s ) → L ∞ (F s ) for a stopping time τ ∈ T (s) with a reward λ is defined by We know that γ(s) is defined to be the minimum reward λ such that, with a choice of τ minimizing V s (τ, λ), the expected loss is at most zero. By minimality of γ(s) and monotonicity of E, the reward γ(s) will yield zero loss under optimal stopping and, therefore, cannot yield a positive expected reward under suboptimal stopping. In particular, the following holds.
Theorem A.2. The function V s defined above satisfies.
Proof. This can be done by showing that V s satisfies the regularity assumptions of Lemma C.6.
For condition (ii), suppose that λ > λ. Then By monotonicity and translation equivariance, we have So, V s is Lipschitz in λ. Condition (iii) follows from (F t ) t≥0 -regularity of E (Remark 8). The result follows from Lemma C.6. Remark 28. Theorem A.2 shows that, under optimal stopping, with the reward γ(s), the expected total loss is zero. In particular, we may view γ(s) as an 'average cost under optimal play' of the bandit.
A. 2 Step A.2: Optimal Stopping time By considering a Snell envelope argument, as in Riedel [68] with slight modification, we can establish that a stopping time τ * achieving the minimum value V s (τ * , λ) = ess inf τ ∈T (s) V s (τ, λ) exists (Theorem C.7). In this subsection, we will show that τ * can be expressed as a hitting time of the Gittins index process (γ(s)).
Definition A.4. Let λ be a non-negative F s -measurable random variable. Define a stopping time σ(s, λ) by σ(s, λ) := inf{θ ≥ 1 : γ(s + θ) > λ} As mentioned in Remark 28, we may view γ as a time-average cost under optimal stopping. The stopping time σ(s, λ) can be interpreted as the first time when this average cost exceeds a fixed λ. Once γ exceeds λ, the offered compensation λ is insufficient to make the bandit attractive so, to minimize the total 'expected' cost, we will stop.
In what follows, we formalize this intuition. We will show that σ(s, λ) is an optimal stopping time when the reward λ is offered. In particular, we will show that σ(s, γ(s)) attains the optimal value with the reward λ = γ(s). Moreover, the value for this optimal stopping problem is zero (by Theorem A.2).
The optimality of σ(s, λ) can be proved by showing that for any stopping time τ ∈ T (s), if τ > σ(s, λ) on some event, our value can be improved by stopping at σ(s, λ) (Lemma A.5). On the other hand, if τ < σ(s, λ) on some event, the value can be improved by continuing to play (Lemma A.6). The easy proofs of these Lemmata are in the appendix C.
Lemma A.5. For every λ ∈ L ∞ (F s ) taking values in [0, C) and τ ∈ T (s), Proof. We will prove this result by applying Corollary A.3 together with timeconsistency and monotonicity of our nonlinear expectation.

By translation equivariance and regularity (Remark 8), it follows that
Finally, by applying monotonicity and time-consistency as in the previous lemma, the result follows.
By combining these observations with the Lebesgue property of E, we have the following theorem.  In particular, σ(s, γ(s)) yields equality in Corollary A.3.
Remark 29. Bank and El Karoui [9] consider a similar result to this theorem, but under a classical expectation with the summation s+τ t=s+1 β t (h(t) − γ(s)) replaced by a more general function in continuous time. (See also [10] and [11] for further discussion).

A.3 Step A.3: Fair Game and Prevailing process
Previously, we considered an optimal stopping problem when the Gittins index is offered as compensation for continued play. In this subsection, we consider a 'fair game' when we offer a compensation which is (just) sufficient to encourage us to continue playing the bandit. In particular, the compensation increases at each optimal stopping time in order to encourage the agent to continue.
We will first define a sequence of optimal stopping times that we have to consider in order to analyze our (minimal) compensation process.
Definition A.9. We defineŜ n to be the stopping time where the Gittins index process (γ(s)) s≥0 exceeds its running maximum for the nth time. We write σ n for the duration betweenŜ n andŜ n+1 , that is, σ n is a random time identifying how long after timeŜ n the process (γ(s)) s≥0 hits a new maximum.
Definition A.10. We define the prevailing reward process Γ by the running maximum of γ, that is, We can then show that the process Γ serves as an indifference reward (process) for our agent, when evaluated from the perspective of one of the stopping timesŜ n .
Proposition A.11. For all n ∈ N, In particular, Proof. By Theorem A.8, we have, for all k ∈ N,

40
Fix n, N ∈ N with N ≥ n, by time-consistency, translation equivariance, By our definition ofŜ n , we haveŜ N ≥ N . Hence,Ŝ N → ∞ as N → ∞. Therefore, by applying Lebesgue property, the result follows.
Intuitively, as Γ(t) ≥ γ(t − 1), the process Γ should be sufficient to compensate for continuing to play. This means that the total 'expected' loss, evaluated from any point in time, must be non-positive if a reward Γ(t) is offered. This is stated formally in the following lemma and theorem.
Moreover, the above inequality is strict on A. Hence, if A is not a P-null set, it then follows from strict monotonicity that This contradicts the minimality of σ(s, λ) established in Theorem A.8.
Proof. Define τ n := (Ŝ n+1 ∧ N ) ∨Ŝ n . SinceŜ n is a stopping time for all n ∈ N, so is τ n . Hence, by Proposition A.11 and Lemma A.12, Therefore, as {Ŝ n ≤ N <Ŝ n+1 } is F N -measurable, by Lebesgue property and regularity (Remark 8), Remark 30. The above theorem says that, with compensation Γ(t), at any point in time we expect to obtain a net reward from continuing to play, i.e. we have a non-positive expected total loss.

A.4 Step A.4: Reward Delay and Robust Representation Theorem
In Step A.3, we have shown that our reward Γ is defined to be (just) sufficient to encourage the player to continue playing (Theorem A.13) until the horizon (i.e. the total expected loss is zero, as in Proposition A.11). We now show that taking a break from play cannot improve a player's expected discounted costs. We now formulate this observation by establishing the existence of a probability measure in our representing set Q such that the expected discounted costs, accounting for the break in play, have a lower bound close to zero. This result will be useful when considering multiple bandits.
Theorem A.14. By Assumption 3.6, recall that E admits a robust representation of the form For every fixed > 0, there exists a probability measure Q ∈ Q such that for every predictable decreasing process (α(t)) t≥0 taking values in [0, 1], we have Proof. By Proposition A.11 and the robust representation theorem, for a fixed > 0, we can find a probability measure Q ∈ Q such that For each predictable decreasing process (α(t)) t≥0 taking values in [0, 1], we define We claim that Indeed, it is clear that the result holds when N = 0. For the sake of induction, assume that the result holds for a given N . We then have By the robust representation theorem (Theorem 3.5), By Theorem A.13, we know that Since α is decreasing, As α is predictable, by rearranging the above inequality, we obtain Hence, by the tower property, By substituting this into (A.2), we prove (A.1) with N replaced by N + 1 and done the induction step.
By using bounded convergence theorem, we can take N → ∞ and obtain the required result.

B Part B: Analysis of multiple bandits
We are now ready to consider the problem of choosing between multiple bandits.
In Definition 4.6, we introduce our class of admissible control which can be considered in our dynamic allocation problems. This class of control introduces a few natural ways of parameterizing time. We therefore will use the following terminology to describe the evolution of time in different ways. This terminology will be useful in our discussion on the proof. Remark 32. The mth component of the recording sequence η n (Definition 4.5) represents the number of runs in the mth bandit before the nth decision. We can see that the random variable η n takes values in r ∈ S : M m=1 r (m) ≤ n . In order to prove C-optimality, we then consider the target function as briefly stated in (4.9).
Definition B.1. For each m ∈ M, let h (m) (t) t≥1 be the uniformly bounded non-negative cost process at the tth trial of the mth bandit with prevailing reward process Γ (m) (t) t≥1 (Definition A.10). For an allocation strategy (τ, p), (Definition 4.6), we define the Gittins' target function by where ρ is the simple form of (τ, p) with corresponding counting processes t ρ n := n−1 k=0 I(ρ k = ρ n−1 ), and E is a partially consistent orthant nonlinear expectation, as in Definition 3.7.
Remark 33. We can also write V (τ, p) in terms of τ and p directly without identifying the simple form ρ. This is done in the proof of Theorem B.4 in Step B.2. This definition, however, makes it clear that V depends on (τ, p) only through its simple form.

B.1 Step B.1: Fubini theorem and Suboptimality
In this subsection, we will show that considering generic stopping times and choice of bandits yields a non-negative expected loss. This can be shown using the robust representation result.
First, we recall the following corollary of Fubini's theorem.
By the definition of E and Fubini's theorem, it follows that As is arbitrary, the result follows.

B.2 Step B.2: Optimality
In this subsection, we will show that the strategy determined by a particular time allocation sequence yields a zero expected cost in the Gittins' target function.
Theorem B.4. For each m ∈ M, let (σ (m) k ) k≥0 be the sequence of running maximum random times associated to the mth bandit, as defined in Definition A.9, i.e. we define σ Using this notation, we can define a variation on the Gittins' target fuction, with the restriction that we consider only the first N plays of the system, that is, with the convention h (0) (t) = Γ (0) (t) = 0 for all t.
By considering the simple form ρ of the strategy (σ, p) and applying Lebesgue property of E, we can show that V agrees with Definition B.1 as N → ∞, that is, lim Hence, it suffices to show that V (N, σ, p) ≤ 0 for all N ∈ N. This will be proved by induction. It is clear that V (0, σ, p) = 0. Fix N ∈ N and assume that V (N, σ, p) ≤ 0. To show that V (N + 1, σ, p) ≤ 0, by subadditivity, it suffices to show that Define the following random variables as in (4. We see that V (N + 1, σ, p) ≤ 0, and the desired result follows by induction.

B.3 Step B.3: C-optimality
In the previous subsections, we introduced an allocation problem when the prevailing process is offered as compensation (Definition B.1). We also proved that the optimal value can be achieved by choosing a proper family of allocation time sequences (i.e. σ as in Theorem B.4). The prevailing reward process Γ for each bandit is non-decreasing, and the optimal allocation sequences σ require us to make a new decision whenever the process Γ increases. By exploiting this fact, together with the discount effect, we will see that it is preferable to play the bandit with the lowest value of Γ first. In particular, we can establish the Robust Gittins index theorem, which we repeat for convenience of the reader. is C-optimal (Definition 3.12) under E for the cost g ρ (n) = β n h (ρn−1) (t ρ n ) where t ρ n = n−1 k=0 I(ρ k = ρ n−1 ).
Proof. Recall the definition of Ψ (m) r in (4.7). We can see that Ψ r determines the orthant filtration whenη n = r where (η n ) is a recording sequence constructed from the time allocation sequence σ, i.e. when the mth bandit was run for r (m) times under the (optimal) allocation sequence σ. In particular, Ψ ∈ F(Ψ r ) Hence, p * is a choice sequence for the allocation sequence σ (Definition 4.5). Therefore, (σ, p * ) is an admissible allocation strategy. Moreover, observe that ρ * given in the statement of this theorem is the simple form of the allocation strategy (σ, p * ).
Next, we will show that n → β n Γ (ρn−1) (t ρ n ) is predictable with respect to our observed filtration. We recall that Γ (m) (t) = max Note that, while the process ξ (m) t is defined on (Ω (m) , F (m) ), we can extend it to (Ω,F) by considering an appropriate embedding.
To construct a compensator for a subsequent time N , we consider 'restarting' our system at an orthant time r = (r (1) , ..., r (M ) ) ∈ S N ⊆ S (as in Theorem C.12). As F(r) describes the information from all bandits, this needs to be done carefully. Each of our single-bandit filtrations (F (m) t ) t≥0 is generated by a discrete-time real-valued process, and F(r) = m F (m) r (m) , so the Doob-Dynkin lemma states that any F(r)-measurable random variable can be written as a Borel function of the first r observations. For concreteness, we denote these observations ω r .
We proceed by freezing the value of ω r and ω u with u ≤ r. Let (ρ * ,ωr n ) n≥N denote the minimum-Gittins-index strategy given by ρ * defined in the theorem when we restart our analysis at r from a given ω r . We do not change the Gittins indices γ when we fix ω r , so the corresponding Γ processes satisfy Γ and are measurable with respect to ω r . As discussed in Remark 19, the optimal strategy (ρ * ,ωr n ) n≥N coincides with the strategy ρ * (and is therefore also measurable with respect to ω r ).
We can now unfreeze ω r and ω u and, summing over all possible scenarios, show that C ρ N (n) := β n max η (ρ n−1 ) N ≤θ≤t ρ n −1 γ (ρn) (θ) is decreasing in N . By applying the same argument as earlier, we also have n → C ρ N (n) is H ρ n -predictable. Therefore, (C ρ N (n)) defines a subcompensator for strategy ρ and fully compensates for ρ = ρ * . We set V N = 0 and observe (3.2) is satisfied.
Finally, as t → Γ (m) ωr (t) is increasing for all m and all ω r , and ρ * is a strategy where the lowest Γ is chosen first, it follows that for all 1 ≤ N ≤ ∞, if ρ ∼ N ρ * , then In particular, (3.3) is satisfied and therefore ρ * is C-optimal.

C Proof of Other Relevant Results
Definition C.1 (Föllmer and Schied [36]). Let (Ω, G, P) be a probability space and let Y be a family of G-measurable random variable. We say Z is a G-essential infimum of Y denoted by Z = G-ess inf Y if (i) Z is G-measurable.
(ii) Z ≤ Y P − a.s. for all Y ∈ Y.
(iii) For Z such that Z ≤ Y P − a.s. for all Y ∈ Y, we must have Z ≤ Z P − a.s..
We also define a similar notion for G-essential supremum. We may omit G in front of ess inf if the measurability of the family is obvious.
Theorem C.2 (Existence of Essential infimum). The G-essential infimum exists. Suppose in addition that Y is directed downwards, that is for Y, Y ∈ Y, there existsỸ ∈ Y such thatỸ ≤ min(Y, Y ). Then there exists a decreasing sequence (Y n ) n∈N ⊆ Y such that Y n ess inf Y P-a.s. The similar result also holds for G-essential supremum.