## The Annals of Applied Probability

### Asymptotically optimal priority policies for indexable and nonindexable restless bandits

I. M. Verloop

#### Abstract

We study the asymptotic optimal control of multi-class restless bandits. A restless bandit is a controllable stochastic process whose state evolution depends on whether or not the bandit is made active. Since finding the optimal control is typically intractable, we propose a class of priority policies that are proved to be asymptotically optimal under a global attractor property and a technical condition. We consider both a fixed population of bandits as well as a dynamic population where bandits can depart and arrive. As an example of a dynamic population of bandits, we analyze a multi-class $\mathit{M/M/S+M}$ queue for which we show asymptotic optimality of an index policy.

We combine fluid-scaling techniques with linear programming results to prove that when bandits are indexable, Whittle’s index policy is included in our class of priority policies. We thereby generalize a result of Weber and Weiss [J. Appl. Probab. 27 (1990) 637–648] about asymptotic optimality of Whittle’s index policy to settings with (i) several classes of bandits, (ii) arrivals of new bandits and (iii) multiple actions.

Indexability of the bandits is not required for our results to hold. For nonindexable bandits, we describe how to select priority policies from the class of asymptotically optimal policies and present numerical evidence that, outside the asymptotic regime, the performance of our proposed priority policies is nearly optimal.

#### Article information

Source
Ann. Appl. Probab. Volume 26, Number 4 (2016), 1947-1995.

Dates
Revised: August 2015
First available in Project Euclid: 1 September 2016

https://projecteuclid.org/euclid.aoap/1472745449

Digital Object Identifier
doi:10.1214/15-AAP1137

Mathematical Reviews number (MathSciNet)
MR3543887

Zentralblatt MATH identifier
1349.90834

#### Citation

Verloop, I. M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab. 26 (2016), no. 4, 1947--1995. doi:10.1214/15-AAP1137. https://projecteuclid.org/euclid.aoap/1472745449.

#### References

• [1] Ahmad, S. H. A., Liu, M., Javidi, T., Zhao, Q. and Krishnamachari, B. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inform. Theory 55 4040–4050.
• [2] Ansell, P. S., Glazebrook, K. D., Niño-Mora, J. and O’Keeffe, M. (2003). Whittle’s index policy for a multi-class queueing system with convex holding costs. Math. Methods Oper. Res. 57 21–39.
• [3] Atar, R., Giat, C. and Shimkin, N. (2010). The $c\mu/\theta$ rule for many-server queues with abandonment. Oper. Res. 58 1427–1439.
• [4] Atar, R., Giat, C. and Shimkin, N. (2011). On the asymptotic optimality of the $c\mu/\theta$ rule under ergodic cost. Queueing Syst. 67 127–144.
• [5] Ayesta, U., Erausquin, M. and Jacko, P. (2010). A modeling framework for optimizing the flow-level scheduling with time-varying channels. Performance Evaluation 67 1014–1029.
• [6] Ayesta, U., Erausquin, M., Jonckheere, M. and Verloop, I. M. (2013). Scheduling in a random environment: Stability and asymptotic optimality. IEEE/ACM Transactions on Networking 21 258–271.
• [7] Ayesta, U., Jacko, P. and Novak, V. (2011). A nearly-optimal index rule for scheduling of users with abandonment. In Proceedings of IEEE INFOCOM 2849–2857. IEEE.
• [8] Benaïm, M. and Boudec, J.-Y. L. (2008). A class of mean field interaction models for computer and communication systems. Performance Evaluation 65 823–838.
• [9] Bertsimas, D. and Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal–dual index heuristic. Oper. Res. 48 80–90.
• [10] Billingsley, P. (1999). Convergence of Probability Measures, 2nd ed. Wiley, New York.
• [11] Cánovas, M. J., López, M. A. and Parra, J. (2005). On the continuity of the optimal value in parametric linear optimization: Stable discretization of the Lagrangian dual of nonlinear problems. Set-Valued Anal. 13 69–84.
• [12] Çinlar, E. (1975). Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs, NJ.
• [13] Dai, J. G. and He, S. (2012). Many-server queues with customer abandonment: A survey of diffusion and fluid approximations. Journal of Systems Science and Systems Engineering 21 1–36.
• [14] Ehsan, N. and Liu, M. (2004). On the optimality of an index policy for bandwidth allocation with delayed state observation and differentiated services. In Proceedings of IEEE INFOCOM 1974–1983. IEEE.
• [15] Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes: Characterization and Convergence. Wiley, New York.
• [16] Gast, N. and Gaujal, B. (2010). A mean field model of work stealing in large-scale systems. In Proceedings of ACM SIGMETRICS 13–24. ACM, New York.
• [17] Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B. Stat. Methodol. 41 148–177.
• [18] Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices. Wiley, Chichester.
• [19] Gittins, J. C., Glazebrook, K. D. and Weber, R. R. (2011). Multi-Armed Bandit Allocation Indices. Wiley, Chichester.
• [20] Glazebrook, K. D., Hodge, D. J. and Kirkbride, C. (2011). General notions of indexability for queueing control and asset management. Ann. Appl. Probab. 21 876–907.
• [21] Glazebrook, K. D., Kirkbride, C. and Ouenniche, J. (2009). Index policies for the admission control and routing of impatient customers to heterogeneous service stations. Oper. Res. 57 975–989.
• [22] Glazebrook, K. D. and Mitchell, H. M. (2002). An index policy for a stochastic scheduling model with improving/deteriorating jobs. Naval Res. Logist. 49 706–721.
• [23] Guo, X., Hernández-Lerma, O. and Prieto-Rumeau, T. (2006). A survey of recent results on continuous-time Markov decision processes. TOP 14 177–261.
• [24] Hasenbein, J. and Perry, D. (2013). Special issue on queueing systems with abandonments. Queueing Syst. 75 111–384.
• [25] Hodge, D. J. and Glazebrook, K. D. (2011). Dynamic resource allocation in a multi-product make-to-stock production system. Queueing Syst. 67 333–364.
• [26] Hodge, D. J. and Glazebrook, K. D. (2015). On the asymptotic optimality of greedy index heuristics for multi-action restless bandits. Adv. in Appl. Probab. 47 652–667.
• [27] Jacko, P. (2011). Optimal index rules for single resource allocation to stochastic dynamic competitors. In Proceedings of the 5th International ICST Conference on Performance Evaluation Methodologies and Tools 425–433.
• [28] Larrañaga, M., Ayesta, U. and Verloop, I. M. (2013). Dynamic fluid-based scheduling in a multi-class abandonment queue. Performance Evaluation 70 841–858.
• [29] Larrañaga, M., Ayesta, U. and Verloop, I. M. (2014). Index policies for multi-class queues with convex holding cost and abandonments. In Proceeding SIGMETRICS ’14 The 2014 ACM International Conference on Measurement and Modeling of Computer Systems 125–137. ACM, New York.
• [30] Liu, K. and Zhao, Q. (2010). Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans. Inform. Theory 56 5547–5567.
• [31] Mahajan, A. and Teneketzis, D. (2007). Multi-armed bandit problems. In Foundations and Application of Sensor Management (A. O. Hero III, D. A. Castanon, D. Cochran and K. Kastella, eds.) 121–308. Springer, New York.
• [32] Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability, 2nd ed. Cambridge Univ. Press, Cambridge.
• [33] Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Adv. in Appl. Probab. 33 76–98.
• [34] Niño-Mora, J. (2007). Dynamic priority allocation via restless bandit marginal productivity indices. TOP 15 161–198.
• [35] Niño-Mora, J. (2007). Marginal productivity index policies for admission control and routing to parallel multi-server loss queues with reneging. Lecture Notes in Comput. Sci. 4465 138–149.
• [36] Niño-Mora, J. (2007). Characterization and computation of restless bandit marginal productivity indices. In Proc. 2007 Workshop on Tools for Solving Structured Markov Chains. ACM, New York.
• [37] Ouyang, W., Eryilmaz, A. and Shroff, N. B. (2012). Asymptotically optimal downlink scheduling over Markovian fading channels. In Proceedings of IEEE INFOCOM 1224–1232. IEEE.
• [38] Pandelis, D. G. and Teneketzis, D. (1999). On the optimality of the Gittins index rule for multi-armed bandits with multiple plays. Math. Methods Oper. Res. 50 449–461.
• [39] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
• [40] Raghunathan, V., Borkar, V., Cao, M. and Kumar, P. R. (2008). Index policies for real-time multicast scheduling for wireless broadcast systems. In Proceedings of IEEE INFOCOM 1570–1578. IEEE.
• [41] Robert, P. (2003). Stochastic Networks and Queues, French ed. Applications of Mathematics (New York) 52. Springer, Berlin.
• [42] Rybko, A. N. and Stolyar, A. L. (1992). On the ergodicity of random processes that describe the functioning of open queueing networks. Problemy Peredachi Informatsii 28 3–26.
• [43] Tijms, H. C. (2003). A First Course in Stochastic Models. Wiley, Chichester.
• [44] Verloop, I. M. and Núñez-Queija, R. (2009). Assessing the efficiency of resource allocations in bandwidth-sharing networks. Performance Evaluation 66 59–77.
• [45] Weber, R. (2007). Comments on: “Dynamic priority allocation via restless bandit marginal productivity indices” [TOP 15 (2007), no. 2, 161–198] by J. Niño-Mora. TOP 15 211–216.
• [46] Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Probab. 27 637–648.
• [47] Weber, R. R. and Weiss, G. (1991). Addendum to: “On an index policy for restless bandits”. Adv. in Appl. Probab. 23 429–430.
• [48] Weiss, G. (1988). Branching bandit processes. Probab. Engrg. Inform. Sci. 2 269–278.
• [49] Whittle, P. (1981). Arm-acquiring bandits. Ann. Probab. 9 284–292.
• [50] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25A 287–298.
• [51] Whittle, P. (1996). Optimal Control: Basics and Beyond. Wiley, Chichester.