## Statistical Science

### Doubly Robust Policy Evaluation and Optimization

#### Abstract

We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance.

In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.

#### Article information

Source
Statist. Sci., Volume 29, Number 4 (2014), 485-511.

Dates
First available in Project Euclid: 15 January 2015

https://projecteuclid.org/euclid.ss/1421330544

Digital Object Identifier
doi:10.1214/14-STS500

Mathematical Reviews number (MathSciNet)
MR3300356

Zentralblatt MATH identifier
1331.62059

#### Citation

Dudík, Miroslav; Erhan, Dumitru; Langford, John; Li, Lihong. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29 (2014), no. 4, 485--511. doi:10.1214/14-STS500. https://projecteuclid.org/euclid.ss/1421330544

#### References

• Agarwal, D., Chen, B.-C., Elango, P. and Ramakrishnan, R. (2013). Content recommendation on web portals. Comm. ACM 56 92–101.
• Asuncion, A. and Newman, D. J. (2007). UCI machine learning repository. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.
• Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R. E. (2002/03). The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 48–77 (electronic).
• Barto, A. G. and Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEE Trans. Systems Man Cybernet. 15 360–375.
• Beygelzimer, A. and Langford, J. (2009). The offset tree for learning with partial labels. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 129–138. Association for Computing Machinery, New York.
• Beygelzimer, A., Langford, J. and Ravikumar, P. (2008). Multiclass classification with filter-trees. Unpublished technical report. Available at http://arxiv.org/abs/0902.3176.
• Beygelzimer, A., Langford, J., Li, L., Reyzin, L. and Schapire, R. E. (2011). Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics (AI&Stats) 19–26. jmlr.org.
• Blum, A., Kalai, A. and Langford, J. (1999). Beating the hold-out: Bounds for $K$-fold and progressive cross-validation. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (Santa Cruz, CA, 1999) 203–208. ACM, New York.
• Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P. and Snelson, E. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. J. Mach. Learn. Res. 14 3207–3260.
• Cassel, C. M., Särndal, C. E. and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika 63 615–620.
• Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: Causal models at scale. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 7–16. ACM, New York.
• Chapelle, O. and Li, L. (2012). An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24 (NIPS) 2249–2257. Curran Associates, Red Hook, NY.
• Dudík, M., Langford, J. and Li, L. (2011). Doubly robust policy evaluation and learning. In International Conference on Machine Learning (ICML).
• Dudík, M., Erhan, D., Langford, J. and Li, L. (2012). Sample-efficient nonstationary-policy evaluation for contextual bandits. In Conference on Uncertainty in Artificial Intelligence (UAI) 1097–1104. Association for Computing Machinery, New York.
• Freedman, D. A. (1975). On tail probabilities for martingales. Ann. Probab. 3 100–118.
• Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. and Schölkopf, B. (2008). Dataset shift in machine learning. In Covariate Shift and Local Learning by Distribution Matching (J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer and N. Lawrence, eds.) 131–160. MIT Press, Cambridge, MA.
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations 11 10–18.
• Hazan, E. and Kale, S. (2009). Better algorithms for benign bandits. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms 38–47. SIAM, Philadelphia, PA.
• Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
• Kakade, S., Kearns, M. and Langford, J. (2003). Exploration in metric state spaces. In International Conference on Machine Learning (ICML) 306–312. AAAI Press, Palo Alto, CA.
• Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 523–539.
• Kearns, M. and Singh, S. (1998). Near-optimal reinforcement learning in polynomial time. In International Conference on Machine Learning (ICML) 260–268. Morgan Kaufmann, Burlington, MA.
• Lambert, D. and Pregibon, D. (2007). More bang for their bucks: Assessing new features for online advertisers. In International Workshop on Data Mining for Online Advertising and Internet Economy (ADKDD) 100–107. Association for Computing Machinery, New York.
• Langford, J., Strehl, A. L. and Wortman, J. (2008). Exploration scavenging. In International Conference on Machine Learning (ICML) 528–535. Association for Computing Machinery, New York, NY.
• Langford, J. and Zhang, T. (2008). The Epoch–Greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS) 817–824. Curran Associates, Red Hook, NY.
• Lewis, D. D., Yang, Y., Rose, T. G. and Li, F. (2004). RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5 361–397.
• Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW) 661–670. Association for Computing Machinery, New York.
• Li, L., Chu, W., Langford, J. and Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In ACM International Conference on Web Search and Data Mining (WSDM) 297–306. Association for Computing Machinery, New York.
• Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat. Med. 23 2937–2960.
• McAllester, D., Hazan, T. and Keshet, J. (2011). Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems (NIPS) 1594–1602. Curran Associates, Red Hook, NY.
• Murphy, S. A. (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 331–366.
• Murphy, S. A., van der Laan, M. J. and Robins, J. M. (2001). Marginal mean models for dynamic regimes. J. Amer. Statist. Assoc. 96 1410–1423.
• Orellana, L., Rotnitzky, A. and Robins, J. M. (2010). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes. Part I: Main content. Int. J. Biostat. 6 Art. 8, 49.
• Precup, D., Sutton, R. S. and Singh, S. P. (2000). Eligibility traces for off-policy evaluation. In International Conference on Machine Learning (ICML) 759–766. Morgan Kaufmann, Burlington, MA.
• Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.) 58 527–535.
• Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Mathematical models in medicine: Diseases and epidemics. Part 2. Math. Modelling 7 1393–1512.
• Robins, J. M. (1998). Marginal structural models. In 1997 Proceedings of the American Statistical Association, Section on Bayesian Statistical Science 1–10. Amer. Statist. Assoc., Alexandria, VA.
• Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90 122–129.
• Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
• Rotnitzky, A. and Robins, J. M. (1995). Semiparametric regression estimation in the presence of dependent censoring. Biometrika 82 805–820.
• Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Plann. Inference 90 227–244.
• Strehl, A., Langford, J., Li, L. and Kakade, S. (2011). Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems (NIPS) 2217–2225. Curran Associates, Red Hook, NY.
• Vansteelandt, S., Bekaert, M. and Claeskens, G. (2012). On model selection and model misspecification in causal inference. Stat. Methods Med. Res. 21 7–30.
• Zhang, B., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 1010–1018.