Statistical Science

Forecaster’s Dilemma: Extreme Events and Forecast Evaluation

Sebastian Lerch, Thordis L. Thorarinsdottir, Francesco Ravazzolo, and Tilmann Gneiting

Full-text: Open access


In public discussions of the quality of forecasts, attention typically focuses on the predictive performance in cases of extreme events. However, the restriction of conventional forecast evaluation methods to subsets of extreme observations has unexpected and undesired effects, and is bound to discredit skillful forecasts when the signal-to-noise ratio in the data generating process is low. Conditioning on outcomes is incompatible with the theoretical assumptions of established forecast evaluation methods, thereby confronting forecasters with what we refer to as the forecaster’s dilemma. For probabilistic forecasts, proper weighted scoring rules have been proposed as decision-theoretically justifiable alternatives for forecast evaluation with an emphasis on extreme events. Using theoretical arguments, simulation experiments and a real data study on probabilistic forecasts of U.S. inflation and gross domestic product (GDP) growth, we illustrate and discuss the forecaster’s dilemma along with potential remedies.

Article information

Statist. Sci. Volume 32, Number 1 (2017), 106-127.

First available in Project Euclid: 6 April 2017

Permanent link to this document

Digital Object Identifier

Diebold–Mariano test hindsight bias likelihood ratio test Neyman–Pearson lemma predictive performance probabilistic forecast proper weighted scoring rule rare and extreme events


Lerch, Sebastian; Thorarinsdottir, Thordis L.; Ravazzolo, Francesco; Gneiting, Tilmann. Forecaster’s Dilemma: Extreme Events and Forecast Evaluation. Statist. Sci. 32 (2017), no. 1, 106--127. doi:10.1214/16-STS588.

Export citation


  • Adolfson, M., Lindé, J. and Villani, M. (2007). Forecasting performance of an open economy DSGE model. Econometric Rev. 26 289–328.
  • Albeverio, S., Jentsch, V. and Kantz, H., eds. (2006). Extreme Events in Nature and Society. Springer, Berlin.
  • Amisano, G. and Giacomini, R. (2007). Comparing density forecasts via weighted likelihood ratio tests. J. Bus. Econom. Statist. 25 177–190.
  • Beirlant, J., Goegebeur, Y., Teugels, J. and Segers, J. (2004). Statistics of Extremes: Theory and Applications. Wiley Series in Probability and Statistics. Wiley, Chichester.
  • Bernoulli, J. (1713). Ars Conjectandi. Impensis Thurnisiorum, Basileae. Reproduction of original from Sterling Memorial Library, Yale University. Online edition of Gale Digital Collections: The Making of the Modern World: Part I: The Goldsmiths’-Kress Collection, 1450–1850. Available at
  • Bernoulli, J. (2006). The Art of Conjecturing: Together with “Letter to a friend on sets in court tennis”, Translated from the Latin and with an introduction and notes by Edith Dudley Sylla. Johns Hopkins Univ. Press, Baltimore, MD.
  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78 1–3.
  • Clark, T. E. and Ravazzolo, F. (2015). Macroeconomic forecasting performance under alternative specifications of time-varying volatility. J. Appl. Econometrics 30 551–575.
  • Cogley, T. S. M. and Sargent, T. J. (2005). Drifts and volatilities: Monetary policies and outcomes in the post-World War II U.S. Rev. Econ. Dyn. 8 262–302.
  • Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer Series in Statistics. Springer, London.
  • Coles, S., Heffernan, J. and Tawn, J. (1999). Dependence measures for extreme value analyses. Extremes 2 339–365.
  • Cooley, D., Davis, R. A. and Naveau, P. (2012). Approximating the conditional density given large observed values via a multivariate extremes framework, with application to environmental data. Ann. Appl. Stat. 6 1406–1429.
  • Dawid, A. P. (2007). The geometry of proper scoring rules. Ann. Inst. Statist. Math. 59 77–93.
  • Dawid, A. P. and Sebastiani, P. (1999). Coherent dispersion criteria for optimal experimental design. Ann. Statist. 27 65–81.
  • Denrell, J. and Fang, C. (2010). Predicting the next big thing: Success as a signal of poor judgment. Manage. Sci. 56 1653–1667.
  • Diebold, F. X. (2015). Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of Diebold–Mariano tests. J. Bus. Econom. Statist. 33 1–9.
  • Diebold, F. X., Gunther, T. A. and Tay, A. S. (1998). Evaluating density forecasts with applications to financial risk management. Internat. Econom. Rev. 39 863–883.
  • Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. J. Bus. Econom. Statist. 13 253–263.
  • Diks, C., Panchenko, V. and van Dijk, D. (2011). Likelihood-based scoring rules for comparing density forecasts in tails. J. Econometrics 163 215–230.
  • Easterling, D. R., Meehl, G. A., Parmesan, C., Changnon, S. A., Karl, T. R. and Mearns, L. O. (2000). Climate extremes: Observations, modeling, and impacts. Science 289 2068–2074.
  • Eguchi, S. and Copas, J. (2006). Interpreting Kullback–Leibler divergence with the Neyman–Pearson lemma. J. Multivariate Anal. 97 2034–2040.
  • Ehm, W., Gneiting, T., Jordan, A. and Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 505–562.
  • Ehrman, C. M. and Shugan, S. M. (1995). The forecaster’s dilemma. Mark. Sci. 14 123–147.
  • Embrechts, P., Klüppelberg, C. and Mikosch, T. (1997). Modelling Extremal Events: For Insurance and Finance. Applications of Mathematics 33. Springer, Berlin.
  • Faust, J. and Wright, J. H. (2009). Comparing Greenbook and reduced form forecasts using a large realtime dataset. J. Bus. Econom. Statist. 27 468–479.
  • Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Probability and Mathematical Statistics, Vol. 1. Academic Press, New York–London.
  • Ferro, C. A. T. and Stephenson, D. B. (2011). Extremal dependence indices: Improved verification measures for deterministic forecasts of rare binary events. Weather Forecast. 26 699–713.
  • Feuerverger, A. and Rahman, S. (1992). Some aspects of probability forecasting. Comm. Statist. Theory Methods 21 1615–1632.
  • Fissler, T., Ziegel, J. F. and Gneiting, T. (2016). Expected shortfall is jointly elicitable with value-at-risk: Implications for backtesting. Risk 58–61.
  • Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica 74 1545–1578.
  • Gneiting, T. (2008). Editorial: Probabilistic forecasting. J. Roy. Statist. Soc. Ser. A 171 319–321.
  • Gneiting, T. (2011). Making and evaluating point forecasts. J. Amer. Statist. Assoc. 106 746–762.
  • Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 243–268.
  • Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application 1 125–151.
  • Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102 359–378.
  • Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. J. Bus. Econom. Statist. 29 411–422.
  • Gneiting, T. and Ranjan, R. (2013). Combining predictive distributions. Electron. J. Stat. 7 1747–1782.
  • Good, I. J. (1952). Rational decisions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 14 107–114.
  • Gumbel, E. J. (1958). Statistics of Extremes. Columbia Univ. Press, New York.
  • Haiden, T., Magnusson, L. and Richardson, D. (2014). Statistical evaluation of ECMWF extreme wind forecasts. ECMWF Newsletter 139 29–33.
  • Hall, S. S. (2011). Scientists on trial: At fault? Nature 477 264–269.
  • Held, L., Rufibach, K. and Balabdaoui, F. (2010). A score regression approach to assess calibration of continuous probabilistic predictions. Biometrics 66 1295–1305.
  • Holzmann, H. and Eulert, M. (2014). The role of the information set for forecasting—With applications to risk management. Ann. Appl. Stat. 8 595–621.
  • Jordan, T., Chen, Y.-T., Gasparini, P., Madariaga, R., Main, I., Marzocchi, W., Papadopoulos, G., Yamaoka, K. and Zschau, J. (2011). Operational earthquake forecasting: State of knowledge and guidelines for implementation Ann. Geophys. 54 315–391.
  • Juutilainen, I., Tamminen, S. and Röning, J. (2012). Exceedance probability score: A novel measure for comparing probabilistic predictions. J. Stat. Theory Pract. 6 452–467.
  • Kahneman, D. (2012). Thinking, Fast and Slow. Penguin Books, London.
  • Katz, R. W., Parlange, M. B. and Naveau, P. (2002). Statistics of extremes in hydrology. Adv. Water Resour. 25 1287–1304.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer Texts in Statistics. Springer, New York.
  • Lerch, S. and Thorarinsdottir, T. L. (2013). Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus, Ser. A Dyn. Meteorol. Oceanogr. 65 21206.
  • Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F. and Gneiting, T. (2016). Supplement to “Forecaster’s dilemma: Extreme events and forecast evaluation”.
  • Manzan, S. and Zerom, D. (2013). Are macroeconomic variables useful for forecasting the distribution of US inflation? Int. J. Forecast. 29 469–478.
  • Marzban, C. (1998). Scalar measures of performance in rare-event situations. Weather Forecast. 13 753–763.
  • Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions. Manage. Sci. 22 1087–1096.
  • McNeil, A. J., Frey, R. and Embrechts, P. (2015). Quantitative Risk Management: Concepts, Techniques and Tools, Revised ed. Princeton Series in Finance. Princeton Univ. Press, Princeton, NJ.
  • Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification. Mon. Weather Rev. 115 1330–1338.
  • Nau, R. F. (1985). Should scoring rules be ‘effective’? Manage. Sci. 31 527–535.
  • Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 231 289–337.
  • Owen, J. (1607). Epigrammatum, Book IV. Hypertext critical edition by Dana F. Sutton, The Univ. California, Irvine (1999). Available at
  • Pelenis, J. (2014). Weighted scoring rules for comparison of density forecasts on subsets of interest. Preprint. Available at
  • Reid, M. D. and Williamson, R. C. (2011). Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12 731–817.
  • Romer, C. D. and Romer, D. H. (2000). Federal Reserve information and the behavior of interest rates. Am. Econ. Rev. 90 429–457.
  • Stephenson, D. B., Casati, B., Ferro, C. A. T. and Wilson, C. A. (2008). The extreme dependency score: A non-vanishing measure for forecasts of rare events. Meteorol. Appl. 15 41–50.
  • Strähl, C. and Ziegel, J. F. (2015). Cross-calibration of probabilistic forecasts. Preprint. Available at
  • Tay, A. S. and Wallis, K. F. (2000). Density forecasting: A survey. J. Forecast. 19 124–143.
  • Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton Univ. Press, Princeton.
  • Timmermann, A. (2000). Density forecasting in economics and finance. J. Forecast. 19 231–234.
  • Tödter, J. and Ahrens, B. (2012). Generalization of the ignorance score: Continuous ranked version and its decomposition. Mon. Weather Rev. 140 2005–2017.
  • Tsyplakov, A. (2013). Evaluation of Probabilistic Forecasts: Proper Scoring Rules and Moments. Available at SSRN:
  • Williams, R. M., Ferro, C. A. T. and Kwasniok, F. (2014). A comparison of ensemble post-processing methods for extreme events. Q. J. R. Meteorol. Soc. 140 1112–1120.
  • Zou, H. and Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. Ann. Statist. 36 1108–1126.

Supplemental materials