Electronic Journal of Statistics

Cross-calibration of probabilistic forecasts

Christof Strähl and Johanna Ziegel

Full-text: Open access


When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Often, there are several competing forecasters of different skill. We extend common notions of calibration where each forecaster is analyzed individually, to stronger notions of cross-calibration where each forecaster is analyzed with respect to the other forecasters. In particular, cross-calibration distinguishes forecasters with respect to increasing information sets. We provide diagnostic tools and statistical tests to assess cross-calibration. The methods are illustrated in simulation examples and applied to probabilistic forecasts for inflation rates by the Bank of England. Computer code and supplementary material (Strähl and Ziegel, 2017a,b) are available online.

Article information

Electron. J. Statist., Volume 11, Number 1 (2017), 608-639.

Received: November 2016
First available in Project Euclid: 3 March 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Calibration predictive distribution prediction space probability integral transform proper scoring rule

Creative Commons Attribution 4.0 International License.


Strähl, Christof; Ziegel, Johanna. Cross-calibration of probabilistic forecasts. Electron. J. Statist. 11 (2017), no. 1, 608--639. doi:10.1214/17-EJS1244. https://projecteuclid.org/euclid.ejs/1488531637

Export citation


  • Al-Najjar, N. I. and J. Weinstein (2008). Comparative testing of experts., Econometrica 76, 541–559.
  • Anderson, T. W. and D. A. Darling (1954). A test of goodness of fit., Journal of the American Statistical Association 49, 765–769.
  • Berkowitz, J. (2001). Testing density forecasts, with applications to risk management., Journal of Business and Economic Statistics 19, 465–474.
  • Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores., Quarterly Journal of the Royal Meteorological Society 135(643), 1512–1519.
  • Campbell, S. D. (2005). A review of backtesting and backtesting procedures., Finance and Economics Discussion Series, Federal Reserve 21.
  • Clements, M. P. (2004). Evaluating the Bank of England density forecasts of inflation., The Economic Journal 114, 844–866.
  • Cox, D. D. and J. S. Lee (2008). Pointwise testing with functional data using the Westfall-Young randomization method., Biometrika 95, 621–634.
  • Dawid, A. P. (1984). Statistical theory: The prequential approach., Journal of the Royal Statistical Society: Series A 147, 278–290.
  • Dawid, A. P. (1985). Calibration-based empirical probability., The Annals of Statistics 13, 1251–1274.
  • Dawid, A. P. (1986). Probability forecasting. In S. Kotz, N. L. Johnson, and C. B. Read (Eds.), Encyclopedia of Statistical Sciences, Volume 7, pp. 210–218. Wiley, New York.
  • Diebold, F. X., T. A. Gunther, and A. S. Tay (1998). Evaluating density forecasts with applications to financial risk management., International Economic Review 39, 863–883.
  • Feinberg, Y. and C. Stewart (2008). Testing multiple forecasters., Econometrica 76, 561–582.
  • Firth, D. (1993). Bias reduction of maximum likelihood estimates., Biometrika 80, 27–38.
  • Galbraith, J. W. and S. van Norden (2012). Assessing gross domestic product and inflation probability forecasts derived from Band of England fan charts., Journal of the Royal Statistical Society: Series A 175, 713–727.
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery (2007). Probabilistic forecasts, calibration and sharpness., Journal of the Royal Statistical Society: Series B 69, 243–268.
  • Gneiting, T. and M. Katzfuss (2014). Probabilistic forecasting., Annual Review of Stastistics and Its Application 1, 125–151.
  • Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation., Journal of the American Statistical Association 102, 359–378.
  • Gneiting, T. and R. Ranjan (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules., Journal of Business and Economic Statistics 29, 411–422.
  • Gneiting, T. and R. Ranjan (2013). Combining predictive distributions., Electronical Journal of Statistics 7, 1747–1782.
  • Hamill, T. M. (2001). Interpretation of rank histograms for verifying ensemble forecasts., Monthly Weather Review 129, 550–560.
  • Heinze, G., M. Ploner, D. Dunkler, and H. Southworth (2013)., logistf: Firth’s bias reduced logistic regression. R package version 1.21.
  • Heinze, G. and M. Schemper (2002). A solution to the problem of separation in logistic regression., Statistics in Medicine 21, 2409–2419.
  • Held, L., K. Rufibach, and F. Balabdaoui (2010). A score regression approach to assess calibration of continuous probabilistic predictions., Biometrics 66, 1295–1305.
  • Holzmann, H. and M. Eulert (2014). The role of the information set for forecasting – with applications to risk management., The Annals of Statistics 8, 595–621.
  • Knüppel, M. (2015). Evaluating the calibration of multi-step-ahead density forecasts using raw moments., Journal of Business and Economic Statistics 33, 270–281.
  • Mason, S. J., J. Galpin, L. Goddard, N. Graham, and B. Rajartnam (2007). Conditional exceedance probabilities., Monthly Weather Review 135, 363–372.
  • Meinshausen, N., M. H. Maathuis, and P. Bühlmann (2011). Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence., The Annals of Statistics 39(6), 3369–3391.
  • Mitchell, J. and S. G. Hall (2005). Evaluating, comparing and combining density forecasts using the KLIC with an application to the Bank of England and NIESR ‘fan’ charts of inflation., Oxford Bulletin of Economics and Statistics 67S, 995–1033.
  • Mitchell, J. and K. F. Wallis (2011). Evaluating density forecasts: Forecast combinations, model mixtures, calibration and sharpness., Journal of Applied Econometrics 26, 1023–1040.
  • Montgomery, D. C., E. A. Peck, and C. G. Vining (2001)., Introduction to linear regression analysis (3rd ed.). John Wiley & Sons, Inc.
  • Murphy, A. H. (1994). A coherent method of stratification within a general framework for forecast verification., Monthly Weather Review 123, 1582–1588.
  • Murphy, A. H. and R. L. Winkler (1987). A general framework for forecast verification., Monthly Weather Review 115, 1330–1338.
  • Murphy, A. H. and R. L. Winkler (1992). Diagnostic verification of probability forecasts., International Journal of Forecasting 7, 435–455.
  • R Core Team (2015)., R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
  • Ranjan, R. and T. Gneiting (2010). Combining probability forecasts., Journal of the Royal Statistical Society: Series B 72, 71–91.
  • Rüschendorf, L. (2009). On the distributional transform, Sklar’s theorem, and the empirical copula process., Journal of Statistical Planning and Inference 139, 3921–3927.
  • Shapiro, S. S. and M. B. Wilk (1965). An analysis of variance test for normality (complete samples)., Biometrika 52, 591–611.
  • Strähl, C. and Ziegel, J. (2017a). Further Examples and the Score Regression Approach. DOI:, 10.1214/17-EJS1244SUPPA.
  • Strähl, C. and Ziegel, J. (2017b). Computer Code. DOI:, 10.1214/17-EJS1244SUPPB.
  • Tsyplakov, A. (2011). Evaluating density forecasts: A comment., MPRA paper 31233. (Available from http://mpra.ub.uni-muenchen.de/31233).
  • Tsyplakov, A. (2013). Evaluation of probabilistic forecasts: proper scoring rules and moments., Preprint, http://dx.doi.org/10.2139/ssrn.2236605.
  • Wallis, K. F. (2003). Chi-squared tests of interval and density forecasts, and the bank of england’s fan charts., International Journal of Forecasting 19, 165–175.
  • Westfall, P. and S. Young (1993)., Resampling-based multiple testing: Examples and methods for p-value adjustment. Wiley, New York.
  • Yap, B. W. and C. H. Sim (2011). Comparisons of various types of normality tests., Journal of Statistical Computation and Simulation 81, 2141–2155.

Supplemental materials

  • Further Examples and the Score Regression Approach. We provide a short discussion of the cross-calibration test suggested by Feinberg and Stewart (2008) and give additional examples of diagnostic plots for cross-calibration. We generalize the test suggested by Held et al. (2010) to a test for cross-ideal forecasters. Finally, we discuss a natural approach for testing marginal cross-calibration, which, unfortunately, is useless in practice.
  • Computer Code. The zip archive contains all R-code used in the paper and supplementary material.