The Annals of Applied Statistics

Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs

Ville A. Satopää, Shane T. Jensen, Barbara A. Mellers, Philip E. Tetlock, and Lyle H. Ungar

Full-text: Open access


Most subjective probability aggregation procedures use a single probability judgment from each expert, even though it is common for experts studying real problems to update their probability estimates over time. This paper advances into unexplored areas of probability aggregation by considering a dynamic context in which experts can update their beliefs at random intervals. The updates occur very infrequently, resulting in a sparse data set that cannot be modeled by standard time-series procedures. In response to the lack of appropriate methodology, this paper presents a hierarchical model that takes into account the expert’s level of self-reported expertise and produces aggregate probabilities that are sharp and well calibrated both in- and out-of-sample. The model is demonstrated on a real-world data set that includes over 2300 experts making multiple probability forecasts over two years on different subsets of 166 international political events.

Article information

Ann. Appl. Stat. Volume 8, Number 2 (2014), 1256-1280.

First available in Project Euclid: 1 July 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Probability aggregation dynamic linear model hierarchical modeling expert forecast subjective probability bias estimation calibration time series


Satopää, Ville A.; Jensen, Shane T.; Mellers, Barbara A.; Tetlock, Philip E.; Ungar, Lyle H. Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs. Ann. Appl. Stat. 8 (2014), no. 2, 1256--1280. doi:10.1214/14-AOAS739.

Export citation


  • Allard, D., Comunian, A. and Renard, P. (2012). Probability aggregation methods in geoscience. Math. Geosci. 44 545–581.
  • Ariely, D., Au, W. T., Bender, R. H., Budescu, D. V., Dietz, C. B., Gu, H., Wallsten, T. S. and Zauberman, G. (2000). The effects of averaging subjective probability estimates between and within judges. Journal of Experimental Psychology: Applied 6 130–147.
  • Baars, J. A. and Mass, C. F. (2005). Performance of national weather service forecasts compared to operational, consensus, and weighted model output statistics. Weather and Forecasting 20 1034–1047.
  • Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E. and Ungar, L. H. (2014). Two reasons to make aggregated probability forecasts more extreme. Decis. Anal. 11. DOI:10.1287/deca.2014.0293.
  • Batchelder, W. H., Strashny, A. and Romney, A. K. (2010). Cultural consensus theory: Aggregating continuous responses in a finite interval. In Advances in Social Computing (S.-K. Chaim, J. J. Salerno and P. L. Mabry, eds.) 98–107. Springer, Berlin.
  • Bier, V. (2004). Implications of the research on expert overconfidence and dependence. Reliability Engineering & System Safety 85 321–329.
  • Bonferroni, C. E. (1936). Teoria Statistica Delle Classi e Calcolo Delle Probabilitá. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8 3–62.
  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 1–3.
  • Bröcker, J. and Smith, L. A. (2007). Increasing the reliability of reliability diagrams. Weather and Forecasting 22 651–661.
  • Buja, A., Stuetzle, W. and Shen, Y. (2005). Loss functions for binary class probability estimation and classification: Structure and applications. Statistics Department, Univ. Pennsylvania, Philadelphia, PA. Available at
  • Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika 81 541–553.
  • Chen, Y. (2008). Learning classifiers from imbalanced, only positive and unlabeled data sets. Project Report for UC San Diego Data Mining Contest. Dept. Computer Science, Iowa State Univ., Ames, IA. Available at
  • Clemen, R. T. and Winkler, R. L. (2007). Aggregating probability distributions. In Advances in Decision Analysis: From Foundations to Applications (W. Edwards, R. F. Miles and D. von Winterfeldt, eds.) 154–176. Cambridge Univ. Press, Cambridge.
  • Cooke, R. M. (1991). Experts in Uncertainty: Opinion and Subjective Probability in Science. Clarendon Press, New York.
  • Erev, I., Wallsten, T. S. and Budescu, D. V. (1994). Simultaneous over- and underconfidence: The role of error in judgment processes. Psychological Review 66 519–527.
  • Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian data analysis. CRC press, Boca Raton.
  • Gelman, A., Jakulin, A., Pittau, M. G. and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. Ann. Appl. Stat. 2 1360–1383.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Institute of Electrical and Electronics Engineer (IEEE) Transactions on Pattern Analysis and Machine Intelligence 6 721–741.
  • Genest, C. and Zidek, J. V. (1986). Combining probability distributions: A critique and an annotated bibliography. Statist. Sci. 1 114–148.
  • Gent, I. P. and Walsh, T. (1996). Phase transitions and annealed theories: Number partitioning as a case study. In Proceedings of European Conference on Artificial Intelligence (ECAI 1996) 170–174. Wiley, New York.
  • Gneiting, T. and Ranjan, R. (2013). Combining predictive distributions. Electron. J. Stat. 7 1747–1782.
  • Gneiting, T., Stanberry, L. I., Grimit, E. P., Held, L. and Johnson, N. A. (2008). Rejoinder on: Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds [MR2434318]. TEST 17 256–264.
  • Good, I. J. (1952). Rational decisions. J. R. Stat. Soc. Ser. B Stat. Methodol. 14 107–114.
  • Hastings, C. Jr., Mosteller, F., Tukey, J. W. and Winsor, C. P. (1947). Low moments for small samples: A comparative study of order statistics. Ann. Math. Statistics 18 413–426.
  • Hayes, B. (2002). The easiest hard problem. American Scientist 90 113–117.
  • Karmarkar, N. and Karp, R. M. (1982). The differencing method of set partitioning. Technical Report UCB/CSD 82/113, Computer Science Division, Univ. California, Berkeley, CA.
  • Kellerer, H., Pferschy, U. and Pisinger, D. (2004). Knapsack Problems. Springer, Dordrecht.
  • Lichtenstein, S., Fischhoff, B. and Phillips, L. D. (1977). Calibration of Probabilities: The State of the Art. In Decision Making and Change in Human Affairs (H. Jungermann and G. De Zeeuw, eds.) 275–324. Springer, Berlin.
  • Lubinski, D. and Humphreys, L. G. (1996). Seeing the forest from the trees: When predicting the behavior or status of groups, correlate means. Psychology, Public Policy, and Law 2 363.
  • Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P. and Swift, S. A. et al. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science 25. DOI:10.1177/0956797614524255.
  • Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21 1087–1092.
  • Migon, H. S., Gamerman, D., Lopes, H. F. and Ferreira, M. A. R. (2005). Dynamic models. In Bayesian Thinking: Modeling and Computation. Handbook of Statist. 25 553–588. Elsevier/North-Holland, Amsterdam.
  • Mills, T. C. (1991). Time series techniques for economists. Cambridge Univ. Press, Cambridge.
  • Morgan, M. G. (1992). Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge Univ. Press, Cambridge.
  • Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification. Monthly Weather Review 115 1330–1338.
  • Neal, R. M. (2003). Slice sampling. Ann. Statist. 31 705–767.
  • Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford Statistical Science Series 28. Oxford Univ. Press, Oxford.
  • Primo, C., Ferro, C. A., Jolliffe, I. T. and Stephenson, D. B. (2009). Calibration of probabilistic forecasts of binary events. Monthly Weather Review 137 1142–1149.
  • Raftery, A. E., Gneiting, T., Balabdaoui, F. and Polakowski, M. (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review 133 1155–1174.
  • Ranjan, R. and Gneiting, T. (2010). Combining probability forecasts. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 71–91.
  • Sanders, F. (1963). On subjective probability forecasting. Journal of Applied Meteorology 2 191–201.
  • Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E. and Ungar, L. H. (2014a). Combining multiple probability predictions using a simple logit model. International Journal of Forecasting 30 344–356.
  • Satopää, V. A., Jensen, S. T., Mellers, B. A., Tetlock, P. E. and Ungar, L. H. (2014b). Supplement to “Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs.” DOI:10.1214/14-AOAS739SUPP.
  • Shlyakhter, A. I., Kammen, D. M., Broido, C. L. and Wilson, R. (1994). Quantifying the credibility of energy projections from trends in past data: The US energy sector. Energy Policy 22 119–130.
  • Tanner, J., Wilson, P. and Swets, J. A. (1954). A decision-making theory of visual detection. Psychological Review 61 401–409.
  • Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton Univ. Press, Princeton, NJ.
  • Ungar, L., Mellers, B., Satopää, V., Tetlock, P. and Baron, J. (2012). The good judgment project: A large scale test of different methods of combining expert predictions. In The Association for the Advancement of Artificial Intelligence 2012 Fall Symposium Series, Univ. Pennsylvania, Philadelphia, PA.
  • Vislocky, R. L. and Fritsch, J. M. (1995). Improved model output statistics forecasts through model consensus. Bulletin of the American Meteorological Society 76 1157–1164.
  • Wallace, B. C. and Dahabreh, I. J. (2012). Class probability estimates are unreliable for imbalanced data (and how to fix them). In Institute of Electrical and Electronics Engineers (IEEE) 12th International Conference on Data Mining (International Conference on Data Mining) 695–704. IEEE Computer Society, Washington, DC.
  • Wallsten, T. S., Budescu, D. V. and Erev, I. (1997). Evaluating and combining subjective probability estimates. Journal of Behavioral Decision Making 10 243–268.
  • Wilson, A. G. (1994). Cognitive factors affecting subjective probability assessment. Discussion Paper 94-02, Institute of Statistics and Decision Sciences, Duke Univ., Chapel Hill, NC.
  • Wilson, P. W., D’Agostino, R. B., Levy, D., Belanger, A. M., Silbershatz, H. and Kannel, W. B. (1998). Prediction of coronary heart disease using risk factor categories. Circulation 97 1837–1847.
  • Winkler, R. L. and Jose, V. R. R. (2008). Comments on: Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds [MR2434318]. TEST 17 251–255.
  • Winkler, R. L. and Murphy, A. H. (1968). “Good” probability assessors. Journal of Applied Meteorology 7 751–758.
  • Wright, G., Rowe, G., Bolger, F. and Gammack, J. (1994). Coherence, calibration, and expertise in judgmental probability forecasting. Organizational Behavior and Human Decision Processes 57 1–25.

Supplemental materials