Statistical Science
- Statist. Sci.
- Volume 25, Number 3 (2010), 289-310.
To Explain or to Predict?
Full-text: Open access
Abstract
Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
Article information
Source
Statist. Sci. Volume 25, Number 3 (2010), 289-310.
Dates
First available in Project Euclid: 4 January 2011
Permanent link to this document
http://projecteuclid.org/euclid.ss/1294167961
Digital Object Identifier
doi:10.1214/10-STS330
Mathematical Reviews number (MathSciNet)
MR2791669
Zentralblatt MATH identifier
1329.62045
Keywords
Explanatory modeling causality predictive modeling predictive power statistical strategy data mining scientific research
Citation
Shmueli, Galit. To Explain or to Predict?. Statist. Sci. 25 (2010), no. 3, 289--310. doi:10.1214/10-STS330. http://projecteuclid.org/euclid.ss/1294167961.
References
- Afshartous, D. and de Leeuw, J. (2005). Prediction in multilevel models. J. Educ. Behav. Statist. 30 109–139.
- Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge Univ. Press.Mathematical Reviews (MathSciNet): MR408097
- Bajari, P. and Hortacsu, A. (2003). The winner’s curse, reserve prices and endogenous entry: Empirical insights from ebay auctions. Rand J. Econ. 3 329–355.
- Bajari, P. and Hortacsu, A. (2004). Economic insights from internet auctions. J. Econ. Liter. 42 457–486.
- Bapna, R., Jank, W. and Shmueli, G. (2008). Price formation and its dynamics in online auctions. Decision Support Systems 44 641–656.
- Bell, R. M., Koren, Y. and Volinsky, C. (2008). The BellKor 2008 solution to the Netflix Prize.
- Bell, R. M., Koren, Y. and Volinsky, C. (2010). All together now: A perspective on the netflix prize. Chance 23 24.Zentralblatt MATH: 1060.91045
- Berk, R. A. (2008). Statistical Learning from a Regression Perspective. Springer, New York.
- Bjornstad, J. F. (1990). Predictive likelihood: A review. Statist. Sci. 5 242–265.Mathematical Reviews (MathSciNet): MR1062578
Digital Object Identifier: doi:10.1214/ss/1177012175
Project Euclid: euclid.ss/1177012175 - Bohlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.Mathematical Reviews (MathSciNet): MR2420454
Digital Object Identifier: doi:10.1214/07-STS242
Project Euclid: euclid.ss/1207580163 - Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123–140.Mathematical Reviews (MathSciNet): MR1425957
Zentralblatt MATH: 0867.62055
Digital Object Identifier: doi:10.1214/aos/1032181158
Project Euclid: euclid.aos/1032181158 - Breiman, L. (2001a). Random forests. Mach. Learn. 45 5–32.
- Breiman, L. (2001b). Statistical modeling: The two cultures. Statist. Sci. 16 199–215.Mathematical Reviews (MathSciNet): MR1874152
Digital Object Identifier: doi:10.1214/ss/1009213726
Project Euclid: euclid.ss/1009213726 - Brown, P. J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 519–536.Mathematical Reviews (MathSciNet): MR1924304
Zentralblatt MATH: 1073.62004
Digital Object Identifier: doi:10.1111/1467-9868.00348
JSTOR: links.jstor.org - Campbell, J. Y. and Thompson, S. B. (2005). Predicting excess stock returns out of sample: Can anything beat the historical average? Harvard Institute of Economic Research Working Paper 2084.
- Carte, T. A. and Craig, J. R. (2003). In pursuit of moderation: Nine common errors and their solutions. MIS Quart. 27 479–501.
- Chakraborty, S. and Sharma, S. K. (2007). Prediction of corporate financial health by artificial neural network. Int. J. Electron. Fin. 1 442–459.
- Chen, S.-H., Ed. (2002). Genetic Algorithms and Genetic Programming in Computational Finance. Kluwer, Dordrecht.
- Collopy, F., Adya, M. and Armstrong, J. (1994). Principles for examining predictive–validity—the case of information-systems spending forecasts. Inform. Syst. Res. 5 170–179.
- Dalkey, N. and Helmer, O. (1963). An experimental application of the delphi method to the use of experts. Manag. Sci. 9 458–467.
- Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach. J. Roy. Statist. Soc. Ser. A 147 278–292.Mathematical Reviews (MathSciNet): MR763811
Digital Object Identifier: doi:10.2307/2981683
JSTOR: links.jstor.org - Ding, Y. and Simonoff, J. (2010). An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11 131–170.Mathematical Reviews (MathSciNet): MR2591624
- Domingos, P. (2000). A unified bias–variance decomposition for zero–one and squared loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence 564–569. AAAI Press, Austin, TX.
- Dowe, D. L., Gardner, S. and Oppy, G. R. (2007). Bayes not bust! Why simplicity is no problem for Bayesians. Br. J. Philos. Sci. 58 709–754.Mathematical Reviews (MathSciNet): MR2375767
Zentralblatt MATH: 1136.03301
Digital Object Identifier: doi:10.1093/bjps/axm033 - Dubin, R. (1969). Theory Building. The Free Press, New York.
- Edwards, J. R. and Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs. Psychological Methods 5 2 155–174.
- Ehrenberg, A. and Bound, J. (1993). Predictability and prediction. J. Roy. Statist. Soc. Ser. A 156 167–206.
- Fama, E. F. and French, K. R. (1993). Common risk factors in stock and bond returns. J. Fin. Econ. 33 3–56.
- Farmer, J. D., Patelli, P. and Zovko, I. I. A. A. (2005). The predictive power of zero intelligence in financial markets. Proc. Natl. Acad. Sci. USA 102 2254–2259.
- Fayyad, U. M., Grinstein, G. G. and Wierse, A. (2002). Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco, CA.
- Feelders, A. (2002). Data mining in economic science. In Dealing with the Data Flood 166–175. STT/Beweton, Den Haag, The Netherlands.
- Findley, D. Y. and Parzen, E. (1998). A conversation with Hirotsugo Akaike. In Selected Papers of Hirotugu Akaike 3–16. Springer, New York.Mathematical Reviews (MathSciNet): MR1486823
- Forster, M. (2002). Predictive accuracy as an achievable goal of science. Philos. Sci. 69 S124–S134.
- Forster, M. and Sober, E. (1994). How to tell when simpler, more unified, or less ad-hoc theories will provide more accurate predictions. Br. J. Philos. Sci. 45 1–35.Mathematical Reviews (MathSciNet): MR1277464
Zentralblatt MATH: 1135.03310
Digital Object Identifier: doi:10.1093/bjps/45.1.1 - Friedman, J. H. (1997). On bias, variance, 0∕1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1 55–77.
- Gefen, D., Karahanna, E. and Straub, D. (2003). Trust and TAM in online shopping: An integrated model. MIS Quart. 27 51–90.
- Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc. 70 320–328.
- Geisser, S. (1993). Predictive Inference: An Introduction. Chapman and Hall, London.
- Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRC New York/Boca Raton, FL.Mathematical Reviews (MathSciNet): MR1385925
- Ghani, R. and Simmons, H. (2004). Predicting the end-price of online auctions. In International Workshop on Data Mining and Adaptive Modelling Methods for Economics and Management, Pisa, Italy.
- Goyal, A. and Welch, I. (2007). A comprehensive look at the empirical performance of equity premium prediction. Rev. Fin. Stud. 21 1455–1508.
- Granger, C. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37 424–438.
- Greenberg, E. and Parks, R. P. (1997). A predictive approach to model selection and multicollinearity. J. Appl. Econom. 12 67–75.
- Gurbaxani, V. and Mendelson, H. (1990). An integrative model of information systems spending growth. Inform. Syst. Res. 1 23–46.
- Gurbaxani, V. and Mendelson, H. (1994). Modeling vs. forecasting—the case of information-systems spending. Inform. Syst. Res. 5 180–190.
- Hagerty, M. R. and Srinivasan, S. (1991). Comparing the predictive powers of alternative multiple regression models. Psychometrika 56 77–85.
- Hastie, T., Tibshirani, R. and Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.Mathematical Reviews (MathSciNet): MR2722294
- Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 46 1251–1271.Mathematical Reviews (MathSciNet): MR513692
Digital Object Identifier: doi:10.2307/1913827
JSTOR: links.jstor.org - Helmer, O. and Rescher, N. (1959). On the epistemology of the inexact sciences. Manag. Sci. 5 25–52.
- Hempel, C. and Oppenheim, P. (1948). Studies in the logic of explanation. Philos. Sci. 15 135–175.
- Hitchcock, C. and Sober, E. (2004). Prediction versus accommodation and the risk of overfitting. Br. J. Philos. Sci. 55 1–34.
- Jaccard, J. (2001). Interaction Effects in Logistic Regression. SAGE Publications, Thousand Oaks, CA.Zentralblatt MATH: 0972.62046
- Jank, W. and Shmueli, G. (2010). Modeling Online Auctions. Wiley, New York.Zentralblatt MATH: 1198.91007
- Jank, W., Shmueli, G. and Wang, S. (2008). Modeling price dynamics in online auctions via regression trees. In Statistical Methods in eCommerce Research. Wiley, New York.Mathematical Reviews (MathSciNet): MR2414052
Digital Object Identifier: doi:10.1002/9780470315262.ch16 - Jap, S. and Naik, P. (2008). Bidanalyzer: A method for estimation and selection of dynamic bidding models. Marketing Sci. 27 949–960.
- Johnson, W. and Geisser, S. (1983). A predictive view of the detection and characterization of influential observations in regression analysis. J. Amer. Statist. Assoc. 78 137–144.Mathematical Reviews (MathSciNet): MR696858
Zentralblatt MATH: 0509.62055
Digital Object Identifier: doi:10.2307/2287120
JSTOR: links.jstor.org - Kadane, J. B. and Lazar, N. A. (2004). Methods and criteria for model selection. J. Amer. Statist. Soc. 99 279–290.Mathematical Reviews (MathSciNet): MR2061890
Zentralblatt MATH: 1089.62501
Digital Object Identifier: doi:10.1198/016214504000000269 - Kendall, M. and Stuart, A. (1977). The Advanced Theory of Statistics 1, 4th ed. Griffin, London.
- Konishi, S. and Kitagawa, G. (2007). Information Criteria and Statistical Modeling. Springer, New York.
- Krishna, V. (2002). Auction Theory. Academic Press, San Diego, CA.
- Little, R. J. A. (2007). Should we use the survey weights to weight? JPSM Distinguished Lecture, Univ. Maryland.
- Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, New York.Mathematical Reviews (MathSciNet): MR1925014
- Lucking-Reiley, D., Bryan, D., Prasad, N. and Reeves, D. (2007). Pennies from ebay: The determinants of price in online auctions. J. Indust. Econ. 55 223–233.
- Mackay, R. J. and Oldford, R. W. (2000). Scientific method, statistical method, and the speed of light. Working Paper 2000-02, Dept. Statistics and Actuarial Science, Univ. Waterloo.Mathematical Reviews (MathSciNet): MR1847825
Digital Object Identifier: doi:10.1214/ss/1009212817
Project Euclid: euclid.ss/1009212817 - Makridakis, S. G., Wheelwright, S. C. and Hyndman, R. J. (1998). Forecasting: Methods and Applications, 3rd ed. Wiley, New York.
- Montgomery, D., Peck, E. A. and Vining, G. G. (2001). Introduction to Linear Regression Analysis. Wiley, New York.Mathematical Reviews (MathSciNet): MR1820113
- Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA.
- Muller, J. and Brandl, R. (2009). Assessing biodiversity by remote sensing in mountainous terrain: The potential of lidar to predict forest beetle assemblages. J. Appl. Ecol. 46 897–905.
- Nabi, J., Kivimäki, M., Suominen, S., Koskenvuo, M. and Vahtera, J. (2010). Does depression predict coronary heart diseaseand cerebrovascular disease equally well? The health and social support prospective cohort study. Int. J. Epidemiol. 39 1016–1024.
- Palmgren, B. (1999). The need for financial models. ERCIM News 38 8–9.
- Parzen, E. (2001). Comment on statistical modeling: The two cultures. Statist. Sci. 16 224–226.Mathematical Reviews (MathSciNet): MR1874152
Digital Object Identifier: doi:10.1214/ss/1009213726
Project Euclid: euclid.ss/1009213726 - Patzer, G. L. (1995). Using Secondary Data in Marketing Research: United States and Worldwide. Greenwood Publishing, Westport, CT.
- Pavlou, P. and Fygenson, M. (2006). Understanding and predicting electronic commerce adoption: An extension of the theory of planned behavior. Mis Quart. 30 115–143.
- Pearl, J. (1995). Causal diagrams for empirical research. Biometrika 82 669–709.Mathematical Reviews (MathSciNet): MR1380809
Zentralblatt MATH: 0860.62045
Digital Object Identifier: doi:10.1093/biomet/82.4.669
JSTOR: links.jstor.org - Rosenbaum, P. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.Mathematical Reviews (MathSciNet): MR742974
Zentralblatt MATH: 0522.62091
Digital Object Identifier: doi:10.1093/biomet/70.1.41
JSTOR: links.jstor.org - Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Ann. Intern. Med. 127 757–763.
- Saar-Tsechansky, M. and Provost, F. (2007). Handling missing features when applying classification models. J. Mach. Learn. Res. 8 1625–1657.
- Sarle, W. S. (1998). Prediction with missing inputs. In JCIS 98 Proceedings (P. Wang, ed.) II 399–402. Research Triangle Park, Durham, NC.
- Seni, G. and Elder, J. F. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery). Morgan and Claypool, San Rafael, CA.
- Shafer, G. (1996). The Art of Causal Conjecture. MIT Press, Cambridge, MA.Zentralblatt MATH: 0874.60003
- Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the Sixth International Joint Conference on Artificial Intelligence 1401–1406. Stockholm, Sweden.
- Shmueli, G. and Koppius, O. R. (2010). Predictive analytics in information systems research. MIS Quart. To appear.
- Simon, H. A. (2001). Science seeks parsimony, not simplicity: Searching for pattern in phenomena. In Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple 32–72. Cambridge Univ. Press.Mathematical Reviews (MathSciNet): MR1932928
- Sober, E. (2002). Instrumentalism, parsimony, and the Akaike framework. Philos. Sci. 69 S112–S123.
- Song, H. and Witt, S. F. (2000). Tourism Demand Modelling and Forecasting: Modern Econometric Approaches. Pergamon Press, Oxford.
- Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA.Mathematical Reviews (MathSciNet): MR1815675
- Stone, M. (1974). Cross-validatory choice and assesment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 39 111–147.
- Taleb, N. (2007). The Black Swan. Penguin Books, London.
- Van Maanen, J., Sorensen, J. and Mitchell, T. (2007). The interplay between theory and method. Acad. Manag. Rev. 32 1145–1154.
- Vaughan, T. S. and Berry, K. E. (2005). Using Monte Carlo techniques to demonstrate the meaning and implications of multicollinearity. J. Statist. Educ. 13 online.
- Wallis, W. A. (1980). The statistical research group, 1942–1945. J. Amer. Statist. Assoc. 75 320–330.Mathematical Reviews (MathSciNet): MR577363
Zentralblatt MATH: 0466.62001
Digital Object Identifier: doi:10.2307/2287451 - Wang, S., Jank, W. and Shmueli, G. (2008). Explaining and forecasting online auction prices and their dynamics using functional data analysis. J. Business Econ. Statist. 26 144–160.Mathematical Reviews (MathSciNet): MR2420144
- Winkelmann, R. (2008). Econometric Analysis of Count Data, 5th ed. Springer, New York.Mathematical Reviews (MathSciNet): MR2148271
- Woit, P. (2006). Not Even Wrong: The Failure of String Theory and the Search for Unity in Physical Law. Jonathan Cope, London.
- Wu, S., Harris, T. and McAuley, K. (2007). The use of simplified or misspecified models: Linear case. Canad. J. Chem. Eng. 85 386–398.
- Zellner, A. (1962). An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. J. Amer. Statist. Assoc. 57 348–368.Mathematical Reviews (MathSciNet): MR139235
Zentralblatt MATH: 0113.34902
Digital Object Identifier: doi:10.2307/2281644
JSTOR: links.jstor.org - Zellner, A. (2001). Keep it sophisticatedly simple. In Simplicity, Inference and Modelling: Keeping It Sophisticatedly Simple 242–261. Cambridge Univ. Press.Mathematical Reviews (MathSciNet): MR1932939
- Zhang, S., Jank, W. and Shmueli, G. (2010). Real-time forecasting of online auctions via functional k-nearest neighbors. Int. J. Forecast. 26 666–683.

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Introduction: How to Deal with Uncertainty in Population Forecasting?
Lutz1, Wolfgang and Goldstein, Joshua R., International Statistical Review, 2004 - High-dimensional data: p > > n in mathematical statistics and bio-medical applications
Van De Geer, Sara A. and Van Houwelingen, Hans C., Bernoulli, 2004 - The Interface Between Statistics and Philosophy of Science
Good, I. J., Statistical Science, 1988
- Introduction: How to Deal with Uncertainty in Population Forecasting?
Lutz1, Wolfgang and Goldstein, Joshua R., International Statistical Review, 2004 - High-dimensional data: p > > n in mathematical statistics and bio-medical applications
Van De Geer, Sara A. and Van Houwelingen, Hans C., Bernoulli, 2004 - The Interface Between Statistics and Philosophy of Science
Good, I. J., Statistical Science, 1988 - Applying a spatiotemporal model for longitudinal cardiac imaging data
George, Brandon, Denney, Jr., Thomas, Gupta, Himanshu, Dell’Italia, Louis, and Aban, Inmaculada, The Annals of Applied Statistics, 2016 - Immigrated urn models—theoretical properties and applications
Zhang, Li-Xin, Hu, Feifang, Cheung, Siu Hung, and Chan, Wai Sum, The Annals of Statistics, 2011 - Hierarchical testing designs for pattern recognition
Blanchard, Gilles and Geman, Donald, The Annals of Statistics, 2005 - For objective causal inference, design trumps
analysis
Rubin, Donald B., The Annals of Applied Statistics, 2008 - Smoothing Observational Data: A Philosophy and Implementation for the Health Sciences
Greenland, Sander, International Statistical Review, 2006 - Formalization in Philosophy
Hansson, Sven Ove, Bulletin of Symbolic Logic, 2000 - Desiderata for a predictive theory of statistics
Clarke, Bertrand, Bayesian Analysis, 2010
