Statistical Science

To Explain or to Predict?

Galit Shmueli
Source: Statist. Sci. Volume 25, Number 3 (2010), 289-310.

Abstract

Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.

First Page: Show Hide
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ss/1294167961
Digital Object Identifier: doi:10.1214/10-STS330
Mathematical Reviews number (MathSciNet): MR2791669

References

Afshartous, D. and de Leeuw, J. (2005). Prediction in multilevel models. J. Educ. Behav. Statist. 30 109–139.
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR408097
Bajari, P. and Hortacsu, A. (2003). The winner’s curse, reserve prices and endogenous entry: Empirical insights from ebay auctions. Rand J. Econ. 3 329–355.
Bajari, P. and Hortacsu, A. (2004). Economic insights from internet auctions. J. Econ. Liter. 42 457–486.
Bapna, R., Jank, W. and Shmueli, G. (2008). Price formation and its dynamics in online auctions. Decision Support Systems 44 641–656.
Bell, R. M., Koren, Y. and Volinsky, C. (2008). The BellKor 2008 solution to the Netflix Prize.
Bell, R. M., Koren, Y. and Volinsky, C. (2010). All together now: A perspective on the netflix prize. Chance 23 24.
Zentralblatt MATH: 1060.91045
Berk, R. A. (2008). Statistical Learning from a Regression Perspective. Springer, New York.
Mathematical Reviews (MathSciNet): MR2722293
Zentralblatt MATH: 05281109
Bjornstad, J. F. (1990). Predictive likelihood: A review. Statist. Sci. 5 242–265.
Mathematical Reviews (MathSciNet): MR1062578
Digital Object Identifier: doi:10.1214/ss/1177012175
Project Euclid: euclid.ss/1177012175
Bohlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
Mathematical Reviews (MathSciNet): MR2420454
Digital Object Identifier: doi:10.1214/07-STS242
Project Euclid: euclid.ss/1207580163
Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123–140.
Mathematical Reviews (MathSciNet): MR1425957
Zentralblatt MATH: 0867.62055
Digital Object Identifier: doi:10.1214/aos/1032181158
Project Euclid: euclid.aos/1032181158
Breiman, L. (2001a). Random forests. Mach. Learn. 45 5–32.
Breiman, L. (2001b). Statistical modeling: The two cultures. Statist. Sci. 16 199–215.
Mathematical Reviews (MathSciNet): MR1874152
Digital Object Identifier: doi:10.1214/ss/1009213726
Project Euclid: euclid.ss/1009213726
Brown, P. J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 519–536.
Mathematical Reviews (MathSciNet): MR1924304
Zentralblatt MATH: 1073.62004
Digital Object Identifier: doi:10.1111/1467-9868.00348
Campbell, J. Y. and Thompson, S. B. (2005). Predicting excess stock returns out of sample: Can anything beat the historical average? Harvard Institute of Economic Research Working Paper 2084.
Carte, T. A. and Craig, J. R. (2003). In pursuit of moderation: Nine common errors and their solutions. MIS Quart. 27 479–501.
Chakraborty, S. and Sharma, S. K. (2007). Prediction of corporate financial health by artificial neural network. Int. J. Electron. Fin. 1 442–459.
Chen, S.-H., Ed. (2002). Genetic Algorithms and Genetic Programming in Computational Finance. Kluwer, Dordrecht.
Collopy, F., Adya, M. and Armstrong, J. (1994). Principles for examining predictive–validity—the case of information-systems spending forecasts. Inform. Syst. Res. 5 170–179.
Dalkey, N. and Helmer, O. (1963). An experimental application of the delphi method to the use of experts. Manag. Sci. 9 458–467.
Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach. J. Roy. Statist. Soc. Ser. A 147 278–292.
Mathematical Reviews (MathSciNet): MR763811
Digital Object Identifier: doi:10.2307/2981683
Ding, Y. and Simonoff, J. (2010). An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11 131–170.
Mathematical Reviews (MathSciNet): MR2591624
Domingos, P. (2000). A unified bias–variance decomposition for zero–one and squared loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence 564–569. AAAI Press, Austin, TX.
Dowe, D. L., Gardner, S. and Oppy, G. R. (2007). Bayes not bust! Why simplicity is no problem for Bayesians. Br. J. Philos. Sci. 58 709–754.
Mathematical Reviews (MathSciNet): MR2375767
Zentralblatt MATH: 1136.03301
Digital Object Identifier: doi:10.1093/bjps/axm033
Dubin, R. (1969). Theory Building. The Free Press, New York.
Edwards, J. R. and Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs. Psychological Methods 5 2 155–174.
Ehrenberg, A. and Bound, J. (1993). Predictability and prediction. J. Roy. Statist. Soc. Ser. A 156 167–206.
Fama, E. F. and French, K. R. (1993). Common risk factors in stock and bond returns. J. Fin. Econ. 33 3–56.
Farmer, J. D., Patelli, P. and Zovko, I. I. A. A. (2005). The predictive power of zero intelligence in financial markets. Proc. Natl. Acad. Sci. USA 102 2254–2259.
Fayyad, U. M., Grinstein, G. G. and Wierse, A. (2002). Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco, CA.
Feelders, A. (2002). Data mining in economic science. In Dealing with the Data Flood 166–175. STT/Beweton, Den Haag, The Netherlands.
Findley, D. Y. and Parzen, E. (1998). A conversation with Hirotsugo Akaike. In Selected Papers of Hirotugu Akaike 3–16. Springer, New York.
Mathematical Reviews (MathSciNet): MR1486823
Forster, M. (2002). Predictive accuracy as an achievable goal of science. Philos. Sci. 69 S124–S134.
Forster, M. and Sober, E. (1994). How to tell when simpler, more unified, or less ad-hoc theories will provide more accurate predictions. Br. J. Philos. Sci. 45 1–35.
Mathematical Reviews (MathSciNet): MR1277464
Zentralblatt MATH: 1135.03310
Digital Object Identifier: doi:10.1093/bjps/45.1.1
Friedman, J. H. (1997). On bias, variance, 0∕1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1 55–77.
Gefen, D., Karahanna, E. and Straub, D. (2003). Trust and TAM in online shopping: An integrated model. MIS Quart. 27 51–90.
Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc. 70 320–328.
Geisser, S. (1993). Predictive Inference: An Introduction. Chapman and Hall, London.
Mathematical Reviews (MathSciNet): MR1252174
Zentralblatt MATH: 0824.62001
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRC New York/Boca Raton, FL.
Mathematical Reviews (MathSciNet): MR1385925
Ghani, R. and Simmons, H. (2004). Predicting the end-price of online auctions. In International Workshop on Data Mining and Adaptive Modelling Methods for Economics and Management, Pisa, Italy.
Goyal, A. and Welch, I. (2007). A comprehensive look at the empirical performance of equity premium prediction. Rev. Fin. Stud. 21 1455–1508.
Granger, C. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37 424–438.
Greenberg, E. and Parks, R. P. (1997). A predictive approach to model selection and multicollinearity. J. Appl. Econom. 12 67–75.
Gurbaxani, V. and Mendelson, H. (1990). An integrative model of information systems spending growth. Inform. Syst. Res. 1 23–46.
Gurbaxani, V. and Mendelson, H. (1994). Modeling vs. forecasting—the case of information-systems spending. Inform. Syst. Res. 5 180–190.
Hagerty, M. R. and Srinivasan, S. (1991). Comparing the predictive powers of alternative multiple regression models. Psychometrika 56 77–85.
Mathematical Reviews (MathSciNet): MR1115296
Digital Object Identifier: doi:10.1007/BF02294587
Hastie, T., Tibshirani, R. and Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR2722294
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 46 1251–1271.
Mathematical Reviews (MathSciNet): MR513692
Digital Object Identifier: doi:10.2307/1913827
Helmer, O. and Rescher, N. (1959). On the epistemology of the inexact sciences. Manag. Sci. 5 25–52.
Hempel, C. and Oppenheim, P. (1948). Studies in the logic of explanation. Philos. Sci. 15 135–175.
Hitchcock, C. and Sober, E. (2004). Prediction versus accommodation and the risk of overfitting. Br. J. Philos. Sci. 55 1–34.
Jaccard, J. (2001). Interaction Effects in Logistic Regression. SAGE Publications, Thousand Oaks, CA.
Zentralblatt MATH: 0972.62046
Jank, W. and Shmueli, G. (2010). Modeling Online Auctions. Wiley, New York.
Zentralblatt MATH: 1198.91007
Jank, W., Shmueli, G. and Wang, S. (2008). Modeling price dynamics in online auctions via regression trees. In Statistical Methods in eCommerce Research. Wiley, New York.
Mathematical Reviews (MathSciNet): MR2414052
Digital Object Identifier: doi:10.1002/9780470315262.ch16
Jap, S. and Naik, P. (2008). Bidanalyzer: A method for estimation and selection of dynamic bidding models. Marketing Sci. 27 949–960.
Johnson, W. and Geisser, S. (1983). A predictive view of the detection and characterization of influential observations in regression analysis. J. Amer. Statist. Assoc. 78 137–144.
Mathematical Reviews (MathSciNet): MR696858
Zentralblatt MATH: 0509.62055
Digital Object Identifier: doi:10.2307/2287120
Kadane, J. B. and Lazar, N. A. (2004). Methods and criteria for model selection. J. Amer. Statist. Soc. 99 279–290.
Mathematical Reviews (MathSciNet): MR2061890
Zentralblatt MATH: 1089.62501
Digital Object Identifier: doi:10.1198/016214504000000269
Kendall, M. and Stuart, A. (1977). The Advanced Theory of Statistics 1, 4th ed. Griffin, London.
Konishi, S. and Kitagawa, G. (2007). Information Criteria and Statistical Modeling. Springer, New York.
Mathematical Reviews (MathSciNet): MR2367855
Zentralblatt MATH: 1172.62003
Krishna, V. (2002). Auction Theory. Academic Press, San Diego, CA.
Little, R. J. A. (2007). Should we use the survey weights to weight? JPSM Distinguished Lecture, Univ. Maryland.
Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1925014
Lucking-Reiley, D., Bryan, D., Prasad, N. and Reeves, D. (2007). Pennies from ebay: The determinants of price in online auctions. J. Indust. Econ. 55 223–233.
Mackay, R. J. and Oldford, R. W. (2000). Scientific method, statistical method, and the speed of light. Working Paper 2000-02, Dept. Statistics and Actuarial Science, Univ. Waterloo.
Mathematical Reviews (MathSciNet): MR1847825
Digital Object Identifier: doi:10.1214/ss/1009212817
Project Euclid: euclid.ss/1009212817
Makridakis, S. G., Wheelwright, S. C. and Hyndman, R. J. (1998). Forecasting: Methods and Applications, 3rd ed. Wiley, New York.
Montgomery, D., Peck, E. A. and Vining, G. G. (2001). Introduction to Linear Regression Analysis. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1820113
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA.
Muller, J. and Brandl, R. (2009). Assessing biodiversity by remote sensing in mountainous terrain: The potential of lidar to predict forest beetle assemblages. J. Appl. Ecol. 46 897–905.
Nabi, J., Kivimäki, M., Suominen, S., Koskenvuo, M. and Vahtera, J. (2010). Does depression predict coronary heart diseaseand cerebrovascular disease equally well? The health and social support prospective cohort study. Int. J. Epidemiol. 39 1016–1024.
Palmgren, B. (1999). The need for financial models. ERCIM News 38 8–9.
Parzen, E. (2001). Comment on statistical modeling: The two cultures. Statist. Sci. 16 224–226.
Mathematical Reviews (MathSciNet): MR1874152
Digital Object Identifier: doi:10.1214/ss/1009213726
Project Euclid: euclid.ss/1009213726
Patzer, G. L. (1995). Using Secondary Data in Marketing Research: United States and Worldwide. Greenwood Publishing, Westport, CT.
Pavlou, P. and Fygenson, M. (2006). Understanding and predicting electronic commerce adoption: An extension of the theory of planned behavior. Mis Quart. 30 115–143.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika 82 669–709.
Mathematical Reviews (MathSciNet): MR1380809
Zentralblatt MATH: 0860.62045
Digital Object Identifier: doi:10.1093/biomet/82.4.669
Rosenbaum, P. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
Mathematical Reviews (MathSciNet): MR742974
Zentralblatt MATH: 0522.62091
Digital Object Identifier: doi:10.1093/biomet/70.1.41
Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Ann. Intern. Med. 127 757–763.
Saar-Tsechansky, M. and Provost, F. (2007). Handling missing features when applying classification models. J. Mach. Learn. Res. 8 1625–1657.
Sarle, W. S. (1998). Prediction with missing inputs. In JCIS 98 Proceedings (P. Wang, ed.) II 399–402. Research Triangle Park, Durham, NC.
Seni, G. and Elder, J. F. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery). Morgan and Claypool, San Rafael, CA.
Shafer, G. (1996). The Art of Causal Conjecture. MIT Press, Cambridge, MA.
Zentralblatt MATH: 0874.60003
Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the Sixth International Joint Conference on Artificial Intelligence 1401–1406. Stockholm, Sweden.
Shmueli, G. and Koppius, O. R. (2010). Predictive analytics in information systems research. MIS Quart. To appear.
Simon, H. A. (2001). Science seeks parsimony, not simplicity: Searching for pattern in phenomena. In Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple 32–72. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR1932928
Sober, E. (2002). Instrumentalism, parsimony, and the Akaike framework. Philos. Sci. 69 S112–S123.
Song, H. and Witt, S. F. (2000). Tourism Demand Modelling and Forecasting: Modern Econometric Approaches. Pergamon Press, Oxford.
Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA.
Mathematical Reviews (MathSciNet): MR1815675
Stone, M. (1974). Cross-validatory choice and assesment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 39 111–147.
Mathematical Reviews (MathSciNet): MR356377
Taleb, N. (2007). The Black Swan. Penguin Books, London.
Van Maanen, J., Sorensen, J. and Mitchell, T. (2007). The interplay between theory and method. Acad. Manag. Rev. 32 1145–1154.
Vaughan, T. S. and Berry, K. E. (2005). Using Monte Carlo techniques to demonstrate the meaning and implications of multicollinearity. J. Statist. Educ. 13 online.
Wallis, W. A. (1980). The statistical research group, 1942–1945. J. Amer. Statist. Assoc. 75 320–330.
Mathematical Reviews (MathSciNet): MR577363
Zentralblatt MATH: 0466.62001
Digital Object Identifier: doi:10.2307/2287451
Wang, S., Jank, W. and Shmueli, G. (2008). Explaining and forecasting online auction prices and their dynamics using functional data analysis. J. Business Econ. Statist. 26 144–160.
Mathematical Reviews (MathSciNet): MR2420144
Winkelmann, R. (2008). Econometric Analysis of Count Data, 5th ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR2148271
Woit, P. (2006). Not Even Wrong: The Failure of String Theory and the Search for Unity in Physical Law. Jonathan Cope, London.
Mathematical Reviews (MathSciNet): MR2245858
Zentralblatt MATH: 1128.81025
Wu, S., Harris, T. and McAuley, K. (2007). The use of simplified or misspecified models: Linear case. Canad. J. Chem. Eng. 85 386–398.
Zellner, A. (1962). An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. J. Amer. Statist. Assoc. 57 348–368.
Mathematical Reviews (MathSciNet): MR139235
Zentralblatt MATH: 0113.34902
Digital Object Identifier: doi:10.2307/2281644
Zellner, A. (2001). Keep it sophisticatedly simple. In Simplicity, Inference and Modelling: Keeping It Sophisticatedly Simple 242–261. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR1932939
Zhang, S., Jank, W. and Shmueli, G. (2010). Real-time forecasting of online auctions via functional k-nearest neighbors. Int. J. Forecast. 26 666–683.

2013 © Institute of Mathematical Statistics

Statistical Science

Statistical Science

Turn MathJax Off
What is MathJax?