Statistical Science

Comment: Strengthening Empirical Evaluation of Causal Inference Methods

David Jensen

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


This is a contribution to the discussion of the paper by Dorie et al. (Statist. Sci. 34 (2019) 43–68), which reports the lessons learned from 2016 Atlantic Causal Inference Conference Competition. My comments strongly support the authors’ focus on empirical evaluation, using examples and experience from machine learning research, particularly focusing on the problem of algorithmic complexity. I argue that even broader and deeper empirical evaluation should be undertaken by the researchers who study causal inference. Finally, I highlight a few key conclusions that suggest where future research should focus.

Article information

Statist. Sci., Volume 34, Number 1 (2019), 77-81.

First available in Project Euclid: 12 April 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Causal inference empirical evaluation machine learning algorithmic complexity constructed observational studies alignment


Jensen, David. Comment: Strengthening Empirical Evaluation of Causal Inference Methods. Statist. Sci. 34 (2019), no. 1, 77--81. doi:10.1214/18-STS690.

Export citation


  • Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30 1145–1159.
  • Cohen, P. R. and Howe, A. E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI Mag. 9 35.
  • DARPA (2017). Ground Truth (GT). Broad agency announcement. Defense Sciences Office. Defense Advanced Research Projects Agency, U.S. Dept. Defense. HR001117S0031.
  • Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
  • Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29 103–130.
  • Dorie, V. and Hill, J. and Shalit, U. and Scott, M. and Cervone, D. (2019). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statist. Sci. 34 43–68.
  • Garant, D. and Jensen, D. (2016). Evaluating causal models by comparing interventional distributions. ArXiv Preprint arXiv:1608.04698.
  • Gelman, A. and Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Dept. Statistics, Columbia Univ., New York, NY.
  • Guyon, I., Janzing, D. and Schölkopf, B. (2010). Causality: Objectives and assessment. In Causality: Objectives and Assessment 1–42.
  • Guyon, I., Statnikov, A. and Batu, B. (2019). Cause-Effect Pairs in Machine Learning. Springer Series on Challenges in Machine Learning. Springer. To appear.
  • Guyon, I., Aliferis, C., Cooper, G., Elisseeff, A., Pellet, J.-P., Spirtes, P. and Statnikov, A. (2008). Design and analysis of the causation and prediction challenge. In Causation and Prediction Challenge 1–33.
  • Hahn, P. R., Dorie, V. and Murray, J. S. (2018). Atlantic Causal Inference Conference (ACIC) data analysis challenge 2017.
  • Hill, J. L., Reiter, J. P. and Zanutto, E. L. (2004). A comparison of experimental and observational data analyses. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives. Wiley Ser. Probab. Stat. 49–60. Wiley, Chichester.
  • LaLonde, R. and Maynard, R. (1987). How precise are evaluations of employment and training programs: Evidence from a field experiment. Evaluation Review 11 428–451.
  • Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In European Conference on Machine Learning 4–15. Springer.
  • Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, Cambridge.
  • Provost, F. J., Fawcett, T., Kohavi, R. et al. (1998). The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning 98 445–453.
  • Schaffter, T., Marbach, D. and Floreano, D. (2011). GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27 2263–2270.
  • Shadish, W. R., Clark, M. H. and Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. J. Amer. Statist. Assoc. 103 1334–1343.
  • Shimoni, Y., Yanover, C., Karavani, E. and Goldschmnidt, Y. (2018). Benchmarking framework for performance-evaluation of causal inference analysis. ArXiv Preprint arXiv:1802.05046.
  • Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA.
  • van’t Veer, A. E. and Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. J. Exp. Soc. Psychol. 67 2–12.

See also

  • Main article: Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition.