## The Annals of Statistics

### CAM: Causal additive models, high-dimensional order search and penalized regression

#### Abstract

We develop estimation for potentially high-dimensional additive structural equation models. A key component of our approach is to decouple order search among the variables from feature or edge selection in a directed acyclic graph encoding the causal structure. We show that the former can be done with nonregularized (restricted) maximum likelihood estimation while the latter can be efficiently addressed using sparse regression techniques. Thus, we substantially simplify the problem of structure search and estimation for an important class of causal models. We establish consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and we also allow for misspecification of the error distribution. Furthermore, we develop an efficient computational algorithm which can deal with many variables, and the new method’s accuracy and performance is illustrated on simulated and real data.

#### Article information

Source
Ann. Statist., Volume 42, Number 6 (2014), 2526-2556.

Dates
First available in Project Euclid: 12 November 2014

https://projecteuclid.org/euclid.aos/1415801782

Digital Object Identifier
doi:10.1214/14-AOS1260

Mathematical Reviews number (MathSciNet)
MR3277670

Zentralblatt MATH identifier
1309.62063

#### Citation

Bühlmann, Peter; Peters, Jonas; Ernest, Jan. CAM: Causal additive models, high-dimensional order search and penalized regression. Ann. Statist. 42 (2014), no. 6, 2526--2556. doi:10.1214/14-AOS1260. https://projecteuclid.org/euclid.aos/1415801782

#### References

• [1] Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. J. Amer. Statist. Assoc. 80 580–619. With discussion and with a reply by the authors.
• [2] Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
• [3] Bühlmann, P., Peters, J. and Ernest, J. (2014). Supplement to “CAM: Causal additive models, high-dimensional order search and penalized regression.” DOI:10.1214/14-AOS1260SUPP.
• [4] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• [5] Bühlmann, P. and Yu, B. (2003). Boosting with the $L_2$ loss: Regression and classification. J. Amer. Statist. Assoc. 98 324–339.
• [6] Chickering, D. M. (2002). Optimal structure identification with greedy search. J. Mach. Learn. Res. 3 507–554.
• [7] Colombo, D., Maathuis, M. H., Kalisch, M. and Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist. 40 294–321.
• [8] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• [9] Hothorn, T., Bühlmann, P., Kneib, T., Schmid, M. and Hofner, B. (2010). Model-based boosting 2.0. J. Mach. Learn. Res. 11 2109–2113.
• [10] Imoto, S., Goto, T. and Miyano, S. (2002). Estimation of genetic networks and functional structures between genes by using Bayesian network and nonparametric regression. In Proceedings of the Pacific Symposium on Biocomputing (PSB) 175–186. Lihue, HI.
• [11] Janzing, D., Peters, J., Mooij, J. M. and Schölkopf, B. (2009). Identifying confounders using additive noise models. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence (UAI) 249–257. AUAI Press, Corvallis, OR.
• [12] Lauritzen, S. L. (1996). Graphical Models. Oxford Univ. Press, New York.
• [13] Loh, P. and Bühlmann, P. (2013). High-dimensional learning of linear causal networks via inverse covariance estimation. J. Mach. Learn. Res. To appear. Available at arXiv:1311.3492.
• [14] Mammen, E. and Park, B. U. (2006). A simple smooth backfitting method for additive models. Ann. Statist. 34 2252–2271.
• [15] Marra, G. and Wood, S. N. (2011). Practical variable selection for generalized additive models. Comput. Statist. Data Anal. 55 2372–2387.
• [16] Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
• [17] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [18] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 417–473.
• [19] Mooij, J. and Heskes, T. (2013). Cyclic causal discovery from continuous equilibrium data. In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence (UAI) 431–439. AUAI Press, Corvallis, OR.
• [20] Mooij, J., Janzing, D., Heskes, T. and Schölkopf, B. (2011). On causal discovery with cyclic additive noise models. In Advances in Neural Information Processing Systems 24 (NIPS) 639–647. Curran, Red Hook, NY.
• [21] Mooij, J., Janzing, D., Peters, J. and Schölkopf, B. (2009). Regression by dependence minimization and its application to causal inference. In Proceedings of the 26th International Conference on Machine Learning (ICML) 745–752. ACM, New York.
• [22] Nowzohour, C. and Bühlmann, P. (2013). Score-based causal learning in additive noise models. Available at arXiv:1311.6359.
• [23] Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge Univ. Press, Cambridge.
• [24] Peters, J. and Bühlmann, P. (2013). Structural intervention distance (SID) for evaluating causal graphs. Neural Comput. To appear. Available at arXiv:1306.1043.
• [25] Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101 219–228.
• [26] Peters, J., Mooij, J., Janzing, D. and Schölkopf, B. (2014). Causal discovery with continuous additive noise models. J. Mach. Learn. Res. 15 2009–2053.
• [27] Ramsey, J., Zhang, J. and Spirtes, P. (2006). Adjacency-faithfulness and conservative causal inference. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI) 401–408. AUAI Press, Corvallis, OR.
• [28] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030.
• [29] Rényi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hung. 10 441–451 (unbound insert).
• [30] Richardson, T. (1996). A discovery algorithm for directed cyclic graphs. In Uncertainty in Artificial Intelligence (Portland, OR, 1996) 454–461. Morgan Kaufmann, San Francisco, CA.
• [31] Schmidt, M., Niculescu-Mizil, A. and Murphy, K. (2007). Learning graphical model structure using L1-regularization paths. In Proceedings of the National Conference on Artificial Intelligence 22 1278. AAAI Press, Menlo Park, CA.
• [32] Shimizu, S., Hoyer, P. O., Hyvärinen, A. and Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7 2003–2030.
• [33] Spirtes, P. (1995). Directed cyclic graphical representations of feedback models. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI) 491–499. Morgan Kaufmann, San Francisco, CA.
• [34] Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA.
• [35] Teyssier, M. and Koller, D. (2005). Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI) 584–590. AUAI Press, Corvallis, OR.
• [36] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [37] van de Geer, S. (2014). On the uniform convergence of empirical norms and inner products, with application to causal inference. Electron. J. Stat. 8 543–574.
• [38] van de Geer, S. and Bühlmann, P. (2013). $\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs. Ann. Statist. 41 536–567.
• [39] van de Geer, S., Bühlmann, P. and Zhou, S. (2011). The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5 688–749.
• [40] Voorman, A., Shojaie, A. and Witten, D. (2014). Graph estimation with joint additive models. Biometrika 101 85–101.
• [41] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
• [42] Wille, A., Zimmermann, P., Vranová, E., Fürholz, A., Laule, O., Bleuler, S., Hennig, L., Prelic, A., von Rohr, P., Thiele, L., Zitzler, E., Gruissem, W. and Bühlmann, P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 5 R92.
• [43] Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, Boca Raton, FL.
• [44] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
• [45] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
• [46] Zhang, K. and Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI) 647–655. AUAI Press, Corvallis, OR.
• [47] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
• [48] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.