Institute of Mathematical Statistics Collections

An ensemble approach to improved prediction from multitype data

Jennifer Clarke, David Seo

Abstract

We have developed a strategy for the analysis of newly available binary data to improve outcome predictions based on existing data (binary or non-binary). Our strategy involves two modeling approaches for the newly available data, one combining binary covariate selection via LASSO with logistic regression and one based on logic trees. The results of these models are then compared to the results of a model based on existing data with the objective of combining model results to achieve the most accurate predictions. The combination of model predictions is aided by the use of support vector machines to identify subspaces of the covariate space in which specific models lead to successful predictions. We demonstrate our approach in the analysis of single nucleotide polymorphism (SNP) data and traditional clinical risk factors for the prediction of coronary heart disease.

First Page: Show Hide
Primary Subjects: 62M20, 62H30
Secondary Subjects: 62P10
Keywords: model ensembles; prediction; single nucleotide polymorphism (SNP); support vector machines; variable selection
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.imsc/1209398476
Digital Object Identifier: doi:10.1214/074921708000000219

References

[1] American Heart Association (2006). Heart Disease and Stroke Statistics – 2006 Update 2–10.
[2] Armitrage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics 11 375–386.
[3] Boser, B., Guyon, I. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In 5th Annual ACM Workshop on COLT (D. Haussler, ed.) 141–152. ACM Press.
[4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth Press, Belmont, CA.
Mathematical Reviews (MathSciNet): MR726392
Zentralblatt MATH: 0541.62042
[5] Chang, C.-C. and Lin, C.-J. (2001). LIBSVM – A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
[6] Chipman, H., George, E. and McCullough, R. (2002). Bayesian treed models. Machine Learning 48 299–320.
[7] Clyde, M. (1999). Bayesian model averaging and model search strategies. In Bayesian Statistics 6 (J. Bernardo, J. Berger, A. Dawid and A. Smith, eds.) 157–185. Oxford University Press, Oxford, UK.
Mathematical Reviews (MathSciNet): MR1723497
Zentralblatt MATH: 0973.62022
[8] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20 273–297.
[9] Devlin, B., Bacanu, S.-A. and Roeder, K. (2004). Genomic control to the extreme. Nature Genetics 36 1129–1130.
[10] Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
[11] Dietterich, T. (2000). Ensemble methods in machine learning. Lecture Notes in Comput. Sci. 1857 1–15. Available at citeseer.ist.psu.edu/ dietterich00ensemble.html.
[12] Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D. and Weingessel, A. (2006). The e1071 Package: Miscellaneous functions of the department of statistics (e1071). Technische Universität Wien, Austria. Version 1.5-16.
[13] Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 121 256–285.
Mathematical Reviews (MathSciNet): MR1348530
Zentralblatt MATH: 0833.68109
Digital Object Identifier: doi:10.1006/inco.1995.1136
[14] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
Mathematical Reviews (MathSciNet): MR1473055
Zentralblatt MATH: 0880.68103
Digital Object Identifier: doi:10.1006/jcss.1997.1504
[15] Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1–141.
Mathematical Reviews (MathSciNet): MR1091842
Zentralblatt MATH: 0765.62064
Digital Object Identifier: doi:10.1214/aos/1176347963
Project Euclid: euclid.aos/1176347963
[16] Greenland, P., Knoll, M., Stamler, J., Neaton, J., Dyer, A., Garside, D. and Wilson, P. (2003). Major risk factors as antecedents of fatal and nonfatal coronary heart disease events. J. Amer. Medical Association 290 891–897.
[17] Greenland, P., Smith, S. and Grundy, S. (2001). Improving coronary heart disease risk assessment in asymptomatic people: Role of traditional risk factors and noninvasive cardiovascular tests. Circulation 104 1863–1867.
[18] Hauser, E., Crossman, D., Granger, C., Haines, J., Jones, C., Mooser, V., McAdam, B., Winkelmann, B., Wiseman, A., Muhlstein, J., Bartel, A., Dennis, C., Dowdy, E., Estabrooks, S., Eggleston, K., Francis, S., Roche, K., Clevenger, P., Huang, L., Pedersen, B., Shah, S., Schmidt, S., Haynes, C., West, S., Asper, D., Booze, M., Sharma, S., Sundseth, S., Middleton, L., Roses, A., Hauser, M., Vance, J., Pericak-Vance, M. and Kraus, W. (2004). A genomewide scan for early-onset coronary artery disease in 438 families: The GENECARD study. Amer. J. Human Genetics 75 436–447.
[19] Karra, R., Vermullapalli, S., Dong, C., Herderick, E., Song, X., Slosek, K., Nevins, J., West, M., Goldschmidt-Clermont, P. and Seo, D. (2005). Molecular evidence for arterial repair in atherosclerosis. Proc. Nat. Acad. Sci. U.S.A. 102 16789–16794.
[20] Kooperberg, C., Ruczinski, I., LeBlanc, M. and Hsu, L. (2001). Sequence analysis using logic regression. Genetic Epidemiology 21 S626–S631.
[21] Lokhorst, J., Venables, B., Turlach, B. and Maechler, M. (2006). The lasso2 package: L1 constrained estimation aka “lasso.” Univ. Western Australia School of Mathematics and Statistics. Version 1.2-5. Available at http://www.maths.uwa.edu.au/˜berwin/software/lasso.html.
[22] Magnus, P. and Beaglehole, R. (2001). The real contribution of the major risk factors to the coronary epidemics: time to end the “only-50%” myth. Archives of Internal Medicine 161 2657–2660.
[23] Meyer, D. (2006). Support vector machines: The interface to libsvm in package e1071. Technische Universität Wien, Austria.
[24] Mosca, L. (2002). C-Reactive protein: To screen or not to screen? New England J. Medicine 347 1615–1617.
[25] Osborne, M., Presnell, B. and Turlach, B. (2000). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337.
Mathematical Reviews (MathSciNet): MR1822089
Digital Object Identifier: doi:10.2307/1390657
[26] Pasternak, R., Abrams, J., Greenland, P., Smaha, L., Wilson, P. and Houston-Miller, N. (2003). Task force #1 – identification of coronary heart disease risk – is there a detection gap? J. American College of Cardiology 41 1863–1874.
[27] R Development Core Team (2006). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
[28] Ridker, P., Rifai, N., Rose, L., Buring, J. and Cook, N. (2002). Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. New England J. Medicine 347 1557–1565.
[29] Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2002). Logic regression – methods and software. In Proceedings of the MSRI workshop on Nonlinear Estimation and Classification (D. Denison, M. Hansen, B. Holmes, B. Mallick and B. Yu, eds.) 333–344. Springer, New York.
Mathematical Reviews (MathSciNet): MR2005800
Zentralblatt MATH: 1142.62386
[30] Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. J. Comput. Graph. Statist. 12 475–511.
Mathematical Reviews (MathSciNet): MR2002632
Digital Object Identifier: doi:10.1198/1061860032238
[31] Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
[32] Schapire, R. (1990). The strength of weak learnability. Machine Learning 5 197–227.
[33] Seo, D., Wang, T., Dressman, H., Hergerick, E., Iversen, E., Dong, C., Vata, K., Milano, C., Rigat, F., Pittman, J., Nevins, J., West, M. and Goldschmidt-Clermont, P. (2004). Gene expression phenotypes of atherosclerosis. Atherosclerosis, Thrombosis, and Vascular Biology 24 1922–1927.
Mathematical Reviews (MathSciNet): MR989056
[34] Sing, R., Sander, O., Beerenwinkel, N. and Lengauer, T. (2005). ROCR: Visualizing classifier performance in R. Bioinformatics 21 3940–3941. Available at http://rocr.bioinf.mpi-sb.mpg.de/.
[35] Sutton, C. (1991). Improving classification trees with simulated annealing. In Proceedings of the 23rd Symposium on the Interface (E. Kazimadas, ed.) 333–344. Interface Foundation of North America.
[36] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
Mathematical Reviews (MathSciNet): MR1379242
[37] Tzeng, J.-Y., Byerley, W., Devlin, B., Roeder, K. and Wasserman, L. (2003). Outlier detection and false discovery rates for whole-genome DNA matching. J. Amer. Statist. Assoc. 98 236–246.
Mathematical Reviews (MathSciNet): MR1965689
Zentralblatt MATH: 1047.62114
Digital Object Identifier: doi:10.1198/016214503388619256
[38] van Laarhoven, P. and Aarts, E. (1987). Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Norwell, MA.
[39] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
Mathematical Reviews (MathSciNet): MR1367965
[40] Wilson, P., D’Agostino, R., Levy, D., Belanger, A., Silbershatz, H. and Kannel, W. (1998). Prediction of coronary heart disease using risk factor categories. Circulation 97 1837–1847.
[41] Xu, H., Gregory, S., Hauser, E., Stenger, J., Pericak-Vance, M., Vance, J., Zuchner, S. and Hauser, M. (2005). SNPselector: a web tool for selecting SNPs for genetic association studies. Bioinformatics 21 4181–4186. Available at http://primer.duhs.duke.edu/.

2012 © Institute of Mathematical Statistics

Institute of Mathematical Statistics Collections

Institute of Mathematical Statistics Collections