The Annals of Applied Statistics

Variable prioritization in nonlinear black box methods: A genetic association case study

Lorin Crawford, Seth R. Flaxman, Daniel E. Runcie, and Mike West

Full-text: Open access

Abstract

The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the “RelATive cEntrality” (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data. We illustrate RATE through Bayesian Gaussian process regression, but the methodological innovations apply to other “black box” methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and two real data association mapping studies, we show that applying RATE enables an explanation for this improved performance.

Article information

Source
Ann. Appl. Stat., Volume 13, Number 2 (2019), 958-989.

Dates
Received: March 2018
Revised: August 2018
First available in Project Euclid: 17 June 2019

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1560758434

Digital Object Identifier
doi:10.1214/18-AOAS1222

Mathematical Reviews number (MathSciNet)
MR3963559

Zentralblatt MATH identifier
07094842

Keywords
Nonlinear regression Gaussian processes centrality measures variable prioritization genome-wide association studies statistical genetics

Citation

Crawford, Lorin; Flaxman, Seth R.; Runcie, Daniel E.; West, Mike. Variable prioritization in nonlinear black box methods: A genetic association case study. Ann. Appl. Stat. 13 (2019), no. 2, 958--989. doi:10.1214/18-AOAS1222. https://projecteuclid.org/euclid.aoas/1560758434


Export citation

References

  • Alaa, A. M. and van der Schaar, M. (2017). Bayesian nonparametric causal inference: Information rates and learning algorithms. Available at ArXiv:1712.08914.
  • Ankra-Badu, G. A., Pomp, D., Shriner, D., Allison, D. B. and Yi, N. (2009). Genetic influences on growth and body composition in mice: Multilocus interactions. Int. J. Obes. 33 89–95. DOI:10.1038/ijo.2008.215.
  • Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist. 32 870–897.
  • Brockmann, G. A., Haley, C. S., Renne, U., Knott, S. A. and Schwerin, M. (1998). Quantitative trait loci affecting body weight and fatness from a mouse line selected for extreme high growth. Genetics 150 369–381.
  • Bross, C. D., Howes, T. R., Abolhassani Rad, S., Kljakic, O. and Kohalmi, S. E. (2017). Subcellular localization of Arabidopsis arogenate dehydratases suggests novel and non-enzymatic roles. J. Exp. Bot. 68 1425–1440.
  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97 465–480.
  • Carvalho, C. M. and West, M. (2007). Dynamic matrix-variate graphical models. Bayesian Anal. 2 69–97.
  • Chaudhuri, A., Kakde, D., Sadek, C., Gonzalez, L. and Kong, S. (2017). The mean and median criterion for automatic kernel bandwidth selection for support vector data description. Available at arXiv:1708.05106.
  • Chen, X., McClusky, R., Chen, J., Beaven, S. W., Tontonoz, P., Arnold, A. P. and Reue, K. (2012). The number of X chromosomes causes sex differences in adiposity in mice. PLoS Genet. 8 e1002709.
  • Chen, X., McClusky, R., Itoh, Y., Reue, K. and Arnold, A. P. (2013). X and Y chromosome complement influence adiposity and metabolism in mice. Endocrinology 154 1092–1104. DOI:10.1210/en.2012-2098.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • Cotter, A., Keshet, J. and Srebro, N. (2011). Explicit approximations of the Gaussian kernel. Available at arXiv:1109.4603.
  • Cox, K. H., Bonthuis, P. J. and Rissman, E. F. (2014). Mouse model systems to study sex chromosome genes and behavior: Relevance to humans. Front. Neuroendocrinol. 35 405–419. DOI:10.1016/j.yfrne.2013.12.004.
  • Crawford, L. and Zhou, X. (2018). Genome-wide marginal epistatic association mapping in case-control studies. BioRxiv 374983.
  • Crawford, L., Zeng, P., Mukherjee, S. and Zhou, X. (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 13 e1006869.
  • Crawford, L., Wood, K. C., Zhou, X. and Mukherjee, S. (2018). Bayesian approximate kernel regression with variable selection. J. Amer. Statist. Assoc. 113 1710–1721.
  • Crawford, L., Flaxman, S. R., Runcie, D. E. and West, M. (2019). Supplement to “Variable prioritization in nonlinear black box methods: A genetic association case study.” DOI:10.1214/18-AOAS1222SUPPA, DOI:10.1214/18-AOAS1222SUPPB, DOI:10.1214/18-AOAS1222SUPPC, DOI:10.1214/18-AOAS1222SUPPD.
  • Cuevas, J., Crossa, J., Montesinos-López, O. A., Burgueño, J., Pérez-Rodríguez, P. and de Los Campos, G. (2017). Bayesian genomic prediction with genotype $\times$ environment interaction kernel models. G3 (Bethesda) 7 41–53.
  • Demetrashvili, N., den Heuvel, E. R. V. and Wit, E. C. (2013). Probability genotype imputation method and integrated weighted lasso for QTL identification. BMC Genet. 14 125.
  • de los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K. and Cotes, J. (2009). Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182 375–385.
  • de los Campos, G., Gianola, D., Rosa, G. J. M., Weigel, K. A. and Crossa, J. (2010). Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet. Res. 92 295–308.
  • Diament, A. L. and Warden, C. H. (2003). Multiple linked mouse chromosome 7 loci influence body fat mass. Int. J. Obes. 28 199 EP.
  • Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6 2153–2175.
  • Fasshauer, G. and McCourt, M. (2016). Kernel-Based Approximation Methods Using MATLAB. World Scientific, Hackensack, NJ.
  • Gelman, A., Hwang, J. and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Stat. Comput. 24 997–1016.
  • Goutis, C. and Robert, C. P. (1998). Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler projections. Biometrika 85 29–37.
  • Gruber, L. and West, M. (2016). GPU-accelerated Bayesian learning and forecasting in simultaneous graphical dynamic linear models. Bayesian Anal. 11 125–149.
  • Gruber, L. F. and West, M. (2017). Bayesian online variable selection and scalable multivariate volatility forecasting in simultaneous graphical dynamic linear models. Econ. Stat. 3 3–22.
  • Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for Genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5 1780–1815.
  • Hemani, G., Knott, S. and Haley, C. (2013). An evolutionary perspective on epistasis and the missing heritability. PLoS Genet. 9 e1003295.
  • Hemani, G., Shakhbazov, K., Westra, H.-J., Esko, T., Henders, A. K., McRae, A. F., Yang, J., Gibson, G., Martin, N. G., Metspalu, A., Franke, L., Montgomery, G. W., Visscher, P. M. and Powell, J. E. (2014). Detection and replication of epistasis influencing transcription in humans. Nature 508 249–253.
  • Hill, W. G., Goddard, M. E. and Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4 e1000008.
  • Horn, T., Sandmann, T., Fischer, B., Axelsson, E., Huber, W. and Boutros, M. (2011). Mapping of signaling networks through synthetic genetic interaction analysis by RNAi. Nat. Methods 8 341–346.
  • Hou, Q. and Bartels, D. (2015). Comparative study of the aldehyde dehydrogenase (ALDH) gene superfamily in the glycophyte Arabidopsis thaliana and Eutrema halophytes. Ann. Bot. 115 465–479.
  • Howard, R., Carriquiry, A. L. and Beavis, W. D. (2014). Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 (Bethesda) 4 1027–1046.
  • Jiang, Y. and Reif, J. C. (2015). Modeling epistasis in genomic selection. Genetics 201 759–768.
  • Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C. and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42 348–354.
  • Kim, S. V., Mehal, W. Z., Dong, X., Heinrich, V., Pypaert, M., Mellman, I., Dembo, M., Mooseker, M. S., Wu, D. and Flavell, R. A. (2006). Modulation of cell adhesion and motility in the immune system by Myo1f. Science 314 136–139.
  • Kirch, H.-H., Bartels, D., Wei, Y., Schnable, P. S. and Wood, A. J. (2004). The ALDH gene superfamily of Arabidopsis. Trends Plant Sci. 9 371–377.
  • Kleyn, P. W., Fan, W., Kovats, S. G., Lee, J. J., Pulido, J. C., Wu, Y., Berkemeier, L. R., Misumi, D. J., Holmgren, L. et al. (1996). Identification and characterization of the mouse obesity gene tubby: A member of a novel gene family. Cell 85 281–290.
  • Kolmogorov, A. N. and Rozanov, Ju. A. (1960). On a strong mixing condition for stationary Gaussian processes. Theory Probab. Appl. 5 222–227.
  • Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of $g$ priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 410–423.
  • Lim, C. and Yu, B. (2016). Estimation stability with cross-validation (ESCV). J. Comput. Graph. Statist. 25 464–492.
  • Lin, L., Chan, C. and West, M. (2016). Discriminative variable subsets in Bayesian classification with mixture models, with application in flow cytometry studies. Biostatistics 17 40–53.
  • Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I. and Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nat. Methods 8 833–835.
  • Loudet, O., Chaillou, S., Camilleri, C., Bouchez, D. and Daniel-Vedele, F. (2002). Bay-0 $\times$ Shahdara recombinant inbred line population: A powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor. Appl. Genet. 104 1173–1184.
  • Mackay, T. F. C. (2014). Epistasis and quantitative traits: Using model organisms to study gene-gene interactions. Nat. Rev. Genet. 15 22–33.
  • Mathai, A. M. and Provost, S. B. (1992). Quadratic Forms in Random Variables. Theory and Applications. Statistics: Textbooks and Monographs 126. Dekker, New York.
  • Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A 209 415–446.
  • Paigen, B., Mitchell, D., Reue, K., Morrow, A., Lusis, A. J. and LeBoeuf, R. C. (1987). Ath-1, a gene determining atherosclerosis susceptibility and high density lipoprotein levels in mice. Proc. Natl. Acad. Sci. USA 84 3763–3767.
  • Phillips, P. C. (2008). Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9 855–867. DOI:10.1038/nrg2452.
  • Piironen, J. and Vehtari, A. (2016). Projection predictive model selection for Gaussian processes. In IEEE International Workshop on Machine Learning for Signal Processing 1–6. IEEE, New York.
  • Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Stat. Comput. 27 711–735.
  • Pillai, N. S., Wu, Q., Liang, F., Mukherjee, S. and Wolpert, R. L. (2007). Characterizing the function space for Bayesian kernel models. J. Mach. Learn. Res. 8 1769–1797.
  • Prabhu, S. and Pe’er, I. (2012). Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease. Genome Res. 22 2230–2240.
  • Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J. and Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81 559–575. DOI:10.1086/519795.
  • Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 3 5.
  • Rance, K. A., Hill, W. G. and Keightley, P. D. (1997). Mapping quantitative trait loci for body weight on the X chromosome in mice. I. Analysis of a reciprocal F2 population. Genet. Res. 70 117–124.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA.
  • Richard, M. D. and Lippmann, R. P. (1991). Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput. 3 461–483.
  • Schölkopf, B., Herbrich, R. and Smola, A. J. (2001). A generalized representer theorem. In Computational Learning Theory (Amsterdam, 2001). Lecture Notes in Computer Science 2111 416–426. Springer, Berlin.
  • Shi, J. Q., Wang, B., Will, E. J. and West, R. M. (2012). Mixed-effects Gaussian process functional regression models with application to dose-response curve prediction. Stat. Med. 31 3165–3177.
  • Smith, A., Naik, P. A. and Tsai, C.-L. (2006). Markov-switching model selection using Kullback–Leibler divergence. J. Econometrics 134 553–577.
  • Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681–690.
  • Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J. et al. (2015). UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12 e1001779.
  • Tan, S., Caruana, R., Hooker, G. and Lou, Y. (2017). Detecting bias in black-box models using transparent model distillation. Available at arXiv:1710.06169.
  • The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467 1061–1073.
  • The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Valdar, W., Solberg, L. C., Gauguier, D., Burnett, S., Klenerman, P., Cookson, W. O., Taylor, M. S., Rawlins, J. N. P., Mott, R. and Flint, J. (2006). Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat. Genet. 38 879–887.
  • Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. SIAM, Philadelphia, PA.
  • Waldmann, P., Mészáros, G., Gredler, B., Fürst, C. and Sölkner, J. (2013). Evaluation of the lasso and the elastic net in genome-wide association studies. Front. Genet. 4 270.
  • Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L. and Yu, W. (2010). BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 87 325–340.
  • Wang, X., Elston, R. C. and Zhu, X. (2011a). Statistical interaction in human genetics: How should we model it if we are looking for biological interaction? Nat. Rev. Genet. 12 74.
  • Wang, X., Elston, R. C. and Zhu, X. (2011b). The meaning of interaction. Hum. Hered. 70 269–277.
  • Weissbrod, O., Geiger, D. and Rosset, S. (2016). Multikernel linear mixed models for complex phenotype prediction. Genome Res. 26 969–979.
  • Wentzell, A. M., Rowe, H. C., Hansen, B. G., Ticconi, C., Halkier, B. A. and Kliebenstein, D. J. (2007). Linking metabolic QTLs with network and cis-eQTLs controlling biosynthetic pathways. PLoS Genet. 3 e162.
  • Woo, J. H., Shimoni, Y., Yang, W. S., Subramaniam, P., Iyer, A., Nicoletti, P., Rodríguez Martínez, M., López, G., Mattioli, M. et al. (2015). Elucidating compound mechanism of action by network perturbation analysis. Cell 162 441–451.
  • Wood, A. R., Tuke, M. A., Nalls, M. A., Hernandez, D. G., Bandinelli, S., Singleton, A. B., Melzer, D., Ferrucci, L., Frayling, T. M. and Weedon, M. N. (2014). Another explanation for apparent epistasis. Nature 514 E3–E5.
  • Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89 82–93.
  • Wu, J., Zhao, Q., Yang, Q., Liu, H., Li, Q., Yi, X., Cheng, Y., Guo, L., Fan, C. and Zhou, Y. (2016). Comparative transcriptomic analysis uncovers the complex genetic network for resistance to Sclerotinia sclerotiorum in Brassica napus. Sci. Rep. 6 19007 EP.
  • Yalcin, B., Nicod, J., Bhomra, A., Davidson, S., Cleak, J., Farinelli, L., Østerås, M., Whitley, A., Yuan, W. et al. (2010). Commercially available outbred mice for genome-wide association studies. PLoS Genet. 6 e1001085.
  • Yandell, B. S., Mehta, T., Banerjee, S., Shriner, D., Venkataraman, R., Moon, J. Y., Neely, W. W., Wu, H., von Smith, R. and Yi, N. (2007). R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 23 641–643. DOI:10.1093/bioinformatics/btm011.
  • Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. and Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46 100–106.
  • Zeng, P. and Zhou, X. (2017). Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat. Commun. 8 456.
  • Zhang, Z., Dai, G. and Jordan, M. I. (2011). Bayesian generalized kernel mixed models. J. Mach. Learn. Res. 12 111–139.
  • Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 39 1167–1173.
  • Zhang, X., Huang, S., Zou, F. and Wang, W. (2010). TEAM: Efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics 26 i217–i227. DOI:10.1093/bioinformatics/btq186.
  • Zhou, X. (2017). A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann. Appl. Stat. 11 2027–2051.
  • Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44 821–825.
  • Zhou, X. and Stephens, M. (2014). Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat. Methods 11 407–409.

Supplemental materials