The Annals of Applied Statistics

Discovering influential variables: A method of partitions

Herman Chernoff, Shaw-Hwa Lo, and Tian Zheng

Full-text: Open access

Abstract

A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack.

Article information

Source
Ann. Appl. Stat., Volume 3, Number 4 (2009), 1335-1369.

Dates
First available in Project Euclid: 1 March 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1267453943

Digital Object Identifier
doi:10.1214/09-AOAS265

Mathematical Reviews number (MathSciNet)
MR2752137

Zentralblatt MATH identifier
1185.62185

Keywords
Partition variable selection influence marginal influence retention impostor resuscitation

Citation

Chernoff, Herman; Lo, Shaw-Hwa; Zheng, Tian. Discovering influential variables: A method of partitions. Ann. Appl. Stat. 3 (2009), no. 4, 1335--1369. doi:10.1214/09-AOAS265. https://projecteuclid.org/euclid.aoas/1267453943


Export citation

References

  • Amos, C. I., Chen, W. V., Lee, A., Li, W., Kern, M., Lundsten, R., Batliwalla, F., Wener, M., Remmers, E., Kastner, D. A., Criswell, L. A., Seldin, M. F. and Gregersen, P. K. (2006). High-density SNP analysis of 642 caucasian families with rheumatoid arthritis identifies two new linkage regions on 11p12 and 2q33. Genes Immun. 7 277–286.
  • Bache, I., Nielsen, N. M., Rostgaard, K., Tommerup, N. and Frisch, M. (2007). Autoimmune diseases in a danish cohort of 4,866 carriers of constitutional structural chromosomal rearrangements. Arthritis Rheum. 56 2402–2409.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Breiman, L. (2001). Random forests. Machine Learning 45 5–32.
  • Chernoff, H., Lo, S.-H. and Zheng, T. (2009). Supplement to “Discovering influential variables: A method of partitions.” DOI: 10.1214/09-AOAS265SUPP.
  • Cordell, H., de Andrade, M., Babron, M.-C., Bartlett, C., Beyene, J., Bickeboller, H., Culverhouse, R., Cupples, A. L., Daw, W. E., Dupuis, J., Falk, C., Ghosh, S., Goddard, K., Goode, E., Hauser, E., Martin, L., Martinez, M., North, K., Saccone, N., Schmidt, S., Tapper, W., Thomas, D., Tritchler, D., Vieland, V., Wijsman, E., Wilcox, M., Witte, J., Yang, Q., Ziegler, A., Almasy, L. and MacCluer, J. (2007). Genetic analysis workshop 15: Gene expression analysis and approaches to detecting multiple functional loci. BMC Proceedings 1 S1.
  • Cornélis, F., Faure, S., Martinez, M., Prud’homme, J. F., Fritz, P., Dib, C., Alves, H., Barrera, P., de Vries, N., Balsa, A., Pascual-Salcedo, D., Maenaut, K., Westhovens, R., Migliorini, P., Tran, T. H., Delaye, A., Prince, N., Lefevre, C., Thomas, G., Poirier, M., Soubigou, S., Alibert, O., Lasbleiz, S., Fouix, S., Bouchier, C., Liote, F., Loste, M. N., Lepage, V., Charron, D., Gyapay, G., Lopes-Vaz, A., Kuntz, D., Bardin, T. and Weissenbach, J. (1998). New susceptibility locus for rheumatoid arthritis suggested by a genome-wide linkage study. Proc. Natl. Acad. Sci. USA 95 10746–10750.
  • Dash, M. and Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis 1 131–156.
  • Ding, Y., Cong, L., Ionita-Laza, I., Lo, S. H. and Zheng, T. (2007). Constructing gene association networks for rheumatoid arthritis using the backward genotype-trait association (BGTA) algorithm. BMC Proceedings 1 S13.
  • Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res. 3 1157–1182.
  • Ionita, I. and Lo, S. H. (2005). Multilocus linkage analysis of affected sib pairs. Hum. Hered. 60 227–240.
  • Jawaheer, D., Seldin, M. F., Amos, C. I., Chen, W. V., Shigeta, R., Etzel, C., Damle, A., Xiao, X., Chen, D., Lum, R. F., Monteiro, J., Kern, M., Criswell, L. A., Albani, S., Nelson, J. L., Clegg, D. O., Pope, R., Schroeder, H. W., Jr., Bridges, S. L., Jr., Pisetsky, D. S., Ward, R., Kastner, D. L., Wilder, R. L., Pincus, T., Callahan, L. F., Flemming, D., Wener, M. H. and Gregersen, P. K. (2003). Screening the genome for rheumatoid arthritis susceptibility genes: A replication study and combined analysis of 512 multicase families. Arthritis Rheum. 48 906–916.
  • John, S., Shephard, N., Liu, G., Zeggini, E., Cao, M., Chen, W., Vasavda, N., Mills, T., Barton, A., Hinks, A., Eyre, S., Jones, K. W., Ollier, W., Silman, A., Gibson, N., Worthington, J. and Kennedy, G. C. (2004). Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: Comparison with microsatellites. Am. J. Hum. Genet. 75 54–64.
  • Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the International Conference on Machine Learning 284–292. Morgan Kaufmann Publishers, Inc., San Francisco, CA.
  • Kuroki, K., Tsuchiya, N., Shiroishi, M., Rasubala, L., Yamashita, Y., Matsuta, K., Fukazawa, T., Kusaoi, M., Murakami, Y., Takiguchi, M., Juji, T., Hashimoto, H., Kohda, D., Maenaka, K. and Tokunaga, K. (2005). Extensive polymorphisms of LILRB1 (ILT2, LIR1) and their association with HLA-DRB1 shared epitope negative rheumatoid arthritis. Hum. Mol. Genet. 14 2469–2480.
  • Lo, S. H. and Zheng, T. (2002). Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Hum. Hered. 53 197–215.
  • Lo, S. H. and Zheng, T. (2004). A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. Proc. Natl. Acad. Sci. USA 101 10386–10391.
  • Lo, S. H., Chernoff, H., Cong, L., Ding, Y. and Zheng, T. (2008). Discovering interactions among BRCA1 and other candidate genes associated with sporadic breast cancer. Proc. Natl. Acad. Sci. USA 105 12387–12392.
  • Osorio y Fortéa, J., Bukulmez, H., Petit-Teixeira, E., Michou, L., Pierlot, C., Cailleau-Moindrault, S., Lemaire, I., Lasbleiz, S., Alibert, O., Quillet, P., Bardin, T., Prum, B., Olson, J. M. and Cornelis, F. (2004). Dense genome-wide linkage analysis of rheumatoid arthritis, including covariates. Arthritis Rheum. 50 2757–2765.
  • Queiroz, R. G., Tamia-Ferreira, M. C., Carvalho, I. F., Petean, F. C. and Passos, G. A. (2001). Association between ecori fragment-length polymorphism of the immunoglobulin lambda variable 8 (IGLV8) gene family with rheumatoid arthritis and systemic lupus erythematosus. Braz. J. Med. Biol. Res. 34 525–528.
  • Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F. and Moore, J. H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69 138–147.
  • Thompson, S. D., Moroldo, M. B., Guyer, L., Ryan, M., Tombragel, E. M., Shear, E. S., Prahalad, S., Sudman, M., Keddache, M. A., Brown, W. M., Giannini, E. H., Langefeld, C. D., Rich, S. S., Nichols, W. C. and Glass, D. N. (2004). A genome-wide scan for juvenile rheumatoid arthritis in affected sibpair families provides evidence of linkage. Arthritis Rheum. 50 2920–2930.
  • Yekutieli, D. and Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J. Statist. Plann. Inference 82 171–196.
  • Zheng, T., Wang, H. and Lo, S. H. (2006). Backward genotype-trait association (BGTA)-based dissection of complex traits in case-control designs. Hum. Hered. 62 196–212.

Supplemental materials