Statistics Surveys

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Yulan Liang and Arpad Kelemen

Full-text: Open access


Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.

Article information

Statist. Surv., Volume 2 (2008), 43-60.

First available in Project Euclid: 28 March 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Complex disease High dimensional data Single Nucleotide Polymorphism Statistical methods


Liang, Yulan; Kelemen, Arpad. Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Statist. Surv. 2 (2008), 43--60. doi:10.1214/07-SS026.

Export citation


  • [1] Anderson, E.C. and Novembre, J. (2003). Finding haplotype block boundaries by using the minimum-description-length principle., American Journal of Human Genetics 73 336–354.
  • [2] Ao, S., Yip, K., Ng, M., Cheung, D., Fong, P.Y., Melhado, I. and Sham, P.C. (2005). CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs., Bioinformatics 21(8) 1735–1736.
  • [3] Avi-Itzhak, H.I., Su, X. and De La Vega, F.M. (2003). Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity., Pac Symp Biocomput. 466–477.
  • [4] Azevedo, L., Suriano, G., van Asch, B., Harding, R. M. and Amorim, A. (2006). Epistatic interactions: how strong in disease and evolution?, Trends Genet. 11 585–598.
  • [5] Baker, S. G. (2005). A simple loglinear model for haplotype effects in a case-control study involving two unphased genotypes., Statistical Applications in Genetics and Molecular Biology 4(1) 14.
  • [6] Becker, T., Cichon, S., Jonson, E. and Knapp, M. (2005). Multiple testing in the context of haplotype analysis revisited: application to case-control data., Annals of Human Genetics 69 747–756.
  • [7] Becker, T. and Knapp, M. (2004). A powerful strategy to account for multiple testing in the context of haplotype analysis., Am J Hum Genet. 75(4) 561–570.
  • [8] Beckmann, L., Thomas, D.C., Fischer, C. and Chang-Claude, J. (2005). Haplotype sharing analysis using Mantel statistics., Human Heredity 59 67–78.
  • [9] Benjamin, D. H. and Nicola, J. C. (2004). Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation., Genetic Epidemiology 26(1) 11–21.
  • [10] Bo, T. and Jonassen, I. (2002). New feature subset selection procedures for classification of expression profiles., Genome Biology 3(4) research0017.
  • [11] Breiman, L. (2001). Random Forests., Machine Learning 45 5–32.
  • [12] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984)., Classification and Regression Tress Wadsworth, Belmont.
  • [13] Brookes, A.J. (1999). Review: The essence of SNPs., Gene 234 177–186.
  • [14] Burkett, K., McNeney, B. and Graham,J. (2004). A note on inference of trait associations with SNP haplotypes and other attributes in generalized linear models., Human Heredity 57 200–206.
  • [15] Burton, P. R., Tobin, M.D. and Hopper, J.L. (2005). Key concepts in genetic epidemiology., Lacent 366 941–951.
  • [16] Cardon, L. R. and Bell, J. I. (2001). Association study designs for complex diseases., Nat Rev Genet 2 91–99.
  • [17] Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. and Nickerson D.A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium., Am J Hum Genet. 74 106–120.
  • [18] Chapman, J. M., Cooper, J. D., Todd, J. A. and Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power., Hum. Hered. 56 18–31.
  • [19] Chatterjee, N., Kalaylioglu, Z., Moslehi, R., Peters, U. and Wacholder, S. (2006). Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions., American Journal of Human Genetics 79(6) 1002–1016.
  • [20] Cheng, R., Ma, J., Elston, R.C. and Li, M.D. (2005). Fine mapping functional sites or regions from case-Control data using haplotypes of multiple linked SNPs., Annals of Human Genetics 69(1) 102–112.
  • [21] Clark, T. G., De Iorio, M., Griffiths, R. C. and Farrall, M. (2005). Finding associations in dense genetic maps: a genetic algorithm approach., Human Heredity 60 97–108.
  • [22] Coffey, C.S., Hebert, P.R., Ritchie, M.D., Krumholz, H.M., Morgan, T.M., Gaziano, J.M. Ridker, P.M. and Moore, J.H. (2004). An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: The importance of model validation., BMC Bioinformatics 5 49.
  • [23] Conneely, K. N. and Boehnke, M. (2005). Combining correlated p-values in trait-SNP association studies., The American Society of Human Genetics 55th Annual Meeting, Salt Lake City, Utah 184–189.
  • [24] Cores, C. and Vapnik, V. N. (1995). Support Vector Networks., Machine Learning 20 273–297.
  • [25] Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. and Lander, E.S. (2001). High-resolution haplotype structure in the human genome., Nat. Genet. 29 229–232.
  • [26] Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processes., The Annals of Applied Probability 2 329–357.
  • [27] Dudbridge, F. and Koeleman, B. P. C. (2004). Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies., American Journal of Human Genetics 75(3) 424–435.
  • [28] Durrant, C., Zondervan, K. T., Cardon, L. R., Hunt, S., Deloukas, P. and Morris, A. P. (2004). Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes., Am. J. Hum. Genet. 75 35–43.
  • [29] Fu, R., Dey, D. K. and Holsinger, K. E. (2005). Bayesian models for the analysis of genetic structure when populations are correlated., Bioinformatics 21(8) 1516–1529.
  • [30] Gopalakrishnan, S. and Qin, Z. S. (2006). TagSNP Selection Based on Pairwise LD Criterion and Power Analysis in Association Studies, Pacific Sym. Biocomputing 11 511–522.
  • [31] Greenspan, G. and Geiger, D. (2004). Model-based inference of haplotype block variation., J. Comp. Biol. 11 493–504.
  • [32] Greenspan, G. and Geiger, D. (2006). Modeling Haplotype Block Variation Using Markov Chains., Genetics 172(4) 2583–2599.
  • [33] Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. N. (2002). Gene Selection for Cancer Classification using Support Vector Machines., Machine Learning 46(1–3) 389–422.
  • [34] Halldorsson, B. V., Bafna, V., Lippert, R., Schwartz, R., De La Vega, F. M., Clark, A. G. and Istrail, S. (2004). Optimal haplotype block-free selection of tagging SNPs for genomewide association studies., Genome Res 14 1633–1640.
  • [35] Halperin, E., Kimmel, G. and Shamir, R. (2005). Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy., Bioinformatics 21(suppl 1) 195–203.
  • [36] Hampe, J., Schreiber, S. and Krawczak, M. (2003). Entropy-based SNP selection for genetic association studies., Hum Genet. 114 36–43.
  • [37] He, J. and Zelikovsky, A. (2006). MLR-tagging informative SNP selection for unphased genotypes based on multiple linear regression., Bioinformatics 22(20) 2558–2561.
  • [38] Hirschhorn, J. N. and Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits., Nature Reviews Genetics 6 95–108.
  • [39] Hoh, J. and Ott, J. (2000). Scan statistics to scan markers for susceptibility genes., Proc Nat Acad Sci 97 9615–9617.
  • [40] Howie, B. N., Carlson, C. S., Rieder, M. J. and Nickerson, D. A. (2006). Efficient selection of tagging single-nucleotide polymorphisms in multiple populations., Human Genetics 120(1) 58–68.
  • [41] Hubley, R. M., Zitzler, E. and Roach, J. C. (2003). Evolutionary algorithms for the selection of single nucleotide polymorphisms., BMC Bioinformatics 4 30–39.
  • [42] Hung, R. J., Brennan, P., Malaveille, C., Porru, S., Donato, F., Boffetta, P. and Witte, J. S. (2004). Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer., Cancer Epidemiology Biomarkers and Prevention 13(6) 1013–1021.
  • [43] Hunter, D. J. (2005). Gene-environment interactions in human diseases., Nature Reviews Genetics 6 287–298.
  • [44] Inza, I., Sierra, B., Blanco, R. and Larranaga, P. (2002). Gene selection by sequential search wrapper approaches in microarray cancer class prediction, Journal of Intelligent and Fuzzy Systems 12(1) 25–34.
  • [45] Ioannidis, J. P., Gwinn, M., Little, J., Higgins, J. P., Bernstein, J. L., Boffetta, P., Bondy, M., Bray, M. S., Brenchley, P.E., Buffler, P. A. et al. (2006). Human Genome Epidemiology Network and the Network of Investigator Networks, A road map for efficient and reliable human genome epidemiology., Nature Genetics 38(1) 3–5.
  • [46] Judson, R, Salisbury, B., Schneider, J., Windemuth, A. and Stephens, J. C. (2002). How many SNPs does a genome-wide haplotype map require?, Pharmacogenomics 3 379–391.
  • [47] Kasabov, N. (2002)., Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines. London-New York, Springer-Verlag.
  • [48] Ke, X. and Cardon, L. R. (2003). Efficient selective screening of haplotype tag SNPs., Bioinformatics 19 287–288.
  • [49] Knorr-Held, L. and Rue, H. (2002). On block updating in Markov random field models for disease mapping., Scandinavian Journal of Statistics 29(4) 597–614.
  • [50] Krina, T., Zondervan, L. and Cardon, T. (2004). The complex interplay among factors that influence allelic association., Nature Reviews Genetics 5(2) 89–100.
  • [51] Krishnapuram, B. and Carin, L. (2005). Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds., IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6).
  • [52] Lal, T. N., Chapelle, O., Weston, J. and Elisseeff, A. (2006). Embedded methods. Feature Extraction: Foundations and Applications. In Guyon, I., Gunn, S., Nikravesh, M. Zadeh, L. A. (Eds.) Springer, Berlin, Germany.
  • [53] Lam, J. C., Roeder, K. and Devlin, B. (2000). Haplotype fine mapping by evolutionary trees., Am. J. Hum. Genet. 66 (2) 659–673.
  • [54] Levin, A. M., Ghosh, D., Cho, K. R. and KardiaS. L. R. (2005). A model-based scan statistics for identifying extreme chromosomal regions of gene expression in human tumors., Bioinformatics 21 2867–2874.
  • [55] Li, J. and Jiang, T. (2005). Haplotype-based linkage disequilibrium mapping via direct data mining, Bioinformatics 21 4384–4393.
  • [56] Liang, Y. and Kelemen, A. (2005). Temporal Gene Expression Classification with Regularised Neural Network., International Journal of Bioinformatics Research and Applications 1(4) 399–413.
  • [57] Lin, Z. and Altman, R. B. (2004). Finding haplotype tagging SNPs by use of principal components analysis., Am. J. Hum. Genet. 75 850–861.
  • [58] Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. and Risch, N. (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping., Genome Research 11 (10) 1716–1724.
  • [59] Liu, Z. and Lin, S. (2005). Multilocus LD measure and tagging SNP selection with generalized mutual information., Genet Epidemiol. 29 353–364.
  • [60] Long, A., Mangalam, H., Chan, B., Tolleri, L., Hatfield, G. and Baldi, P. (2001). Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework., J. Biol. Chem. 276 19937–19944.
  • [61] Mannila, H., Koivisto, M., Perola, M., Varilo, T., Hennah, W., Ekelund, J., Lukk, M., Peltonen, L. and Ukkonen, E. (2003). Minimum description length block finder, a method to identify haplotype blocks and to compare the strength of block boundaries., Am. J. Hum. Genet. 73 86–94.
  • [62] Molitor, J., Marjoram, P. and Thomas, D. (2003). Fine-Scale Mapping of Disease Genes with Multiple Mutations via Spatial Clustering Techniques., Am. J. Hum. Genet. 73 1368–1384.
  • [63] Monari, G. and Dreyfus, G. (2000). Withdrawing an example from the training set: an analytic estimation of its effect on a nonlinear parameterized model., Neurocomputing Letters 35 195–201.
  • [64] Moore, J. H. (2007). Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Zhu, Davidson (eds.) Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data, IGI, (in, press).
  • [65] Moore, J. H. and White, B. C. (2006). Exploiting expert knowledge for genome-wide genetic analysis using genetic programming. In: Runarsson et al. (eds.) Parallel Problem Solving from Nature - PPSN IX, Lecture Notes in Computer Science 4193, 969–977.
  • [66] Moore, J. H. and Williams, S. M. (2002). New strategies for identifying gene-gene interactions in hypertension., Ann Med.
  • [67] Motsinger, A. A., Lee, S. L., Mellick, G. and Ritchie, M. D. (2006). PNN: Power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease., BMC Bioinformatics 7(1) 39–50.
  • [68] Neale, B. and Sham, P. (2004). The future of association studies: Gene-based analysis and replication., American Journal of Human Genetics 75 353–362.
  • [69] Newton, M. A., Kendziorski, C. M., Richmond, C. S., Blattner, F. R. and Tsui, K. W. (2001). On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data., Journal of Computational Biology 8(1) 37–52.
  • [70] Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other., American Journal of Human Genetics 74(4) 765–769.
  • [71] Ott, J. (2001). Neural networks and disease association studies., merican Journal of Medical Genetics 105 (1) 60–61.
  • [72] Park, M. and Hastie, T. (2006). Regularization Path Algorithms for Detecting Gene Interactions, preprint.
  • [73] Pavlidis, P. and Noble, W. S. (2001). Analysis of strain and regional variation in gene expression in mouse brain., Genome Biology 2(10) research0042.1-0042.15.
  • [74] Pedrycz, W. (1997)., Computational Intelligence: An Introduction. Boca Raton, FL, CRC.
  • [75] Risch, N. J. (2000). Searching for genetic determinants in the new millennium., Nature 405 847–856.
  • [76] Risch, N. and Merikangas, K. (1996). The future of genetics studies of complex human diseases., Science 273 1516–1517.
  • [77] Ritchie, M. D., Hahn, L. W. and Moore, J. H. (2003a). Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity., Genet Epidemiol. 24 150–157.
  • [78] Ritchie, M. D., White, B. C., Parker, J. S., Hahn, L. W. and Moore, J. H. (2003b). Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases., BMC Bioinformatics 4 28–38.
  • [79] Rivals, I. and Personnaz, L. (2003). MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling., Journal of Machine Learning Research 3 1383–1398.
  • [80] Salyakina, D., Seaman, S. R., Browning, B. L., Dudbridge, F. and Muller-Myhsok, B. (2005). Evaluation of Nyholt’s procedure for multiple testing correction., Human Heredity 60(1) 19–25.
  • [81] Schaid, D. J. (1996). General score tests for associations of genetic markers with disease using cases and their parents., Genetic Epidemiology 13 423–449.
  • [82] Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. and Poland, G. A. (2002). Score test for association between traits and haplotypes when linkage phase is ambiguous., Am J Hum Genet 70 425–439.
  • [83] Schwender, H. and Ickstadt, K. (2006). Identification of SNP Interactions Using Logic Regression,, accessed on, Oct.-31-2006.
  • [84] Seaman, S.R. and Muller-Myhsok, B. (2005). Rapid simulation of P values for product methods and multiple-testing adjustment in association studies., American Journal of Human Genetics 76 399–408.
  • [85] Sebastiani, P., Lazarus, R., Weiss, S. T., Lunkel, L. M., Kohane, I. S. and Romani, M. F. (2003). Minimal haplotype tagging., Proc. Natl. Acad. Sci. USA 100 9900–9905.
  • [86] Shriver, M., Mei, R., Parra, E. J., et al., (2005). Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation., Human Genomics 2(2) 81–89.
  • [87] Song, K. and Elston, R. C. (2006). A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies., Statistics in Medicine 25(1) 105–126.
  • [88] Stephens, M. and Donnelly, P. (2000). Inference in molecular population genetics., J R Stat Soc B 62 605–655.
  • [89] Stram, D. O., Haiman, C. A., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E. and Pike, M. C. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data using preliminary sample of unrelated subjects with an example from the multiethnic cohort study., Hum. Hered. 55 27–36.
  • [90] Sun, W. and Cai, T. (2007). Oracle and adaptive compound decision rules for false discovery rate control., J. American Statistical Association 102 901–912.
  • [91] Sun, Y., Levin, A., Boerwinkle, E., Robertson, H. and Kardia, S. (2006). A scan statistic for identifying chromosomal patterns of SNP association., Genetic Epidemiology 30 627–635.
  • [92] Tan, P., Steinbach, M. and Kumar, V. (2005). Introduction to Data Mining, Addison-Wesley, pp., 76–79.
  • [93] The International HapMap Consortium (2005). A haplotype map of the human genome., Nature 437 1299–1320.
  • [94] The International HapMap Consortium (2004). Integrating ethics and science in the International HapMap Project., Nat Rev Genet 5 467–475.
  • [95] The International HapMap Consortium (2003). The International HapMap Project., Nature 426 789–796.
  • [96] Thomas, D. C., Stram, D. O., Conti, D., Molitor, J. and Marjoram, P. (2003). Bayesian spatial modeling of haplotype associations., Human Heredity 56 32–40.
  • [97] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., J. Royal. Statist. Soc B. 58(1) 267–288.
  • [98] Tibshirani, R. (1997). The lasso method for variable selection in the Cox model., Statistics in Medicine 16 385–395.
  • [99] Toivonen, H. T., Onkamo, P., Vasko, K., Ollikainen, V., Sevon, P., Mannila, H., Herr, M. and Kere, J. (2000). Data mining applied to linkage disequilibrium mapping., Am. J. Hum. Genet. 67(1) 133–145.
  • [100] Tzeng, J. N., Wang, C. H., Kao, J. T. and Hsiao, C. K. (2006). Regression-based association analysis with clustered haplotypes through use of genotypes., American Journal of Human Genetics 78(2) 231–242.
  • [101] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New, York
  • [102] Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New, York.
  • [103] Verzilli, C. J., Stallard, N. and Whittaker, J. C. (2006). Bayesian graphical models for genomewide association studies., American Journal of Human Genetics 79(1) 100–112.
  • [104] Wallenstein, S. and Neff, N. (1987). An approximation for the distribution of the scan statistic., Stat Med 6 197–207.
  • [105] Wang, L., Zhu, J. and Zou, H. (2006). Doubly regularized support vector machine., Statistica Sinica 16 589–615.
  • [106] Wessel, J. and Schork, N. J. (2006). Generalized Genomic Distance Based Regression Methodology for Multilocus Association Analysis., American Journal of Human Genetics 79(5) 792–806.
  • [107] Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. and Vapnik, V. (2000). Feature Selection for SVMs. In S. A. Solla, T. K. Leen, and K. R. Muller, (eds), Advances in Neural Information Processing Systems, volume 12, 526–532, Cambridge, MA, USA. MIT, Press.
  • [108] Witte, J. S. and Fijal, B. A. (2001). Introduction: Analysis of Sequence Data and Population Structure., Genetic Epidemiology 21 600–601.
  • [109] Yu, J. and Chen, X. W. (2005). Bayesian Neural Network Approaches to Ovarian Cancer Identification from High-resolution Mass Spectrometry Data., Bioinformatics 21 (suppl-1) i487–i494.
  • [110] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1) 49–67.
  • [111] Zaykin, D. V., Westfall, P. H., Young, S. S., Karnoub, M. A., Wagner, M. J. and Ehm, M. G. (2002b). Testing Association of Statistically Inferred Haplotypes with Discrete and Continuous Traits in Samples of Unrelated Individuals., Hum Hered 53 79–91.
  • [112] Zaykin, D. V. and Zhivotovsky, L. A. (2005). Ranks of genuine associations in whole-genome scans., Genet 171 813–823.
  • [113] Zaykin, D. V., Zhivotovsky, L. A., et al. (2002a). Truncated product method for combining P-values., Genet Epidemiol 22 170–185.
  • [114] Zhang, K. and Jin, L. (2003). HaploBlockFinder: Haplotype block analysis., Bioinformatics 19 1300–1301.
  • [115] Zhang, K., Qin, Z., Liu, J., Chen, T., Waterman, M. S. and Sun, F. (2004). Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies., Genome Res. 14 908–916.
  • [116] Zhang, Y., Niu, T. and Liu, J. (2006). A coalescence-guided hierarchical Bayesian method for haplotype inference., American Journal of Human Genetics 79(2) 313–322.
  • [117] Zhao, J., Boerwinkle, E. and Xiong, M. (2005). An entropy-based statistic for genomewide association studies., American Journal of Human Genetics 77 27–40.