Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.


Introduction
Correlating genetic variations in DNA sequences with phenotypic differences has been one of the grand challenges in biomedical research.Substantial efforts have been made to obtain all common genetic variations in humans, including single nucleotide polymorphisms (SNPs), deletions and insertions [13].SNPs are single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater [13].In practice, the term "SNP" is used more loosely.Restricting the attention to common SNPs with minor allele frequency bigger than a certain cutoff, e.g.1% will help to filter out some "recent" mutations.SNPs are believed to alter the risk for developing particular diseases.It is, however, very unlikely that individual SNPs play an important role in the development of complex diseases.Instead, high-order interactions of SNPs are supposed to explain the differences between low and high risk population groups.
The HapMap Project has collected genotypes of millions of SNPs from populations with ancestry from Africa, Asia and Europe and makes this information freely available in the public domain [93][94][95].To find evidence of association in this huge data set is a grand challenge now.Therefore, there is a great need, conceptually as well as computationally, to develop advanced robust algorithms and analytical methods for characterizing genetic variations that are non-redundant and identify the target SNPs that are most likely to affect the phenotypes and ultimately contribute to disease development.
Exploiting information redundancy due to associations between SNP markers potentially reduces the efforts in terms of time and cost for genetic association studies [75].However, the efficacy of searching for an optimal set of SNPs has not been as successful as expected in theory.One primary cause is the high dimensionality with highly correlated features/SNPs that can hinder the power of the identification of small to moderate genetic effects in complex diseases.The need to incorporate covariates of other environmental risk factors as effect modifiers or confounders further worsens the "curse of dimensionality" problem in mapping genes for complex diseases [16].Therefore, feature selection for massive genomic data in high dimensions has become one of the main tasks to be tackled with statistical and computational efforts in the past decade.

Feature selection methods for high dimensional problems
The computational and statistical methods that address the "curse of dimensionality" problem in genomic research can be grouped into three categories: filtering, wrapper, and embedded methods.Filtering methods select feature subsets independently from the learning classifiers and do not incorporate learning.Therefore, filtering methods are fast [10; 60; 69; 109].A weakness of filtering methods is that they only consider the individual features in isolation and ignore the possible interaction among them.Yet, the combination of these features may have a combined effect that does not necessarily follow from the individual performances of features in the group [73].One of the consequences of filtering methods is that we may end up with many highly correlated features/SNPs with highly redundant information that worsens the classification and prediction performance.If there is a limit on the number of features to be chosen, then we may not be able to include all the informative ones.
To address this problem in filtering methods, wrapper methods wrap around a particular learning algorithm that can assess the selected feature subsets in terms of the estimated classification errors and then build the final classifier [44].Wrapper methods use a learning machine to measure the quality of subsets of features.One of the well-known wrapper methods for feature selection is Support Vector Machine Recursive Feature Elimination, which refines the optimum feature set by using a Support Vector Machine, [33].The idea of SVMRFE is that the orientation of the separating hyper-plane found by the SVM can be used to select informative features: if the plane is orthogonal to a particular feature dimension, then that feature is informative, and vice versa.SVMRFE uses the weights of a SVM classifier to produce a feature ranking, and then eliminates the feature with smallest weight magnitude recursively.
Wrapper methods can be used with arbitrary classifiers and can notably reduce the number of features and significantly improve the classification accuracy [63; 79].However, wrapper methods have the drawback that they do not incorporate knowledge about the specific structure of the classification or regression function [52].Moreover, they are more computationally expensive since they need to evaluate a cross-validation scheme at each iteration.
With much better computational efficiency and similar performance to wrapper methods, a relatively new class of approaches for feature selection called "embedded methods" has become available in the literature.Lal et al. [52] provide the detailed mathematical formulations of embedded methods.Embedded methods process feature selection simultaneously with the learning classifier and the feature selection can not be separated from the learning.For example, Weston et al. [107] measure the importance of a feature using a bound that is valid for Support Vector Machines only, thus it is not possible to use this method with, for example, decision trees.
Therefore the structure of the class of functions under consideration plays a crucial role.For an embedded method, every subset of features is modeled by a vector σ ∈ {0, 1} n of indicator variables, σ i := 0 indicating that a feature is present in a subset and σ i := 1 indicating that a feature is absent (i = 1,. ..,n).
A parameterized family of classification or regression functions are given as follows: f : Λ × ℜ n → ℜ, (α, x) ∝ f (α, x).The goal of an embedded method is to find a vector of indicator variables σ * ∈ {0, 1} n and α * ∈ Λ that minimize the expected risk R(α, σ) = L[f (α, σ * x), y]dF (x, y), where * denotes the pointwise product, L is a loss function and P is a measure on the domain of the training data (X; Y ).
One may impose some additional constraints for penalty or regularizations to achieve sparseness: s(σ) ≤ σ 0 , where s : [0, 1] n → ℜ + measures the sparsity of a given indicator vector σ.For example, s could be defined as: s(σ) := l 0 (σ) ≤ σ 0 , that is to bound the zero "norm" l 0 (σ), which counts the number of nonzero entries in σ.The L1-norm, L2-norm, and L∞-norm or the elastic-net penalty, a mixture of the L2-norm and the L1-norm penalties [105] are also proposed to achieve automatic feature selections by shrinking the fitted coefficients toward zero.These automatic feature selection methods also benefit from the reduction in the fitted coefficients' variance.
One of the merits of an embedded method is that it intends to find the feature subset of a certain size that leads to the best possible generalization or equivalently to minimal risk, which can be seen from the above formulation.Therefore, the function that measures the quality of a scaling factor can be evaluated faster than a cross-validation error estimation procedure.Moreover, they turn the multiple testing problems for feature selection into an optimization problem in the nonparametric setting.Some recent studies [90; 105] have shown that they are more computational efficient and asymptotically optimal for high dimensional data.
Embedded methods tend to have higher capacity than filtering methods and are therefore more likely to overfit.We thus expect filtering methods to perform better if only a small amount of training data is available.Embedded methods eventually outperform filtering methods as the number of training samples increase.LASSO proposed by Tibshirani [97; 98], logic regression with the regularized Laplacian prior [51] and Bayesian regularized neural network with automatic relevance determination [56] are examples of embedded methods.
Note that the three feature reduction methods, filter, wrapper and embedded methods discussed in this section may perform differently when applied to categorical SNP data instead of continuous gene expression data, in which there are only three genotypes, two homozygous genotypes and one heterozygous genotype.Next we will focus on the review of the recently developed categorical SNP data reduction methods in genome wide association studies.

SNP selections in genome-wide association studies
A major aim of association studies is the identification of polymorphisms, usually single nucleotide polymorphisms (SNPs) associated with a trait or disease status.There are several major computational and statistical tracks for SNP selections, which we will review next [3; 18; 25; 34; 35; 40; 46; 48; 59; 115].

Statistical measures and testing for SNP-disease association
Specifically, in genome-wide disease association studies, various statistical measures and testing based approaches have been proposed for selecting a subset of SNPs [17; 30; 36; 57; 85; 89].These include Linkage Disequilibrium (LD) based SNP selection and supervised SNP selection.Linkage Disequilibrium based methods for selecting a maximally informative set of SNPs for association analyses were developed first [24; 92; 101; 102; 108].For instance, Zhang and Jin [114] identified tagSNPs from haplotype data in two steps; first, they identified haplotype blocks and then identified tagSNPs that best distinguish the haplotypes within a haplotype block.This method is applicable for all types of association studies.Anderson and Novembre [1] and Mannila et al. [61] proposed finding haplotype block boundaries using minimum description length.Entropy-based measure for SNP selections were proposed by Hampe, Schreiber, and Krawczak [36] and Zhao, Boerwinkle, and Xiong [117].Beckmann et al. [8] presented Mantel statistics for SNP selections and disease mapping purposes by using haplotype sharing to correlate temporal and spatial distributions of cancer in a generalized regression model.
A sliding window approach developed by Neale and Sham [68] combines pvalues from multiple independent tests using Here, p i is the p-value of association between SN P i and presence of disease, and m is the number of SNPs in the sliding window.The test statistic χ 2 has a chi-square distribution with 2m degrees of freedom.The sliding window incorporates the ordering of SNPs on the chromosome and merges results across adjacent windows to detect chromosome regions with significant associations [27; 84; 113].However, it does not consider the distance between them and the implicit assumption is that the SNPs are equally spaced.
The scan statistic [26; 39; 54; 91; 104] does account for the spacing and ordering of SNPs on the chromosome, but it does not consider gene-gene interactions.For instance, Sun, et al. [91] developed a chromosomal scan statistic approach, which includes two parts: (i) Identifying SNP clusters; (ii) Identifying SNP clusters with significant disease association.This scan method assumes the position of each SNP is randomly determined by a Poisson process.The lengths between two adjacent SNPs have an exponential distribution and the sum-of-lengths between SNPs has a Gamma distribution.Under the above assumptions, the clusters of SNPs are first identified by testing the hypothesis that whether the observed length between a set of SNPs (combined interval between these SNPs) is equal (null hypothesis) or less than (alternative hypothesis) the expected length.Rejection of the null hypothesis identifies this group of SNPs as cluster.To further identify SNP clusters with significant disease association, disease outcomes are incorporated and Pearson Chi-square p-values are computed for associations of significance.
Other test statistic approaches, such as the score statistic [81; 82], and weightedaverage statistic [87] for disease mapping in case-control studies were also proposed for SNP selection in genetic association studies.Cheng et al. [20] propose using the expectation maximization (EM) algorithm to estimate haplotype fre-quencies of multiple linked SNPs, and follow this by constructing a contingency table statistic S for LD analysis based on the estimated haplotype frequencies.An empirical p-value is obtained based on the null distribution of the maximum of S (S*) from a large number (e.g., 1,000 or more) of randomized permutations.This method is developed for mapping functional sites or regions from case-control data using haplotypes of multiple linked SNPs.
All these conventional test based filter approaches estimate the association between each SNP (or multiple SNPs) and a phenotype, and then use the corresponding p-values to prioritize the results.One drawback is that one may end up with many highly correlated SNPs or genes with high redundancy information, which can be hurdles for further classifications and predictions.Also the test based approaches can not incorporate many environmental factors to account for gene-environment interactions.Furthermore, the non-independence of SNPs in physical proximity (Linkage Disequilibrium) may cause problems for multiple testing scenarios with correlated tests [6; 7; 23; 70; 80; 112].Simple corrections may lead to either conservative p-values if Bonferroni correction is used or become computationally expensive, if permutation is used [84].Nyholt [70] proposed a method for efficiently accounting for multiple testing of many SNPs in an association study that involves estimating an "effective number" of independent tests, and then adjusting the smallest observed p-value using Sidak's formula based on this number of tests.Salyakina et al. [80] further evaluated this method.
Note that the "multiple testing problem" discussed here differs from the "curse of dimensionality problem", so it poses different challenges."Multiple testing problem" is caused by the high dimensionality of the predictors (including features plus possible interactions of features) and the complex correlation structures of the predictors, while the "curse of dimensionality problem" arises when considering the interaction of many features, i.e., there are not enough observations in each combination of those features.
Last, but not least, these existing testing based approaches ignore some information about the SNPs, such as sub-structures of the underlying population (admixture problem).This may lead to spurious results as well as suffer from low power.This may explain why reproducibility has become a major issue in genomic association studies for complex diseases.The same data set can show a highly significant association with one method, whereas a different method shows no or only a marginal association.Also, given the low prior probability of causality for each SNP in the genome, rigorous standards of statistical significance are needed for genome-wide association studies in order to avoid a flood of false-positive results.Multiple replications in large samples may provide the most straightforward path in identifying robust and broadly relevant associations.

Supervised statistical models and statistical learning algorithms
In order to incorporate environmental factors and other covariates/confounders into the genomic association studies, various model based approaches have been developed.He and Zelikovsky [37] proposed tagSNPs for unphased genotypes based on multiple linear regressions.Durrant et al. [28] adopted a logisticregression model applicable to whole-genome screens using sliding windows; it controls for other (continuous) confounders and gene-environment interactions.Yet, they have to make assumptions on the disease model, which is usually unknown in practice.Moreover, the effects of violations of these assumptions are unpredictable in general.Baker [5] applied a simple loglinear model for haplotype effects in a case-control study involving two unphased genotypes.
The haplotype trend regression, developed by Zaykin et al. [111], fits a model of additive effects of haplotypes that takes a series of marker genotypes, computes haplotype probabilities for each observation using the composite haplotype method (CHM), and forms a linear regression on the response using the haplotype probabilities as the regression matrix.A nonparametric method called Haplotype Pattern Mining (HPM) was proposed to identify disease associated haplotype patterns from case-control data.HPM has two steps: In step I, given the data-markers, haplotypes, and phenotypes, the goal is to output all haplotype patterns that are strongly associated with the disease status for a given value of the association threshold; In step II, it is to find the "gene location", by counting frequency that one marker appears in the haplotype patterns identified in the first step.Since the HPM method utilizes the disease status (case/control), it is a supervised mining approach.Toivonen et al. [99] showed that HPM does not require any assumptions on the inheritance patterns and has good localization power, even when the number of phenocopies is large.Knorr-Held and Rue [49] developed Markov random field models on block updating for disease mapping.Other model-based approaches that can take into account the spatial correlation between markers were also proposed [14; 31; 32; 42; 96; 100; 103; 106].
Recently, Schwender and Ickstadt [83] demonstrated logic regression based identification of SNP interactions for the disease status in case-control study and proposed two measures for quantifying the importance of feature interactions for classification.In comparison with some well-known classification methods such as CART [12], Random Forests [11], and other regression procedures [108], logic regression has shown a good classification performance when applied to SNP data.When fitting with categorical features/variables in the model based approaches, i.e. the genotype measurements with two homozygous genotypes and one heterozygous genotype, we often define a set of dummy variables that represent a single categorical feature/variable.In order to select the set/group of dummy variables that represent a single categorical feature/variable/SNP simultaneously, Yuan and Lin [110] proposed the group-Lars and the group-Lasso methods.Park and Hastie [72] proposed several regularization path algorithms with grouped feature/variable selection for modeling gene-gene interactions.Multifactor dimensionality reduction has been proposed and implemented for SNPs data reduction by Coffey et al. [22], Ritchie et al. [77] and Moore et al. [64].

Unsupervised haplotype mapping approaches
Haplotype density based clustering algorithms and clustering techniques based on the degree of haplotype sharing in affected individuals for haplotype mapping were developed recently.These approaches have advantages of robustness since they are nonparametric and require fewer assumptions in modeling.Fu et al. [29] and Zhang et al. [116] proposed Bayesian models for the analysis of genetic structure when populations are correlated.Liu et al. [58] employed a Bayesian approach to model positions of the historical recombinations and mutation events that produced the observed haplotypes from an initial set of founders by accounting for all sources of uncertainties.They employed Monte Carlo Markov Chain (MCMC) method for parameter estimation and assigned haplotypes to clusters representing allele heterogeneity.Molitor et al. [62] modeled haplotype risks using clusters obtained from a probability model, but their method does not take phenocopies into consideration.Both methods were developed mainly for haplotype fine mapping and do not scale up for whole-genome screens very well.
Other algorithms for SNPs are hierarchical clustering and graph methods [2; 55].Principal Component Analysis with multiple genotype frequencies was also applied to select a subset of correlated SNPs that capture multiple genotype variability in the region [9; 57].However, whether Principal Component Analysis is a suitable tool for categorical SNPs information is arguable, since it is more appropriate for continuous scale data.The related correspondence analysis may be more suitable, but the interpretation of the results from correspondence analysis reveals many challenges.

Computational intelligence approaches
Computational intelligence systems [47; 74] hold a great promise for tackling the tasks and challenges posed by large, diverse, genomic data for complex diseases.Some of these challenges are the identification of gene-gene and geneenvironment interactions [4; 43; 50; 66; 78], dealing with the notorious "curse of dimensionality", the uncertainty, and unclear, fuzzy boundaries of phenotypes for complex diseases [76; 88].Techniques include neural networks [71], genetic algorithms [21], genetic programming [65], evolutionary trees [53], evolutionary algorithms [41] and various hybrid approaches.For instance, Moore [64] developed a hybrid genetic programming (GP) with a multifactor dimensionality reduction method to pick SNPs for epistasis.
Motsinger, et al. [67] applied a genetic programming neural network (GPNN) approach for detecting epistasis in case-control studies for SNPs data.They evaluated the power of GPNN for identifying high-order gene-gene interactions and applied GPNN to a real data on Parkinson's disease.They developed a Grammatical Evolution Neural Network (GENN), a machine-learning approach to detect gene-gene and gene-environment interactions in high dimensional genetic epidemiological data.Furthermore, they proposed an Ensemble Learning Approach for Set association (ELAS) to detect a set of interacting loci that predicts the complex trait.An important advantage of the hybrid approach is that any form of expert knowledge could be used to guide the stochastic search algorithm to identify epistatic SNPs in the absence of marginal effects.

Other challenges in genetic association studies of complex diseases
An important challenge that faces molecular association studies in the post genomic era is to understand the interconnections from a network of genes and their products that are modified by a variety of environmental factors [15; 45].The variety of phenotype definitions leads to a multiplicity of tests that involve a large number of comparisons that often result in less power.The need for adequate algorithms and models for reducing biological and statistical redundancy from thousands of SNPs and finding an optimal set of SNPs associated with diseases are pressing for common complex diseases.Dealing with many dependent association tests is one of the emerging issues on the statistical/computational side.
For SNP-disease data, in addition to being large, redundant, diverse and distributed, three important characteristics pose challenges for data analysis and modeling: (1) heterogeneity, (2) a constantly evolving biological nature and (3) complexity.Firstly, there is the heterogeneity of SNP data, in the sense that i) the population data involves the population substructure or admixture problem and there is locus heterogeneity where a large fraction of the prevalence is due to phenocopies; and ii) there is a wide array of data types, including categorical, continuous, sequence data, as well as temporal, incomplete and missing data.Such data sets are large with a lot of redundancy in SNP and haplotype databases.Secondly, they are very dynamic and continuously evolving, which means that special knowledge is required when designing the modeling techniques.Lastly, but most importantly, these SNP and haplotype data are complex with intrinsic features and subtle patterns, in the sense that they are very rich in associated complex phenotype traits.
The difficulty in a SNPs association study is increased by the nature of complex diseases [38].Typically, the contribution of single genes as well as of single environmental risk factors is small to moderate.Furthermore, most complex diseases result from gene-gene and gene-environment interactions [19].By disregarding interactions, relative risks of individual genetic variants are expected to be small.Disregarding gene-environment interaction also weakens exposuredisease and gene-disease associations.In complex diseases, it is likely that a combination of genes predisposes for the disease and environmental factors aggravate the impact of these genes and therefore are jointly responsible for disease development in populations (known as epistatic effect).In addition, environmental factors, which seem to have only a moderate impact at the population level might have larger relative risks in subpopulations with certain genetic predispositions.There are major methodological challenges in the study of gene-gene and gene-environment interactions.
Other open questions and challenges for new computational approaches in analyzing the associations between genetic markers, such as SNPs in complex diseases involve several hierarchical levels.First level of complexity: How to analyze multiple SNPs in a single gene?How to analyze interactions among multiple SNPs in a single gene?Second level of complexity: How to analyze multiple SNPs in multiple genes?How to analyze interactions among multiple SNPs in multiple genes?Third level of complexity: How to analyze interactions among multiple SNPs in multiple genes and environmental factors?Fourth level of complexity: How to analyze associations between SNPs in single or multiple genes and quantitative traits?How to identify and quantify the percentage of the association between genes and diseases explained by the association between the same gene and quantitative traits, taking into consideration single genes, multiple genes and environmental factors.Lastly, the ultimate goal in genetic/genomic analysis is to build direct or indirect causal association between genetic variants and phenotypes/disease status, but the difficulty here is that we do not know if there is association between the SNPs and the disease.However, with the development of computational/statistical approaches, we may be able to identify these causal associations and construct the pathways related to complex diseases.

Discussion
New advances in human genome research have drawn tremendous attention of researchers from multiple fields, including both theoretical scientists and applied researchers, especially in the statistical field.Huge amounts of continuously growing large-scale genomic, proteomic and clinical data for complex diseases and phenotype traits have posed ever greater challenges for the computational field.Multiple whole genome wide association studies have already been completed and have resulted in novel and promising genetic variants for various diseases.In this paper we presented a survey of recent advances and some promises of designing, developing and implementing statistical/computational methods for identifying SNP markers responsible for common, complex, chronic diseases, such as diabetes, cancer, multiple sclerosis, and cardiovascular disease and for tackling the challenges, such as gene-gene and gene-environment interactions along with the notorious "curse of dimensionality" problem.Success in identifying SNPs and haplotypes conferring susceptibility or resistance to common diseases will provide a deeper understanding of the architecture of the disease, the risks, and offer a more powerful diagnostic tool and predictive treatment.