Whole and targeted sequencing of human genomes is a promising, increasingly feasible tool for discovering genetic contributions to risk of complex diseases. A key step is calling an individual’s genotype from the multiple aligned short read sequences of his DNA, each of which is subject to nucleotide read error. Current methods are designed to call genotypes separately at each locus from the sequence data of unrelated individuals. Here we propose likelihood-based methods that improve calling accuracy by exploiting two features of sequence data. The first is the linkage disequilibrium (LD) between nearby SNPs. The second is the Mendelian pedigree information available when related individuals are sequenced. In both cases the likelihood involves the probabilities of read variant counts given genotypes, summed over the unobserved genotypes. Parameters governing the prior genotype distribution and the read error rates can be estimated either from the sequence data itself or from external reference data. We use simulations and synthetic read data based on the 1000 Genomes Project to evaluate the performance of the proposed methods. An R-program to apply the methods to small families is freely available at http://med.stanford.edu/epidemiology/PHGC/.
"Improving sequence-based genotype calls with linkage disequilibrium and pedigree information." Ann. Appl. Stat. 6 (2) 457 - 475, June 2012. https://doi.org/10.1214/11-AOAS527