The Annals of Applied Statistics

Assignment of endogenous retrovirus integration sites using a mixture model

David R. Hunter, Le Bao, and Mary Poss

Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

Ann. Appl. Stat., Volume 11, Number 2 (2017), 751-770.

Received: September 2015
Revised: November 2016
First available in Project Euclid: 20 July 2017

Mixture model negative binomial read count data


Hunter, David R.; Bao, Le; Poss, Mary. Assignment of endogenous retrovirus integration sites using a mixture model. Ann. Appl. Stat. 11 (2017), no. 2, 751--770. doi:10.1214/16-AOAS1016.

Supplemental materials

  • Datasets and R code. We provide all data used in the article along with code written in R [R Core Team (2016)] that can be used to duplicate all analyses and figures.