Open Access
June 2017 Assignment of endogenous retrovirus integration sites using a mixture model
David R. Hunter, Le Bao, Mary Poss
Ann. Appl. Stat. 11(2): 751-770 (June 2017). DOI: 10.1214/16-AOAS1016


Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.


Download Citation

David R. Hunter. Le Bao. Mary Poss. "Assignment of endogenous retrovirus integration sites using a mixture model." Ann. Appl. Stat. 11 (2) 751 - 770, June 2017.


Received: 1 September 2015; Revised: 1 November 2016; Published: June 2017
First available in Project Euclid: 20 July 2017

zbMATH: 06775891
MathSciNet: MR3693545
Digital Object Identifier: 10.1214/16-AOAS1016

Keywords: mixture model , negative binomial , read count data

Rights: Copyright © 2017 Institute of Mathematical Statistics

Vol.11 • No. 2 • June 2017
Back to Top