The Annals of Applied Statistics

Assignment of endogenous retrovirus integration sites using a mixture model

David R. Hunter, Le Bao, and Mary Poss

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

Article information

Ann. Appl. Stat., Volume 11, Number 2 (2017), 751-770.

Received: September 2015
Revised: November 2016
First available in Project Euclid: 20 July 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Mixture model negative binomial read count data


Hunter, David R.; Bao, Le; Poss, Mary. Assignment of endogenous retrovirus integration sites using a mixture model. Ann. Appl. Stat. 11 (2017), no. 2, 751--770. doi:10.1214/16-AOAS1016.

Export citation


  • Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. and Symer, D. E. (2008). Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res. 18 869–880.
  • Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Control 19 716–723.
  • Baillie, J. K., Barnett, M. W., Upton, K. R., Gerhardt, D. J., Richmond, T. A., De Sapio, F., Brennan, P. M., Rizzu, P., Smith, S., Fell, M., Talbot, R. T., Gustincich, S., Freeman, T. C., Mattick, J. S., Hume, D. A., Heutink, P., Carninci, P., Jeddeloh, J. A. and Faulkner, G. J. (2011). Somatic retrotransposition alters the genetic landscape of the human brain. Nature 479 534–537.
  • Bao, L., Elleder, D., Malhotra, R., DeGiorgio, M., Maravegias, T., Horvath, L., Carrel, L., Gillin, C., Hron, T., Fábryová, H., Hunter, D. R. and Poss, M. (2014). Computational and statistical analyses of insertional polymorphic endogenous retroviruses in a non-model organism. Comput. 2 221–245.
  • Böhne, A., Brunet, F., Galiana-Arnoux, D., Schultheis, C. and Volff, J.-N. (2008). Transposable elements as drivers of genomic and biological diversity in vertebrates. Chromosome Res. 16 203–15.
  • Bourque, G. (2009). Transposable elements in gene regulation and in the evolution of vertebrate genomes. Curr. Option Genet. Dev. 19 607–12.
  • Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30 1145–1159.
  • Burns, K. H. and Boeke, J. D. (2012). Human transposon tectonics. Cell 149 740–52.
  • Contreras-Galindo, R., Kaplan, M. H., He, S., Contreras-Galindo, A. C., Gonzalez-Hernandez, M. J., Kappes, F., Dube, D., Chan, S. M., Robinson, D., Meng, F., Dai, M., Gitlin, S. D., Chinnaiyan, A. M., Omenn, G. S. and Markovitz, D. M. (2013). HIV infection reveals widespread expansion of novel centromeric human endogenous retroviruses. Genome Res. 23 1505–1513.
  • Cordaux, R. and Batzer, M. a. (2009). The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10 691–703.
  • Cullingham, C. I., Nakada, S. M., Merrill, E. H., Bollinger, T. K., Pybus, M. J. and Coltman, D. W. (2011). Multiscale population genetic analysis of mule deer (Odocoileus hemionus hemionus) in western Canada sheds new light on the spread of chronic wasting disease. Can. J. Zool. 89 134–147.
  • Elleder, D., Kim, O., Padhi, A., Bankert, J. G., Simeonov, I., Schuster, S. C., Wittekindt, N. E., Motameny, S. and Poss, M. (2012). Polymorphic integrations of an endogenous gammaretrovirus in the mule deer genome. J. Virol. 86 2787–96.
  • Evrony, G. D., Cai, X., Lee, E., Hills, L. B., Elhosary, P. C., Lehmann, H. S., Parker, J. J., Atabay, K. D., Gilmore, E. C., Poduri, A., Park, P. J. and Walsh, C. A. (2012). Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 151 483–496.
  • Evrony, G. D., Lee, E., Park, P. J. and Walsh, C. A. (2016). Resolving rates of mutation in the brain using single-neuron genomics. eLife 5 e12966.
  • Faircloth, B. C. and Glenn, T. C. (2012). Not all sequence tags are created equal: Designing and validating sequence identification tags robust to indels. PLoS ONE 7 e42543.
  • Fedoroff, N. V. (2012). Transposable elements, epigenetics, and genome evolution. Science 338 758–767.
  • Hunter, D. R., Bao, L. and Poss, M. (2017). Supplement to “Assignment of endogenous retrovirus integration sites using a mixture model.” DOI:10.1214/16-AOAS1016SUPP.
  • Iskow, R. C., McCabe, M. T., Mills, R. E., Torene, S., Pittard, W. S., Neuwald, A. F., Van Meir, E. G., Vertino, P. M. and Devine, S. E. (2010). Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141 1253–61.
  • Kapusta, A., Kronenberg, Z., Lynch, V. J., Zhuo, X., Ramsay, L., Bourque, G., Yandell, M. and Feschotte, C. (2013). Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 9 e1003470.
  • Kazazian, H. H. (2004). Mobile elements: Drivers of genome evolution. Science 303 1626–32.
  • Kokošar, J. and Kordiš, D. (2013). Genesis and regulatory wiring of retroelement-derived domesticated genes: A phylogenomic perspective. Mol. Biol. Evol. 30 1015–1031.
  • Latch, E. K., Reding, D. M., Heffelfinger, J. R., Alcalá-Galván, C. H. and Rhodes, O. E. (2014). Range-wide analysis of genetic structure in a widespread, highly mobile species (Odocoileus hemionus) reveals the importance of historical biogeography. Mol. Ecol. 23 3171–3190.
  • Malhotra, R., Elleder, D., Bao, L., Hunter, D. R., Acharya, R. and Poss, M. (2016). Clustering pipeline for determining consensus sequences in targeted next-generation sequencing. In Proceedings of the 8th International Conference on Bioinformatics and Computational Biology (BICOB 2016).
  • Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • O’Donnell, K. a. and Burns, K. H. (2010). Mobilizing diversity: Transposable element insertions in genetic variation and disease. Mob. DNA 1 21.
  • Powell, J. H., Kalinowski, S. T., Higgs, M. D., Ebinger, M. R., Vu, N. V. and Cross, P. C. (2013). Microsatellites indicate minimal barriers to mule deer Odocoileus hemionus dispersal across Montana, USA. Wildl. Biol. 19 102–110.
  • Richardson, S. R., Morell, S. and Faulkner, G. J. (2014). L1 retrotransposons and somatic mosaicism in the brain. Annu. Rev. Genet. 48 1–27.
  • R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Wittekindt, N. E., Padhi, A., Schuster, S. C., Qi, J., Zhao, F., Tomsho, L. P., Kasson, L. R., Packard, M., Cross, P. and Poss, M. (2010). Nodeomics: Pathogen detection in vertebrate lymph nodes using meta-transcriptomics. PLoS ONE 5 e13432.

Supplemental materials

  • Datasets and R code. We provide all data used in the article along with code written in R [R Core Team (2016)] that can be used to duplicate all analyses and figures.