The Annals of Applied Statistics

Detecting simultaneous variant intervals in aligned sequences

David Siegmund, Benjamin Yakir, and Nancy R. Zhang

Full-text: Open access


Given a set of aligned sequences of independent noisy observations, we are concerned with detecting intervals where the mean values of the observations change simultaneously in a subset of the sequences. The intervals of changed means are typically short relative to the length of the sequences, the subset where the change occurs, the “carriers,” can be relatively small, and the sizes of the changes can vary from one sequence to another. This problem is motivated by the scientific problem of detecting inherited copy number variants in aligned DNA samples. We suggest a statistic based on the assumption that for any given interval of changed means there is a given fraction of samples that carry the change. We derive an analytic approximation for the false positive error probability of a scan, which is shown by simulations to be reasonably accurate. We show that the new method usually improves on methods that analyze a single sample at a time and on our earlier multi-sample method, which is most efficient when the carriers form a large fraction of the set of sequences. The proposed procedure is also shown to be robust with respect to the assumed fraction of carriers of the changes.

Article information

Ann. Appl. Stat., Volume 5, Number 2A (2011), 645-668.

First available in Project Euclid: 13 July 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Scan statistics change-point detection segmentation DNA copy number


Siegmund, David; Yakir, Benjamin; Zhang, Nancy R. Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat. 5 (2011), no. 2A, 645--668. doi:10.1214/10-AOAS400.

Export citation


  • Bignell, G. R., Huang, J., Greshock, J., Watt, S., Butler, A., West, S., Grigorova, M., Jones, K. W., Wei, W., Stratton, M. R., Futreal, P. A., Weber, B., Shapero, M. H. and Wooster, R. (2004). High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res. 14 287–295.
  • Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucl. Acids Res. 35 2013–2025.
  • Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J. M. and Wang, K. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucl. Acids Res. 36 e126+.
  • Göring, H. H., Curran, J. E., Johnson, M. P., Dyer, T. D., Charlesworth, J., Cole, S. A., Jowett, J. B. M., Abraham, L. J., Rainwater, D. L., Comuzzie, A. G., Mahaney, M. C., Almasy, L., MacCluer, J. W., Kissebah, A. H., Collier, G. R., Moses, E. K. and Blangero, J. (2007). Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 39 1208–1216.
  • Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21 3763–3770.
  • McCarroll, S. A. (2008). Extending genome-wide association studies to copy-number variation. Hum. Mol. Genet. 17 R135–R142.
  • McCarroll, S. A. A., Kuruvilla, F. G. G., Korn, J. M. M., Cawley, S., Nemesh, J., Wysoker, A., Shapero, M. H. H., de Bakker, P. I. W. I., Maller, J. B. B., Kirby, A., Elliott, A. L. L., Parkin, M., Hubbell, E., Webster, T., Mei, R., Veitch, J., Collins, P. J. J., Handsaker, R., Lincoln, S., Nizzari, M., Blume, J., Jones, K. W. W., Rava, R., Daly, M. J. J., Gabriel, S. B. B. and Altshuler, D. (2008). Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40 1166–1174.
  • Morley, M., Molony, C. M., Teresa, M., Weber, T. M., Devlin, J. L., Ewens, W. G., Spielman, R. S. and Cheung, V. G. (2004). Genetic analysis of genome-wide variation in human gene expression. Nature 430 743–747.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
  • Peiffer, D. A., Le, J. M., Steemers, F. J., Chang, W., Jenniges, T., Garcia, F., Haden, K., Li, J., Shaw, C. A., Belmont, J., Cheung, S. W., Shen, R. M., Barker, D. L. and Gunderson, K. L. (2006). High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 16 1136–1148.
  • Pinkel, D. and Albertson, D. G. (2005). Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 37 S11–S17.
  • Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins, C., Kuo, W. L., Chen, C., Zhai, Y., Dairkee, S. H., Ljung, B. M., Gray, J. W. and Albertson, D. G. (1998). High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet. 20 207–211.
  • Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., Williams, C. F., Jeffrey, S. S., Botstein, D. and Brown, P. O. (1999). Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet. 23 41–46.
  • Purdom, E. and Holmes, S. P. (2005). Error distribution for gene expression data. Statist. Appl. Genet. Mol. Biol. 4 16.
  • Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, D. T., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., Macdonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
  • Shi, J., Siegmund, D. and Levinson, D. F. (2007). Statistical corrections of linkage data suggest predominantly cis regulations of gene expression. In Proceedings of the 2006 Genetic Analysis Workshop, BMCC Proceedings I S145.
  • Siegmund, D. O. and Yakir, B. (2000). Tail probabilities for the null distribution of scanning statistics. Bernoulli 6 191–213.
  • Siegmund, D. O. and Yakir, B. (2007). The Statistics of Gene Mapping. Springer, New York.
  • Siegmund, D. O., Yakir, B. and Zhang, N. R. (2010). Tail approximations for maxima of random fields by likelihood ratio transformations. Sequential Anal. 29 245–262.
  • Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamilton, G., Hindle, A. K., Huey, B., Kimura, K., Law, S., Myambo, K., Palmer, J., Ylstra, B., Yue, J. P., Gray, J. W., Jain, A. N., Pinkel, D. and Albertson, D. G. (2001). Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet. 29 263–264.
  • Tartakovsky, A. and Polunchenko, A. S. (2007). Decentralized quickest change detection in distributed sensor systems with applications to information assurance and counter terrorism. In Proceedings of the 13th Annual Army Conference on Applied Statistics, Houston, TX.
  • Walsh, T., McClellan, J. M., McCarthy, S. E., Addington, A. M., Pierce, S. B., Cooper, G. M., Nord, A. S., Kusenda, M., Malhotra, D., Bhandari, A., Stray, S. M., Rippey, C. F., Roccanova, P., Makarov, V., Lakshmi, B., Findling, R. L., Sikich, L., Stromberg, T., Merriman, B., Gogtay, N., Butler, P., Eckstrand, K., Noory, L., Gochman, P., Long, R., Chen, Z., Davis, S., Baker, C., Eichler, E. E., Meltzer, P. S., Nelson, S. F., Singleton, A. B., Lee, M. K., Rapoport, J. L., King, M.-C. and Sebat, J. (2008). Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320 539–543.
  • Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H. and Bucan, M. (2007). PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17 1665–1674.
  • Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to arrayCGH data for downstream analyses. Bioinformatics 21 4084–4091.
  • Yakir, B. and Pollak, M. (1998). A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8 749–774.
  • Zhang, N. R. (2010). DNA copy number profiling in normal and tumor genomes. In Frontiers in Computational and Systems Biology ( J. Feng, W. Fu and F. Sun, eds.) 259–281. Springer, London.
  • Zhang, N. R., Senbabaoglu, Y. and Li, J. Z. (2010). Joint estimation of DNA copy number from multiple platforms. Bioinformatics 26 153–160.
  • Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63 22–32.
  • Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous change-points in multiple sequences. Biometrika. 97 631–644.