The Annals of Applied Statistics

Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing

Jeremy J. Shen and Nancy R. Zhang

Full-text: Open access

Abstract

We propose a flexible change-point model for inhomogeneous Poisson Processes, which arise naturally from next-generation DNA sequencing, and derive score and generalized likelihood statistics for shifts in intensity functions. We construct a modified Bayesian information criterion (mBIC) to guide model selection, and point-wise approximate Bayesian confidence intervals for assessing the confidence in the segmentation. The model is applied to DNA Copy Number profiling with sequencing data and evaluated on simulated spike-in and real data sets.

Article information

Source
Ann. Appl. Stat., Volume 6, Number 2 (2012), 476-496.

Dates
First available in Project Euclid: 11 June 2012

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1339419604

Digital Object Identifier
doi:10.1214/11-AOAS517

Mathematical Reviews number (MathSciNet)
MR2976479

Zentralblatt MATH identifier
1243.62112

Keywords
Copy number CNV change point inhomogeneous Poisson process point-wise confidence interval

Citation

Shen, Jeremy J.; Zhang, Nancy R. Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. Ann. Appl. Stat. 6 (2012), no. 2, 476--496. doi:10.1214/11-AOAS517. https://projecteuclid.org/euclid.aoas/1339419604


Export citation

References

  • Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change models. J. Appl. Econometrics 18 1–22.
  • Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming. Commun. ACM 4 284.
  • Benjamini, Y. and Speed, T. (2011). Estimation and correction for GC-content bias in high throughput sequencing. Technical Report 804, Dept. Statistics, Univ. California, Berkeley.
  • Boeva, V., Zinovyev, A., Bleakley, K., Vert, J.-P., Janoueix-Lerosey, I., Delattre, O. and Barillot, E. (2011). Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics 27 268–269.
  • Campbell, P. J., Stephens, P. J., Pleasance, E. D., O’Meara, S., Li, H., Santarius, T., Stebbings, L. A., Leroy, C., Edkins, S., Hardy, C., Teague, J. W., Menzies, A., Goodhead, I., Turner, D. J., Clee, C. M., Quail, M. A., Cox, A., Brown, C., Durbin, R., Hurles, M. E., Edwards, P. A. W., Bignell, G. R., Stratton, M. R. and Futreal, P. A. (2008). Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics 40 722–729.
  • Chen, H., Xing, H. and Zhang, N. R. (2011). Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput. Biol. 7 e1001060, 15.
  • Cheung, M.-S., Down, T. A., Latorre, I. and Ahringer, J. (2011). Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 39 e103.
  • Chiang, D. Y., Getz, G., Jaffe, D. B., O’Kelly, M. J., Zhao, X., Carter, S. L., Russ, C., Nusbaum, C., Meyerson, M. and Lander, E. S. (2009). High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature Methods 6 99–103.
  • Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a changepoint problem. Biometrika 65 243–251.
  • Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. and Pritchard, J. K. (2006). A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38 75–81.
  • Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36 e105.
  • Hinkley, D. V. (1970). Inference about the change-point in a sequence of random variables. Biometrika 57 1–17.
  • Hornik, K. (2005). A CLUE for CLUster Ensembles. Journal of Statistical Software 14.
  • Hornik, K. (2010). clue: Cluster ensembles R package version 0.3-34.
  • Ivakhno, S., Royce, T., Cox, A. J., Evers, D. J., Cheetham, R. K. and Tavaré, S. (2010). CNAseg–a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics 26 3051–3058.
  • Khaja, R., Zhang, J., MacDonald, J. R., He, Y., Joseph-George, A. M., Wei, J., Rafiq, Q. C. M. A., Shago, M., Pantano, L., Aburatani, H., Jones, K., Redon, R., Hurles, M., Armengol, L., Estivill, X., Mural, R. J., Lee, C., Scherer, S. and Feuk, L. (2007). Genome assembly comparison to identify structural variants in the human genome. Nature Genetics 38 1413–1418.
  • Lai, T. L., Xing, H. and Zhang, N. R. (2007). Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics 9 290–307.
  • Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21 3763–3770.
  • Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing 85 1501–1510.
  • Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N. and Yakhini, Z. (2006). Efficient calculation of interval scores for DNA copy number data analysis. J. Comput. Biol. 13 215–228 (electronic).
  • McCarroll, S. A., Hadnott, T. N., Perry, G. H., Sabeti, P. C., Zody, M. C., Barrett, J. C., Dallaire, S., Gabriel, S. B., Lee, C., Daly, M. J., Altshuler, D. M. and The International HapMap Consortium (2006). Common deletion polymorphisms in the human genome. Nature Genetics 38 86–92.
  • Medvedev, P., Stanciu, M. and Brudno, M. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6 S13–S20.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
  • Olshen, A. B., Bengtsson, H., Neuvial, P., Spellman, P. T., Olshen, R. A. and Seshan, V. E. (2011). Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics 27 2038–2046.
  • Rabinowitz, D. (1994). Detecting clusters in disease incidence. In Change-Point Problems (South Hadley, MA, 1992). Institute of Mathematical Statistics Lecture Notes—Monograph Series 23 255–275. IMS, Hayward, CA.
  • Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, D. T., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., Macdonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shah, S. P., Lam, W. L., Ng, R. T. and Murphy, K. P. (2007). Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics 23 450–458.
  • Siegmund, D. (1988a). Approximate tail probabilities for the maxima of some random fields. Ann. Probab. 16 487–501.
  • Siegmund, D. (1988b). Confidence sets in change-point problems. Internat. Statist. Rev. 56 31–48.
  • Siegmund, D. O., Yakir, B. and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat. 5 645–668.
  • Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657–663.
  • Walther, G. (2010). Optimal and fast detection of spatial clusters with scan statistics. Ann. Statist. 38 1010–1033.
  • Wang, P., Kim, Y., Pollack, J., Narasimhan, B. and Tibshirani, R. (2005). A method for calling gains and losses in array-CGH data. Biostatistics 6 45–58.
  • Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to arrayCGH data for downstream analyses. Bioinformatics 21 4084–4091.
  • Xie, C. and Tammi, M. T. (2009). CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10 80.
  • Yoon, S., Xuan, Z., Makarov, V., Ye, K. and Sebat, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 19 1586–1592.
  • Zhang, N. R. (2010). DNA copy number profiling in normal and tumor genomes. In Frontiers in Computational and Systems Biology (J. Feng, W. Fu and F. Sun, eds.). Computational Biology 15 259–281. Springer, London.
  • Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63 22–32, 309.
  • Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97 631–645.