The Annals of Applied Statistics

Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing

Jeremy J. Shen and Nancy R. Zhang
Source: Ann. Appl. Stat. Volume 6, Number 2 (2012), 476-496.

Abstract

We propose a flexible change-point model for inhomogeneous Poisson Processes, which arise naturally from next-generation DNA sequencing, and derive score and generalized likelihood statistics for shifts in intensity functions. We construct a modified Bayesian information criterion (mBIC) to guide model selection, and point-wise approximate Bayesian confidence intervals for assessing the confidence in the segmentation. The model is applied to DNA Copy Number profiling with sequencing data and evaluated on simulated spike-in and real data sets.

First Page: Show Hide
Full-text: Access denied (no subscription detected)
In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1339419604
Digital Object Identifier: doi:10.1214/11-AOAS517
Zentralblatt MATH identifier: 06062727
Mathematical Reviews number (MathSciNet): MR2976479

References

Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change models. J. Appl. Econometrics 18 1–22.
Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming. Commun. ACM 4 284.
Benjamini, Y. and Speed, T. (2011). Estimation and correction for GC-content bias in high throughput sequencing. Technical Report 804, Dept. Statistics, Univ. California, Berkeley.
Boeva, V., Zinovyev, A., Bleakley, K., Vert, J.-P., Janoueix-Lerosey, I., Delattre, O. and Barillot, E. (2011). Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics 27 268–269.
Campbell, P. J., Stephens, P. J., Pleasance, E. D., O’Meara, S., Li, H., Santarius, T., Stebbings, L. A., Leroy, C., Edkins, S., Hardy, C., Teague, J. W., Menzies, A., Goodhead, I., Turner, D. J., Clee, C. M., Quail, M. A., Cox, A., Brown, C., Durbin, R., Hurles, M. E., Edwards, P. A. W., Bignell, G. R., Stratton, M. R. and Futreal, P. A. (2008). Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics 40 722–729.
Chen, H., Xing, H. and Zhang, N. R. (2011). Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput. Biol. 7 e1001060, 15.
Mathematical Reviews (MathSciNet): MR2776334
Cheung, M.-S., Down, T. A., Latorre, I. and Ahringer, J. (2011). Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 39 e103.
Chiang, D. Y., Getz, G., Jaffe, D. B., O’Kelly, M. J., Zhao, X., Carter, S. L., Russ, C., Nusbaum, C., Meyerson, M. and Lander, E. S. (2009). High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature Methods 6 99–103.
Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a changepoint problem. Biometrika 65 243–251.
Mathematical Reviews (MathSciNet): MR513930
Zentralblatt MATH: 0394.62074
Digital Object Identifier: doi:10.1093/biomet/65.2.243
Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. and Pritchard, J. K. (2006). A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38 75–81.
Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36 e105.
Hinkley, D. V. (1970). Inference about the change-point in a sequence of random variables. Biometrika 57 1–17.
Mathematical Reviews (MathSciNet): MR273727
Zentralblatt MATH: 0198.51501
Digital Object Identifier: doi:10.1093/biomet/57.1.1
Hornik, K. (2005). A CLUE for CLUster Ensembles. Journal of Statistical Software 14.
Hornik, K. (2010). clue: Cluster ensembles R package version 0.3-34.
Ivakhno, S., Royce, T., Cox, A. J., Evers, D. J., Cheetham, R. K. and Tavaré, S. (2010). CNAseg–a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics 26 3051–3058.
Khaja, R., Zhang, J., MacDonald, J. R., He, Y., Joseph-George, A. M., Wei, J., Rafiq, Q. C. M. A., Shago, M., Pantano, L., Aburatani, H., Jones, K., Redon, R., Hurles, M., Armengol, L., Estivill, X., Mural, R. J., Lee, C., Scherer, S. and Feuk, L. (2007). Genome assembly comparison to identify structural variants in the human genome. Nature Genetics 38 1413–1418.
Lai, T. L., Xing, H. and Zhang, N. R. (2007). Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics 9 290–307.
Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21 3763–3770.
Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing 85 1501–1510.
Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N. and Yakhini, Z. (2006). Efficient calculation of interval scores for DNA copy number data analysis. J. Comput. Biol. 13 215–228 (electronic).
Mathematical Reviews (MathSciNet): MR2255255
Digital Object Identifier: doi:10.1089/cmb.2006.13.215
McCarroll, S. A., Hadnott, T. N., Perry, G. H., Sabeti, P. C., Zody, M. C., Barrett, J. C., Dallaire, S., Gabriel, S. B., Lee, C., Daly, M. J., Altshuler, D. M. and The International HapMap Consortium (2006). Common deletion polymorphisms in the human genome. Nature Genetics 38 86–92.
Medvedev, P., Stanciu, M. and Brudno, M. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6 S13–S20.
Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
Olshen, A. B., Bengtsson, H., Neuvial, P., Spellman, P. T., Olshen, R. A. and Seshan, V. E. (2011). Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics 27 2038–2046.
Rabinowitz, D. (1994). Detecting clusters in disease incidence. In Change-Point Problems (South Hadley, MA, 1992). Institute of Mathematical Statistics Lecture Notes—Monograph Series 23 255–275. IMS, Hayward, CA.
Mathematical Reviews (MathSciNet): MR1477929
Zentralblatt MATH: 1158.60352
Digital Object Identifier: doi:10.1214/lnms/1215463129
Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, D. T., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., Macdonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
Mathematical Reviews (MathSciNet): MR468014
Zentralblatt MATH: 0379.62005
Digital Object Identifier: doi:10.1214/aos/1176344136
Project Euclid: euclid.aos/1176344136
Shah, S. P., Lam, W. L., Ng, R. T. and Murphy, K. P. (2007). Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics 23 450–458.
Siegmund, D. (1988a). Approximate tail probabilities for the maxima of some random fields. Ann. Probab. 16 487–501.
Mathematical Reviews (MathSciNet): MR929059
Zentralblatt MATH: 0646.60032
Digital Object Identifier: doi:10.1214/aop/1176991769
Project Euclid: euclid.aop/1176991769
Siegmund, D. (1988b). Confidence sets in change-point problems. Internat. Statist. Rev. 56 31–48.
Mathematical Reviews (MathSciNet): MR963139
Digital Object Identifier: doi:10.2307/1403360
Siegmund, D. O., Yakir, B. and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat. 5 645–668.
Mathematical Reviews (MathSciNet): MR2840169
Zentralblatt MATH: 1223.62166
Digital Object Identifier: doi:10.1214/10-AOAS400
Project Euclid: euclid.aoas/1310562199
Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657–663.
Walther, G. (2010). Optimal and fast detection of spatial clusters with scan statistics. Ann. Statist. 38 1010–1033.
Mathematical Reviews (MathSciNet): MR2604703
Zentralblatt MATH: 1183.62076
Digital Object Identifier: doi:10.1214/09-AOS732
Project Euclid: euclid.aos/1266586621
Wang, P., Kim, Y., Pollack, J., Narasimhan, B. and Tibshirani, R. (2005). A method for calling gains and losses in array-CGH data. Biostatistics 6 45–58.
Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to arrayCGH data for downstream analyses. Bioinformatics 21 4084–4091.
Xie, C. and Tammi, M. T. (2009). CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10 80.
Yoon, S., Xuan, Z., Makarov, V., Ye, K. and Sebat, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 19 1586–1592.
Zhang, N. R. (2010). DNA copy number profiling in normal and tumor genomes. In Frontiers in Computational and Systems Biology (J. Feng, W. Fu and F. Sun, eds.). Computational Biology 15 259–281. Springer, London.
Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63 22–32, 309.
Mathematical Reviews (MathSciNet): MR2345571
Digital Object Identifier: doi:10.1111/j.1541-0420.2006.00662.x
Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97 631–645.
Mathematical Reviews (MathSciNet): MR2672488
Digital Object Identifier: doi:10.1093/biomet/asq025

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?