Annals of Applied Statistics

Estimating the occurrence rate of DNA palindromes

I-Ping Tu, Shao-Hsuan Wang, and Yuan-Fu Huang

Full-text: Open access

Abstract

A DNA palindrome is a segment of letters along a DNA sequence with inversion symmetry that one strand is identical to its complementary one running in the opposite direction. Searching nonrandom clusters of DNA palindromes, an interesting bioinformatic problem, relies on the estimation of the null palindrome occurrence rate. The most commonly used approach for estimating this number is the average rate method. However, we observed that the average rate could exceed the actual rate by 50% when inserting 5000 bp hot-spot regions with 15-fold rate in a simulated 150,000 bp genome sequence. Here, we propose a Markov based estimator to avoid counting the number of palindromes directly, and thus to reduce the impact from the hot-spots. Our simulation shows that this method is more robust against the hot-spot effect than the average rate method. Furthermore, this method can be generalized to either a higher order Markov model or a segmented Markov model, and extended to calculate the occurrence rate for palindromes with gaps. We also provide a $p$-value approximation for various scan statistics to test nonrandom palindrome clusters under a Markov model.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 2 (2013), 1095-1110.

Dates
First available in Project Euclid: 27 June 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1372338480

Digital Object Identifier
doi:10.1214/12-AOAS622

Mathematical Reviews number (MathSciNet)
MR3113502

Zentralblatt MATH identifier
1288.62170

Keywords
DNA palindrome genome sequence hairpin structure higher order Markov model hot-spot Markov model occurrence rate Poisson process power $p$-value segmented Markov model

Citation

Tu, I-Ping; Wang, Shao-Hsuan; Huang, Yuan-Fu. Estimating the occurrence rate of DNA palindromes. Ann. Appl. Stat. 7 (2013), no. 2, 1095--1110. doi:10.1214/12-AOAS622. https://projecteuclid.org/euclid.aoas/1372338480


Export citation

References

  • Chan, H. P. and Zhang, N. R. (2007). Scan statistics with weighted observations. J. Amer. Statist. Assoc. 102 595–602.
  • Chen, G. and Zhou, Q. (2010). Heterogeneity in DNA multiple alignments: Modeling, inference, and applications in motif finding. Biometrics 66 694–704.
  • Chew, D., Cho, K. and Leung, M. (2005). Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses. Nucleic Acids Res. 33 134.
  • FitzGerald, P. C., Shlyakhtenko, A., Mir, A. A. and Vinson, C. (2004). Clustering of DNA sequences in human promoters. Genome Res. 14 1562–1574.
  • Hansen, N. R. (2009). Statistical models for local occurrences of RNA structures. J. Comput. Biol. 16 845–858.
  • Leach, D. R. (1994). Long DNA palindromes, cruciform structures, genetic instability and secondary structure repair. BioEssays 16 893–900.
  • Leung, M.-Y., Choi, K. P., Xia, A. and Chen, L. H. Y. (2005). Nonrandom clusters of palindromes in herpesvirus genomes. J. Comput. Biol. 12 331–354.
  • Lisnić, B., Svetec, I.-K., Sarić, H., Nikolić, I. and Zgaga, Z. (2005). Palindrome content of the yeast Saccharomyces cerevisiae genome. Curr. Genet. 47 289–297.
  • Lu, L., Jia, H., Dröge, P. and Li, J. (2007). The human genome-wide distribution of DNA palindromes. Funct. Integr. Genomics 7 221–227.
  • Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 5 223–253.
  • Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals. Springer, New York.
  • Tu, I.-P. (2009). Asymptotic overshoots for arithmetic i.i.d. random variables. Statist. Sinica 19 315–323.
  • Tu, I. P. (2013). The maximum of a ratchet scanning process over a Poisson random field. Statist. Sinica. To appear.
  • Tu, I. P., Wang, S. H. and Huang, Y. F. (2013a). Appendix for the paper “Estimating the occurrence rate of DNA palindromes.” DOI:10.1214/12-AOAS622SUPPA.
  • Tu, I. P., Wang, S. H. and Huang, Y. F. (2013b). Matlab scripts for the paper “Estimating the occurrence rate of DNA palindromes.” DOI:10.1214/12-AOAS622SUPPB.
  • Woodroofe, M. (1979). Repeated likelihood ratio tests. Biometrika 66 453–463.

Supplemental materials

  • Supplementary material A: Appendix for the paper “Estimating the occurrence rate of DNA palindromes”. The technical proofs for the theorems and corollaries in this paper are put in Supplement A as the appendix.
  • Supplementary material B: Matlab scripts for the paper “Estimating the occurrence rate of DNA palindromes”. The matlab scripts to calculate the thresholds derived in Theorem 4 are provided. The instruction is in the file “README.txt.”.