The Annals of Statistics

Approximate $p$-values for local sequence alignments

David Siegmund and Benjamin Yakir

Full-text: Open access

Abstract

Assume that two sequences from a finite alphabet are optimally aligned according to a scoring system that rewards similarities according to a general scoring scheme and penalizes gaps (insertions and deletions). Under the assumption that the letters in each sequence are independent and identically distributed and the two sequences are also independent, approximate $p$-values are obtained for the optimal local alignment when either (i) there are at most a fixed number of gaps, or (ii) the gap initiation cost is sufficiently large. In the latter case the approximation can be written in the same form as the well-known case of ungapped alignments.

Article information

Source
Ann. Statist. Volume 28, Number 3 (2000), 657-680.

Dates
First available in Project Euclid: 12 March 2002

Permanent link to this document
http://projecteuclid.org/euclid.aos/1015951993

Digital Object Identifier
doi:10.1214/aos/1015951993

Mathematical Reviews number (MathSciNet)
MR1792782

Zentralblatt MATH identifier
1105.62377

Subjects
Primary: 62M40: Random fields; image analysis
Secondary: 92D10: Genetics {For genetic algebras, see 17D92}

Keywords
Sequence alignment $p$-value gaps large deviations

Citation

Siegmund, David; Yakir, Benjamin. Approximate $p$-values for local sequence alignments. Ann. Statist. 28 (2000), no. 3, 657--680. doi:10.1214/aos/1015951993. http://projecteuclid.org/euclid.aos/1015951993.


Export citation

References

  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J. Molecular Biol. 215 403-410.
  • Altschul, S. F. and Gish, W. (1996). Local alignment statistics. Methods in Enzymology 266 460-480. Altschul, S. F., Madden, T. L., Sch¨affer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman,
  • D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25 3389-3402.
  • Arratia, R., Goldstein, L. and Gordon L. (1989). Two moments suffice for Poisson approximation: the Chen-Stein method. Ann. Probab. 17 9-25.
  • Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erd¨os-R´enyi Law in distribution for coin tossing and sequence matching. Ann. Statist. 18 539-570.
  • Asmussen, S. (1989). Risk theory in a Markovian environment. Scand. Actuarial J. 69-100.
  • Athreya, K. B., McDonald, D. and Ney. P. (1978). Limit theorems for semi-Markov processes and renewal theory for Markov chains. Ann. Probab. 6 788-797.
  • Chung, K. L. (1974). A Course in Probability Theory. Academic Press, New York.
  • Dembo, A., Karlin, S. and Zeitouni, O. (1994). Limit distribution of maximal non-aligned twosequence segmental score. Ann. probab. 22 2022-2039.
  • Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge Univ. Press.
  • Durrett, R. (1990). Probability: Theory and Examples. Duxbury Press, Belmont, CA.
  • Hogan, M. and Siegmund, D. (1986). Large deviations for the maximum of some random fields, Adv. in Appl. Math. 7 2-22.
  • Karlin, S. and Dembo, A. (1992). Limit distributions of maximal segmental score among Markovdependent parital sums. Adv. in Appl. Probab. 24 113-140.
  • Lezaud, P. (1998). Chernoff-type bound for finite Markov chains. Ann. Appl. Probab. 8 849-867.
  • Mott, R. and Tribe, R. (1999). Approximate statistics of gapped alignments. J. Comput. Biol. 6 91-112.
  • Neuhauser, C. (1994). A Poisson approximation for sequence comparisons with insertions and deletions. Ann. Statist. 22 1603-1629.
  • Pearson, W. R. (1995). Comparison of methods for searching protein databases. Protein Sci. 4 1145-1160.
  • Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals. Springer, New York.
  • Siegmund, D. and Yakir, B. (2000). Tail probabilities for the null distribution of scanning statistics. Bernoulli 6 191-213.
  • Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Molecular Biol. 147 195-197.
  • Waterman, M. (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London.
  • Waterman, M. and Vingron, M. (1994). Sequence comparison and Poisson approximation. Statist. Sci. 9 367-381.
  • Williams, D. (1991). Probability and Martingales. Cambridge Univ. Press.
  • Woodroofe, M. (1982). Nonlinear Renewal Theory in Sequential Analysis. SIAM, Philadelphia.
  • Yakir, B. and Pollak, M. (1998). A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8 749-774.