The Annals of Statistics

Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times

Yonil Park, Sergey Sheetlin, and John L. Spouge

Full-text: Open access

Abstract

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

Article information

Source
Ann. Statist. Volume 37, Number 6A (2009), 3697-3714.

Dates
First available in Project Euclid: 17 August 2009

Permanent link to this document
http://projecteuclid.org/euclid.aos/1250515402

Digital Object Identifier
doi:10.1214/08-AOS663

Mathematical Reviews number (MathSciNet)
MR2549575

Zentralblatt MATH identifier
05644295

Subjects
Primary: 62M99: None of the above, but in this section
Secondary: 92-08: Computational methods

Keywords
Gumbel scale parameter estimation gapped sequence alignment importance sampling stopping time Markov renewal process Markov additive process

Citation

Park, Yonil; Sheetlin, Sergey; Spouge, John L. Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times. Ann. Statist. 37 (2009), no. 6A, 3697--3714. doi:10.1214/08-AOS663. http://projecteuclid.org/euclid.aos/1250515402.


Export citation

References

  • [1] Aldous, D. (1989). Probability Approximations via the Poisson Clumping Heuristic, 1st ed. Springer, New York.
  • [2] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J. Molecular Biology 215 403–410.
  • [3] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402.
  • [4] Altschul, S. F., Bundschuh, R., Olsen, R. and Hwa, T. (2001). The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 29 351–361.
  • [5] Asmussen, S. (2003). Applied Probability and Queues. Springer, New York.
  • [6] Arratia, R. and Waterman, M. S. (1994). A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4 200–225.
  • [7] Bundschuh, R. (2002). Rapid significance estimation in local sequence alignment with gaps. J. Comput. Biology 9 243–260.
  • [8] Bundschuh, R. (2002). Asymmetric exclusion process and extremal statistics of random sequences. Phys. Rev. E 65 031911.
  • [9] Chan, H. P. (2003). Upper bounds and importance sampling of p-values for DNA and protein sequence alignments. Bernoulli 9 183–199.
  • [10] Cinlar, E. (1975). Introduction to Stochastic Processes. Prentice Hall, Upper Saddle River, NJ.
  • [11] Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure 345–352. National Biomedical Research Foundation, Silver Spring, MD.
  • [12] Dembo, A., Karlin, S. and Zeitouni, O. (1994). Limit distributions of maximal nonaligned two-sequence segmental score. Ann. Probab. 22 2022–2039.
  • [13] Djellout, H. and Guillin, A. (2001). Moderate deviations for Markov chains with atom. Stochastic Process. Appl. 95 203–217.
  • [14] Galombos, J. (1978). The Asymptotic Theory of Extreme Order Statistics, 1st ed. Wiley and Sons, New York.
  • [15] Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Molecular Biology 162 705–708.
  • [16] Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89 10915–10919.
  • [17] Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73–101.
  • [18] Karlin, S. and Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. In Proceedings of the National Academy of Sciences of the United States of America 87 2264–2268.
  • [19] Kingman, J. F. C. (1961). A convexity property of positive matrices. Quart. J. Math. Oxford 12 283–284.
  • [20] Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer, New York.
  • [21] Mott, R. (1999). Local sequence alignments with monotonic gap penalties. Bioinformatics 15 455–462.
  • [22] Mott, R. (2000). Accurate formula for p-values of gapped local sequence and profile alignments. J. Molecular Biology 300 649–659.
  • [23] Olsen, R., Bundschuh, R. and Hwa, T. (1999). Rapid assessment of extremal statistics for gapped local alignment. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology 211–222. AAAI Press, Menlo Park, CA.
  • [24] Park, Y., Sheetlin, S. and Spouge, J. L. (2005). Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. J. Phys. A: Mathematical and General 38 97–108.
  • [25] Seneta, E. (1981). Nonnegative Matrices and Markov Chain. Springer, New York.
  • [26] Sheetlin, S., Park, Y. and Spouge, J. L. (2005). The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res. 33 4987–4994.
  • [27] Siegmund, D. and Yakir, B. (2000). Approximate p-values for local sequence alignments. Ann. Statist. 28 657–680.
  • [28] Spouge, J. L. (2004). Path reversal, islands, and the gapped alignment of random sequences. J. Appl. Probab. 41 975–983.
  • [29] Storey, J. D. and Siegmund, D. (2001). Approximate p-values for local sequence alignments: Numerical studies. J. Comput. Biology 8 549–556.
  • [30] Waterman, M. S., Smith, T. F. and Beyer, W. A. (1976). Some biological sequence metrics. Adv. in Math. 20 367–387.
  • [31] Waterman, M. S. and Vingron, M. (1994). Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91 4625–4628.
  • [32] Waterman, M. S. and Vingron, M. (1994). Sequence comparison significance and Poisson approximation. Statist. Sci. 9 367–381.
  • [33] Yu, Y. K. and Altschul, S. F. (2005). The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions. Bioinformatics 21 902–911.
  • [34] Yu, Y. K. and Hwa, T. (2001). Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biology 8 249–282.
  • [35] Zhang, Y. (1995). A limit theorem for matching random sequences allowing deletions. Ann. Appl. Probab. 5 1236–1240.