Translator Disclaimer
September, 1986 An Extreme Value Theory for Sequence Matching
Richard Arratia, Louis Gordon, Michael Waterman
Ann. Statist. 14(3): 971-993 (September, 1986). DOI: 10.1214/aos/1176350045


Consider finite sequences $X_1, X_2,\cdots, X_m$ and $Y_1, Y_2,\cdots, Y_n$ where the letters ${X_i}$ and ${Y_i}$ are chosen i.i.d. on a countable alphabet with $p=P(X_1=Y_1)\in(0,1)$ We study the distribution of the longest contiguous run of matches between the X's and Y's allowing at most k mismatches. The distribution is closely approximated by that of the maximum of (1 - p)mn i.i.d. negative binomial random variables. The latter distribution is in turn shown to behave like the integer part of an extreme value distribution. The expectation is approximately $\log(qmn)+k\log\log(qmn)+k\log(q/p)-\log(k!)+\gamma\log(e)-\frac{1}{2}$, where q = 1 - p, log denotes logarithm base 1/p, and y is the Euler constant. The variance is approximated by $(\pi\log(e))2/6+\frac{1}{2}$. The paper concludes with an example in which we compare segments taken from the DNA sequence of the bacteriophage lambda.


Download Citation

Richard Arratia. Louis Gordon. Michael Waterman. "An Extreme Value Theory for Sequence Matching." Ann. Statist. 14 (3) 971 - 993, September, 1986.


Published: September, 1986
First available in Project Euclid: 12 April 2007

zbMATH: 0602.62015
MathSciNet: MR856801
Digital Object Identifier: 10.1214/aos/1176350045

Primary: 62E20
Secondary: 62P10

Rights: Copyright © 1986 Institute of Mathematical Statistics


Vol.14 • No. 3 • September, 1986
Back to Top