Abstract
Consider finite sequences $X_1, X_2,\cdots, X_m$ and $Y_1, Y_2,\cdots, Y_n$ where the letters ${X_i}$ and ${Y_i}$ are chosen i.i.d. on a countable alphabet with $p=P(X_1=Y_1)\in(0,1)$ We study the distribution of the longest contiguous run of matches between the X's and Y's allowing at most k mismatches. The distribution is closely approximated by that of the maximum of (1 - p)mn i.i.d. negative binomial random variables. The latter distribution is in turn shown to behave like the integer part of an extreme value distribution. The expectation is approximately $\log(qmn)+k\log\log(qmn)+k\log(q/p)-\log(k!)+\gamma\log(e)-\frac{1}{2}$, where q = 1 - p, log denotes logarithm base 1/p, and y is the Euler constant. The variance is approximated by $(\pi\log(e))2/6+\frac{1}{2}$. The paper concludes with an example in which we compare segments taken from the DNA sequence of the bacteriophage lambda.
Citation
Richard Arratia. Louis Gordon. Michael Waterman. "An Extreme Value Theory for Sequence Matching." Ann. Statist. 14 (3) 971 - 993, September, 1986. https://doi.org/10.1214/aos/1176350045
Information