Open Access
September, 1986 An Extreme Value Theory for Sequence Matching
Richard Arratia, Louis Gordon, Michael Waterman
Ann. Statist. 14(3): 971-993 (September, 1986). DOI: 10.1214/aos/1176350045

Abstract

Consider finite sequences $X_1, X_2,\cdots, X_m$ and $Y_1, Y_2,\cdots, Y_n$ where the letters ${X_i}$ and ${Y_i}$ are chosen i.i.d. on a countable alphabet with $p=P(X_1=Y_1)\in(0,1)$ We study the distribution of the longest contiguous run of matches between the X's and Y's allowing at most k mismatches. The distribution is closely approximated by that of the maximum of (1 - p)mn i.i.d. negative binomial random variables. The latter distribution is in turn shown to behave like the integer part of an extreme value distribution. The expectation is approximately $\log(qmn)+k\log\log(qmn)+k\log(q/p)-\log(k!)+\gamma\log(e)-\frac{1}{2}$, where q = 1 - p, log denotes logarithm base 1/p, and y is the Euler constant. The variance is approximated by $(\pi\log(e))2/6+\frac{1}{2}$. The paper concludes with an example in which we compare segments taken from the DNA sequence of the bacteriophage lambda.

Citation

Download Citation

Richard Arratia. Louis Gordon. Michael Waterman. "An Extreme Value Theory for Sequence Matching." Ann. Statist. 14 (3) 971 - 993, September, 1986. https://doi.org/10.1214/aos/1176350045

Information

Published: September, 1986
First available in Project Euclid: 12 April 2007

zbMATH: 0602.62015
MathSciNet: MR856801
Digital Object Identifier: 10.1214/aos/1176350045

Subjects:
Primary: 62E20
Secondary: 62P10

Keywords: DNA sequences , extreme value , inclusion-exclusion , Matching , Poisson process

Rights: Copyright © 1986 Institute of Mathematical Statistics

Vol.14 • No. 3 • September, 1986
Back to Top