Approximate word matches between two random sequences



The Annals of Applied Probability

Approximate word matches between two random sequences

Conrad J. Burden, Miriam R. Kantorovitz, and Susan R. Wilson

Source: Ann. Appl. Probab. Volume 18, Number 1 (2008), 1-21.

Abstract

Given two sequences over a finite alphabet $\mathcal{L}$, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Primary Subjects: 60F17, 92D20
Keywords: DNA sequences; sequence comparison; word matches

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoap/1199890013
Digital Object Identifier: doi:10.1214/07-AAP452
Mathematical Reviews number (MathSciNet): MR2380889

References

Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: A user's guide. Ann. Appl. Probab. 11 964--1002.
Mathematical Reviews (MathSciNet): MR1865030
Digital Object Identifier: doi:10.1214/aoap/1015345355
Project Euclid: euclid.aoap/1015345355
Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res. 9 1135--1142.
Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J. Comput. Chem. 23 755--757.
Conover, W. J. (1999). Practical Nonparametric Statistics. Wiley, New York.
Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Probab. 3 534--545.
Mathematical Reviews (MathSciNet): MR0428387
Digital Object Identifier: doi:10.1214/aop/1176996359
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T. and Hide, W. (2001). STACK: Sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29 234--238.
Dembo, A. and Rinott, Y. (1996). Some examples of normal approximations by Stein's method. IMA Vol. Math. Appl. 76 25--44.
Mathematical Reviews (MathSciNet): MR1395606
Zentralblatt MATH: 0847.60015
Janson, S. (1988). Normal convergence by higher semiinvariants with applications to sums of dependent random variables and random graphs. Ann. Probab. 16 305--312.
Mathematical Reviews (MathSciNet): MR0920273
Digital Object Identifier: doi:10.1214/aop/1176991903
Project Euclid: euclid.aop/1176991903
Kantorovitz, M. R., Booth, H. S., Burden, C. J. and Wilson, S. R. (2007). Asymptotic behavior of $k$-word matches between two random sequences. J. Appl. Probab. 44 788--805.
Mathematical Reviews (MathSciNet): MR2355592
Digital Object Identifier: doi:10.1239/jap/1189717545
Project Euclid: euclid.jap/1189717545
Lippert, R. A., Huang, H. and Waterman, M. S. (2002). Distributional regimes for the number of $k$-word matches between two random sequences. Proc. Natl. Acad. Sci. USA 99 13980--13989.
Mathematical Reviews (MathSciNet): MR1944413
Digital Object Identifier: doi:10.1073/pnas.202468099
Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R. and Hide, W. A. (1999). A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base. Genome Res. 9 1143--1155.
Melko, O. M. and Mushegian, A. R. (2004). Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes. Bioinformatics 20 67--74.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147 195--197.
Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability 2 583--602. Univ. California Press, Berkeley, CA.
Mathematical Reviews (MathSciNet): MR0402873
Zentralblatt MATH: 0278.60026
Stein, C. (1986). Approximate Computation of Expectations. IMS, Hayward, CA.
Mathematical Reviews (MathSciNet): MR0882007
Zentralblatt MATH: 0721.60016
Vinga, S. and Almeida, J. S. (2003). Alignment-free sequence comparison---a review. Bioinformatics 19 513--523.
Waterman, M. S. (2000). Introduction to Computational Biology. Chapman and Hall/CRC, New York.
Zhang, Y. X., Perry, K., Vinci, V. A., Powell, K., Stemmer, W. P. and del Cardayre, S. B. (2002). Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415 644--646.

2009 © Institute of Mathematical Statistics