Given two sequences over a finite alphabet
, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.
References
Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: A user's guide. Ann. Appl. Probab. 11 964--1002.
Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res. 9 1135--1142.
Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J. Comput. Chem. 23 755--757.
Conover, W. J. (1999). Practical Nonparametric Statistics. Wiley, New York.
Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Probab. 3 534--545.
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T. and Hide, W. (2001). STACK: Sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29 234--238.
Dembo, A. and Rinott, Y. (1996). Some examples of normal approximations by Stein's method. IMA Vol. Math. Appl. 76 25--44.
Janson, S. (1988). Normal convergence by higher semiinvariants with applications to sums of dependent random variables and random graphs. Ann. Probab. 16 305--312.
Kantorovitz, M. R., Booth, H. S., Burden, C. J. and Wilson, S. R. (2007). Asymptotic behavior of $k$-word matches between two random sequences. J. Appl. Probab. 44 788--805.
Lippert, R. A., Huang, H. and Waterman, M. S. (2002). Distributional regimes for the number of $k$-word matches between two random sequences. Proc. Natl. Acad. Sci. USA 99 13980--13989.
Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R. and Hide, W. A. (1999). A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base. Genome Res. 9 1143--1155.
Melko, O. M. and Mushegian, A. R. (2004). Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes. Bioinformatics 20 67--74.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147 195--197.
Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability 2 583--602. Univ. California Press, Berkeley, CA.
Stein, C. (1986). Approximate Computation of Expectations. IMS, Hayward, CA.
Vinga, S. and Almeida, J. S. (2003). Alignment-free sequence comparison---a review. Bioinformatics 19 513--523.
Waterman, M. S. (2000). Introduction to Computational Biology. Chapman and Hall/CRC, New York.
Zhang, Y. X., Perry, K., Vinci, V. A., Powell, K., Stemmer, W. P. and del Cardayre, S. B. (2002). Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415 644--646.