Advances in Applied Probability

Exact distribution of word counts in shuffled sequences

Einar Andreas Rødland
Source: Adv. in Appl. Probab. Volume 38, Number 1 (2006), 116-133.

Abstract

In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.

First Page: Show Hide

Related Works:

Primary Subjects: 60C05
Secondary Subjects: 05A15, 60J10, 60J20, 62E15
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aap/1143936143
Digital Object Identifier: doi:10.1239/aap/1143936143
Zentralblatt MATH identifier: 05033689
Mathematical Reviews number (MathSciNet): MR2213967

References

Altschul, S. and Erickson, B. (1985). Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molec. Biol. Evol. 2, 526--538.
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886--892.
Mathematical Reviews (MathSciNet): MR1133796
Digital Object Identifier: doi:10.2307/3214691
Zentralblatt MATH: 0741.60071
Dawson, R. and Good, I. (1957). Exact Markov probabilities from oriented linear graphs. Ann. Math. Statist. 28, 946--956.
Mathematical Reviews (MathSciNet): MR93819
Digital Object Identifier: doi:10.1214/aoms/1177706795
Project Euclid: euclid.aoms/1177706795
Fitch, W. (1983). Random sequences. J. Molec. Biol. 163, 171--176.
Goodman, L. (1958). Exact probabilities and asymptotic relationships for some statistics from $m$-th order Markov chains. Ann. Math. Statist. 29, 476--490.
Mathematical Reviews (MathSciNet): MR94847
Digital Object Identifier: doi:10.1214/aoms/1177706623
Project Euclid: euclid.aoms/1177706623
Kandel, D., Matias, Y., Unger, R. and Winkler, P. (1996). Shuffling biological sequences. Discrete Appl. Math. 71, 171--185.
Mathematical Reviews (MathSciNet): MR1420298
Digital Object Identifier: doi:10.1016/S0166-218X(97)81456-4
Zentralblatt MATH: 0873.92012
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205--220.
Mathematical Reviews (MathSciNet): MR1325386
Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters.J. Appl. Prob. 36, 179--193.
Mathematical Reviews (MathSciNet): MR1699643
Digital Object Identifier: doi:10.1239/jap/1032374240
Project Euclid: euclid.jap/1032374240
Zentralblatt MATH: 0945.60008
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349--359.
Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 1--16.
Mathematical Reviews (MathSciNet): MR1382515
Digital Object Identifier: doi:10.1051/ps:1997100
Zentralblatt MATH: 0869.60067
Schbath, S. (1995). Étude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. Doctoral Thesis, Université René Descartes, Paris V.
Tutte, W. (1948). The dissection of equilateral triangles into equilateral triangles. Proc. Camb. Philos. Soc. 44, 463--482.
Mathematical Reviews (MathSciNet): MR27521
Digital Object Identifier: doi:10.1017/S030500410002449X
Van Aardenne-Ehrenfest, T. and de Bruijn, N. (1951). Circuits and trees in oriented linear graphs. Simon Stevin 28, 203--217.
Mathematical Reviews (MathSciNet): MR47311
Whittle, P. (1955). Some distribution and moment formulae for the Markov chain. J. R. Statist. Soc. B 17, 235--242.
Mathematical Reviews (MathSciNet): MR77041
Zaman, A. (1984). Urn models for Markov exchangeability. Ann. Prob. 12, 223--229.
Mathematical Reviews (MathSciNet): MR723741
Digital Object Identifier: doi:10.1214/aop/1176993385
Project Euclid: euclid.aop/1176993385
Zentralblatt MATH: 0542.60065

2012 © Applied Probability Trust

Advances in Applied Probability

Advances in Applied Probability