Source: Adv. in Appl. Probab. Volume 38, Number 1
(2006), 116-133.
In DNA sequences, specific words may take on biological functions
as marker or signalling sequences. These may often be identified
by frequent-word analyses as being particularly abundant. Accurate
statistics is needed to assess the statistical significance of
these word frequencies. The set of shuffled sequences - letter
sequences having the same k-word composition, for some
choice of k, as the sequence being analysed - is considered
the most appropriate sample space for analysing word counts.
However, little is known about these word counts. Here we present
exact formulae for word counts in shuffled sequences.
Full-text: Access denied (no subscription
detected)
We're sorry, but we are unable to provide
you with the full text of this article because we are not able to identify
you as a subscriber.
If you have a personal subscription to
this journal, then please login. If you are already logged in, then you
may need to update your profile to register your subscription.
Read more about accessing full-text
References
Altschul, S. and Erickson, B. (1985). Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molec. Biol. Evol. 2, 526--538.
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886--892.
Dawson, R. and Good, I. (1957). Exact Markov probabilities from oriented linear graphs. Ann. Math. Statist. 28, 946--956.
Mathematical Reviews (MathSciNet):
MR93819
Fitch, W. (1983). Random sequences. J. Molec. Biol. 163, 171--176.
Goodman, L. (1958). Exact probabilities and asymptotic relationships for some statistics from $m$-th order Markov chains. Ann. Math. Statist. 29, 476--490.
Mathematical Reviews (MathSciNet):
MR94847
Kandel, D., Matias, Y., Unger, R. and Winkler, P. (1996). Shuffling biological sequences. Discrete Appl. Math. 71, 171--185.
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205--220.
Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters.J. Appl. Prob. 36, 179--193.
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349--359.
Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 1--16.
Schbath, S. (1995). Étude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. Doctoral Thesis, Université René Descartes, Paris V.
Tutte, W. (1948). The dissection of equilateral triangles into equilateral triangles. Proc. Camb. Philos. Soc. 44, 463--482.
Mathematical Reviews (MathSciNet):
MR27521
Van Aardenne-Ehrenfest, T. and de Bruijn, N. (1951). Circuits and trees in oriented linear graphs. Simon Stevin 28, 203--217.
Mathematical Reviews (MathSciNet):
MR47311
Whittle, P. (1955). Some distribution and moment formulae for the Markov chain. J. R. Statist. Soc. B 17, 235--242.
Mathematical Reviews (MathSciNet):
MR77041
Zaman, A. (1984). Urn models for Markov exchangeability. Ann. Prob. 12, 223--229.
Mathematical Reviews (MathSciNet):
MR723741