finite dictionary in random sequences of letters

In this paper we study a classical model concerning  occurrence of words in a random sequence of letters from an alphabet. The problem can be studied as a game among $(m+1)$ words: the winning word in this game is the one that occurs first.  We prove that the knowledge of the  first $m$ words results in an advantage in the construction of the last word, as it has been shown in the literature for the cases $m=1$ and $m=2$ [CZ1,CZ2]. The last word can in fact be constructed so that its probability of winning is strictly larger than $1/(m+1)$. For the latter probability we will give an explicit lower bound. Our method is based on rather general  probabilistic arguments that allow us to consider an arbitrary cardinality  for the alphabet, an arbitrary value for $m$ and different mechanisms generating the random sequence of letters.


Introduction
The theme of the occurrence of words in random sequences of letters from an alphabet is a rather classical one in discrete probability.The related literature has a long tradition and papers with new insights and deep results continue to appear from time to time.
This topic has, among others, the following interesting aspects: it has a number of important applications and it is characterized by surprising results which, at a first glance, can sometimes appear even contradictory.Feller's book is a starting point for the study of occurrence of words in a Bernoulli scheme [3].Different types of interesting problems arise in this field and many important papers appeared in the related literature; see in particular [4,1,2,5,6,7] and references cited therein.
One interesting problem considers a finite set of given words, a dictionary, and concerns the probability that a fixed word occurs as the first.This problem can be seen as related to a game among different words, where the winner is the word which occurs first.
In this respect, given m words w 1 , . . ., w m , we construct a word w m+1 such that its probability of winning is larger than 1/(m + 1).
In [1] the case of two competing words (i.e.m = 1) on a binary alphabet has been considered.In [2], the analysis has been extended in a thorough way to the case of three words (i.e.m = 2).
Provided that the length of the words is sufficiently large, and by introducing suitable probabilistic arguments, we solve the problem for an arbitrary value of m and for an alphabet of arbitrary size.In particular we provide an explicit lower bound for the probability of first occurrence for the constructed word w m+1 .
In the next section, we introduce some useful notation to formalize our result in Theorem 2.3.Then we give our constructive proof after presenting the preliminary Lemmas 1-3.Section 3 is devoted to a short discussion containing some comments and concluding remarks.

Construction of efficient words and probability of winning
Let A N = {a 1 , . . ., a N } be an alphabet composed of N distinct letters.We consider (m + 1) words w 1 , . . ., w m+1 of a fixed length k, i.e. (m + 1) elements of A k N and let W m+1 = {w 1 , . . ., w m+1 }.We write w i,l for the i-th letter of the word w l and we say that the word (w i,l , . . ., w j,l ), for 1 ≤ i ≤ j ≤ k, is a sub-word of w l .
At any instant n = 1, 2, . . .a letter is drawn from the alphabet A N .Drawings are supposed to be independent and uniformly distributed over A N .We define the space Ω = A N N ; for ω = (ω 1 , ω 2 , . ..) ∈ Ω, we refer to ω n as the letter at time n ∈ N. The probability measure on Ω is then the product measure that, at any drawing, assigns probability 1/N to each letter of A N : We now consider a game that ends at the random time R 1 , where and the winner is the word w l such that (ω R1−k+1 , . . ., ω R1 ) = w l . (2.2) Next we define the events Hence, the event E l means that w l occurs first within W m+1 .We assume that w m+1 can be chosen as a function of the other words w 1 , . . ., w m and we show that this can be done in such a way that the winning probability of w m+1 is greater than 1 m+1 .For this purpose it is convenient to assume that drawings of letters go on indefinitely also beyond time R 1 , so that, a.s., we will have an infinite number of games.
Let us introduce the random variables V l,n , for each l = 1, . . ., m + 1 and n ∈ N, as the number of wins of word w l until time n.Obviously the probability law of V l,n also depends on the ordered sequence (w 1 , . . ., w m+1 ), however these words are fixed once forever and for shortness sake we will omit to indicate this dependence.Recursively define First occurrences of words for h = 1, 2, . . .where R 1 is the random variable defined in (2.1).Thus, by using this notation, the random variables V 1,n , . . ., V m+1,n can be more formally defined as for l = 1, . . ., m + 1 and n = 1, 2, . . . .Let moreover R = {R h : h ∈ N} and define the random variable N l,n as the number of times in which the word w l occurs inside the interval [0, n], i.e.
(2.6) Furthermore we consider the random times T h = inf{n : We present a remark that will be useful for the proof of Lemmas 2.5-2.6.
Remark 2.1.Let n ≥ k be fixed.An event of the form {ω ∈ Ω : (ω n−k+1 , . . ., ω n ) = (w 1,l , . . ., w k,l )} for l = 1, . . ., m + 1 implies the event {n ∈ T } but does not imply the event {n ∈ R}.In order to guarantee {n ∈ R} it is sufficient (but not necessary), see definition (2.4), to exclude that, for some Notice on the other hand that, again by (2.4), the event {s ∈ R} excludes the event {n ∈ R} for n = s + 1, . . ., s + k − 1.In fact we can not observe two wins at a distance less than k.
By a renewal-theorem, or an ergodic-theorem, argument the limit lim n→∞ V l,n /n exists almost surely for l = 1, . . ., m + 1 and it is constant.Thus we define the quantities q l as follows q l = lim n→∞ V l,n n a.s. (2.8) By taking into account the latter equation we see that . Concerning the probabilities P (E l ), we can also write (2.10) The above identity is obtained by recalling (2.9) and by noticing that, by the renewal theorem or by the ergodic theorem, one must have (2.11) As the main achievement of our paper we can state the following result.It is convenient first to introduce the notation: (2.13) and let w 1 , . . ., w m ∈ A k N be any m distinct words.Then there exists a word The proof of Theorem 2.3 will be presented at the end of this section as a direct consequence of Lemmas 2.4, 2.5, 2.6 below.
The interest of P (E m+1 ) > 1 m+1 becomes clear when we consider the following case: each of (m+1) players bets one dollar on a word of length k and the one who has chosen the winning word receives (m+1) dollars.Even if the drawings are independent and the letters are equiprobable, the word w m+1 can be constructed, for any given w 1 , . . ., w m , in such a way that the game is unfair, namely it is favorable for (m + 1)-th player.
It is intuitive that, for fixed N and m, the length k of the words should be large enough.We shall see in Remark 2.7, as a consequence of (2.13), that log N m is the appropriate order.
As mentioned, the word w m+1 will be obtained by means of a constructive procedure.We roughly anticipate that the word w m+1 can be constructed according to the following steps: Step 1.The second part of w m+1 , of a suitable length r, must coincide with the initial part of the word w 1 .
Step 2. The first k −r letters of w m+1 must give rise to a sub-word which does not coincide with any sub-word drawn from w 1 , . . ., w m (see Lemma 2.4).
We will discover that a suitable value for r is 2L = 2 log N (mk) + 2. ).
(2.14) Concerning such a choice, the following lemma shows that any possible matching between an initial sub-word of w m+1 and a final sub-word of any word in W m+1 must be sufficiently short.
Proof.First we prove the case l = m + 1.For i = k − L, . . ., k − 1, we compare the (i + L − k + 1)th letter of the two sub-words in (2.15).The (i + L − k + 1)-th letter of the word on the l.h.s. is w L+1,m+1 = v 1 , while the (i + L − k + 1)-th letter of the word on the r.h.s. is v. Then the validity of (2.15) excludes the possibility that k − L ≤ i ≤ k − 1.
For the words w m+1 and w 1 we now respectively consider the ratios (2.16) Proof.For n ≥ k, let us define the events (2.17) (2.20) Clearly, for n ≥ k, the event H n ∩ F n means that word w m+1 wins at n. Moreover the probabilities P (G n ∩ F n ) and P (F n ) do not depend on n and the events G n , F n are independent.Now, in view of Lemma 2.4 and the above definition (2.20), we will show that whenever n ≥ 2k.First notice that the event H n ∩ F n is equivalent to {V m+1,n − V m+1,n−1 = 1}.In order to prove inclusion (2.21), we can argue as follows.
Remark 2.1 says that On the other hand

First occurrences of words
At this point, we can use Lemma 2.4 to ensure that the above argument is still valid if we replace with G n .Now notice that P (F 2k+i ) = P (F 2k ), P (G 2k+i ) = P (G 2k ), and P (F 2k+i ∩ G 2k+i ) = P (F 2k ∩ G 2k ) for i ≥ 0. We set p F = P (F 2k ), p G = P (G 2k ), and p F ∩G = P (F 2k ∩ G 2k ).Independence between F 2k and G 2k immediately yields p F ∩G = p F p G .By ergodic theorem the following equalities hold almost surely: ( By following a same type of argument as in the previous proof we can now obtain an asymptotic upper bound for the ratio V 1,n /N 1,n .
Lemma 2.6. (2.24) Proof.We consider n ≥ 3k and define the events (2.28) We will also use H c n where H n is defined in (2.18).Concerning the event H c n , we can write, for n ≥ 3k, that the event depend on n for n ≥ 3k; the events G n , F n and K n are independent.Concerning the event G n , we can say that Lemma 2.4 allow us to limit the range of the index i in formula (2.28), i.e. i = 1, . . ., 2L.
We can now use the arguments in Remark 2.1 as follows: in view of Lemma 2.4 and the above definition of G n , we will show that whenever n ≥ 3k.In fact, F n ∩ K n implies that the special word w m+1 appears at time n − 2L, which in particular means n − 2L ∈ T ; moreover, the concomitant occurrence of the event G n guarantees that n − 2L ∈ R by using an argument similar to the proof of Lemma 2.5.The event {n − 2L ∈ R} implies {n ∈ R}, whence (2.29) easily follows.Now notice that, for i ≥ 0, P ( F 3k+i ) = P ( F 3k ).Similarly it happens for K 3k+i , G 3k+i , and F 3k+i ∩ K 3k+i ∩ G 3k+i and we set p F = P ( F 3k ),p K = P ( K 3k ), p G = P ( G 3k ), and By ergodic theorem we claim (2.30) As to the r.h.s. of previous formula, we can write N k .
We are now in a position to give the proof of our main result.
Proof of Theorem 2.3.Since the probability of the occurrence in a given position of a single word (a 1 , . . ., a k ) is a constant, equal to 1/N k , we have that, for l = 1, . . ., m + 1, thus (2.34) gives a lower bound for the probability of winning for w m+1 .
Finally, by (2.32), we obtain that the last inequality following from the hypothesis (2.13).This ends the proof.
Remark 2.7.In the proof of Theorem 2.3 we obtain the explicit bound We notice furthermore that, in the statement of Theorem 2.3, the condition (2.13) can be replaced by the simpler inequality k ≥ 22(1 + log N m).In fact, at the cost of elementary but rather tedious manipulations, one can show that the latter implies (2.13).We notice in this respect that the latter estimate is of the right order.In fact, for given N and k, the number of distinct words of length k on the alphabet A N is obviously N k ; then, since we need to find at least (m+1) distinct words, we must necessarily have k ≥ log N (m+1).
On the other hand This limit does not depend on N and shows that the condition in Theorem 2.3 is quite efficient.

Discussion and final remarks
We conclude the paper with some comments about our result and about the method that we use.First of all, our construction is based on rather general probabilistic arguments.This has allowed us to formulate a general result.In fact, in Theorem 2.3, we have no limitation on the choice of N and m, provided that k is large enough.Notice, for instance, that the methods used in [1,2] are specific for the cases N = 2, m = 1, 2.
Our procedure could be easily extended to deal with cases where stochastic independence among the letters fails.
Let us denote by π w the probability that a word w of length k occurs as soon as possible, namely in the first k drawings.In the independence setting this probability is 1/N k .Now we point out that, essentially, we have used three hypotheses to prove our result.The latter can be conveniently summarized as follows a) The process of generation of random sequences of letters is ergodic.b) For any word w of length k the probability π w belongs to {0, C k }, where C k is a positive constant, i.e. some words are forbidden and all the remaining words share the same probability C k .
Concerning item c), we point out that w m+1 should be composed of two different parts.The second part is fixed (it coincides with the initial part of word w 1 ).On the contrary, several possible choices for determining the first part of w m+1 are possible.We only need, in fact, that the first letters of w m+1 give rise to a sub-word, of a convenient length, which does not coincide with any sub-word extracted from w 1 , . . ., w m .First occurrences of words Under the case of independence, that we considered along the paper, condition c1) is trivially satisfied and there are many different words that satisfy item c).When independence is dropped, c1) is not trivial anymore.We can still expect however that one can find different words that satisfy c1) besides satisfying c2), c3).It is just this possibility of different employable words which could be useful in a possible extension of our work beyond the case of independence.
An instance where all the conditions a), b) and c) hold is the model of random drawings with delayed replacement of letters from an urn: two consecutive letters can not be equal.For such a model, validity of b) is obvious with C k = 1 N (N −1) k and a) holds when N ≥ 3.In fact, in such a case, the sequence of letters drawn is an irreducible, aperiodic Markov chain.Finally, some words w m+1 satisfying condition c) can be constructed, provided N ≥ 4.

( 2 2 . 3 .
.12) for given N , m, k.EJP 17 (2012), paper 25.Page 3/9 ejp.ejpecp.orgFirst occurrences of words Theorem Let k be such that Now, we proceed to explain how to explicitly construct the word w m+1 .Let us consider the set W L,m of all the words w i,l = (w i+1,l , w i+1,l , . . ., w i+L,l ) ∈ A L N with l = 1, . . .m, i = 0, . . ., k − L and with L defined in (2.12).Hence we are considering all the sub-words of length L of the words w 1 , . . ., w m .Clearly |A L N | = N L and |W L,m | < mk, therefore |A L N | > |W L,m |; thus, the set A L N \ W L,m is not empty, and we can choose a word (v 1 , . . ., v L ) / ∈ W L,m .Let us arbitrarily take a letter v = v 1 and consider . The latter expresses the ratio between the numbers of times where the word w 1 appears within the first n drawings and the corresponding number of wins of the same word.Similarly for Nm+1,n Vm+1,n concerning w m+1 .The following lemma provides an almost sure lower bound for the limit of the ratio Vm+1,n Nm+1,n .Notice that the existence of such a limit can be guaranteed by ergodicity arguments.An upper bound for lim n→∞