Closeness to the Diagonal for Longest Common Subsequences in Random Words

The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any property, common to two subsequences, typically holds in most parts of the optimal alignment whenever this same property holds, with high probability, for strings of similar short-length. Our results should, in particular, prove useful for simulations since they imply that the re-scaled two dimensional representation of a LCS gets uniformly close to the diagonal as the length of the sequences grows without bound.


Introduction
Let x and y be two finite strings.A common subsequence of x and y is a subsequence which is a subsequence of x as well as of y.A Longest Common Subsequence (LCS) of x and y is a common subsequence of x and y of maximal length.
Throughout, let X and Y be two random strings X = X 1 . . .X n and Y = Y 1 . . .Y n , and let LC n denote the length of the LCS of X and Y .As well known, common subsequences can be represented as alignments with gaps; and this is illustrated next with some examples: First take the binary strings x = 0010 and y = 0110.A common subsequence is 01.We represent this common subsequence as an alignment with gaps.We allow only for alignments which align the same letters or letters with gaps.We represent a common subsequence by aligning the letters of the subsequences from each word.The letters which do not appear in the common subsequence get aligned with a gap; several alignments can represent the same common subsequence.In this first example, an alignment corresponding to the common subsequence 01 is given by x 0 0 1 0 y 0 1 1 0 (1.1) However, the LCS is not 01, but 010.We call an alignment corresponding to the LCS an optimal alignment.Hence, (1.1) is not an optimal alignment, but an optimal alignment is given by: Here the LCS is LCS(x, y) = 010, and its length is 3 a fact that we denote by Let us consider another example: let x = christian and y = krystyan.In this situation, the LCS of x and y is LCS(x, y) = rstan and the alignment with gaps representing the LCS is: Again, all the letters which are part of the LCS get aligned with each other, while the other letters get aligned with gaps.Often, we say that, in a given alignment, a part of x gets aligned with a part of y, and here is what is meant: In the above example, x 8 x 9 = an is aligned with y 7 y 8 = an, while x 5 x 6 x 7 x 8 x 9 = stian is aligned (with gaps) with y 4 y 5 y 6 y 7 y 8 = styan.In this situation, we say that [5,9] is aligned with [4,8].(We will also sometimes say that [5,9] gets mapped onto [4,8] by the alignment we consider).This means that the following two conditions are satisfied: i) the letters x 5 x 6 x 7 x 8 x 9 are all aligned exclusively with gaps or with letters from the string y 4 y 5 y 6 y 7 y 8 .ii) the letters from y 4 y 5 y 6 y 7 y 8 are all aligned with gaps or with letters from the substring x 5 x 6 x 7 x 8 x 9 .
For further clarification see that in the alignment (1.2), the interval [1,4] is aligned with [1,2].In other words, we say that in an alignment a piece of x gets aligned with a piece of y, if and only if the letters from the piece of x which get aligned to letters get only aligned with letters from the piece of y and vice versa.
Longest Common Subsequences (LCS) and Optimal Alignments (OA) are important tools used in Computational Biology and Computational Linguistics for string matching, e.g., see [10], [3].It is known and due to Chvàtal and Sankoff [4] that the expected length of the LCS divided by n converges to a constant γ.But even for the simplest distributions, the exact value of γ is not known In several special cases (e.g., [5], [6], [7]), the long open problem of finding the order of the variance of the LCS and the OA has been solved.Such is the case, for example, with binary sequences with 0 and 1 having very different probabilities from each other.In all these cases, it turned out that the variance of the LCS is of linear order in the length of the sequences.This is the order conjectured by Waterman [9], for which Steele [8] has an upper bound of such order, whilst Alexander [1] determined the speed of convergence.The most important cases, like for example i.i.d.sequences with equiprobable letters remain, however, open as far as this order is concerned.
In [6] and [7], for determining the linear order of the variance of the LCS, we used ad-hoc (somewhat intricate) combinatorial arguments, despite the fact that the situation there is less involved than for the general case.In this paper, we show a general method to prove certain properties of the optimal alignment, given that typically the property holds for alignments of short strings.
We investigate here the nature of the optimal alignments of random strings, i.e., of the alignments corresponding to LCSs.For this, and throughout, we take two independent random strings X = X 1 . . .X n and Y = Y 1 . . .Y n and further assume that X and Y are both iid sequences drawn from a finite ordered alphabet.We denote by LC n the length of the LCS of X and Y .
To do so, we are going to partition X into pieces of length k, fixed, as n goes to infinity, and prove that typically, in any optimal alignment, most of these pieces get aligned with pieces of Y of similar length.More precisely, we assume throughout that Assume that the integers are such that The above condition just indicates that there exists an optimal alignment which maps The first goal of the present paper is to show that for k fixed, and n large enough, any generic optimal alignment is such that the vast majority of the intervals [r i−1 + 1, r i ] are close in length to k.Another goal is to show that if a certain property P holds with high probability for any optimal alignment of strings of (short) length order k, then typically any optimal alignment has a large proportion of parts of order k having the property P.This is proven in Section 4. Let us get back to the first goal of this paper.That is, we will show that with high probability if the integers r 0 , r 1 , . . ., r m satisfy (1.3) and (1.4) then most of the lengths r i − r i−1 are close to k.
Of course, we need to quantify what is meant by "close to k".To do so, we first need a definition.For p > 0, let This function γ * is just a new parametrization of the usual function A subadditivity argument as in Chvàtal and Sankoff [4] shows that the above limits do exist.When X and Y are identically distributed, then the function γ is symmetric about the origin, while a further subadditivity argument shows that it is concave and so it reaches its maximum at q = 0.In general, it is not clear, whether or not it is strictly concave at q = 0. From simulations it seems almost certain that the function γ is strictly concave at p = 1.This however might be very difficult to prove.(The LCS problem is a Last Passage Percolation problem with correlated weights.In general, proving for First/Last passage percolation that the shape of the wet zone is strictly concave seems difficult and in many cases has not been done yet.)Note that q(p) = (p − 1)/(p + 1) = 1 − (2/(p + 1)). is strictly increasing in p and is equal to 0 for p = 1.So, if γ(•) is strictly concave at q = 0, then it reaches a strict maximum at that point.In that case, γ * (•) would reach a strict maximum at p = 1.Without the strict concavity of γ(•), p = 1 would not be the unique point of maximal value.
So assume that we know that p 1 and p 2 are such that The main result of this paper is that if n is large enough (we take k fixed and let n go to infinity), then for all optimal alignment typically we have that most of the intervals [r i−1 + 1, r i ] (for i = 1, 2, . . ., m) have their lengths between kp 1 and kp 2 .By most, we mean that by taking k fixed, and n large enough, that proportion gets typically as close to 100% as we want.
Before proving the above theorem, let us mention that the results presented here are yet another step in our attempt at finding the order of the variance of LC n (see [5], [7], [6] and the references therein), and they will prove useful towards our ultimate goal ([2]) 2 Proof of the main theorem In this section we prove our main Theorem 1.1.To do so, we will need to define a few things: So far we have looked at the intervals on which an optimal alignment would map the intervals [k(i − 1) + 1, ki]; and we are now going to take the opposite stand: we will give non-random integers r 0 = 0 < r 1 < r 2 < . . .< r m = n and request that the alignment aligns In general, such an alignment is not optimal and the best score an alignment can reach under the above constraint is given by: where r = (r 0 , r 1 , . . ., r m ).Hence, the quantity L n ( r) = L n (r 1 , . . ., r m ) represents the maximum number of aligned identical letter pairs under the constraint that the string Note that for non-random r = (r 0 , r 1 , . . ., r m ), the partial scores are independent of each other, and concentration inequalities will prove handy when dealing with denote the (non-random) set of all integer vectors r = (r 0 , r 1 , . . ., r m ) satisfying the conditions (1.3), and (1.9).Let R c ǫ,p 1 ,p 2 , denote the (non-random) set of all integer vectors r = (r 0 , r 1 , . . ., r m ) satisfying the condition (1.3), but not (1.9).
To determine elements in R c ǫ,p 1 ,p 2 we need to pick m elements from the set {1, 2, . . ., n}.Hence we get the following upper bound for the number of elements in the set by a well known and simple bound on binomial coefficients.Now, let δ := min(γ * − γ * (p 1 ), γ * − γ * (p 2 )).By definition LC n is always larger or equal to L n ( r).For r to "define an optimal alignment" we need to have: (2.2) Hence for the event A n ǫ,p 1 ,p 2 not to hold, there needs to be at least an element r in R c ǫ,p 1 ,p 2 for which (2.2) is satisfied.This means that (The proof of this fact is given in Lemma 2.1.)With the last inequality above, we find that Note that the quantity L n ( r)−LC n changes by less than 2 units, when we change any of the i.i.d.entries X 1 , X 2 , . . ., X n ; Y 1 , Y 2 , . . ., Y n .Hence, we can apply Hoeffding's martingale inequality to the right side of (2.4), to obtain (Recall that Hoeffding's inequality ensures that if f is a map in l entries, so that changing any one single entry affect the value by less than a, then provided the variables W 1 , W 2 , . . .are independent.)Combining the last inequality above with (2.3), one obtains But, by the equation (2.1), the set R c ǫ,p 1 ,P 2 contains less than (ek) m elements so that out of (2.5), we get This will finish the proof of Theorem 1.1, provided we prove: . Assume that r = (r 0 , . . ., r m ) ∈ R c ǫ,p 1 ,p 2 .Then for n large enough, we have Proof.Assume that we compute the LCS of the string X (i−1)k+1 X (i−1)k+2 . . .X ik and of the string is not between p 1 k and p 2 k, then the rescaled expected value is below γ * by at least δ * .(Rescaled by the average of the lengths of the two strings).Hence, . One of the strings has length k.Hence, not rescaled we are below γ * by at least δ * k/2.By definition, any alignment belonging to R c ǫ,p 1 ,p 2 has a proportion of at least ǫ of the intervals X (i−1)k+1 X (i−1)k+2 . . .X ik which get matched on strings of length not in [p 1 k, p 2 k].This corresponds to a total number of ǫm.For each of these intervals we are below the expected value, which would correspond to γ * times the average length, by at least δ * k/2.Hence, the expected value for any alignment of as soon as r = (r 0 , r 1 , . . ., r m ) is in R c ǫ,p 1 ,p 2 .Now E[LC n ]/n → γ * as n goes to infinity.Note, that by definition, δ * − δ > 0, and so for all n large enough, we will have (2.9) Combining (2.8) and (2.9) yields that for all n large enough, we have as soon as r ∈ R c ǫ,p 1 ,p 2 .The proof is now completed.

Closeness to the diagonal
Let us start by explaining how we can represent alignments in two dimensions by considering an example.Take the two related words: the English X = mother and the German Y = mutter.The longest common subsequence is mter and hence LC 6 = 4.As mentioned, we represent any common subsequence as an alignment with gaps.The letters appearing in the common subsequence are aligned one on top of the other.The letters which are not aligned with the same letter in the other text get aligned with a gap.In the present case the common subsequence mter corresponds to the following alignment: An alignment corresponding to a LCS is called an optimal alignment.The optimal alignment is, in general, not unique.For example, to the same common subsequence mter corresponds also the following optimal alignment: In the following, we represent alignments in 2 dimensions.For this we view alignments as subsets of R 2 , in the following manner: If the i-th letter of X gets aligned with the j-th letter of Y , then the set representing the alignment is to contain (i, j).For example, the alignment (3.1) can be represented as follows: (1, 1), (3, 3), (5,5) Here, the symbol x indicates pairs of aligned letters.We say that these points represent the optimal alignment.The main result of the previous section implies that the optimal alignment must remain close to the diagonal.This is the content of the next theorem.
Proof.Let D n a be the event that any optimal alignment of X 1 X 2 . . .X n with Y 1 Y 2 . . .Y n is above the line y(x) = p 1 x−p 1 nǫ−p 1 k; and let D n b be the event that any optimal alignment of X and thus P(D nc ǫ,p 1 ,p 2 ) ≤ P(D nc a ) + P(D nc b ).
By symmetry we have that P(D nc a ) = P(D nc b ).The last inequality above then yields Next, we are going to prove that A n ǫ,p 1 ,p 2 ⊂ D n a . (3.5) Let x ≤ ǫn be a multiple of k.Let thus a be a natural number such that ak = x.
Let us first consider the case where x ≤ ǫn.In this situation, we have that p 1 x − p 1 ǫn is negative.But any alignment (and optimal alignment) we consider maps any x ∈ [0, n] onto [0, n].Hence for every xǫn the condition is always verified, that is any optimal alignment aligns x onto a point which is no less than p 1 x − p 1 ǫn.
Let us next consider the case where x ≥ ǫn.When the event A n ǫ,p 1 ,p 2 holds, then any optimal alignment aligns all but a proportion ǫ of the interval [(i − 1)k + 1, ik], i ∈ {1, . . ., m} onto intervals of length longer or equal to p 1 k.The maximum number of intervals which could be matched on intervals of length less than p 1 k is thus ǫm.In the interval [0, x] there are a intervals from the partition [(i − 1)k + 1, ik], i ∈ {1, . . ., m}.Hence, at least a − ǫm of these intervals are matched onto intervals of length no less than p 1 k.This implies that, when the event A n ǫ,p 1 ,p 2 holds, we find that the point x gets matched by the optimal alignment on a value no less than (a − ǫm)kp 1 .
Noting that ak = x and that mk = n the above bound becomes If x is not a multiple of k, let x 1 denote the largest multiple of k which is less than x.By definition, we have that The two-dimensional alignment curve cannot go down, hence, we have that x gets aligned to a point which cannot be below to where x 1 gets aligned.Now, for x 1 , since it is a multiple of k, we have that it gets aligned on a point which is larger or equal to Using (3.6), we find We have just proven that when the event A n ǫ,p 1 ,p 2 holds, the point x gets aligned above or on the point p 1 x − p 1 ǫn − p 1 k.This finishes proving that the event A n ǫ,p 1 ,p 2 is a subevent of D n a .Since A n ǫ,p 1 ,p 2 ⊂ D n a , we get P (D nc a ) ≤ P A nc ǫ,p 1 ,p 2 .But by Theorem 1.1, the last probability above is upper bounded by: exp(−n (− ln(ek)/k + δ 2 ǫ 2 /16) ), so that P(D nc a ) ≤ exp(−n (− ln(ek)/k + δ 2 ǫ 2 /16) ).Hence, (3.4) becomes: Theorem 3.1 allows to reduce the time to compute the LCS for two random sequences.First note that by rescaling the two-dimensional representation of optimal alignments by n, it implies that, with high probability, up to a distance of order ǫ > 0 any optimal alignment is above the line x → p 1 x and below x → p 2 x.Moreover, in the theorem we can take ǫ > 0 as small as we want, (leaving it fixed though when n goes to infinity).Simulations seem to indicate that the mean curve γ * is strictly concave at p = 1.If this is indeed true then we can take p 1 as close to one as we want and it will satisfy the conditions of the theorem.That is, we could then take ǫ as close to 0 as we want and p 1 as close as close as we want to 1. Hence, the rescaled two-dimensional representation of the optimal alignments gets uniformly as close to the diagonal as we want when n goes to infinity.Figure 1, below, is the graph of a simulation with two i.i.d binary sequences of length n = 1000.All the optimal alignments are contained between the two lines in the graph below.We see that all the optimal alignments stay extremely close to the diagonal:  4 Proving a property of the optimal alignment Let P be a map which assigns to every pair of strings (x, y) a 1 if the pair (x, y) has a certain property and 0 otherwise.Hence, if A is the alphabet we consider, then If P(x, y) = 1 we will say that the string pair (x, y) has the property P. Let ǫ > 0 be any fixed number strictly larger than 0 and let r = (r 0 , r 1 , . . ., r m ) be an integer vector satisfying condition (1.3).Let B n P ( r, ǫ) denote the event that there is a proportion of at least 1 − ǫ of the string pairs satisfying the property P. In other words, the event Let B n P (ǫ) denote the event that for every optimal alignment the proportion of aligned string pairs (4.1) satisfying the property P, is more than 1 − ǫ.Hence, the event B n P (ǫ) holds if and only if for every vector r = (r 0 , r 1 , . . ., r m ) satisfying (1.3) and such that LC n = L n ( r), the event B n P ( r, ǫ) holds.Most of the time, the properties we want for string pairs only holds with high probability if the two strings have their lengths not too far from each other.Let q be a (small) constant so that q ∈ [0, 1].Assume that as soon as r i − r i−1 ∈ [kp 1 , kp 2 ], the probability that that string pairs (4.1) has the required property is above 1 − q.Hence, assume that for every r 1 ∈ [kp 1 , kp 2 ] we have: We are going to investigate next how small q = q(k) needs to be, in order to insure that a large proportion of the aligned string pairs (4.1) have the property P (for every optimal alignment).Recall that A n ǫ,p 1 ,p 2 denotes the event that every optimal alignment aligns a proportion larger/equal to 1 − ǫ of the substrings X (i−1)k+1 . . .X ik to substrings of Y with length in [p 1 k, p 2 k].Also, recall that R ǫ,p 1 ,p 2 denotes the set of integer vectors r = (r 0 , r 1 , . . ., r m ), satisfying (1.3) and such that there is more than We will need a small modification of the event B n P ( r, ǫ).For this let BP ( r, ǫ) denote the event that among the aligned string pieces (4.1), there are no more than mǫ which do not satisfy the property P and have their length r i − r i−1 in [kp 1 , kp 2 ].We find that We find the bound Noting that m ǫ 2 m is bounded above by exp(H e (ǫ 2 )m), where H e is the base e entropy function, given by H e (x) = −x ln x − (1 − x) ln(1 − x), 0 < x < 1, we find the bound We can now apply inequality (4.3) to inequality (4.2).For this note that in the set R ǫ,p 1 ,p 2 there are less than (3k) m elements as we proved in Section 2. Thus we obtain Taking q(k) equal to 1/((6k) Note that H e (ǫ 2 ) < ln 2 as soon as ǫ 2 < 0.5.So, if we assume that ǫ 2 < 0.5, then expression exp( (H e (ǫ 2 ) − ln(2))m ) is a (negatively) exponentially small quantity in m.We already, learned how to bound the probability of the event A nc (ǫ 2 , p 1 , p 2 )) in the previous sections.Hence, the inequality (4.5), allows to show that a high percentage of the aligned string pairs (4.1), have property P in any optimal alignment.For this we just need to show that for pairs (4.1) with similar length, the probability q(k) is less or equal to 1 (12k) 1/ǫ 2 , where q(k) := max P( the pair (X 1 . . .X k ; Y 1 . . .Y r 1 ) has not property P ).This is the content of the next theorem, which is obtained by putting ǫ 1 = ǫ 2 = ǫ/2: Theorem 4.1 Assume that p 1 and p 2 are such that p 1 < 1 < p 2 .Let δ > 0 be strictly less than min(γ * − γ(p 1 ), γ * − γ(p 2 )).Let ǫ > 0. Assume that there is a natural number k ≥ 1 such that for any l ∈ [kp 1 , kp 2 ].Then, for any optimal alignment r (i.e., such that LC n = L( r)), the proportion of string pairs ((X (i−1)k+1 ....X ik ; Y r i−1 +1 . . .Y r i ) which have property P is above 1 − ǫ with probability bounded below as given in the next inequality: The above theorem is very useful for showing that when a certain property holds for aligned string pairs with similar lengths of order k, then the property holds typically in most parts of the optimal alignment.From our experience, for most properties we are interested in, when p 1 and p 2 are close to 1 but fixed, then the probability that does not have a certain property is about the same for all l ∈ [kp 1 , kp 2 ].In other words, how the alignment of X 1 . . .X k with Y 1 . . .Y l , behaves does not depend very much on l as soon as l is close to k and k fixed.(We are not necessarily able to prove this formally in many situation though).Looking at (4.7) we see that we need a bound for the probability on the left side of (4.7) which is smaller than any negative polynomial order in k. (At least if we want to be able to take ǫ as close as we want to 0. If we just want ǫ > 0 small but fixed, then a negative polynomial bound with a huge exponent will do).So, if the probability is for example of order k − ln k or e −k α for a constant α > 0, we get condition (4.7) satisfied by taking k large enough.Similarly, condition (4.6) always gets satisfied for k large enough.
On the other hand, we could envision using Montecarlo simulation in order to find a likely bound for the probability on the left side of (4.7).Things then become much more difficult.Assume that you want ǫ to be 0.2 and take δ = 0.1.Then, by condition (4.6) you find that k must be larger than: k ≥ ln(64) + ln(25) + ln(100))64 • 25 • 100 ≈ 1260000.
The probability not to have property P for strings of length approximately k must be less (see (4.7)) than (6k) −10 , so that with our previous bound on k,we would get less than 10 −66 .
The above number is way too small for Montecarlo simulations!Indeed, to show that a probability is as small as 10 −66 , one would need to run an order of 10 66 simulations.

Further improvements
There are several ways to improve on our bounds.First, we took as upper bound for R ǫ,p 1 ,p 2 the value n m .One can find a better upper bound as follows: first note that if r = (r 0 , r 1 , . . ., r m ) ∈ R ǫ,p The typical situation is that ǫ 1 + ǫ 2 should be of a given order.So, we will try to find ǫ 1 and ǫ 2 under the constraint ǫ = ǫ 1 + ǫ 2 , so that the right bound in (4.16) is least small.For this note first that the power 1/ǫ 2 has much more effect on making the bound small than the expression ǫ 2 1 on top of the fraction bar.Note that exp(H e (ǫ 1 ) + H e (ǫ 2 )) is just going to be a value between 1 and 2, and so will not have a lot of influence.Also, ((3k)/ǫ 1 ) ǫ 1 is somewhat negligible compared to 3k.So When only dealing with the inequality (4.15), things look somewhat better.Take k = 1000 and (p 1 − p 1 )k = 100, then the order for the left bound for q(k) is about 10 −5 which is feasible with Montecarlo!So, if we could find another method than the one described here to make sure that most of the pieces of strings X (i−1)k+1 X (i−1)k+2 . . .X ik are aligned with pieces of similar length we would be in business!
longest vertical distance=26, longest horizontal length=112 X−sequence of length 1000 Y−sequence of length 1000