Bernoulli

  • Bernoulli
  • Volume 20, Number 3 (2014), 1292-1343.

Optimal alignments of longest common subsequences and their path properties

Jüri Lember, Heinrich Matzinger, and Anna Vollmer

Full-text: Open access

Abstract

We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all optimal alignments. We present several simulations indicating that the homologous (having the same common ancestor) sequences have typically the distance between the extremal alignments of much smaller size than independent sequences. In particular, the simulations suggest that for the homologous sequences, the growth of the distance between the extremal alignments is logarithmical. The main theoretical results of the paper prove that (under some assumptions) this is the case, indeed. The paper suggests that the properties of the optimal alignment paths characterize the relatedness of the sequences.

Article information

Source
Bernoulli, Volume 20, Number 3 (2014), 1292-1343.

Dates
First available in Project Euclid: 11 June 2014

Permanent link to this document
https://projecteuclid.org/euclid.bj/1402488941

Digital Object Identifier
doi:10.3150/13-BEJ522

Mathematical Reviews number (MathSciNet)
MR3217445

Zentralblatt MATH identifier
1312.60004

Keywords
longest common subsequence optimal alignments homologous sequences

Citation

Lember, Jüri; Matzinger, Heinrich; Vollmer, Anna. Optimal alignments of longest common subsequences and their path properties. Bernoulli 20 (2014), no. 3, 1292--1343. doi:10.3150/13-BEJ522. https://projecteuclid.org/euclid.bj/1402488941


Export citation

References

  • [1] Alexander, K.S. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4 1074–1082.
  • [2] Amsalu, S., Matzinger, H. and Popov, S. (2007). Macroscopic non-uniqueness and transversal fluctuation in optimal random sequence alignment. ESAIM Probab. Stat. 11 281–300.
  • [3] Arratia, R. and Waterman, M.S. (1994). A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4 200–225.
  • [4] Baeza-Yates, R.A., Gavaldà, R., Navarro, G. and Scheihing, R. (1999). Bounding the expected length of longest common subsequences and forests. Theory Comput. Syst. 32 435–452.
  • [5] Barder, S., Lember, J., Matzinger, H. and Toots, M. (2012). On suboptimal LCS-alignments for independent Bernoulli sequences with asymmetric distributions. Methodol. Comput. Appl. Probab. 14 357–382.
  • [6] Bonetto, F. and Matzinger, H. (2006). Fluctuations of the longest common subsequence in the asymmetric case of 2- and 3-letter alphabets. ALEA Lat. Am. J. Probab. Math. Stat. 2 195–216.
  • [7] Chao, K.M. and Zhang, L. (2009). Sequence Comparison: Theory and Methods. Computational Biology. London: Springer.
  • [8] Chvatal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Probability 12 306–315.
  • [9] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge Univ. Press.
  • [10] Hansen, N.R. (2006). Local alignment of Markov chains. Ann. Appl. Probab. 16 1262–1296.
  • [11] Hirmo, E., Lember, J. and Matzinger, H. (2012). Detecting the homology of DNA-sequence based on the variety of optimal alignments: A case study. Available at arXiv:1210.3771.
  • [12] Houdré, C., Lember, J. and Matzinger, H. (2006). On the longest common increasing binary subsequence. C. R. Math. Acad. Sci. Paris 343 589–594.
  • [13] Kiwi, M., Loebl, M. and Matoušek, J. (2005). Expected length of the longest common subsequence for large alphabets. Adv. Math. 197 480–498.
  • [14] Lember, J. and Matzinger, H. (2009). Standard deviation of the longest common subsequence. Ann. Probab. 37 1192–1235.
  • [15] Lember, J., Matzinger, H. and Torres, F. (2012). The rate of the convergence of the mean score in random sequence comparison. Ann. Appl. Probab. 22 1046–1058.
  • [16] Lember, J., Matzinger, H. and Vollmer, A. (2007). Path properties of LCS-optimal alignments. SFB 701 Preprintreihe, Univ. Bielefeld (07 - 77).
  • [17] Matzinger, H., Lember, J. and Durringer, C. (2007). Deviation from mean in sequence comparison with a periodic sequence. ALEA Lat. Am. J. Probab. Math. Stat. 3 1–29.
  • [18] Siegmund, D. and Yakir, B. (2000). Approximate $p$-values for local sequence alignments. Ann. Statist. 28 657–680.
  • [19] Steele, J.M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist. 14 753–758.
  • [20] Waterman, M.S. (1994). Estimating statistical significance of sequence alignments. Phil. Trans. R. Soc. Lond. B 344 383–390.
  • [21] Waterman, M.S. (1995). Introduction to Computational Biology. London: Chapman & Hall.
  • [22] Waterman, M.S. and Vingron, M. (1994). Sequence comparison significance and Poisson approximation. Statist. Sci. 9 367–381.