Journal of Applied Probability

Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata

Grégory Nuel
Source: J. Appl. Probab. Volume 45, Number 1 (2008), 226-243.

Abstract

In the framework of patterns in random texts, the Markov chain embedding techniques consist of turning the occurrences of a pattern over an order-m Markov sequence into those of a subset of states into an order-1 Markov chain. In this paper we use the theory of language and automata to provide space-optimal Markov chain embedding using the new notion of pattern Markov chains (PMCs), and we give explicit constructive algorithms to build the PMC associated to any given pattern problem. The interest of PMCs is then illustrated through the exact computation of P-values whose complexity is discussed and compared to other classical asymptotic approximations. Finally, we consider two illustrative examples of highly degenerated pattern problems (structured motifs and PROSITE signatures), which further illustrate the usefulness of our approach.

First Page: Show Hide
Primary Subjects: 65C40
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.jap/1208358964
Digital Object Identifier: doi:10.1239/jap/1208358964
Zentralblatt MATH identifier: 1142.65010
Mathematical Reviews number (MathSciNet): MR2409323

References

Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials. J. Appl. Prob. 38, 508--518.
Mathematical Reviews (MathSciNet): MR1834757
Digital Object Identifier: doi:10.1239/jap/996986759
Project Euclid: euclid.jap/996986759
Zentralblatt MATH: 0985.60072
Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19, 521--545.
Mathematical Reviews (MathSciNet): MR903536
Digital Object Identifier: doi:10.2307/1427406
Zentralblatt MATH: 0629.60100
Chadjiconstantinidis, S., Antzoulakos, D. L. and Koutras, M. V. (2000). Joint distribution of successes, failures and patterns in enumeration problems. Adv. Appl. Prob. 32, 866--884.
Mathematical Reviews (MathSciNet): MR1788099
Digital Object Identifier: doi:10.1239/aap/1013540248
Project Euclid: euclid.aap/1013540248
Zentralblatt MATH: 0963.60003
Chryssaphinou, O. and Papastavridis, S. (1990). The occurrence of a sequence of patterns in repeated dependent experiments. Theory Prob. Appl. 35, 167--173.
Mathematical Reviews (MathSciNet): MR1050068
Crochemore, M. and Hancart, C. (1997). Automata for matching patterns. In Handbook of Formal Languages, Vol. 2, Linear Modeling: Background and Application, Springer, Berlin, pp. 399--462.
Mathematical Reviews (MathSciNet): MR1470014
Crochemore, M. and Stefanov, V. T. (2003). Waiting time and complexity for matching patterns with automata. Inf. Proc. Lett. 87, 119--125.
Mathematical Reviews (MathSciNet): MR1986775
Digital Object Identifier: doi:10.1016/S0020-0190(03)00271-0
Zentralblatt MATH: 1161.68760
Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica 6, 957--974.
Mathematical Reviews (MathSciNet): MR1422413
Fu, J. C. and Chang, Y. M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J. Appl. Prob. 30, 183--208.
Mathematical Reviews (MathSciNet): MR1895144
Digital Object Identifier: doi:10.1239/jap/1019737988
Project Euclid: euclid.jap/1019737988
Zentralblatt MATH: 1008.60031
Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc. 89, 1050--1058.
Mathematical Reviews (MathSciNet): MR1294750
Digital Object Identifier: doi:10.2307/2290933
Zentralblatt MATH: 0806.60011
Fu, J. C. and Lou, W. Y. W. (2003). Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Approach. World Scientific, Singapore.
Mathematical Reviews (MathSciNet): MR2000533
Zentralblatt MATH: 1030.60063
Gasteiger, E., Jung, E. and Bairoch, A. (2001). SWISS-PROT: Connecting biological knowledge via a protein database. Current Issues Molec. Biol. 3, 47--55.
Glaz, J., Kulldorff, M., Pozdnyakov, V. and Steele, J. M. (2006). Gambling teams and waiting times for patterns in two-state Markov chains. J. Appl. Prob. 43, 127--140.
Mathematical Reviews (MathSciNet): MR2225055
Digital Object Identifier: doi:10.1239/jap/1143936248
Project Euclid: euclid.jap/1143936248
Zentralblatt MATH: 1105.60051
Guibas, L. J. and Odlyzko, A. M. (1981). String overlaps, pattern matching and transitive games. J. Combinatorial Theory A 30, 183--208.
Mathematical Reviews (MathSciNet): MR611250
Digital Object Identifier: doi:10.1016/0097-3165(81)90005-4
Zentralblatt MATH: 0454.68109
Hopcroft, J. E., Motwani, R. and Ullman, J. D. (2001). Introduction to Automata Theory, Languages, and Computation, 2nd edn. ACM Press, New York.
Zentralblatt MATH: 0980.68066
Hulo, N. \et\! (2006). The PROSITE database. Nucleic Acid Res. 34, D227--D230.
Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998). ARPACK Users' Guide. Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA.
Mathematical Reviews (MathSciNet): MR1621681
Li, S.-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Prob. 8, 1171--1176.
Mathematical Reviews (MathSciNet): MR602390
Digital Object Identifier: doi:10.1214/aop/1176994578
Project Euclid: euclid.aop/1176994578
Zentralblatt MATH: 0447.60006
Lou, W. Y. W. (1996). On runs and longest run tests: a method of finite Markov chain imbedding. J. Amer. Statist. Assoc. 91, 1595--1601.
Mathematical Reviews (MathSciNet): MR1439099
Digital Object Identifier: doi:10.2307/2291585
Zentralblatt MATH: 0881.62086
Marsan, L. and Sagot, M.-F. (2000). Algorithms for extracting structured motifs using a suffix tree with an application to promoter consensus identification. J. Comput. Biol. 7, 345--362.
Nicodeme, P., Salvy, B. and Flajolet, P. (2002). Motifs statistics. Theoret. Comput. Sci. 28, 593--617.
Mathematical Reviews (MathSciNet): MR1930238
Digital Object Identifier: doi:10.1016/S0304-3975(01)00264-X
Zentralblatt MATH: 1061.68118
Nuel, G. (2006a). Effective $p$-value computations using finite Markov chain imbedding (FMCI): application to local score and to pattern statistics. Algo. Molec. Biol. 1.
Nuel, G. (2006b). Numerical solutions for patterns statistics on Markov chains. Statist. Appl. Gen. Molec. Biol. 5.
Mathematical Reviews (MathSciNet): MR2306489
Digital Object Identifier: doi:10.2202/1544-6115.1219
Zentralblatt MATH: 1166.62324
Nuel, G. (2007). Cumulative distribution function of a geometric Poisson distribution. J. Statist. Comput. Simul. 77.
Mathematical Reviews (MathSciNet): MR2432252
Zentralblatt MATH: 1136.62018
Digital Object Identifier: doi:10.1080/10629360600997371
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words, an overview. J. Comput. Biology 7, 1--46.
Robin, S. and Daudin, J.-J. (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179--193.
Mathematical Reviews (MathSciNet): MR1699643
Digital Object Identifier: doi:10.1239/jap/1032374240
Project Euclid: euclid.jap/1032374240
Zentralblatt MATH: 0945.60008
Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurrences of a set of words. Ann. Inst. Statist. Math. 36, 895--905.
Mathematical Reviews (MathSciNet): MR1880819
Digital Object Identifier: doi:10.1023/A:1014633825822
Zentralblatt MATH: 1006.60012
Robin, S. \et\! (2002). Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761--773.
Stefanov, V. T. (2000). On some waiting time problems. J. Appl. Prob. 37, 756--764.
Mathematical Reviews (MathSciNet): MR1782451
Digital Object Identifier: doi:10.1239/jap/1014842834
Project Euclid: euclid.jap/1014842834
Zentralblatt MATH: 0969.60021
Stefanov, V. T. (2003). The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: an algorithmic approach. J. Appl. Prob. 40, 881--892.
Mathematical Reviews (MathSciNet): MR2012674
Digital Object Identifier: doi:10.1239/jap/1067436088
Project Euclid: euclid.jap/1067436088
Zentralblatt MATH: 1054.60022
Stefanov, V. T. and Pakes, A. G. (1997). Explicit distributional results in pattern formation. Ann. Appl. Prob. 7, 666--678.
Mathematical Reviews (MathSciNet): MR1459265
Digital Object Identifier: doi:10.1214/aoap/1034801248
Project Euclid: euclid.aoap/1034801248
Zentralblatt MATH: 0893.60005
Stefanov, V. T. and Pakes, A. G. (1999). Explicit distributional results in pattern formation. II. Austral. N. Z. J. Statist. 41, 79--90, 254.
Mathematical Reviews (MathSciNet): MR1705486
Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Appl. Math. 155, 868--880.
Mathematical Reviews (MathSciNet): MR2309853
Digital Object Identifier: doi:10.1016/j.dam.2005.07.016
Zentralblatt MATH: 1112.60055

2013 © Applied Probability Trust

Journal of Applied Probability

Journal of Applied Probability