The Annals of Applied Statistics

Context tree selection and linguistic rhythm retrieval from written texts

Antonio Galves, Charlotte Galves, Jesús E. García, Nancy L. Garcia, and Florencia Leonardi

Full-text: Open access


The starting point of this article is the question “How to retrieve fingerprints of rhythm in written texts?” We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres–Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.

Article information

Ann. Appl. Stat., Volume 6, Number 1 (2012), 186-209.

First available in Project Euclid: 6 March 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Variable length Markov chains model selection BIC smallest maximizer criterion linguistic rhythm European and Brazilian Portuguese


Galves, Antonio; Galves, Charlotte; García, Jesús E.; Garcia, Nancy L.; Leonardi, Florencia. Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6 (2012), no. 1, 186--209. doi:10.1214/11-AOAS511.

Export citation


  • Abercrombie, D. (1967). Elements of General Phonetics. Aldine, Chicago.
  • Bühlmann, P. (2000). Model selection for variable length Markov chains and tuning the context algorithm. Ann. Inst. Statist. Math. 52 287–315.
  • Bühlmann, P. and Wyner, A. J. (1999). Variable length Markov chains. Ann. Statist. 27 480–513.
  • Csiszár, I. and Talata, Z. (2006). Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 1007–1016.
  • Cuesta-Albertos, J. A., Fraiman, R., Galves, A., Garcia, J. and Svarc, M. (2007). Identifying rhythmic classes of languages using their sonority: A Kolmogorov–Smirnov approach. J. Appl. Stat. 34 749–761.
  • Dalevi, D. and Dubhashi, D. (2005). The Peres–Shields order estimator for fixed and variable length Markov models with applications to DNA sequence similarity. In Algorithms in Bioinformatics. Lecture Notes in Computer Science 3692 291–302. Springer, Berlin.
  • Dauer, R. (1983). Stress-timing and syllable-timing reanalized. Journal of Phonetics 11 51–62.
  • de Carvalho, B. J. (1988). Réduction vocalique, quantité et accentuation: Pour une explication structurale de la divergence entre portugais lusitanien et portugais brésilien. Boletim de Filologia 32 5–26.
  • Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability 57. Chapman & Hall, New York.
  • Frota, S. and Vigário, M. (2001). On the correlates of rhythm distinctions: The European/Brazilian Portuguese case. Probus 13 247–275.
  • Galves, A. and Leonardi, F. (2008). Exponential inequalities for empirical unbounded context trees. In In and Out of Equilibrium. 2. Progress in Probability 60 257–269. Birkhäuser, Basel.
  • Galves, A. and Löcherbach, E. (2008). Stochastic chains with memory of variable length. In Festchrift in Honour of Jorma Rissanen on the Occasion of His 70th Birthday (Grünwald et al., eds.). TICSP Series 38 117–133. Tampere Univ. Technology, Tampere, Finland.
  • Galves, A., Galves, C., Garcia, J. E., Garcia, N. L. and Leonardi, F. (2011). Supplement to “Context tree selection and linguistic rhythm retrieval from written texts.” DOI:10.1214/11-AOAS511SUPP.
  • Garivier, A. (2006). Modèles contextuels et alphabets infinis en théorie de l’information. Ph.D. thesis, Univ. Paris Sud.
  • Kleinhenz, U. (1997). Domain typology at the phonology-syntax interface. In Interfaces in Linguistic Theory (G. Matos et al., eds.) 201–220. APL/Colibri, Lisboa.
  • Kolmogorov, A. N. and Rychkova, N. G. (2000). Analysis of russian verse rhythm, and probability theory. Theory Probab. Appl. 44 375–385.
  • Lloyd, J. (1940). Speech Signals in Telephony. Pitman, London.
  • Mehler, J., Dupoux, E., Nazzi, T. and Dehaene-Lambertz, G. (1996). Coping with linguistic diversity: The Infant’s viewpoint. In Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition (J. Morgan and K. Demuth, eds.) 101–116. LEA, Hillsdale, NJ.
  • Nespor, M. and Vogel, I. (1986). Prosodic Phonology. Foris, Dordrecht.
  • Peres, Y. and Shields, P. (2005). Two new Markov order estimators. Unpublished manuscript. Available at arXiv:math/0506080v1.
  • Pike, K. L. (1945). The Intonation of American English. University of Michigan Press, Ann Arbor.
  • Ramus, F. (2002). Acoustic correlates of linguistic rhythm: Perspectives. In Proc. First International Conference on Speech Prosody (B. Bel and I. Marlien, eds.) 323–326. Laboratoire Parole et Langage, Aix-en-Provence.
  • Ramus, F., Nespor, M. and Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition 73 265–292.
  • Rissanen, J. (1983). A universal data compression system. IEEE Trans. Inform. Theory 29 656–664.
  • Ron, D., Singer, Y. and Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25 117–149.
  • Sândalo, F., Abaurre, M. B., Mandel, A. and Galves, C. (2006). Secondary stress in two varieties of portuguese and the sotaq optimality based computer program. Probus 18 97–125.
  • Vigário, M. (2003). The Prosodic Word in European Portuguese. de Gruyter, Berlin.
  • Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Trans. Inform. Theory 41 653–664.

Supplemental materials

  • Supplementary material: Data set and scripts. The directory SUPPLEMENT [Galves et al. (2011)] contains two subdirectories DATA and SCRIPTS. The directory named DATA contains the samples used in our linguistic case study. A Readme file describing the data sources as well as the linguistic preprocessing and encoding procedure is included in this directory. The directory named SCRIPTS contains the three Perl scripts used in this paper and three associated Readme files explaining how to use the scripts.