Posterior Representations for Bayesian Context Trees: Sampling, Estimation and Convergence

We revisit the Bayesian Context Trees (BCT) modelling framework for discrete time series, which was recently found to be very effective in numerous tasks including model selection, estimation and prediction. A novel representation of the induced posterior distribution on model space is derived in terms of a simple branching process, and several consequences of this are explored in theory and in practice. First, it is shown that the branching process representation leads to a simple variable-dimensional Monte Carlo sampler for the joint posterior distribution on models and parameters, which can efficiently produce independent samples. This sampler is found to be more efficient than earlier MCMC samplers for the same tasks. Then, the branching process representation is used to establish the asymptotic consistency of the BCT posterior, including the derivation of an almost-sure convergence rate. Finally, an extensive study is carried out on the performance of the induced Bayesian entropy estimator. Its utility is illustrated through both simulation experiments and real-world applications, where it is found to outperform several state-of-the-art methods.


Introduction
The statistical modelling and analysis of discrete time series are important scientific and engineering tasks, with a very wide range of applications. Numerous Markovian model classes have been developed in connection with these and related problems, including mixture transition distribution (MTD) models (Raftery, 1985;Berchtold and Raftery, 2002), variable-length Markov chains (VLMC) (Bühlmann and Wyner, 1999;Bühlmann, 2000;Mächler and Bühlmann, 2004) and sparse Markov chains (Jääskinen et al., 2014;Xiong et al., 2016). Alternative approaches also include the use of multinomial logit or probit regression (Zeger and Liang, 1986), categorical regression models (Fokianos and Kedem, 2003), and conditional tensor factorisation (Sarkar and Dunson, 2016).
A popular and useful class of relevant models for discrete time series are the context-tree sources, introduced by Rissanen (1983aRissanen ( ,b, 1986) as descriptions of variable-memory Markov chains, a flexible and rich class of chains that admit parsimonious representations. Their key feature is that the memory length of the chain is allowed to depend on the most recently observed symbols, providing a richer model class than ordinary Markov chains. Context-tree sources have been very successful in information-theoretic applications in connection with data compression (Weinberger et al., 1994;Willems et al., 1995), and the celebrated context tree weighting (CTW) algorithm (Willems et al., 1995;Willems, 1998), also based on context-tree sources, has been used widely as an efficient compression method with extensive theoretical guarantees and justifications.
Recently, Kontoyiannis et al. (2022) revisited context-tree models and the CTW algorithm from a Bayesian inference point of view. A general modelling framework, called Bayesian Context Trees (BCT), was developed for discrete time series, along with a collection of efficient algorithmic tools both for exact inference and for posterior sampling via Markov chain Monte Carlo (MCMC). The BCT methods were found to be very effective in important statistical tasks, including model selection, estimation, prediction and change-point detection (Papageorgiou et al., 2021;Lungu et al., 2022b,a); see also the R package BCT (Papageorgiou et al., 2020).
In this work we derive an alternative representation of the posterior distribution induced by the BCT framework, and explore several ways in which it facilitates inference, both in theory and in practice. Our first main contribution is described in Sections 3.1 and 3.2, where we show that both the prior and posterior distributions on model space admit explicit representations as simple branching processes. In particular, sampling tree models from the prior or the posterior is shown to be equivalent to generating trees via an appropriate Galton-Watson process (Athreya and Ney, 2004;Harris, 1963), stopped at a given depth. Therefore, in some sense the BCT model prior acts as a 'conjugate' prior for variable-memory Markov chains. An immediate first practical consequence of this representation is that it facilitates direct Monte Carlo (MC) sampling from the posterior, where independent and identically distributed (i.i.d.) samples can be efficiently obtained from the joint posterior distribution on models and parameters. This variable-dimensional sampler and its potential utility in a wide range of applications (including model selection, parameter estimation and Markov order estimation) are described in Section 3.3.
The Bayesian perspective adopted in this work is neither purely subjective nor purely objective. For example, we think of the model posterior distribution as a summary of the most accurate, data-driven representation of the regularities present in a given time series, but we also examine the frequentist properties of the resulting inferential procedures (Gelman et al., 1995;Chipman et al., 2001;Bernardo and Smith, 2009). Indeed, in Section 4 we employ the branching process representation to show that the posterior asymptotically almost surely concentrates on the "true" underlying model (Theorem 4.1), and in Theorem 4.2 we derive an explicit rate for this convergence as a function of the sample size. Analogous results are established in Theorems 4.3 and 4.4 in the case of out-of-class modelling, when the data are generated by a model outside the BCT class. Importantly, the limiting model is explicitly identified in this case. The branching process representation is also used in Proposition 4.1 to provide a simple, explicit representation of the posterior predictive distribution. These theoretical results are the second main contribution of this work.
Our last contribution, in Section 5, is a brief experimental evaluation of the utility of the MC sampler of Section 3, and a careful examination of the performance of the induced Bayesian entropy estimator. In Section 5.1, the new i.i.d. sampler is compared with the MCMC samplers introduced by Kontoyiannis et al. (2022) on simulated data. As expected, it is found that the i.i.d. sampler has superior performance, both in terms of estimation accuracy and, as expected, in terms of mixing.
Finally, in Section 5.2 we consider the important problem of estimating the entropy rate of a discrete time series. Starting with the original work of Shannon (1951), many different approaches have been developed for this task, including Lempel-Ziv (LZ) estimators (Ziv and Lempel, 1977;Wyner and Ziv, 1989), prediction by partial matching (PPM) (Cleary and Witten, 1984), the CTW algorithm (Gao et al., 2008), and block sorting methods (Cai et al., 2004); for an extensive review of the relevant literature, see Verdú (2019). Entropy estimation has also received a lot of attention in the neuroscience literature (Strong et al., 1998;London et al., 2002;Nemenman et al., 2004;Paninski, 2003), in an effort to describe and quantify the amount of information transmitted by neurons.
In contrast with most earlier work, here we adopt a fully-Bayesian approach. Since the entropy rate is a functional of the model and associated parameters, the branching process sampler of Section 3 makes it possible to effectively sample from (and hence estimate) the actual posterior distribution of the entropy rate. This of course provides a much richer picture than the simple point estimates employed in most applications. The performance of the BCT entropy estimator is illustrated both on simulated data and real-world applications from neuroscience, finance and animal communication, where it is seen to outperform several of the state-of-the-art methods.
In closing this introduction, we mention that there are, of course, numerous other approaches to the problem of inference for discrete time series. In addition to the extensive review given by Kontoyiannis et al. (2022), those include the class of reversible variable-memory chains examined by Bacallado (2011); Bacallado et al. (2013Bacallado et al. ( , 2016, and the Bayesian analyses of discrete models with priors that encourage sparse representations developed in Heiner et al. (2019); Heiner and Kottas (2022).

Bayesian context trees
In this section, we briefly review the BCT model class, the associated prior structure, and some relevant properties and results that will be needed in subsequent sections.
The BCT model class consists of variable-memory Markov chains, where the memory length of the process may depend on the values of the most recently observed symbols. Variablememory Markov chains admit natural representations as context trees. Let {X n } be a dth order Markov chain, for some d ≥ 0, taking values in the alphabet A = {0, 1, . . . , m − 1}. The model describing {X n } as a variable-memory chain is represented by a proper m-ary tree T as in the example in Figure 1, where a tree T is called proper if any node in T that is not a leaf has exactly m children.
Each leaf of the tree T corresponds to a string s determined by the sequence of symbols along the path from the root node λ to that leaf. At each leaf s, there is an associated set of parameters θ s that form a probability vector, θ s = (θ s (0), θ s (1), . . . , θ s (m − 1)). At every time n, the conditional distribution of the next symbol X n , given the past d observations (x n−1 , . . . , x n−d ), is given by the vector θ s associated to the unique leaf s of T that is a suffix of (x n−1 , . . . , x n−d ). Throughout this paper, every variable-memory Markov chain is described by a tree model T and a set of associated parameters θ = {θ s ; s ∈ T }, where T is viewed as the collection of its leaves. Model prior. Given a maximum depth D ≥ 0, let T (D) denote the collection of all proper m-ary trees with depth no greater than D. In Kontoyiannis et al. (2022), the following prior distribution is introduced on T (D), where β ∈ (0, 1) is a hyperparameter, α is given by α = (1 − β) 1/(m−1) , |T | is the number of leaves of T , and L D (T ) is the number of leaves of T at depth D.
This prior clearly penalises larger trees by an exponential amount, and larger values of β make the penalisation more severe. We adopt the default value β = 1 − 2 −m+1 for β; see Kontoyiannis et al. (2022) for an extensive discussion of the properties of this prior and the choice of the hyperparameter β.
(2.2) Let x = (x −D+1 , . . . , x 0 , x 1 , . . . , x n ) denote a time series with values in A. For each i ≤ j, we write x j i for the segment (x i , x i+1 , . . . , x j ), so that x consists of the observations x n 1 along with an initial context x 0 −D+1 of length D. One of the main observations of Kontoyiannis et al. (2022) is that the prior predictive likelihood, averaged over both models and parameters, can be computed exactly and efficiently by a version of the CTW algorithm (where P (x|T, θ) denotes the probability of x under model T with parameters θ), which of course facilitates numerous important statistical tasks. For a given time series x = x n −D+1 , the CTW algorithm uses the estimated probabilities P e,s defined as follows. For any tree model T ∈ T (D) and any context s (not necessarily a leaf), where the elements of each count vector a s = (a s (0), a s (1), . . . , a s (m − 1)) are given by, and M s = a s (0) + a s (1) + · · · + a s (m − 1).
CTW: The context tree weighting algorithm.
1. Build the tree T MAX , which is the smallest proper tree that contains all the contexts x i i−D+1 , i = 1, 2, . . . , n, as leaves. Compute P e,s as given in (2.4) for each node s of T MAX .
2. Starting at the leaves and proceeding recursively towards the root, for each node s of T MAX compute the weighted probabilities P w,s , given by, where sj is the concatenation of context s and symbol j.
As shown in Kontoyiannis et al. (2022), the weighted probability P w,λ produced by the CTW algorithm at the root λ, is indeed exactly equal to the prior predictive likelihood in (2.3). A family of Markov chain Monte Carlo (MCMC) samplers for the posterior π(T |x) or π(T, θ|x) were also introduced in Kontoyiannis et al. (2022). However, as will be seen below, the representation of Section 3 leads to a simple i.i.d. MC sampler that typically outperforms these MCMC samplers.

Branching process representations
In this section we show that the BCT model prior π(T ) = π D (T ; β) of (2.1) and the resulting posterior π(T |x) both admit natural and easily interpretable representations in terms of simple branching processes. We also discuss how the posterior representation leads to efficient samplers that can be used for model selection and estimation.

The prior branching process
Given D ≥ 0 and β ∈ (0, 1), let T = {λ} consist of only the root node λ and consider the following procedure: • If D = 0, stop.
• If D > 0, then, with probability β, mark the root as a leaf and stop, or, with probability (1 − β), add all m children of λ at depth 1 to T . If D = 1, stop.
• If D > 1, examine each of the m new nodes, and either mark a node as a leaf with probability β, or add all m of its children to T with probability (1 − β), independently from node to node.
• Continue recursively, at each step examining all non-leaf nodes at depths strictly smaller than D, until no more eligible nodes remain to be examined.
• Output the resulting tree T ∈ T (D).
The above construction is a simple Galton-Watson process (Athreya and Ney, 2004;Harris, 1963) with offspring distribution (β, (1 − β)) on {0, m}, stopped at generation D. The following proposition states that the distribution of a tree T generated by this process is exactly the prior π D (T ; β). Note that this also implies that the expression for π D (T ; β) given in (2.1) indeed defines a probability distribution on T (D), giving an alternative proof of (Kontoyiannis et al., 2022, Lemma 2.1).
Proposition 3.1. For any D ≥ 0 and any β ∈ (0, 1), the probability that the above branching process produces any particular tree T ∈ T (D) is given by π D (T ; β) as in (2.1).
Proof. When D = 0, T (D) consists of a single tree, T = {λ}, which has probability 1 under both π D and the branching process construction. Assume D ≥ 1. Note that every tree T ∈ T (D) can be viewed as a collection of a number, k, say, of m-branches, since every node in T has either zero or m children. The proposition is proven by induction on k. The result is trivial for k = 0, since the only tree with no m-branches is T = {λ} and its probability under both π D and the branching process construction is equal to β.
Suppose the claim of the proposition is true for all trees with k m-branches, and let T ∈ T (D) consist of (k + 1) m-branches. Then T can be obtained from some T ∈ T (D) that has k mbranches, by adding a single m-branch to one of its leaves, s, say. Two cases are considered.
(i) If s is at depth D − 2 or smaller, then the probability π b (T ) of T under the branching process construction is, where the second equality follows from the inductive hypothesis and the third from the definition of π D (T ; β). Therefore, since |T | = |T | + m − 1 and no leaves are added at depth D, so that L D (T ) = L D (T ), as required.
(ii) If s is at depth D − 1, we similarly find that, completing the proof.
Apart from being aesthetically appealing, this representation also offers a simple and practical way of sampling from π D (T ; β). Moreover, using well-known properties of the Galton-Watson process we can perform some direct computations that offer better insight into the nature and specific properties of the BCT prior.
Interpretation and choice of β.
The branching process description of π D (T, β) further clarifies the role of the hyperparameter β: It is exactly the probability that, when a node is added to the tree T , it is marked as a leaf and its children are not included in T .
In terms of choosing the value of the hyperparameter β appropriately, recall that, for a Galton-Watson process, the expected number of children of each node, in this case ρ = m(1−β), governs the probability of extinction P e : If ρ ≤ 1 we have P e = 1, whereas if ρ > 1, P e is strictly less than one. Therefore, in the binary case m = 2, the original choice β = 1/2 used in the CTW algorithm gives an expected number of children equal to the critical value ρ = 1. This suggests that a reasonable choice for general alphabets could be β = 1−1/m, which keeps ρ = 1, so that the resulting prior would have similar qualitative characteristics with the well-studied binary case. This is also in line with the observation of Kontoyiannis et al. (2022) that β should decrease with m. Now suppose T is a random model generated by the prior and let L d (T ) denote the number of nodes at depth d = 0, 1, . . . , D. Then, standard Galton-Watson theory (Athreya and Ney, 2004;Harris, 1963) provides the useful expressions, where σ 2 = m 2 β(1 − β).

The posterior branching process
Given a time series x = x n −D+1 , a maximum depth D, and β ∈ (0, 1), for any context s with length strictly smaller than D we define the branching probabilities P b,s as, where the estimated and weighted probabilities, P e,s and P w,s , are defined in (2.4)-(2.6), and with the convention that P b,s = β for all contexts s that do not appear in x. Starting with T = {λ}, the following construction produces a sample model T ∈ T (D) from the model posterior π(T |x): • If D = 0, stop.
• If D > 0, then, with probability P b,λ , mark the root as a leaf and stop, or, with probability (1 − P b,λ ), add all m children of λ at depth 1 to T . If D = 1, stop.
• If D > 1, examine each of the m new nodes and either mark a node s as a leaf with probability P b,s , or add all m of its children to T with probability (1 − P b,s ), independently from node to node.
• Continue recursively, at each step examining all non-leaf nodes at depths strictly smaller than D, until no more eligible nodes remain.
• Output the resulting tree T ∈ T (D).
Proposition 3.2. For any D ≥ 0 and any β ∈ (0, 1), the probability that the above branching process produces any particular tree T ∈ T (D) is given by π(T |x).
The proof, which follows along the same lines as that of Proposition 3.1, is given in Section A of the supplementary material. It is perhaps somewhat remarkable that the posterior π(T |x) on the vast model space T (D) admits such a simple description. Indeed, the posterior branching process is of exactly the same form as that of the prior, which can then naturally be viewed as a conjugate prior on T (D).
Model posterior probabilities. Proposition 3.2 allows us to write an exact expression for the posterior of any model T ∈ T (D) in terms of the branching probabilities P b,s , where T o denotes the set of all internal nodes of T , and with the convention that P b,s = 1 for all leaves of T at depth d = D. This expression will be the starting point in the proofs of the asymptotic results of Section 4 for π(T |x).
In terms of inference, the main utility of Proposition 3.2 is that it offers a practical way of obtaining exact i.i.d. samples directly from the model posterior, as described in the next section.

Sampling from the posterior
The branching process representation of π(T |x) readily leads to a simple way for obtaining i.i.d. samples {T (i) } from the posterior on model space. And since the full conditional density of the parameters π(θ|T, x) is explicitly identified by Kontoyiannis et al. (2022) as a product of Dirichlet densities, for each T (i) we can draw a conditionally independent sample θ (i) ∼ π(θ|T (i) , x), producing a sequence of exact i.i.d. samples {(T (i) , θ (i) )} from the joint posterior π(T, θ|x). This facilitates numerous applications. For example, effective parameter estimation can be performed by simply keeping the samples {θ (i) }, which come from the marginal posterior distribution π(θ|x). Similarly, Markov order estimation can be performed by collecting the sequence of maximum depths of the models {T (i) }. And in model selection tasks, the model posterior can be extensively explored, offering better insight and deeper understanding of the underlying structure and dependencies present in the data.
Although a family of MCMC samplers was introduced and successfully used for the same tasks in Kontoyiannis et al. (2022), MCMC sampling has well-known limitations and drawbacks, including potentially slow mixing, high correlation between samples, and the need for convergence diagnostics (Gelman and Rubin, 1992;Cowles and Carlin, 1996;Robert and Casella, 2004). Partly for these reasons, being able to obtain i.i.d. samples from the posterior is generally much more desirable, as illustrated in Section 5.1.

Estimation of general functionals.
Consider the general Bayesian estimation problem, where the goal is to estimate an arbitrary functional F = F (T, θ) of the underlying variablememory chain, based on data x. Using the above sampler, the entire posterior distribution of the statistic F can be explored, by considering the i.i.d. samples F (i) = F (T (i) , θ (i) ), distributed according to the desired posterior π(F |x).
In connection with classical estimation techniques, and in order to evaluate estimation performance more easily in practice, several reasonable point estimates can also be obtained. The most common choices are the empirical average approximation to the posterior mean, or the posterior mode, i.e., the maximum a posteriori probability (MAP) estimate, F MAP . In cases where the conditional meanF (T ) = E F (T, θ)|x, T can be computed for any model T (as e.g. in the case of parameter estimation), a lower-variance Rao-Blackwellised estimate (Blackwell, 1947;Gelfand and Smith, 1990) for the posterior mean can also be obtained as, Importantly, as this posterior sampler provides access to the entire posterior distribution π(F |x) of the statistic of interest F , standard Bayesian methodology can be applied to quantify the resulting uncertainty of any estimatorF , for example by obtaining credible intervals in terms of the posterior π(F |x).
An interesting special case of particular importance in practice is the estimation of the entropy rate H of the underlying process, F (T, θ) = H(T, θ). The performance of all methods discussed above, with emphasis on the estimation of the entropy rate, is illustrated through simulated experiments and real-world applications in Section 5.2.

Theoretical results
Using the branching process representation of Proposition 3.2, we show how to derive precise results on the asymptotic behaviour of the BCT posterior π(T |x) on model space, and provide an explicit, useful expression for the posterior predictive distribution.
Let {X n } be a variable-memory chain with model T ∈ T (D). The specific model T that describes the chain is typically not unique, for the same reason, e.g., that every i.i.d. process can also trivially be described as a first-order Markov chain: Adding m children to any leaf of T which is not at maximal depth, and giving each of them the same parameters as their parent, leaves the distribution of the chain unchanged.
The natural main goal in model selection is to identify the "minimal" model, i.e., the smallest model that can fully describe the distribution of the chain. A model T ∈ T (D) is called minimal if every m-tuple of leaves {sj ; j = 0, 1, . . . , m − 1} in T contains at least two with non-identical parameters, i.e., there are j = j such that θ sj = θ sj . It is easy to see that every Dth order Markov chain {X n } has a unique minimal model T * ∈ T (D).
A variable-memory chain {X n } with model T ∈ T (D) and with associated parameters θ = {θ s ; s ∈ T } is ergodic, if the corresponding first-order chain {Z n := X n n−D+1 ; n ≥ 1} taking values in A D is irreducible and aperiodic. In order to avoid uninteresting technicalities, in most of our results we will assume that the data are generated by a positive-ergodic chain {X n }, namely that all its parameters θ s (j) are nonzero, so that its unique stationary distribution π gives strictly positive probability to all finite contexts s.

Posterior consistency and concentration
Our first theorem is a strong consistency result, which states that, if the data x = x n −D+1 are generated by an ergodic chain with minimal model T * ∈ T (D), then the model posterior asymptotically almost surely (a.s.) concentrates on T * . Theorem 4.1 both strengthens and generalises a weaker result on the asymptotic behaviour of the MAP model established in (Willems et al., 1993, Theorem 8).
Proof. Recalling the posterior representation in (3.2), it suffices to show that, as n → ∞, P b,s → 0 a.s. for all internal nodes of T * , and P b,s → 1 a.s. for all leaves of T * . These two claims are established in Lemmas 4.2 and 4.3 below.
Lemma 4.2. Under the assumptions of Theorem 4.1, for every internal node s of T * , the branching probability P b,s → 0, a.s., as n → ∞.
Proof. As the complete proof is quite involved, only the main and more interesting part of the argument is given here, with the remaining details given in Section B.1 of the supplementary material. We begin by observing that, for some constant c 0 , where in the last step we used that either P w,sj ≥ βP e,sj or P w,sj = P e,sj . Therefore, it suffices to show that P e,s j P e,sj → 0, a.s., as n → ∞. Let s be a fixed finite context, and let X and J denote the random variables corresponding to the symbols that follow and precede s, respectively, under the stationary distribution of {X n }. Using Lemma 4.1 and the ergodic theorem for Markov chains, it is shown in Section B.1 of the supplementary material that, log P e,s − j log P e,sj = −nI(X; J|s)π(s) + o(n), a.s., ( 4.5) where I(X; J|s) is the conditional mutual information between X and J given s (Cover and Thomas, 2012, Ch. 2). This mutual information is always nonnegative, and it is zero if and only if X and J are conditionally independent given s. For any internal node s that is a parent of leaves of T * , the minimality of T * implies that X and J are not conditionally independent given s, as there exist j = j such that θ sj = θ sj , so that θ sj depends on j. Therefore, I(X; J|s) > 0 and π(s) > 0 by assumption, so (4.5) implies that log P e,s − j log P e,sj → −∞, a.s., as required.
For the general case of internal nodes that may not be parents of leaves, a simple iterative argument is given in Section B.1 of supplementary material establishing the same result in that case as well, and completing the proof of the lemma.
Lemma 4.3. Under the assumptions of Theorem 4.1, for every leaf s of T * , the branching probability P b,s → 1, a.s., as n → ∞.
Proof. As with the previous lemma, we only give an outline of the main interesting steps in the proof here; complete details are provided in Section B.2 of the supplementary material.
By definition, for any leaf s of T * (and also for any 'external' node s, that is, any context s not in T * ), we have I(X; J|s) = 0 because of conditional independence. Therefore, we need to consider the higher-order terms in the asymptotic expansion of (4.5). Using Lemma 4.1, the ergodic theorem, and the law of the iterated logarithm (LIL) for Markov chains, it is shown in Section B.2 of the supplementary material that here, This implies that j log P e,sj − log P e,s → −∞, so that j P e,sj /P e,s → 0, a.s. The last step of the proof, namely that for any leaf s of T * , j P e,sj /P e,s → 0 also implies that j P w,sj /P e,s → 0, a.s., so that, by (4.3), P b,s → 1, a.s., is given in Section B.2 of the supplementary material.
Our next result is a refinement of Theorem 4.1, which characterises the rate at which the posterior probability of T * converges to 1.
Theorem 4.2. Let X n −D+1 be a time series generated by a positive-ergodic variable-memory chain {X n } with minimal model T * ∈ T (D). For any value of the prior hyperparameter β ∈ (0, 1) and any > 0, we have, as n → ∞: The proof of Theorem 4.2 is given in Section B.3 of the supplementary material. In fact, as discussed at the end of Section B.3, the proof also reveals that a stronger statement can be made about the rate in the case of full Dth order Markov chains: Corollary 4.1. If {X n } is a genuinely Dth order chain in that its minimal model T * is the complete tree of depth D, then its posterior probability π(T * |X n −D+1 ) almost surely converges to 1 at an exponential rate.

Out-of-class modelling
In this section we consider the behaviour of the posterior distribution π(T |x) on models T ∈ T (D) when the time series x is not generated by a chain from the model class T (D), but from a general stationary and ergodic process {X n } with possibly infinite memory. We first give an explicit description of the "limiting" model T ∞ ∈ T (D) on which the posterior π(T |x) concentrates when the observations are generated by a general process outside T (D), and then we give conditions under which T ∞ is structurally "as close as possible" to the true underlying model.
Recall that, for any context s, we write X and J for the random variables corresponding to the symbols that follow and precede s, respectively. The limiting tree T ∞ ∈ T (D) corresponding to a general stationary process {X n } with values in A can be constructed via the following procedure: • Take T ∞ to be the empty tree.
• Starting with the nodes at depth d = D − 1, for each such s, if I(X; J|s) > 0, then add s to T ∞ along with all its children and all its ancestors; that is, add the complete path from the root λ to the children of s.
• After all nodes s at depth d = D − 1 have been examined, examine all possible nodes s at depth d = D − 2 that are not already included in T ∞ , and repeat the same process.
• Continue recursively towards the root, until all nodes at all depths 0 ≤ d ≤ D − 1 have been examined.
• For any node already in T ∞ at depth d ≤ D − 1, such that only some but not all m of its children are included in T ∞ , add the missing children to T ∞ so that it becomes proper.
In order to state our results we need to impose two additional conditions on the underlying data-generating process. Suppose {X n } is stationary. Without loss of generality (by Kolmogorov's extension theorem) we may consider the two-sided version of the process, {X n ; n ∈ Z}. Its α-mixing coefficients (Ibragimov, 1962;Philipp and Stout, 1975) are defined as, where A and B denote the σ-algebras, σ(. . . , X −1 , X 0 ) and σ(X n , X n+1 , . . .), respectively. We will need a mixing condition and a positivity condition for our results: Theorem 4.3. Let X n −D+1 be a time series generated by a stationary ergodic process {X n } satisfying the assumptions (4.8), and let T ∞ ∈ T (D) be given by the above construction. Then, for any value of the prior hyperparameter β ∈ (0, 1) we have: The proof of Theorem 4.3 follows along exactly the same lines as the earlier proof of Theorem 4.1. Instead of the ergodic theorem for Markov chains (Chung, 1967, p. 92) we now use Birkhoff's ergodic theorem (Breiman, 1992, Ch. 6), and instead of the LIL for Markov chains we apply the general LIL for functions of blocks of an ergodic process, which follows, as usual, from the almost-sure invariance principle (Philipp and Stout, 1975;Rio, 1995;Zhao and Woodroofe, 2008). The mixing condition in (4.8) was chosen as one of the simplest ones that guarantee this general version of the LIL.
Regarding the overall structure of the proof, all the earlier asymptotic expansions of the branching probabilities still remain valid, including (4.5) and (4.6). The only possible difference might be at the boundary conditions that are required as a starting point for the iterative argument in the proof of Lemma 4.2, since the actual "leaves" of the true underlying model of {X n } are not necessarily at depth d ≤ D here. But, as before, all branching probabilities P b,s tend either to 0 or to 1, depending on whether the mutual information condition that appears in the description of T ∞ holds or not. Specifically, starting from nodes s at depth d = D − 1, if I(X; J|s) = 0 then we are in the same situation as in Lemma 4.3, so that P b,s → 1, and all children of s are pruned. On the other hand, if I(X; J|s) > 0, then P b,s → 0, and by the same iterative argument, P b,u → 0 for all ancestors u of node s as well.
Analogous comments apply to the proof of Theorem 4.4, which is again a refinement characterising the rate at which the posterior probability of T ∞ converges to 1.
Theorem 4.4. Let X n −D+1 be a time series generated by a stationary ergodic process {X n } satisfying the assumptions (4.8), and let T ∞ be its limiting model in T (D). Then, for any β ∈ (0, 1) and any > 0, as n → ∞ we have: In general, it is natural to expect that T ∞ should be "as close as possible" in some sense to the true underlying model T * , and this is indeed what is most often observed in applications: T ∞ being the same as T * truncated to depth D.
But this is not always the case. For example, recall the 3rd order chain {X n } considered in Example 5.2 of Kontoyiannis et al. (2022), also described as having a "bimodal posterior" in Section 5.2. There, T * is the complete m-ary tree of depth 3 (with m = 6), but X n depends on (X n−1 , X n−2 , X n−3 ) only via X n−3 . For that reason, the limiting model T ∞ in T (D) with D = 1 or D = 2 is not T * truncated at depth D, but rather the empty tree {λ} consisting of only the root node λ.
Fortunately, we can read a simple necessary and sufficient condition for the "expected" behaviour to occur from the definition of T ∞ itself. Let {X n } be a stationary and ergodic process on the finite alphabet A. From the description of Csiszár and Talata (2006), it is easy to see that there is a unique minimal context tree model T * for {X n } of possibly infinite depth. Let T * |D be T * truncated at depth D, and write N D−1 (T * ) for the set of all internal nodes of T * at depth d = D − 1 whose children exist in T * (at depth D) but are not leaves of T * . Proof. The result follows directly from the definition of T ∞ combined with the observation that the condition I(X; J|s) > 0 is already satisfied for all nodes s of T * at depth d = D − 1 whose children are leaves of T * .

The posterior predictive distribution
The branching process representation of the posterior can also be used to facilitate practically useful computations. In Proposition 4.1, an exact expression is given for the posterior predictive distribution P x n+1 |x n −D+1 in terms of the branching probabilities P b,s .
Proposition 4.1. The posterior predictive distribution is given by, where, for 0 ≤ i ≤ D, the string s (i) is the context of length i preceding x n+1 , and γ i is the posterior probability that node s (i) is a leaf, given by, (4.10) Proof. Writing x = x n −D+1 , the posterior predictive distribution can be expressed as, P (x n+1 |T, x) π(T |x). (4.11) For any tree T ∈ T (D), exactly one of the contexts s (i) , 0 ≤ i ≤ D, is a leaf of the tree. For every 0 ≤ i ≤ D, define the subset T i (D) ⊂ T (D) to be the collection of trees T ∈ T (D) such that the context of x n+1 that is a leaf of T is s (i) ; these T i (D) are disjoint and their union is T (D).
The key observation here is that P (x n+1 |T, x) is the same for all trees T ∈ T i (D), since, where we used the full conditional density of the parameters in (3.3). So, from (4.11), which completes the proof upon noticing that the last sum T ∈T i (D) π(T |x) is exactly the posterior probability that node s (i) is a leaf, namely, γ i as in (4.10).

Experimental results
Being able to obtain exact i.i.d. samples from the posterior is generally more desirable and typically leads to more efficient estimation than using approximate MCMC samples. In Section 5.1 we offer empirical evidence justifying this statement in the present setting through a simple simulation example. Then in Section 5.2 we present the results of a careful empirical study of the natural entropy estimator induced by the BCT framework, compared against a number of the most common alternative estimators, on three simulated and three real-world data sets.

Comparison with MCMC
Consider n = 1000 observations generated from a 5th order, ternary chain, with model given by the context tree of Figure 1 in Section 2 (the values of the parameters θ = {θ s ; s ∈ T } are given in Section C of the supplementary material). A simple and effective convergence diagnostic here (which can also be viewed as an example of an estimation problem) is the examination of the frequency with which the MAP model, T * 1 , appears in the i.i.d. or the MCMC sample trajectory. The model T * 1 can be identified by the BCT algorithm and its posterior probability π(T * 1 |x) can be computed, as in (Kontoyiannis et al., 2022). In each case, the five graphs correspond to five independent repetitions of the experiment with N = 1000 simulated samples. The horizontal line is the limiting frequency, π(T * 1 |x). Figure 2, the estimates based on the random-walk MCMC sampler of Kontoyiannis et al. (2022) and on the i.i.d. sampler of Section 3.3 both appear to converge quite quickly, with the corresponding MCMC estimates converging significantly more slowly. In 50 independent repetitions of the same experiment (with N = 1000 simulated samples in each run), the estimated variance of the MCMC estimates (0.0084) was found to be larger than that for the i.i.d. estimates (1.4 × 10 −4 ), by a factor of around 60. Figure 3 shows the trace plots (Roy, 2020) obtained from N = 10000 simulated samples from the MCMC and i.i.d. samplers, which can be used to monitor the log-posterior in each case. It is immediately evident that the i.i.d. sampler is more efficient in exploring the effective support of the posterior. As expected, the i.i.d. sampler has superior performance compared to the MCMC sampler, both in terms of estimation and in terms of mixing. Also, although the two types of samplers have comparable complexity in terms of computation time and memory requirements, the structure of the i.i.d. sampler is much simpler, giving a much easier implementation. In view of these observations, in the following section we only employ the i.i.d. sampler for the purposes of entropy estimation.

Entropy estimation
Estimating the entropy rate from empirical data -in this case, a discrete time series -is an important and timely problem that has received a lot of attention in the recent literature, in connection with questions in many areas including neuroscience (Timme and Lapish, 2018), natural language modelling (Willems et al., 2016), animal communication (Kershenbaum, 2014), and cryptography (Simion, 2020), among others; see, e.g., the recent literature reviews by Verdú (2019) and Feutrill and Roughan (2021). The well-known difficulties of entropy estimation stemming from the nonlinear nature of the entropy rate functional and its dependence on the entire process distribution are discussed in the references listed above.
For a general process {X n } on a finite alphabet, the entropy rateH is defined as the limit H = lim n→∞ (1/n)H(X n 1 ), whenever the limit exists, where H(X n 1 ) denotes the usual Shannon entropy (in nats rather than bits, as we take logarithms to the base e) of the discrete random vector X n 1 . For an ergodic, first-order Markov chain {X n },H can be expressed as, where S is the state space of {X n }, and (P ij ) and (π(i)) denote its transition matrix and its stationary distribution, respectively.
An analogous formula can be written for the entropy rate of any ergodic variable-memory chain with model T ∈ T (D), by viewing it as a full Dth order chain and considering blocks of length (D + 1), as usual; cf. Cover and Thomas (2012). This means thatH can be expressed as an explicit functionH = H(T, θ) of the model and parameters. Therefore, given a time series x, using the MC sampler of Section 3.3 to produce i.i.d. samples (T (i) , θ (i) ) from π(T, θ|x), we can obtain i.i.d. samples H (i) = H(T (i) , θ (i) ) from the posterior π(H|x) of the entropy rate. The calculation of each H (i) = H(T (i) , θ (i) ) is straightforward and only requires the computation of the stationary distribution π of the induced first-order chain that corresponds to taking blocks of size [depth(T (i) ) + 1]. The only potential difficulty is if either the depth of T (i) or the alphabet size m are so large that the computation of π becomes computationally expensive. In such cases, H (i) can be computed approximately by including an additional Monte Carlo step: Generate a sufficiently long random sample Y M −D+1 from the chain (T (i) , θ (i) ), and calculate: The ergodic theorem and the central limit theorem for Markov chains (Chung, 1967;Meyn and Tweedie, 2012) then guarantee the accuracy of (5.2).
In the remainder of this section, the BCT estimator (with maximum model depth D = 10) is compared with the state-of-the-art approaches, as identified by Gao et al. (2008) and Verdú (2019) and summarised below. The BCT estimator is found to generally give the most reliable estimates on a variety of different types of simulated and real-world data. Moreover, compared to most existing approaches that give simple point estimates (sometimes accompanied by confidence intervals), the BCT estimator has the additional advantage that it provides the entire posterior distribution π(H|x). Plug-in estimator. Motivated by the definition of the entropy rate, the simplest and one of the most commonly used estimators of the entropy rate is the per-sample entropy of the empirical distribution of k-blocks. Letting p k (y k 1 ), y k 1 ∈ A k , denote the empirical distribution of k-blocks induced by the data on A k , the plug-in or maximum-likelihood estimator is simply, H k = (1/k)H( p k ). The main advantage of this estimator is its simplicity. Well-known drawbacks include its high variance due to undersampling, and the difficulty in choosing appropriate blocklengths k effectively. Lempel-Ziv estimator. Among the numerous match-length-based entropy estimators that have been derived from the Lempel-Ziv family of data compression algorithms, we consider the increasing-window estimator of Gao et al. (2008), identified there as the most effective one. For every position i in the observed data, let i denote the length of the longest segment x i+ i −1 i starting at i which also appears somewhere in the window x i−1 0 preceding i. Writing L i = 1 + i for each i, the relevant estimator is,

CTW estimator.
This uses the prior predictive likelihood P (x) computed by the CTW algorithm, to define H CTW = −(1/n) log P (x n 1 ). This estimator was found by Gao et al. (2008) and Verdú (2019) to achieve the best performance in practice. Its consistency and asymptotic normality follow easily from standard results, and its (always positive) bias is of O((log n)/n), which can be shown to be in a minimax sense as small as possible. In all experiments we take the maximum depth of CTW to be D = 10. PPM estimator. Using a different adaptive probability assignment, Q(x), this method forms an estimate of the same type as the CTW estimator, H PPM = −(1/n) log Q(x n 1 ), where prediction by partial matching (PPM) (Cleary and Witten, 1984) is used to fit the model that leads to Q(x n 1 ). We use the interpolated smoothing variant of PPM introduced by Bunton (1996), which is implemented in the R package available at: https://rdrr.io/github/pmcharrison/ppm/.

A ternary chain.
We consider the same n = 1000 observations generated from the 5th order, ternary chain examined in Section 5.1. The entropy rate of this chain isH = 1.02. In Figure 4 we show MC estimates of the prior distribution π(H), and of the posterior π(H|x) based on n = 100 and on n = 1000 observations from the chain. After n = 1000 observations, the posterior is close to a Gaussian with mean µ = 1.005 and standard deviation σ = 0.017.  Figure 5 shows the performance of the BCT estimator compared with the other four estimators described above, as a function of the length n of the available observations x. For BCT we plot the posterior mean. For the plug-in we plot estimates with block-lengths k = 5, 6, 7. It is easily observed that the BCT estimator outperforms all the alternatives, and converges faster and closer to the true value ofH. A third order binary chain.
Here, we consider n = 1000 observations generated from an example of a third order binary chain from Berchtold and Raftery (2002). The underlying model is the complete binary tree of depth 3 pruned at node s = 11; the tree model T and the parameter values θ = {θ s ; s ∈ T } are given in Section C of the supplementary material. The entropy rate of this chain isH = 0.4815. Figure 6 shows the performance of all five estimators, where the BCT estimator (which uses the posterior mean again) is found to have the best performance. The histogram of the BCT posterior after n = 1000 observations, shown in Section C of the supplementary material, is close to a Gaussian with a mean µ = 0.4806 and a standard deviation σ = 0.0405. A bimodal posterior. We re-examine a simulated time series x from Kontoyiannis et al. (2022), which consists of n = 1450 observations generated from a 3rd order chain {X n } with alphabet size m = 6 and with the property that each X n depends on past observations only via X n−3 . The complete specification of the chain is given in Section C of the supplementary material. Its entropy rate isH = 1.355. An interesting aspect of this data set is that the model posterior is bimodal, with one mode corresponding to the empty tree (describing i.i.d. observations) and the other consisting of tree models of depth 3.
As shown in Figure 7a, the posterior of the entropy rate is also bimodal here, with two separated approximately-Gaussian modes corresponding to each of the modes of the model posterior. The dominant mode is the one corresponding to models of depth 3; it has mean µ 1 = 1.406, standard deviation σ 1 = 0.031, and relative weight w 1 = 0.91. The second mode corresponding to the empty tree has mean µ 2 = 1.632, standard deviation σ 2 = 0.020, and a much smaller weight w 2 = 1−w 1 = 0.09. In this case, the mode of π(H|x) gives a more reasonable choice for a point estimate than the posterior mean. Like in the previous two examples, the BCT entropy estimator performs better than most benchmarks, as illustrated in Section C of the supplementary material.  Neural spike trains. We consider n = 1000 binary observations from a spike train recorded from a single neuron in region V4 of a monkey's brain. The BCT posterior is shown in Figure  7b: Its mean is µ = 0.0234, its standard deviation is σ = 0.0108, and is skewed to the right. This dataset is the first part of a long spike train of length n = 3, 919, 361 from Gregoriou et al. (2009Gregoriou et al. ( , 2012. Although there is no "true" value of the entropy rate here, for the purposes of comparison we use the estimate obtained by the CTW estimator (identified as the most effective method by Gao et al. (2008) and Verdú (2019)) when all n = 3, 919, 361 samples are used, givinḡ H = 0.0241. The resulting estimates for all five methods (with the posterior mean given for BCT) are summarised in Table 1, verifying again that BCT outperforms all the other methods.  Financial data. Here, we consider n = 2000 observations from the financial dataset F.2 of Kontoyiannis et al. (2022). This consists of tick-by-tick price changes of the Facebook stock price, quantised to three values: x i = 0 if the price goes down, x i = 1 if it stays the same, and x i = 2 if it goes up. The BCT entropy-rate posterior is shown in Figure 8a: It has mean µ = 0.921, and standard deviation σ = 0.028. Once again, as the "true" value of the entropy rate we take the estimate produced by the CTW estimator on a longer sequence with n = 10 4 observations, givingH = 0.916. The results of all five estimators are summarised in Table 2, where for the BCT estimator we once again give the posterior mean.   Pewee birdsong.
The last data set examined is a time series x describing the twilight song of the wood pewee bird (Craig, 1943;Sarkar and Dunson, 2016). It consists of n = 1327 observations from an alphabet of size m = 3. The BCT posterior is shown in Figure 8b: It is approximately Gaussian with mean µ = 0.258 and standard deviation σ = 0.024. The fact that the standard deviation is small is important as it suggests "confidence" in the resulting estimates, which is important because here (as in most real applications) there is no knowledge of a "true" underlying value. Table 3 shows all the resulting estimates; the posterior mean is shown for the BCT estimator.  Summary. The main conclusion from the results on the six data sets examined in this section is that the BCT estimator gives the most accurate and reliable results among the five estimators considered. In addition to the fact that the BCT point estimates typically outperform those produced by other methods, the BCT estimator is accompanied by the entire posterior distribution π(H|x) of the entropy rate, induced by the observations x. As usual, this distribution can be used to quantify the uncertainty in estimatingH, and it contains significantly more information than simple point estimates and their associated confidence intervals.

Concluding remarks
In this work, we revisited the Bayesian Context Trees (BCT) modelling framework, which was recently found to be very effective for a range of statistical tasks in the analysis of discrete time series. We showed that the prior and posterior distributions on model space admit simple and easily interpretable representations in terms of branching processes, and we demonstrated their utility both in theory and in practice.
The branching process representation was first employed to develop an efficient Monte Carlo sampler that provides i.i.d. samples from the joint posterior on models and parameters, thus facilitating effective Bayesian inference with empirical time series data. Then, it was used to establish strong theoretical results on the asymptotic consistency of the BCT posterior on model space, which provide important theoretical justifications for the use of the BCT framework in practice. Finally, the performance of the proposed Monte Carlo sampler was examined extensively in the context of entropy estimation. The resulting fully-Bayesian entropy estimator was found to outperform several of the state-of-the-art approaches, on simulated and real-world data.
Although the BCT framework was originally developed for modelling and inference of discretevalued time series, it was recently used to develop general mixture models for real-valued time series, along with a collection of associated algorithmic tools for inference . Extending the results presented in this work to that setting presents an interesting direction of further research, motivated by important practical applications.

Supplementary material A Proof of Proposition 3.2
We need the following representation of the marginal likelihood from (Kontoyiannis et al., 2022).
Lemma A.1. The marginal likelihood P (x|T ) of the observations x given a model T is, where P e,s are the estimated probabilities in (2.4).
Proof of Proposition 3.2. The proof parallels that of Proposition 3.1. When D = 0, T (D) consists of a single tree, T = {λ}, which has probability 1 under both the BCT posterior π(·|x) and under the distribution π b (·) induced by the branching process construction. Suppose D ≥ 1.
As before, we view every tree T ∈ T (D) as a collection of k of m-branches, and we proceed by induction on k. For k = 0, i.e., for T = {λ}, by the definitions, and using Lemma A.1 and the fact that P w,λ is exactly the normalising constant P (x), Now assume the result of the proposition holds for all trees with k m-branches, and suppose T ∈ T (D) contains (k + 1) m-branches and is obtained from some T ∈ T (D) by adding a single m-branch to one of its leaves, s. Again, consider two cases.
(i) If s is at depth D − 2 or smaller, then by construction, and therefore, using the inductive hypothesis, Using the definitions of π D and P b,s , as well as Lemma A.1, we can express the posterior odds and from the definitions of P w,s and P b,s we obtain, Substituting (A.2) into (A.1) yields, π b (T ) = π(T |x), as claimed.
(ii) Similarly, if s is at depth D − 1, from the inductive hypothesis, where the posterior odds can be expressed as, where in the second equality we used that P w,sj = P e,sj , as all nodes sj are at depth d = D in this case. Substituting (A.3) above yields π b (T ) = π(T |x), and completes the proof.
B Proofs of results from Section 4 B.1 Proof of Lemma 4.2 Here we establish the two missing steps in the proof of the lemma given in Section 4.1 of the main text.
Proof of (4.5). Using the upper bound of Lemma 4.1 for a fixed context s, and the corresponding lower bound for the context sj, we obtain the upper bound, for some constant C. Since M s and M sj both tend to infinity a.s. as n → ∞ by positiveergodicity, the last two terms above both vanish a.s. For the first two terms, we first note that, by the ergodic theorem for Markov chains (e.g., (Chung, 1967, p. 92 where for the stationary distribution π, the notation we use is that si denotes the concatenation of context s followed by symbol i moving 'forward' in time. Recalling the definition of X and J, we have, P(X = i|s) = π(si) π(s) = π(i|s), P(J = j|s) = π(js) π(s) , (B.3) so that for the first term of (B.1), as n → ∞, Similarly, for the second term of (B.1), from the ergodic theorem, Proof of final step in Lemma 4.2.
As already noted, (4.5) implies P b,s → 0 a.s. for nodes whose children are leaves of T * . The same holds for all internal nodes s of T * for which I(X; J|s) > 0. The only remaining case is that of internal nodes u for which I(X; J|u) = 0. The fact that again P b,u → 0 a.s. is an immediate consequence of the result given as Lemma B.1 in Section B.3.

B.2 Proof of Lemma 4.3
Here we provide proofs for the two missing steps in the proof of the lemma given in Section 4.1 of the main text. Proof of (4.6). We can rewrite (B.9) as, We write p s and π s for the empirical and stationary conditional distributions of the symbol following s, Let D(p q) denote the relative entropy (or Kullback-Leibler divergence) between two probability mass functions p, q on the same discrete alphabet (Cover and Thomas, 2012, Ch. 2). By the nonnegativity of relative entropy we have, Adding and subtracting the term m−1 i=0 a s (i) log π s (i) to (B.11) and using the last inequality, We examine the first and second terms in (B.12) separately. For the first (and main) term, since for all count vectors, a s (i) = m−1 j=0 a sj (i), we can express, where the last equality holds because s is either a leaf or an external nodes of T * , so that I(X; J|s) = 0 and π sj = π sj = π s , for all j, j .
In order to bound the relative entropy between the empirical and the stationary conditional distributions, we first recall that relative entropy is bounded above by the χ 2 -distance, e.g., (Gibbs and Su, 2002), From the law of the iterated logarithm (LIL) for Markov chains (Chung, 1967, p. 106), we have, a.s. as n → ∞, a sj (i) = nπ(jsi) + O( n log log n), M sj = nπ(js) + O( n log log n), (B.14) so that, Substituting in (B.13) yields, Proof of final step in Lemma 4.3. As discussed in the proof of Lemma 4.3 in the main text, the asymptotic relation (4.6) implies that j P e,sj /P e,s → 0, a.s., for all leaves and external nodes s of T * . The next proposition states that it also implies that j P w,sj /P e,s → 0, so that by (4.3) P b,s → 1, a.s., completing the proof of Lemma 4.3. Note that it suffices to consider leaves at depths d ≤ D − 1, since for leaves s at depth D we already have P b,s = 1. Proof. Note that, since the stationary distribution is positive on all finite contexts, the tree T MAX is eventually a.s. the complete tree of depth D, so we need not consider special cases of contexts s that do not appear in the data separately. Let s be a leaf or external node of T * at depth 0 ≤ d ≤ D − 1. The proof is by induction on d.
For d = D − 1, the claim is satisfied trivially as P w,sj = P e,sj , since nodes sj are at depth D. For the inductive step, we assume that the claim holds for all leaves and external nodes s of T * at some depth d ≤ D − 1, and consider a leaf or external node s of T * at depth d − 1. We have, as n → ∞, as m−1 t=0 P w,sjt /P e,sj → 0 by the inductive hypothesis, and m−1 j=0 P e,sj /P e,s → 0 for a node s which is either a leaf or external node of T * . This establishes the inductive step and completes the proof.

B.3 Proof of Theorem 4.2
The starting point of the proof is the representation of the posterior given in equation (3.2) of the main text. In particular, we examine the asymptotic behaviour of the branching probabilities P b,s separately for leaves and internal nodes. Leaves. Let s be a leaf or an external node of T * . We already have a strong upper bound for the estimated probabilities of s in (4.6). Write r = (m − 1) 2 /2. Since the bound (4.6) holds a.s., a straightforward sample-path-wise computation immediately implies that, for all > 0, , a.s.
Proof. The proof is similar to that of Proposition B.1, by induction on d.
If d = D − 1, then P w,sj = P e,sj and the claim follows from (B.19). For the inductive step, assume the claim holds for all leaves and external nodes at some depth d ≤ D − 1, and consider a leaf or external node s at depth d − 1. Then substituting (B.19) into (B.18) and noting that t P w,sjt /P e,sj → 0 a.s. by Proposition B.1, completes the proof.
Combining Proposition B.2 with equation (4.3), we get that, for any leaf or external node s at depth d, a.s. as n → ∞: Next we establish a corresponding bound for all internal nodes, indeed, for any node u that is a suffix of a node s that has I(X; J|s) > 0. Proof. Let ∆l = l(s) − l(u) ≥ 0; the proof is by induction on ∆l. For ∆l = 0, the claim is satisfied trivially as u = s, for which I(X; J|s) > 0, corresponding to the previous case. For the inductive step, we assume that the claim holds for context uj which is the suffix of s with ∆l = k ≥ 0, and prove that it also holds for the context u, which is the suffix of s with ∆l = k + 1. For node u, which is at depth d < D − 1, from (4.4) and the definition of the branching probabilities, in the case when T * is the full tree of depth D shows that P b,s = 1 for all leaves s, so the rate is determined by the exponential bounds in (B.24) as claimed in Corollary 4.1.

C Entropy estimation
This section contains additional details associated with the entropy estimation experiments of Section 5.2 of the main text.   The tree model for the third order binary chain, along with its associated parameters, is shown in Figure 9a. The posterior π(H|x) of the entropy rate based on n = 1000 observations x is shown in Figure 9b: It is approximately Gaussian with mean µ = 0.4806 and standard deviation σ = 0.0405. The histogram was constructed using N = 10 5 i.i.d. samples from the entropy rate posterior.
A bimodal posterior.
The distribution of the third-order chain {X n } in this example is given by, Pr(X n = j|X n−1 = a, X n−2 = b, X n−3 = i) = Q ij , i, j, a, b ∈ A, where the alphabet A = {0, 1, 2, 3, 4, 5} and the transition matrix Q = (Q ij ) is, Viewed as a variable-memory chain, the model of {X n } is the complete tree of depth 3, but the dependence of each X n on its past is only via X n−3 . So, meaningful dependence is detected only at memory lengths of at least three: the two most recent symbols are independent of X n . This is why the MAP model T * 1 identified by the BCT algorithm of Kontoyiannis et al. (2022) based on the n = 1450 observations x is the empty tree T * 1 = {λ}, corresponding to i.i.d. data. Its posterior probability is π(T * 1 |x) = 0.09, which corresponds exactly to the weight of the secondary mode in the entropy rate posterior, as shown in Figure 7a of the main text. All other trees identified by the k-BCT algorithm are complex trees of depth 3, with posterior probabilities close to that of T * 1 ; e.g., the second a posteriori most likely model T * 2 has posterior π(T * 2 |x) = 0.08. These trees of depth 3 form the other mode of the bimodal posterior on model space, which corresponds to the primary mode of the entropy rate posterior in Figure 7a. Table 4 shows the entropy rate estimates by all five methods in this example. The MAP value of the posterior π(H|x) is given for BCT. Note that, although the plug-in estimator with block-length k = 5 gives a slightly better estimate than the BCT here, given the well-known high variability of the plug-in this is likely more a coincidence rather than an indication of accuracy of the plug-in.