Modeling of time series using random forests: theoretical developments

In this paper we study asymptotic properties of random forests within the framework of nonlinear time series modeling. While random forests have been successfully applied in various fields, the theoretical justification has not been considered for their use in a time series setting. Under mild conditions, we prove a uniform concentration inequality for regression trees built on nonlinear autoregressive processes and, subsequently, we use this result to prove consistency for a large class of random forests. The results are supported by various simulations.


Introduction
Random forests, originally introduced by Breiman [8], constitute an ensemble learning algorithm for classification and regression, which produces predictions by first growing a large number of randomized decision trees [9] and, then, aggregates the results. Since its introduction, the algorithm has been applied in various fields such as object recognition [25], bioinformatics [12], ecology [10,22] and finance [15,18], and the evidence is strong: with very little tuning, random forests are able to deliver a flexible tool for prediction which is fully comparable with other state-of-the-art algorithms. In fact, Howard and Bowles [17] claim that random forests have been the most successful general-purpose algorithm in recent times. While many successful applications indicate the wide applicability of random forests, only little theoretical work exists to support this impression. Among components, which make the random forests of Breiman [8] difficult to analyze, are the operation of bagging randomized predictors [7] as well as the highly data-dependent partitions associated to the socalled CART regression trees [9], which form the forest. Other types of random forests have been proposed; see, for example, [2,14].
While the bagging step is often discarded in theoretical work, or replaced by another resampling method such as subsampling, asymptotic results for random forests in the (classical) nonparametric regression setting, where (X 1 , Y 1 ), . . . , (X T , Y T ) are i.i.d. observations from the model have been established under rather weak assumptions on the structure of the underlying regression trees. In (1.1), f is a suitable smooth function and ε is a mean-zero square integrable noise term which is independent of X. To mention a few significant results in this setup, Scornet et al. [24] prove L 2 consistency of Breiman's random forests when f is additive (i.e., f (x) = p i=1 f i (x i )) and ε is Gaussian, Wager and Walther [27] establish pointwise consistency of similar forests with larger leaves in a high-dimensional setting, and Wager and Athey [26] prove (pointwise) asymptotic normality of a particular random forest algorithm. Although assumptions are more restrictive, valuable insights about performance (i.e., convergence rates) in sparse settings and lower bounds on mean squared error were provided by [4,5,19]. For a nice overview of existing theoretical work on random forests within the regression setting as well as further references, see the survey in Biau and Scornet [6].
In some applications, particularly financial, the underlying data correspond to observations from a time series, and the aim is to predict future values by feeding in a number of the most recent observations to the algorithm. While the problem is often treated precisely as in the regression setting from a practical point of view, by forming pairs (X 1 , Y 1 ), . . . , (X T , Y T ) where X t = (Y t−1 , . . . , Y t−p ) for some integer p ≥ 1, things change dramatically on the theoretical side. Indeed, observations can no longer be assumed to be i.i.d. draws from (1.1) and, instead, the entire process (Y t ) t≥1 is necessarily defined recursively by the equation given initial data ξ = (Y 0 , Y −1 , . . . , Y 1−p ). Processes satisfying (1.2) are often referred to as nonlinear autoregressive processes of order p (or, in short, NLAR(p) processes). For further detail on these processes, see [3,16]. In such a framework, the dependence structure, across pairs (X 1 , Y 1 ), . . . , (X T , Y T ) as well as between entries in X t , is determined within the model. Consequently, in contrast to the regression setup, it is often only appropriate to impose assumptions on f and (ε t ) t≥1 . In fact, even if one accepts an implicit model assumption, e.g., the typical assumption that X t admits a copula density which is bounded away from zero and infinity, it turns out to be rather restrictive. Indeed, if (Y t ) t≥1 is Gaussian and p ≥ 2, such an assumption is satisfied only if f = 0 almost everywhere. It follows that other types of assumptions and techniques are needed to guarantee the validity of random forests in the time series setting. In this paper we rely on the principal ideas of [27] to obtain a uniform concentration inequality which applies simultaneously across all regression trees satisfying a mild condition on their minimum leaf size k, when data are generated by the NLAR(p) model (1.2). While it is required that k increases in the sample size, the growth rate may be very slow and trees are allowed to be grown adaptively (the partitions of the trees can be highly data-dependent). As an application of the established concentration inequality, we prove that all random forests respecting a number of conditions are pointwise consistent estimators for f when the data generating process is (1.2). The assumptions we impose in the model (1.2) are explicit in terms of f and the distribution of (ε t ) t≥1 , and they are not difficult to check. For instance, all our results are applicable if f is bounded and Lipschitz continuous, and (ε t ) t≥1 is an i.i.d. sequence with ε 1 having a suitably light-tailed distribution. (As pointed out in Section 2, the assumption of f being bounded is no stronger than what is usually imposed in the regression setting.) Our techniques rely on, among other things, the theory of Markov processes as well as various Bernstein type concentration inequalities. To the best of our knowledge, theoretical properties of random forests within the framework of time series have not been fully addressed.
The paper is laid out as follows. Section 2 introduces the model as well as the regression trees of interest and establishes uniform concentration of these around their so-called partition-optimal counterparts (Theorem 2.1). In Section 3 we translate this result into a concentration inequality for random forests (Corollary 3.1) and provide sufficient conditions ensuring that they are pointwise consistent estimators of f (Theorem 3.2). Subsequently, we carry out a simulation study in Section 4 which considers the performance of random forests within the NLAR(p) model for various specifications of f . Finally, Section 5 contains proofs of all statements as well as a number of auxiliary results.

Concentration of regression trees around partitionoptimal counterparts
Let (ε t ) t≥1 be a sequence of i.i.d. random variables with E[ε 1 ] = 0 and E[ε 2 1 ] < ∞, and fix an integer p ≥ 1. Given a vector ξ = (Y 0 , Y −1 , . . . , Y 1−p ) of initial data independent of (ε t ) t≥1 and a measurable function f : R p → R, define the process . In addition to the initial data ξ, suppose that we have T observations Y 1 , . . . , Y T from the model (2.1) available and that we group them in input-output pairs, The aim of this section is to establish uniform concentration inequalities for regression trees built on D T . We start by recalling the associated concept of recursive partitions [9], which is used to construct regression trees. Define a sequence of partitions P 1 , P 2 , . . . by starting from P 1 = {R p } and then, for each n ≥ 1, construct P n+1 from P n by replacing one set (node) A ∈ P n by A L := {x ∈ A : x i ≤ τ } and where the split direction i ∈ {1, . . . , p} and split position τ ∈ {x i : x ∈ A} are chosen in accordance with some set of rules. Here x i refers to the i-th entry of x ∈ R p . In this context, we will say that A is the parent node of A L and A R , while A L and A R are the child nodes of A. A given partition Λ of R p is called recursive if Λ = P n for some n ≥ 1, where P 1 , . . . , P n are obtained as above. Note that the rules determining how to choose node, direction and position of a split may depend on the data D T as well as some injected randomness Θ. For instance, in Breiman's random forests a node is split as soon as it contains at least a certain number of observations, while the position and direction are determined by maximizing impurity decrease (or, equivalently, minimizing the total mean-corrected sum of squares of the outputs Y over the resulting two child nodes; see also [9]), but only over a randomly chosen subset of directions in {1, . . . , p}. To any recursive partition Λ we define the corresponding regression tree T Λ by Here the notation A Λ (x) is used to refer to the unique set A ∈ Λ with the property that x ∈ A. Our interest will be on regression trees defined by k-valid partitions (k ≥ 1). We will say that a partition Λ is k-valid, and write Λ ∈ V k , if Λ is recursive and each set in Λ (sometimes called a leaf of the corresponding tree T Λ ) contains at least k data points. Note that, since Λ is recursive, it can depend on both the data D T and a random mechanism Θ, while V k depends only on D T . Setting a minimum number k of observations in each leaf of a tree is default in most practical implementations of random forests. Besides, such an assumption is natural since it ensures that X t ∈ A Λ (x) for some t ∈ {1, . . . , T }, and this implies that the regression tree (2.2) is well-defined for all x ∈ R p . In this section we will be working under the following set of assumptions: (A1) The random variable ε 1 admits a density h ε which is positive almost everywhere on R and, for some c ∈ (0, ∞), for any τ ∈ (0, ∞).
(A2) The function f in (2.1) is bounded, The minimum number of leaves k satisfies k/(log T ) 4 → ∞ as T → ∞.
In contrast to k, the quantities c, M and p will be kept fixed, and hence we will not keep track of the dependence on these in the following results. In particular, the introduced constants can depend on c, M and p, but not on T and k. When ε 1 admits a density which is positive almost everywhere and f satisfies (2.5) (in particular, when (A1) and (A2) are imposed), it follows by [3, Theorem 3.1] that the distribution of ξ can be chosen such that (Y t ) t≥1 is strictly stationary and, thus, this will be assumed throughout the paper. This means that (X 1 , Y 1 ), . . . , (X T , Y T ) are identically distributed. Before turning to the results, let us attach some comments to the assumptions stated above. The assumption of (A1) that ε 1 has a positive density is convenient, since it ensures that the p-th order Markov chain (Y t ) t≥1 can reach any state in one time step. In addition to strict stationarity of the chain, when combined with (A2), the assumption ensures geometrical ergodicity as well.
While it is not required that f is bounded to prove such properties of (Y t ) t≥1 , we need boundedness to apply Bernstein type inequalities for weakly dependent processes and to obtain good estimates on the dependency between entries of the input vector X 1 . The boundedness assumption (2.5) is implicitly assumed in essentially all theoretical work on random forests as one usually assumes that the input vector is transformed so that it belongs to the unit cube [0, 1] p and then requires continuity of f on this domain. The assumption on the moments of ε 1 in (2.3) is known as Bernstein's condition and implies that ε 1 is sub-exponential in the sense that for suitably chosen γ 1 , γ 2 ∈ (0, ∞). It is a well-known assumption to impose when proving concentration inequalities and is often needed when ε 1 cannot be assumed bounded. Among distributions satisfying Bernstein's condition (2.3) are (sub-)Gaussian distributions, but also those with a slightly heavier tail such as the Laplace distribution. The assumption (2.4) is used in conjunction with (2.5) to estimate probabilities involving the input vector X 1 (see Lemma 5.1 for details). Ultimately, it is an assumption on the left tail of ε 1 , and a sufficient condition for this to hold is that the limit exists and is non-zero for all τ ∈ (0, ∞). It is straightforward to verify that this, as well, is satisfied for both Gaussian and Laplace distributions. Together with (2.5), (2.4) ensures that we do not need to impose conditions on the copula density of the input vector X 1 , as is usually done in the regression setting, and this is convenient since such conditions can be both difficult to verify and even rather restrictive in a time series setting. Finally, we impose (A3), which in particular implies that k → ∞ as T → ∞. Although it is allowed that k → ∞ at a slow rate, the assumption contrasts the trees used in the random forests of Breiman [8], where k is some fixed and often small number. On the other hand, (A3) is very similar to assumptions imposed in most theoretical work within the regression setting (see, e.g., [4,24,27]). In fact, to the best of our knowledge, the only asymptotic result for random forests built on trees with fixed k is [24, Theorem 2]. The logarithmic factor (log T ) 4 is related to the fact that the established bound applies uniformly across all trees (see Remark 2.2) and that we use a Bernstein type inequality for strongly mixing processes which is slightly weaker than the classical one for the independent case.
While a couple of additional assumptions are needed to establish consistency of random forests in Section 3, (A1)-(A3) are sufficient to prove that regression trees of the form (2.2) concentrate around their so-called partition-optimal counterparts , and E Λ denotes expectation with respect to the conditional probability measure is non-random and, hence, the right-hand side of (2.7) simply means that the map Our setting is very similar to that of Wager and Walther [27], but besides requiring partitions to be k-valid they impose an additional assumption that excludes too "unbalanced" splits (see also the trees constructed in Section 3).
with probability at least 1 − 4T −1 for all sufficiently large T .
| is the deviation of the sample average over at least k observations from its theoretical counterpart within a specific leaf L. Some of the leaves, which can be obtained by varying (x, Λ), contain only k observations and for these, the error is of order 1/ √ k. This is almost the same upper bound as in (2.8) apart from the logarithmic factor (log T ) 2 , which reflects the fact that the deviation is controlled simultaneously across all feasible pairs (x, Λ) as well as the sub-exponential tail of ε 1 . Remark 2.3. In Theorem 2.1, and the remaining results of this paper, it is assumed that one is able to select a suitable p ≥ 1 such that (2.1) is correctly specified. If it is not possible to identify such p, one may consider a sequence of models (indexed by T ) where p increases as more data become available. Eventually, if (Y t ) t≥1 is an NLAR(p * ) process for some p * ≥ 1, this will ensure that the model is correctly specified for large samples. Under suitable assumptions, Theorem 2.1 can in fact be adjusted to allow for such setting by adapting the ideas of [27] and keeping track of how constants depend on p. However, the resulting upper bound on the uniform deviation of regression trees from their partition-optimal counterparts seems to be rather sensitive to the value of p and, thus, effectively demands that p increases very slowly in T .

Concentration and consistency of forests
We start by translating the concentration inequality of Theorem 2.1 into the framework of random forests, which are constructed by averaging a number of trees. To this end, let W k := {Λ ⊆ V k : |Λ| < ∞} be the family of all finite collections of k-valid partitions. In line with Wager and Walther [27], given an element The associated partition-optimal forest H * Λ is given by As an immediate consequence of Theorem 2.1, we obtain the following concentration inequality which applies uniformly across all k-valid forests (the result is stated without proof): Note that all trees T Λ 1 , . . . , T Λ B in (3.1) are based on the same data set D T (the partitions Λ 1 , . . . Λ B as well as the averages within the relevant leaves In contrast, in the random forests of Breiman [8], an initial bootstrap step is performed before growing each tree, meaning that trees are built on a bootstrap sample from D T (with replacement) rather than on D T itself. Once we have a concentration inequality as in Theorem 2.1 (or Corollary 3.1) at our disposal, it is not difficult to design trees in such a way that the corresponding random forests are consistent estimators of f . Roughly speaking, given that f is smooth, and since each tree in a forest is close to its partition-optimal counterpart with high probability, it is sufficient to design the recursive partitioning scheme such that the maximal diameter of each leaf shrinks to zero as T becomes large. Below we demonstrate how to refine the collection of k-valid partitions V k in a suitable way and, subsequently, prove consistency of the corresponding forests. The construction will be similar to those of [20,26,27]. We emphasize that the refinement considered here does not result in one particular random forest estimator; rather, a number of rules is outlined, and these will ensure consistency of any random forest estimator, which is built in line with them. With α ∈ (0, 1/2), k ≥ 1 and m ≥ 2k, we call Λ an (α, k, m)-valid partition, and write Λ ∈ V α,k,m , if it is recursive and obeys the following rules: (i) Any currently unsplit node with at least m data points will eventually be split.
(ii) The probability ρ i = ρ i (D T ) that a given (feasible) node is split along the i-th direction is bounded from below for all i = 1, . . . , p by a strictly positive constant.
(iii) The split position is chosen such that each child node contains at least a fraction α ∈ (0, 1/2) of the data points in its parent node.
(iv) All leaves of the tree contain at least k data points.
The corresponding (α, k, m)-valid forest is given by (3.1) with Λ 1 , . . . , Λ B ∈ V α,k,m . Let us now briefly address the rules outlined in (i)-(iv). Clearly, (iv) ensures V α,k,m ⊆ V k , and thus (α, k, m)-valid forests form a subclass of k-valid forests. Rule (i) controls the maximal number of observations in each leaf of a tree, and m = 2k corresponds to a situation where one keeps splitting until placing another split would violate (iv). In general, if m is not too large relative to T , this condition ensures that the number of leaves becomes large and, hence, the partition becomes fine. Concerning (ii), it ensures that, eventually, a split will be placed along any of the p (canonical) directions of the input space R p . Such a condition makes sense for us when p is thought of as being fixed and rather small, but will not be reasonable in sparse settings where p → ∞, and one will instead design the algorithm in a way that detects important directions with high probability. On the other hand, ρ i is indeed allowed to depend on D T , so one may use the data to identify which of the directions that are most important and then, based on this, form the probabilities ρ 1 , . . . , ρ p . In a time series setting, it may be advantageous to favor splits along the first direction which corresponds to an observation that is likely to be highly dependent with the observed value of Y . Finally, (iii) is a balancing condition which prohibits "edge splits". This is a technical condition imposed to track the distribution of data points among leaves. In theoretical work on random forests within the regression setting, it is typical to impose assumptions similar to (i)-(iv), see [20,26,27]. On the other hand, standard implementations, such as the RandomForestRegressor from the sklearn library in Python and the ranger package in R, incorporate only (i), (ii) and (iv). Since consistency will be established by relying on Theorem 2.1, we require that (A1)-(A3) are satisfied. Moreover, the following assumptions are imposed: with C ∈ (0, ∞) being a suitable constant and · some norm on R p .
With assumptions (A1)-(A5) in hand, we can now state the following consistency result for (α, k, m)-forests applied to nonlinear autoregressive processes: Letf T be an (α, k, m)-forest and suppose that (A1)-(A5) are satisfied. Then the following statements hold: (a)f T is a pointwise consistent estimator of f in the sense that for any x ∈ R p .
(b)f T (X) is a consistent estimator of the conditional mean E[Y | X] in the sense thatf in probability as T → ∞.
Remark 3.3. It should be emphasized that, since consistency is obtained through Theorem 2.1, the averaging effect gained by considering (3.1) rather than a single tree is not exploited in this setting. In particular, for the regression trees to concentrate around their partition-optimal counterparts, the number of observations in each leaf is required to approach infinity as T becomes large (cf. (A3)). If this is not the case, averages within leaves do not converge, meaning that individual trees will be inconsistent estimators for f . In this case, consistency off T must be caused by improved accuracy gained by averaging trees. for x ∈ R. As already mentioned, this choice meets the conditions imposed in (A1).

A simulation study
To keep things simple, we consider initially p = 1 so that f is one-dimensional and (Y t ) t≥1 is a first order Markov chain. Within this setting, we choose four different specifications of f , namely and f (x) = min{|x|, 0.75} min{|x|, 10}. (4.1) The first specification of f satisfies f (x) = 0.5x when x ∈ [−10, 10], and is constant outside of [−10, 10], and hence the corresponding process (Y t ) t≥1 is intended to mimic the classical linear AR(1) process. Indeed, it is very unlikely that |Y t | exceeds 10, which means that there is only little practical difference between the two processes. The second specification is an example of an exponential AR model (see, e.g., [3]), while the last two specifications of f correspond to an oscillating function and a particular spline, respectively. In Figure 1, we have simulated a sample path Y 1 , . . . , Y 400 for each of these specifications of f . We consider estimation of f by a random forestf T across different sample sizes T and we will be using the ranger package of R with B = 500 and k = ⌊0.04(log T ) 4 log log T ⌋. To obtain diverse trees, we will use the extremely randomized trees of Geurts et al. [14] which corresponds to setting the parameters replace = FALSE, sample.fraction = 1 and splitrule = "extratrees". Effectively, this means that split positions are chosen at random and that we build each tree using the entire sample D T (no initial bootstrap step). Note that, while this implementation aligns with the (α, k, m)-valid forests treated in Section 3, α is not a prespecified parameter in the ranger package, yet in principle its value can be implicitly determined. In Figure 2 we Furthermore, we note that choosing the parameter k in finite samples is not a trivial task, and the choice used above is rather arbitrary (the assumption of (A3) concerns only its asymptotic behavior). Nevertheless, its value can have a significant impact  on performance as it controls the bias-variance tradeoff of the estimator. While optimal tuning of k is outside the scope of this paper, we illustrate its effect onf T in Figure 4 where we estimate two of the functions in (4.1) for different values of k using a sample of size T = 1600. For comparison, the value used for k in Figure 2 when T = 1600 was 236. We conclude this section by indicating consistency of random forests in a more challenging setting. In particular, we consider p = 2 and the following choice of f : We rely on the ranger package once again with the same specifications as were used to obtain Figure 2, but we pass in the additional parameter split.select.weights = (1/2, 1/2) so that the probability of splitting along a given direction is the same for both directions (that is, ρ 1 = ρ 2 = 1/2). To evaluate its performance, we compute the mean squared error over the grid X := {−2, −1.75, . . . , 1.75, 2} 2 for different values of T . In Figure 5, the MSE is depicted as a function of 10 −4 T .

Proofs
It will be convenient to transform the input vector X t = (Y t−1 , . . . , Y t−p ) so that it takes values in [0, 1] p . Effectively, this can be done by applying a cumulative distribution function with h : R → [0, ∞) being a probability density which is strictly positive almost everywhere. We extend the domain of F h to R := R∪{±∞} by using the conventions F h (−∞) = 0 and F h (∞) = 1, so that the mapping is one-to-one between R p and [0, 1] p . The transformed input vector is defined by Z t = ι h (X t ). While there are no further restrictions on the choice of h, we will pick one that leads to good estimates on the density h Z of Z 1 .
Proof. By (2.4) in (A1) it holds that It follows as well from (A1) that ε 1 admits a density h ε which is strictly positive almost everywhere, and hence is a valid density to use for defining Z t = ι h (X t ). To show that h Z meets (5.2), it suffices to establish that for all z 1 , . . . , z p ∈ [0, 1], where F h is the cumulative distribution function defined by (5.1) and ζ =ζ p . Since ε i − M ≤ Y i ≤ ε i + M by (A2), it follows immediately from the independence of ε 1 , . . . , ε p and the monotonicity of Consequently, we only need to show that both F ε (F −1 h (z)+M) ≤ζz and F ε (F −1 h (z)− M) ≥ζ −1 z for an arbitrary z ∈ [0, 1]. Observe that, by (5.3) and the definition of and this completes the proof.
In all of the following, Z t = ι h (X t ) for some h such that (5.2) holds, and we will be using the notation #R := |{t ∈ {1, . . . T } : Z t ∈ R}|, µ(R) := P(Z 1 ∈ R), and η(R) := E[Y 1 | Z 1 ∈ R] for any given measurable set R ⊆ [0, 1] p . Note that if Λ ∈ V k , the partitionΛ of R p obtained by the exact same sequence of consecutive splits is again a k-valid partition and for all x ∈ R p . Moreover, for any x ∈ R p we have that where L k consists of all sets which are members of k-valid partitions of [0, 1] p . In particular, it suffices to prove uniform concentration inequalities for empirical averages over rectangles in L k . Still, there are infinitely many rectangles in L k , so one cannot simply analyze |G T (L)| and then rely on a union bound. We will follow the ideas of Wager and Walther [27] who demonstrated that one only needs to understand the concentration over a much smaller set of approximating rectangles. In particular, we will make use of one of their results which states that there exists a rather small collection of rectangles in [0, 1] p containing good approximations to any non-negligible rectangle in terms of Lebesgue measure. Since their result is more general than what is needed here (e.g., it can be used in situations where p → ∞), we state a rather simplified version in Theorem 5.2 below. To avoid introducing too many non-informative constants in the following, we introduce some convenient notation. For two sequences (a t ) t≥1 and (b t ) t≥1 we will write a t b t if there exists a constant c ≥ 1 such that a t ≤ cb t for all t. If both a t b t and b t a t we write a t ≍ b t .
Theorem 5.2 (Wager and Walther [27]). Let ε ≍ k −1/2 and w ≍ k/T . Then there exists a collection of rectangles R ε,w with the following two properties: ii) The cardinality |R ε,w | of R ε,w satisfies the bound log |R ε,w | log T .
Let ε, w ∈ (0, 1) be given as in Theorem 5.2. It follows that any given leaf L ∈ L w k := {L ∈ L k : Leb(L) ≥ w} can be inner ε-approximated by a rectangle L ε − from R ε,w in the sense of (5.8). Moreover, Thus, to obtain a concentration inequality for (5.7) it suffices to show that, for all large T and with high probability, the three terms on the right-hand side of the inequality (5.9) are small and L k = L w k . Bounding the first term of (5.9) is the easiest task.
Proof. By (A2), we find for an arbitrary leaf L ∈ L w k that Moreover, Lemma 5.1 and (5.8) imply and this concludes the proof.
The key to obtain estimates of the second and third term of (5.9), as well as showing that L k = L w k , with high probability is to establish good concentration inequalities for the counts #L and #L ε − , which apply across all L ∈ L w k . As we will see in later proofs, by relying on Theorem 5.2 and ideas similar to [27,Theorem 10 and Lemma 13], it suffices to understand the concentration of #R across all rectangles in R ε,w of non-negligible volume. This is the motivation for the following result, which relies on a Bernstein type inequality for weakly dependent processes.
Proof. Note that, by a union bound, it suffices to establish that for any R ∈ R ε,w with µ(R) ≥ w, To this end, observe that (Y t ) t≥1 forms a stationary geometrically ergodic p-th order Markov chain (cf. [3, Theorem 3.1]). It is well-known that any such chain is exponentially α-mixing (see, e.g., [13, p. 89]). In particular, the t-th α-mixing coefficient Moreover, the α-mixing coefficients of (1 R (Z t )) t≥1 are obviously bounded by (α(t)) t≥0 (which do not depend on R), and thus we can rely on a Bernstein type inequality for weakly dependent sequences [21, Theorem 2] to establish that It is easy to see that | Cov(1 R (Z t+1 ), 1 R (Z 1 ))| ≤ min{α(t), µ(R)}. From this inequality and the fact that α(t) ≤ µ(R) as long as t log(T /k), which follows from (5.12) and µ(R) k/T , we deduce that By combining this variance bound with inequality (5.13) and using that µ(R) 1/T we get To put it differently, we may choose a sufficiently large constantγ such that for any fixed τ ∈ (0, ∞), Since µ(R) k/T , the maximum of (5.17) is equal to its last term if k ≥ κ(log T ) 3 log τ for a suitable constant κ. Moreover, if τ = |R ε,w |T , Theorem 5.2(ii) shows that log τ log T , so if k is chosen in accordance with (A3), the last term of the maximum in (5.17) is the dominating one when T is large. Consequently, by choosing x to be the right-hand side of (5.17) with τ = |R ε,w |T , we obtain By using Theorem 5.2(ii) once again it follows that (5.11) is satisfied for a suitable constant γ and verifies that (5.10) holds with probability at least 1 − T −1 for all sufficiently large T .
The next result shows how the inequality (5.10) impacts the magnitude of the third term of (5.9).
By combining this with (5.22) we conclude that where it is implicitly understood that the supremum only runs over rectangles in [0, 1] p . Now, if R is a rectangle with µ(R) < 2ζw, we may expand it along one or more of the p directions to obtain a new rectangle R with R ⊆ R ⊆ [0, 1] p and µ( R) = 2ζw. Thus, by (5.23) this means that By (A3), the last term in the parenthesis goes to zero and e ζ 2 ε goes to one as T approaches infinity, so we establish that #R < k as long as T exceeds a certain threshold (which does not depend on R). To put it differently, as long as T is sufficiently large and for any rectangle R ⊆ [0, 1] p , the following implication holds: Consider now any leaf L ∈ L k . By (5.25) it must be the case that µ(L) ≥ 2ζw, and thus (5.19) is an immediate consequence of (5.23). Moreover, the µ-measure of the inner ε-approximation L ε − of L is bounded from below as where the last inequality applies as long as T is large enough. Thus, (5.21) is implied by (5.10). In order to prove (5.20), first note that by (5.21). By dividing both sides of (5.27) with T µ(L ε − ) and using that T µ(L ε − ) ≥ k/4 when T is large (by (5.26)) we obtain Now, by using the bound (5.28) for the last term in (5.27) and rearranging terms, By (A3), 2γ log T / √ k ≤ 1 when T is sufficiently large, and this proves (5.20). Now we use (5.19)-(5.21) to bound (#L − #L ε − )/#L uniformly across L ∈ L w k . For an arbitrary leaf L ∈ L w k it follows by (5.19) that and hence where, due to (5.20), the last inequality applies as long as T exceeds a certain threshold (which does not depend on L). Moreover, (5.20) implies and, as in (5.26), the µ-measure of L ε − is bounded from below as In view of (5.18), this finishes the proof.
Remark 5.6. Suppose that we are in the setting of Lemma 5.5. In its proof it is in fact established that L k = L w k when (5.10) holds and T is large. For instance, this is an immediate consequence of (5.25).
In a similar way, we use (5.10) to bound the second term of (5.9); this is detailed in the following lemma.
Proof. For any given rectangle R we have the bound (5.32) When (5.10) is satisfied and T is large enough, it follows from (5.26) (which holds under (A1)-(A3)) that Moreover, for any R ∈ R ′ , (5.10) implies immediately that and hence also that as soon as T exceeds a certain threshold (which is independent of R). By combining (5.32)-(5.35) we obtain the result.
Since R ε,w is a rather small collection of sets, it is hinted by Lemmas 5.5 and 5.7 that the only missing part in order to prove Theorem 2.1 is to obtain bounds on max t=1,...,T for any R ∈ R ε,w with µ(R) ≥ ζw. The first term is easy to handle, since it is a maximum of i.i.d. random variables satisfying Bernstein's condition (2.3). The last two terms can be handled by relying on Bernstein type inequalities for martingale differences and weakly dependent random variables. We will go through the details below.
Proof of Theorem 2.1. The proof goes by defining four events E 1 , E 2 , E 3 and E 4 and arguing that (i) the inequality (2.8) holds true on E 1 ∩ E 2 ∩ E 3 ∩ E 4 , and (ii) each event E i occurs with probability at least 1 − T −1 . With ε = k −1/2 and w = k/(4ζT ), ζ ∈ (1, ∞) given as in Lemma 5.1, the events that we will consider are the following: Here γ is the constant from Lemma 5.4, while c 1 , c 2 and c 3 will be introduced during the proof. Moreover, E c 1 refers to the complement of E 1 . Proof of (i): Suppose that the event E 1 ∩ E 2 ∩ E 3 ∩ E 4 has occurred. Then, by using (5.9) and Lemmas 5.3, 5.5 and 5.7, it follows that In view of this inequality, (5.7) and Remark 5.6, we conclude that (2.8) is satisfied on E 1 ∩ E 2 ∩ E 3 ∩ E 4 for a suitably chosen constant β.
Proof of (ii): The content of Lemma 5.4 is exactly that P(E 1 ) ≥ 1 − T −1 . Since the moments of ε 1 meet (2.3), its distribution is sub-exponential and (2.6) holds. Moreover, ε 1 , . . . , ε T are i.i.d. random variables, so by applying a union bound we obtain the estimate In other words, max t=1,...,T |ε t | ≤ x with probability at least 1−T −1 if x ≥ log(γ 1 T 2 )/γ 2 , and this shows P(E 2 ) ≥ 1 − T −1 for some c 1 . Next, consider any rectangle R ∈ R ε,w with µ(R) ≥ ζw and observe that (ε t 1 R (Z t )) t≥1 is a martingale difference sequence with respect to the filtration F t = σ(Y s : s ≤ t). Since ε t is independent of F t−1 and its moments satisfy (2.3), In particular, these observations show that we can rely on a Bernstein (Freedman) type inequality for unbounded summands to obtain for any x, y > 0. Such a result can, e.g., be found in [11,Theorem 8.2.2]. Let γ be the constant from Lemma 5.4, consider specifically and note that y ≤ 2T µ(R) when T is large (by (A3)). By using (5.36) with this choice of y it follows that From this inequality we deduce the existence of a constant κ, such that for any τ > 0, for a suitable constant c 2 . By rearranging terms and using that T µ(R) ≥ k/4 it follows that Consequently, by relying on a union bound over all rectangles in R ε,w , we establish that P( To show P(E 4 ) ≥ 1 − T −1 we consider again an arbitrary rectangle R ∈ R ε,w with µ(R) ≥ ζw. The sequence (f (X t )1 R (Z t )) t≥1 is bounded and α-mixing, and its associated mixing coefficients is bounded by those of (X t ) t≥1 , which we will denote by (α(t)) t≥1 . In particular, by (5.12) it follows that the mixing coefficients of (f (X t )1 R (Z t )) t≥1 are bounded by an exponentially decaying sequence of numbers with a decay rate which does not depend on R. Consequently, as in the proof of Lemma 5.4, we can again rely on [21, Theorem 2] to obtain that Thus, by using the same arguments as in the proof of Lemma 5.4 (in relation to (5.14)) we establish ν 2 R µ(R) log T , meaning that (5.40) implies log P 1 T t : Zt∈R f (X t ) − E[f (X)1 R (Z)] > x − x 2 T max{µ(R) log T, x(log T ) 2 } .

(5.41)
Since the right-hand side of (5.41) is the same as in (5.15), we can use the exact same arguments to verify the existence of a constant c 3 such that when T exceeds a certain threshold (not depending on R). In particular, from which it follows by a union bound over rectangles in R ε,w that P(E 4 ) ≥ 1−T −1 .
We have now argued that both (i) and (ii) outlined in the beginning of the proof hold true, and hence we obtain the desired result.
We now turn to the task of proving Theorem 3.2. To do so, we will make use of an auxiliary result which is presented in Lemma 5.8 below. In this formulation, diam(A) := sup x,x ′ ∈A x ′ − x is the diameter of A ⊆ R p .
Proof. Let us represent the rectangle A Λ (x) in Λ containing x ∈ R p as A Λ (x) = A 1 Λ (x) × · · · × A p Λ (x). Then, it suffices to show that with probability one for i = 1, . . . , p. To this end, imagine the tree illustrating how Λ is obtained by the recursive partitioning scheme and consider the path that x takes down the tree from its root to the leaf A Λ (x). Let d denote the depth of the tree at x (that is, x traverses exactly d − 1 nodes before it reaches A Λ (x)), and let A l be the node containing x at depth l. In particular, (A l ) l is a decreasing sequence of sets with A 1 = R p and A d = A Λ (x), and A l j = A l+1 j for exactly one j (with the notation A = A 1 × · · · × A p ). We let S i l = 1 A l i =A l+1 i indicate whether the node containing x at depth l will be split along the i-th direction, and τ i l = min{j ∈ {τ i l−1 + 1, . . . , d − 1} : S i j = 1} the depth at which x will experience the l-th split along the i-th direction (τ i 0 ≡ 0 and, say, τ i l = ∞ if the set is empty). For an illustration of these definitions, see Figure 6. By the construction of the tree (specifically, the rules (i) and (iii) outlined in Section 3) it holds that m ≥ T α d−1 , and hence d ≥ 1 + log(T /m) log(α −1 ) .

(5.43)
Recall also that the tree is constructed in such a way that there exists a strictly positive constant ρ which is a lower bound for the probability ρ i of splitting along the i-th direction at any given node. Suppose for simplicity (but without loss of generality) that, in fact, ρ i = ρ. Then, (S i l ) l≥1 is a sequence of i.i.d. Bernoulli random variables and thus, with probability one, n l=1 S i l −→ ∞, n → ∞.
Since the right-hand side of (5.43) tends to infinity by (A5), it follows that for suitable Λ 1 , . . . , Λ B ∈ V α,k,m . Thus, it suffices to show that T * Λ (x) → f (x) and T * Λ (X) → f (X) in probability as T → ∞ when Λ ∈ V α,k,m for all T . By using Lemma 5.8 together with the inequality which holds by (A4), it follows that T * Λ (x) → f (x) almost surely and, in particular, in probability. Here, as in the proof of Lemma 5.8, subscript T indicates that we are conditioning on the randomness related to the partition Λ. The last part follows immediately from Tonelli's theorem as this implies that, on an event with probability one, T * Λ (x) → f (x) for (Lebesgue) almost all x ∈ R p .