Random forest estimation of conditional distribution functions and conditional quantiles

We propose a theoretical study of two realistic estimators of conditional distribution functions and conditional quantiles using random forests. The estimation process uses the bootstrap samples generated from the original dataset when constructing the forest. Bootstrap samples are reused to define the first estimator, while the second requires only the original sample, once the forest has been built. We prove that both proposed estimators of the conditional distribution functions are consistent uniformly a.s. To the best of our knowledge, it is the first proof of consistency including the bootstrap part. We also illustrate the estimation procedures on a numerical example.


Introduction
Conditional distribution functions and conditional quantiles estimation is an important task in several domains including environment, insurance or industry. It is also an important tool for Quantile Oriented Sensitivity Analysis (QOSA), see e.g., Fort et al. (2016); Maume-Deschamps and Niang (2018); Browne et al. (2017). In order to estimate conditional quantiles, various methods exist such as kernel based estimation or quantile regression (Koenker and Hallock, 2001) but they present some limitations. Indeed, the performance of kernel methods strongly depends on the bandwidth parameter selection and quickly breaks down as the number of covariates increases. On one other hand, quantile regression is not adapted in a non-gaussian setting since the true conditional quantile is not necessarily a linear combination of the input variables (Maume-Deschamps et al., 2017). To overcome these issues, we propose to explore the Random Forest estimation of conditional quantiles (Meinshausen, 2006).
Random forest algorithms allow a flexible modeling of interactions in high dimension by building a large number of regression trees and averaging their predictions. The most famous random forest algorithm is that of Breiman (2001) whose construction is based on the seminal work of Amit and Geman (1997); Ho (1998); Dietterich (2000). Breiman's random forest estimate is a combination of two essential components: Bagging and Classification And Regression Trees (CART)-split criterion (Breiman et al., 1984). Bagging for bootstrap-aggregating was proposed by Breiman (1996) in order to improve the performance of weak or unstable learners.
Random forests are also related to some local averaging algorithms such as nearest neighbors methods (Lin and Jeon, 2006;Biau and Devroye, 2010) or kernel estimators (Scornet, 2016c). More precisely, thanks to Lin and Jeon (2006), the random forest method can be seen as an adaptive neighborhood regression procedure and therefore the prediction (estimation of the conditional mean) can be formulated as a weighted average of the observed response variables.
Based on that approach, we develop a Weighted Conditional Empirical Cumulative Distribution Function (W_C_ECDF) approximating the Conditional Cumulative Distribution Function (C_CDF). Then, α-quantile estimates are obtained by using W_C_ECDF instead of C_CDF. Meinshausen (2006) defined a W_C_ECDF with weights using the original dataset whereas we allow to construct the weights using the bootstrap samples, as it is done practically in regression random forests. We prove the almost sure consistency of these estimators. Both estimators have several advantages over methods such as kernel methods (Nadaraya, 1964;Watson, 1964). Due to the intrinsic tree building process, random forest estimators can easily handle both univariate and multivariate data with few parameters to tune. Besides, these methods have good predictive power and can outperform standard kernel methods (Davies and Ghahramani, 2014;Scornet, 2016c). Lastly, being based on the random forest algorithm, they are also easily parallelizable and can handle large dataset. A implementation of both algorithms is made available within a Julia package called ConditionalDistributionForest (Fabrège and Maume-Deschamps, 2020) as well as a python package named qosa-indices, (Elie-Dit-Cosaque, 2020).
The C_CDF can be seen as a regression function. On this basis, we were interested in the literature dealing with the consistency of random forest estimates in order to show the convergence of ours estimators. Several authors such as Breiman (2004); Biau (2012); Wager and Walther (2015); Scornet et al. (2015b); Mentch and Hooker (2016); Wager and Athey (2018); Goehry (2019) have established asymptotic properties of particular variants and simplifications of the original Breiman's random forest algorithm. Facing some theoretical issues with the bootstrap, most studies replace it by subsampling, assuming that each tree is grown with s n < n observations randomly chosen without replacement from the original dataset. Most of the time, in order to ensure the convergence of the simplified model, the subsampling rate s n /n is assumed to tend to zero at some prescribed rate, assumption that excludes the bootstrap mode. Besides, the consistency is generally showed by assuming that the number of trees goes to infinity which is not fully relevant in practice. Under some conditions, Scornet (2016a) showed that if the infinite random forest regression estimator is L 2 consistent then so does the finite random forest regression estimator when the number of trees goes to infinity in a controlled way. Recent attempts to bridge the gap between theory and practice, provide some results on random forest algorithms at the price of fairly strong conditions. For example, Scornet et al. (2015b) showed the L 2 consistency of random forests in an additive regression framework by replacing the bootstrap step by subsampling. Their result rests on a fundamental lemma developped in Scornet et al. (2015a) which reviews theoretical random forest. Highlighted by a counterexample developed in Section 4, assumptions are required to get out of the additive framework. Furthermore, consistency and asymptoctic normality of the whole algorithm were recently proved under strong conditions by Wager and Athey (2018) replacing bootstrap by subsampling and simplifying the splitting step.
One of the strong conditions used in the Theorem 3.1. of Wager and Athey (2018) is that the individual trees satisfy a condition called honesty. An example of an honest tree given by the authors is one where the tree is grown using one subsample, while the predictions at the leaves of the tree are estimated using a different subsample. Due to this assumption, the authors admit that their theorems are not valid for the practical applications most of the time because almost all implementations of random forests use the training sample twice.
Thus, despite an active investigation during the last decade, the consistency of the original (i.e. with the bootstrap samples) Breiman's random forest method is not fully proved. This motivated our work. Our major contribution is the proof of the almost everywhere uniform convergence of the estimator W_C_ECDF both using the bootstrap samples (Theorem 4.1) or the original one (Theorem 4.2). To the best of our knowledge, this is the first consistency result under realistic assumptions for a method based on bootstrap samples in the random forest field. Remark that Meinshausen (2006) gave of proof of the consistency in probability of the W_C_ECDF for a simplified model where the weights are considered as constant while they are indeed random variables heavily data-dependent.
The paper is organized as follows. Breiman's random forest algorithm is detailed in Section 2 and notations are stated. The random forest estimations of C_CDF based both on bootstrap samples and the original dataset are introduced in Section 3 as a natural generalization of regression random forests. The main consistency results are presented in Section 4 and the proofs of those are gathered in Section 5. Section 6 is devoted to a short simulation study and a conclusion is given in Section 7.

Breiman's random forest
The aim of this section is to present the Breiman's random forest algorithm as well as notations used throughout this paper.
Random forest is a generic term to name an aggregation scheme of decision trees allowing to deal with both supervised classification and regression tasks. We are only concerned with the regression task. The general framework is the nonparametric regression estimation where an input random vector X ∈ X ⊂ R d is observed and a response Y ∈ R is predicted by estimating the regression function We assume that we are given a training sample D n = X j , Y j j=1,...,n of independent random variables distributed as the prototype pair (X, Y ) which is a (d+1)-dimensional random vector. The purpose is to use the dataset D n to construct an estimator m n : X → R of the function m.
Random forests proposed by Breiman (2001) build a predictor consisting of a collection of k randomized regression trees grown based on the CART algorithm. The CART-split criterion of Breiman et al. (1984) is used in the construction of the individual trees to recursively partition the input space X in a dyadic manner. More precisely, at each step of the partitioning, a part of the space is divided into two sub-parts according to the best cut perpendicular to the axes. This best cut is selected in each node of the tree by optimizing the CART-split criterion over the d variables, i.e. minimizing the prediction squared error in the two child nodes. The trees are thus grown until reaching a stopping rule. There are several, but one that is generally proposed is that the tree construction continues while leaves contain at least min samples leaf elements. This criterion is implemented in the RandomForestRegressor class of the python package Scikit-Learn (Pedregosa et al., 2011) or in the build_forest function of the Julia (Bezanson et al., 2017) package DecisionTree.
Building several different trees from a single dataset requires to randomize the tree building process. Breiman (2001) proposed to inject some randomness both in the dataset and in the tree construction. First of all, prior to the construction of each tree, a resampling step is done by bootstrapping (Efron, 1979) from the original dataset, that is, by choosing uniformly at random n times from n observations with replacement. Only these bootstrap observations are taken into account in the tree building. Accordingly, the min samples leaf hyperparameter introduced previously refers in the random forest method to the minimum number of bootstrap observations contained in each leaf of a tree. Secondly, at each step of the tree construction, instead of optimizing the CART-split criterion over the d variables, a number of variables called max f eatures is selected uniformly at random among the d variables. Then, the best split is chosen as the one optimizing the CART-split criterion only along the max f eatures preselected variables in each node.
For any query point x ∈ X , the -th tree estimates m(x) as follows: where: • Θ , = 1, . . . , k are independent random vectors, distributed as a generic random vector Θ = Θ 1 , Θ 2 and independent of D n . Θ 1 contains indexes of observations that are used to build each tree, i.e. the bootstrap sample and Θ 2 indexes of splitting candidate variables in each node.
• D n (Θ ) is the bootstrap sample selected prior to the tree construction.
n (x; Θ , D n ) is the number of elements of D n (Θ ) that fall into A n (x; Θ , D n ). The trees are then combined to form the finite forest estimator: We may now present the conditional distribution function estimators.

Conditional Distribution Forests
We aim to estimate F (y|x) = P(Y y|X = x). Two estimators may be defined. One uses the bootstrap samples both in the forest construction and in the estimation. The other uses the original sample in the estimation part. Once the distribution function has been estimated, the conditional quantiles may be estimated straightforwardly.

Bootstrap samples based estimator
First of all, let us define the random variable B j Θ 1 , D n as the number of times that the observation X j , Y j has been drawn from the original dataset for the -th tree construction. Thanks to it, the conditional mean estimator in Equation (2.2) may be rewritten as: where the weights are defined by: . (3.2) Note that the weights w b n,j (x; Θ 1 , . . . , Θ k , D n ) are nonnegative random variables as functions of Θ 1 , . . . , Θ k , D n and their sum for j = 1, . . . , n equals 1.
The random forest estimator (3.1) can be seen as a local averaging estimate. Indeed, as mentioned by Scornet (2016b), the regression trees make an average of the observations located in a neighborhood of x, this neighborhood being defined as the leaf of the tree containing x. The forest, which aggregates several trees, also operates by calculating a weighted average of the observations in a neighborhood of x. However, in the case of forests, this neighborhood results from the superposition of the neighborhoods of each tree, and therefore has a more complex shape. Several works have tried to study the random forest algorithm from this point of view (local averaging estimate) such as Lin and Jeon (2006) who was the first to point out the connection between the random forest and the adaptive nearest-neighbors methods, further developped by Biau and Devroye (2010). Some works such as Scornet (2016c) have also studied random forests through their link with the kernel methods.
We are interested in the Conditional Cumulative Distribution Function (C_CDF) of Y given X = x in order to obtain the conditional quantiles. Pairing the following equality: with the weighted approach described above, we propose to estimate the C_CDF as follows: Hence, given a level α ∈ ]0, 1[, the conditional quantile estimator q α ( Y | X = x) is defined as follows: Let us turn now to the estimator using the original sample.

Original sample based estimator
Trees are still grown with their respective bootstrap sample D n (Θ ) , = 1, . . . , k. But instead of considering them in the estimation, we may use the original sample D n . Consider the weights: . (3.5) where N o n (x; Θ , D n ) is the number of points of D n that fall into A n (x; Θ , D n ). As previously, the weights w o n,j (x; Θ 1 , . . . , Θ k , D n ) are nonnegative random variables as functions of Θ 1 , . . . , Θ k , D n and their sum over j = 1, . . . , n equals 1.
It was proposed in Meinshausen (2006) to estimate the C_CDF with: The conditional quantiles are then estimated by plugging A complete description of the procedure for computing conditional quantile estimates via the C_CDF with both previous estimators can be found in Algorithm 1. A python library named qosa-indices has also been developed to perform the numerical estimations of conditional distributions and quantiles for both methods. It is available at Elie-Dit-Cosaque (2020) and uses Scikit-Learn, Numpy, Numba. Both approaches are also implemented in a Julia package based on the library DecisionTree and that is available at Fabrège and Maume-Deschamps (2020).
It has to be noted that a package called quantregForest has been made available in R (R Core Team, 2019) and can be found at Meinshausen (2019). The estimation method currently implemented in quantregForest is different from the method described in Meinshausen (2006). It does the following. For a new observation x and the -th tree, one element of D n = X j , Y j j=1,...,n falling into in the leaf node A n (x; Θ , D n ) is chosen at random. This gives, k values of Y and allows to estimate the conditional distribution function with the classical Empirical Cumulative Distribution Function associated with the empirical measure. The performance of this method seems weak and no theoretical guarantees are available.

Consistency results
In this section, we state our main results, which are the uniform a.s. consistency of both estimators F b k,n and F o k,n of the conditional distribution function. It constitutes the most interesting result of this paper because it handles the bootstrap component and gives the almost sure uniform convergence. Indeed, most of the studies (Scornet et al., 2015b;Wager and Athey, 2018;Goehry, 2019) replace the bootstrap by subsampling without replacement in order to avoid the mathematical difficulties induced by this one and therefore differ slightly from the procedure used in practice. Meinshausen (2006) showed the uniform convergence in probability of a simplified version of the estimator F o k,n . In Meinshausen (2006), the weights w o n,j (x; Θ 1 , . . . , Θ k , D n ) are in fact considered to be non-random while they are indeed random variables depending on (Θ ) =1,...,k and D n .

Algorithm 1: Conditional Distribution Forest algorithm
Input: • Training sample: D n • Number of trees: k ∈ N • Number of features to be considered for the best split in a node: max f eatures ∈ {1, . . . , d} • Minimum number of samples required in a leaf node: min samples leaf ∈ {1, . . . , n} • Point where the conditional distribution function or the conditional quantile is required: x ∈ X • The order of the conditional quantile: α ∈ [0, 1] Output: Estimated value of the conditional quantile of x at the α-order.
Select uniformly with replacement n data points among D n . Only these observations will be used in the tree construction. Consider the whole space X as root node.

7
Cut the current node at the selected split in two child nodes.

8
Repeat the lines (5)-(7) for the two resulting nodes until each node of the tree contains at least min samples leaf observations. Overall, proving the consistency of the forest methods whose construction depends both on the X j 's and on the Y j 's is a difficult task. This feature makes the resulting estimate highly datadependent, and therefore difficult to analyze. A simplification widely used by most authors from a theoretical point of view is to work with random forest estimates whose form of the tree depends only on X j 's which Devroye et al. (2013) called the X-property but the Y j 's are still used to compute the prediction, either the conditional mean or the conditional distribution function for example. One of the first results dealing with data-dependent random forest estimator of the regression function is Scornet et al. (2015b) who showed the L 2 consistency in an additive regression framework by replacing the bootstrap by subsampling. Thanks to the following assumptions, we go further by showing the consistency of our estimators in a general framework and not only in the additive regression scheme.
Assumption 4.1. For all ∈ 1, k , we assume that the variation of the conditional cumulative distribution function within any cell goes to 0: We shall discuss further on Assumption 4.1 but let us remark that it is satisfied, for example, provided that the diameter of each tree cell goes to zero and for all y, F (y|·) is continuous.

Assumption 4.2.
We shall make the following assumptions on k (number of trees) and N b n (x; Θ, D n ) (number of bootstrap observations in a leaf node): Remark 4.1.
In order to prove our main consistency result, either Assumption 4.2 item 2. or item 3. is needed. Item 2. may seem much stronger than item 3. but it has to be noted that the number of bootstrap observations in a tree leaf is a construction parameter of the forest, so that it can be controlled. Using item 2. simplifies the proof but item 3. is sufficient.
For every x ∈ X , the conditional cumulative distribution function F ( y| X = x) is continuous and strictly increasing in y.
The two theorems below give the uniform a.s. consistency of our two estimators.
Consider a random forest which satisfies Assumptions 4.1 to 4.3. Then, Using standard arguments, the consistency of quantile estimates stems from Assumption 4.3 as well as the uniform convergence of the conditional distribution function estimators obtained above.
Let us make some comments on the assumptions above.
Assumption 4.1 ensures a control on the approximation error of the estimators. It is drawn from the Proposition 2 of Scornet et al. (2015b) who shows the consistency of Breiman's random forest estimate in an additive regression framework. Their Proposition 2 allows to manage the approximation error of the estimator by showing that the variation of the regression function m within a cell of a random empirical tree is small provided n is large enough. This result is based on the fundamental Lemma 1 of Scornet et al. (2015b) which states that the variation of the regression function m within a cell of a random theoretical tree goes to zero for an additive regression model. A random theoretical tree is grown as a random empirical tree, except that the theoretical equivalent of the empirical CART-split criterion (Biau and Scornet, 2016) defined in any node A below is used to choose the best split: Hence, a theoretical tree is obtained thanks to the best consecutive cuts (i , z ) optimizing the previous criterion L (·, ·).
General results on standard partitioning estimators whose construction is independent of the label in the training set (see Chapter 4 in Györfi et al. (2006) or Chapter 6 in Devroye et al. (2013)) state that a necessary condition to prove the consistency is that the diameter of the cells tend to zero as n → ∞. Instead of such a geometrical assumption, Proposition 2 in Scornet et al. (2015b) ensures that the variation of m inside a node is small thanks to their Lemma 1. But the cornerstone of the Lemma 1 is the Technical Lemma 1 of Scornet et al. (2015a) recalled below for completness.
Technical Lemma. Assume that:

Then the regression function m is constant on A.
This lemma states that if the theoretical split criterion is zero for all cuts in a node, then the regression function m is constant on this node, i.e. the variation of this one within the cell is zero. But, we will see in the sequel a counterexample where L A (i, z) = 0 ∀i, ∀z ∈ [a i , b i ] and yet, the regression function is not constant.

Within a standard cell of the form
as well as for a generic model of the type Y = m (X) + ε with X, independent random inputs and ε, an independent centered noise of X, the condition above is equivalent to: This result is obtained by deriving with respect to z the following function: Let us consider a two-dimensional example, let • and ε a centered noise in independent of X.
It can be shown for this model that within the node A, L ≡ 0 for all i ∈ {1, 2}, for all z ∈ [a i , b i ] and yet the regression function m is not constant.
Accordingly, the technical lemma above is well-designed for an additive regression framework. But this context is far from reality for many concrete examples. Outside this framework and as highlighted by our counterexample, an additional assumption is necessary in order to control the approximation error of the estimator. Theorem 1 in Meinshausen (2006) handles the approximation error of its estimator based on a simplified random forest model thanks to a restrictive assumption on the proportion of observations selected in each split and for each direction (see Assumption 3 in Meinshausen (2006)) and a Lipschitz assumption on the conditional distribution function, which is not always true as mentionned in Biau and Scornet (2016). We use Assumption 4.1 instead.
On one other hand, Assumption 4.2 allows us to control the estimation error of our estimators and expresses that cells should contain a sufficiently large number of points so that averaging among the observations is effective.
Finally, Assumption 4.3 is necessary to get uniform convergence of the estimators.
Hence, thanks to all these suitable assumptions we get the consistency of our estimators in Theorems 4.1 and 4.2. As far as we know, Theorem 4.1 is the first consistency result for a method based on the original Breiman's random forest algorithm (i.e. using the bootstrap samples).
The next section is devoted to the proofs of the two Theorems 4.1 and 4.2

Proofs of the main theorems
The proofs of Theorems 4.1 and 4.2 are close. We begin with that of Theorem 4.1 and then sketch that of Theorem 4.2 which is a bit simpler.

Proof of Theorem 4.1
The main ingredient of the proof is to use a second sample D n in order to deal with the datadependent aspect. Thus, we first define a dummy estimator based on two samples D n and D n which will be used below. The trees are grown as in Algorithm 1 using D n , but we consider another sample D n (independent of D n and Θ) which is used to define a dummy estimator: where the weights are: , j = 1, . . . , n with N n x; Θ , X 1 , . . . , X n , D n , the number of elements of D n that fall into A n (x; Θ , D n ). Throughout this section, we shall use the convention 0 0 = 0 in case N n x; Θ , X 1 , . . . , X n , D n = 0 and thus 1 {X j ∈An(x;Θ ,Dn)} = 0 for j = 1, . . . n. The weights w n,j x; Θ 1 , . . . , Θ k , X 1 , . . . , X n , D n are nonnegative random variables, as function of Θ 1 , . . . , Θ k , X 1 , . . . , X n , D n . To lighten the notation in the sequel, we will simply write F k,n ( y| X = x) = n j=1 w j (x) 1 {Y j y} instead of (5.1).
Let x ∈ X and y ∈ R, we have: The convergence of the two right-hand terms is handled separately into the following Proposition 5.1 and Lemma 5.2.
Hence, Proposition 5.1 establishes the consistency for a random forest estimator based on a second sample D n independent of D n and Θ. Wager and Athey (2018) proved that estimators built from honest forests are asymptotically Gaussian. Remark that in Wager and Athey (2018), it is also required to control the proportion of chosen observations at each split and in each direction. In our case, going through a kind of honest trees is just a theoretical tool. We go one step further with the following lemma by showing that the estimators built with honest and non-honest trees are close.
Lemma 5.2. Consider a random forest which satisfies Assumption 4.2. Then, Hence, according to Proposition 5.1 and Lemma 5.2, we get Now, thanks to Dini's second theorem, let us sketch how to obtain the almost sure uniform convergence relative to y of the estimator.
Note that Y j y = U j F Y |X=x (y) under Assumption 4.3 with U j = F Y |X=x Y j , j = 1, . . . , n which are i.i.d random variables. Then, (5.2) is equivalent to: As in the proof of Glivenko-Cantelli's Theorem, using that s → n j=1 w b j (x) 1 {U j (ω) s} is increasing and Dini's second theorem, we get the uniform convergence almost everywhere, which concludes the proof of the theorem.
We now turn to the proofs of Proposition 5.1 and Lemma 5.2. To that aim, the following lemma, based on Vapnik-Chervonenkis classes (Vapnik and Chervonenkis, 1971) is a key tool.
Lemma 5.3. Consider D n and D n , two independent datasets of independent n samples of (X, Y ). Build a tree using D n with bootstrap and bagging procedure driven by Θ. As before, N b (A n (Θ)) = N b n (x; Θ, D n ) is the number of bootstrap observations of D n that fall into in A n (Θ) = A n (x; Θ, D n ) and N (A n (Θ)) = N n x; Θ, X 1 , . . . , X n , D n , the number of observations of D n that fall into in A n (Θ). Then: Proof of Lemma 5.3.
Let ε > 0 and x ∈ X , we have: The last two right-hand terms are handled thanks to a direct application of the Theorem of Vapnik and Chervonenkis (1971) over the class B whose Vapnik-Chervonenkis dimension is 2d. This class is nothing more than an extension of the class R of rectangles in R d . Following the lines of the proof of Theorem 13.8 in Devroye et al. (2013), one sees that the classes R and B have the same Vapnik-Chervonenkis dimension.
A special attention should be given to the first right hand-term. The bootstrap component is represented with the random vector B j Θ 1 , D n j=1,...,n referring to the number of times that the observation X j , Y j has been chosen from the original dataset. Conditionally to D n , this random vector has a multinomial distribution with parameters M (n; 1/n, . . . , 1/n). As stated in Arenal-Gutiérrez et al. (1996), the bootstrap component can also be represented thanks to the variables selected with replacement from the set D n = X 1 , Y 1 , . . . , (X n , Y n ) . Let Z 1 , . . . , Z n be these elements which are distributed as Z = (Z 1 , Z 2 ) that has a discrete uniform distribution over D n conditionally to D n . The first right hand-term is rewritten as: By applying Vapnik-Chervonenkis' Theorem under the conditional distribution given D n , we get: Therefore, Finally, we get the overall upper bound: Lemma 5.3 is the main ingredient of the proof of Proposition 5.1.

Proof of Proposition 5.1.
We aim to prove: ∀x ∈ X , ∀y ∈ R, P ω ∈ Ω : Let x ∈ X and y ∈ R, we have: We first show that (W n ) n 1 goes to 0 a.s. in the case of Assumption 4.2 item 2. This is achieved by adapting Hoeffding inequality's proof to our random weighted sum context.
For any ε > 0, t ∈ R * + , we have We shall make use of the folklore lemma below.

Lemma.
Let X be a centred random variable, a.s. bounded by 1. Then, for any t ∈ R, E e tX e t 2 2 .
Thus, by using Assumption 4.2, item 1., k = O(n α ) so that the right hand side is summable, we conclude that W n goes to 0 almost surely.
Let us now show that (W n ) n 1 goes to 0 a.s. in the case where Assumption 4.2 item 3 is satisfied.
Let us first show that (W n 2 ) n 1 goes to 0 a.s.
We have: Using Bienaymé-Tchebychev's inequality, we get: Finally, thanks to Lemma 5.3 and Assumption 4.2 items 1. and 3., there exist C, K and M positive constants such that: The trick of using a second sample D n independent of the first-one and the random variable Θ is really important to handle the J n term. Indeed, we have J n = 0 while the equivalent term encountered in the proof of the Theorem 2 developed by Scornet et al. (2015b) is handled using a conjecture regarding the correlation behavior of the CART algorithm that is difficult to verify (cf. assumption (H2) of Scornet et al. (2015b)). Indeed: Finally: Hence, since n≥1 I n 2 < ∞, Borel-Cantelli Lemma gives: Let us now show that (W n ) n 1 converges almost surely to 0.

3
Finally, we show that (V n ) n 1 goes to 0 a.s.
Hence, by using Assumption 4.1, we have: This allows us to conclude that Let us now turn to the proof of Lemma 5.2 which shows that the dummy estimator F k,n is close to the interesting one F b k,n .
Let us consider for some ∈ 1, k , We have, We continue the proof below in the case where Assumption 4.2 item 3. is satisfied. The case where item 2. is verified is done easier following the same lines. Let ε > 0.

1
We are now going to show the almost everywhere convergence to 0 for each term G 1 and G 2 . Let us start with G 1 .
Again, thanks to the Bienaymé-Tchebychev's inequality: Now, using Lemma 5.3 and Assumption 4.2, we get: Then, thanks to Borel-Cantelli Lemma: Now, consider the G 2 term: We are going to bound the first term by using again the Vapnik-Chervonenkis theory. By considering Let us make some comments on Vapnik-Chervonenkis dimension of the class B. If we had the it could be shown by calculations similar to those from Theorem 13.8 in Devroye et al. (2013) that the Vapnik-Chervonenkis dimension is 2d + 1. But in our case, the element c is fixed as y, then all the possibilities to break the points are related to the elements a i , b i , which thus gives us a Vapnik-Chervonenkis dimension equals to 2d. Therefore, the first two right-hand terms are handled thanks to a direct application of the Theorem of Vapnik and Chervonenkis (1971) over the class B. The latter deserves special attention.
The third term is treated as in the proof of Lemma 5.3. We use as before, the representation of the bootstrap component with the random variables Z 1 , . . . , Z n . We apply Vapnik-Chervonenkis' Theorem under the conditional distribution given D n and get: Therefore, Hence, at last As a consequence, Equation (5.3) and Assumption 4.2 lead to: Thanks to Borel-Cantelli Lemma, we get: We conclude that G goes to 0 for all , thus ∀x ∈ X , ∀y ∈ R, F k,n ( y| In the case where Assumption 4.2 item 2. is verified, it exists K > 0 such that So that P(|G 1 | > ε) and P(|G 2 | > ε) are bounded above respectively by A simple application of Lemma 5.3 and an adaptation of it to N J (A n (Θ )) show that G 1 and G 2 go to 0 a.s.
This conclude the proof of Theorem 4.1, we now sketch the proof of Theorem 4.2 which is a bit simpler.

Proof of Theorem 4.2
The different steps are similar to those of the proof of Theorem 4.1 and the dummy estimator F k,n ( y| X = x) introduced in the proof of Theorem 4.1 will be reused. Let x ∈ X and y ∈ R, we have: As in the proof of Theorem 4.1, the first right-hand term is handled thanks to Proposition 5.1 which gives: ∀x ∈ X , ∀y ∈ R, F k,n ( y| The convergence of the second right-hand term is handled into the following Lemma 5.4.
Lemma 5.4. Consider a random forest which satisfies Assumption 4.2. Then, This allows us to conclude that As in the proof of Theorem 4.1, the almost sure uniform convergence relative to y of the estimator is achieved using Dini's second theorem, which concludes the proof.

Proof of Lemma 5.4.
The proof is done by following the same steps as the proof of Lemma 5.2 and with the same notations. Let x ∈ X and y ∈ R, we have For any ∈ 1, k , We have, We prove that G 1 and G 2 go to 0 a.s. in the case where Assumption 4.2 item 3. is verified. The case where item 2. is satisfied is done easier following the same lines as in the proof of Lemma 5.2. Let ε > 0.

1
Let us start by proving the a.s. convergence to 0 of G 1 .
The first two right-hand terms will be bounded by using the Vapnik-Chervonenkis' theory with Let us start with the first right hand-term as follows: This term is handled as in the proof of Lemma 5.3 by rewriting the bootstrap component thanks to the variables selected with replacement from the set D n = X 1 , Y 1 , . . . , (X n , Y n ) instead of the random vector B j Θ 1 , D n j=1,...,n .

2
Now, consider the G 2 term: The middle term is treated as in Equation (5.4), we get: As a consequence, Equation (5.3) and Assumption 4.2 give: We conclude that G goes to 0 a.s. for all , thus In order to illustrate the theoretical results, we provide a numerical example.
The accuracy of the conditional distribution function estimators will be evaluated first, then that of the conditional quantile estimators.

Conditional distribution function
Let us start by assessing the performance of the C_CDF estimators using the Kolmogorov-Smirnov distance recalled below: with F ( y| x) being either F b k,n ( y| x) or F o k,n ( y| x) here. F ( y| x) has the following expression in our example: with Φ, the distribution function of the standard normal distribution N (0, 1).
For two randomly chosen points x, the C_CDF estimates are built with a sample of size n = 10 4 , a forest grown with n trees = 500 and the minimum number of samples required to be at a leaf node set to min samples leaf = √ n · log(n) 1.5 /250 for each tree. These experiments are replicated s = 500 times. Figure 1 below shows the C_CDF approximations computed with the estimator using the original dataset on the left-hand side and those of the estimator calculated with the bootstrap samples on the right-hand side. On each graph, the orange plain line is the true C_CDF, while the blue ones represent the 95% quantiles of the replications. Figure 1 therefore allows to display and compare approximated curves visually.
From a quantitative perspective, the quality of the estimators is measured at each point x using the following average Kolmogorov-Smirnov distance: According to the numerical results displayed in Figure 1, estimators perform well for points that are well represented in the training sample but, the performance decreases for extreme points. In order to reflect the overall performance of the estimators, we define in the sequel an averaged version of the previous metric and compute it with p = 5 × 10 4 randomly chosen points x: We get M KS = 0.1344 for the estimator F b k,n ( y| x) and M KS = 0.1295 for F o k,n ( y| x). Thus, it seems that both estimators have a good accuracy for estimating the C_CDF of most points x.
Let us now assess the performance of the conditional quantile estimators.

Conditional quantiles
The analytic value of the α-quantile conditionally to x = (x 1 , x 2 , x 3 ) is easy to calculate: with z α , α-quantile of the standard normal distribution N (0, 1). Figure 2 shows for two specific points x and for several levels α ranging from 0.1 to 0.9, the distribution of the estimators of the conditional quantiles computed with the original dataset on the left-hand side and with the bootstrap samples on the right-hand side. The estimates have been calculated with the following setting: a sample of size n = 10 4 , a forest grown with n trees = 500 and the minimum number of samples required to be at a leaf node set to min samples leaf =  n · log(n) 1.5 /250 . In order to assess the quality of the estimators at these points, the following indicators are computed by repeating the experiment s = 500 times.

RM SE
is the estimator on the j's sample, j = 1, . . . , s. Based on the graphs obtained in Figure 2, it seems difficult to say if one estimator is better than the other. It appears that the performance of the two estimators differs a bit depending on the observation x.
In order to get global measures of both estimators, we define an averaged version of the previous indicators computed with p = 5×10 4 randomly chosen points x according to the following formulas: V ariance x j By using the same setting as previously for the estimators, the numerical results for these three measures are listed in Table 1 for several levels α. First of all, both estimators have an equivalent RMSE, whereas the original dataset based estimator is the one with the smallest variance, which may seem surprising. Indeed, random forest method is an ensemble learning method that begins with bagging (the bootstrapped aggregation of regression tree predictions) in order to reduce the variance of the prediction function. Thus, for the particular case of the conditional quantile approximation, it is observed the opposite phenomenon on our example. In contrast, the bootstrap samples based estimator has the smallest bias for all α. The bias-variance tradeoff of both methods could be handled by tuning the hyperparameters of the random forest such as min samples leaf , which would also allow to improve their accuracy. Finally, it has to be noted that the performance of the two estimators depends on the level α.

Conclusion
This article proposes two conditional distribution functions and conditional quantiles approximations based on random forests. The former is a natural generalisation of the random forest estimator of the regression function making use of the bootstrap samples, while the latter is based on a variant using only the original dataset.
The consistency of the bootstrap samples based estimator is shown under realistic assumptions and constitutes the major contribution of this paper. Indeed, this is the first consistency result handling the bootstrap component in a random forest method whereas it is usually replaced by subsampling. As for the second estimator, the consistency proof established in Meinshausen (2006) for a simplified random forest model is extended to a realistic one by taking into account all the randomness used in the procedure. The two estimators have close performances on our toy example. A specific interest of the bootsrap estimation is that the Out-Of-Bag samples could be used for cross-validation and / or back-testing procedures.
The estimators developed in this paper rest on trees grown with the CART-split criterion. But the assumptions providing the consistency results are detached from the split procedure used. Thus, the theoretical tools developed here could be useful for a large class of methods by just changing the splitting scheme. An ambitious additional work would be to develop a theoretical analysis for obtaining convergence rates and also to construct confidence intervals.