Consistent Regression using Data-Dependent Coverings

In this paper, we introduce a novel method to generate interpretable regression function estimators. The idea is based on called data-dependent coverings. The aim is to extract from the data a covering of the feature space instead of a partition. The estimator predicts the empirical conditional expectation over the cells of the partitions generated from the coverings. Thus, such estimator has the same form as those issued from data-dependent partitioning algorithms. We give sufficient conditions to ensure the consistency, avoiding the sufficient condition of shrinkage of the cells that appears in the former literature. Doing so, we reduce the number of covering elements. We show that such coverings are interpretable and each element of the covering is tagged as significant or insignificant. The proof of the consistency is based on a control of the error of the empirical estimation of conditional expectations which is interesting on its own.


Introduction
We consider the following regression setting: (X, Y ) is a couple of random variables in R d × R of unknown distribution Q such that where E[Z] = 0, V(Z) = σ 2 and g * is a measurable function from R d to R. We make the following common assumptions: • Z is independent of X and σ 2 ≥ 0 is known; (H1) • Y is bounded: Q(S) = 1 with S = R d ×[−L, L], for some L > 0 (unknown).
Given a sample D n = ((X 1 , Y 1 ), . . . , (X n , Y n )), we aim at predicting Y conditionally on X. The observations (X i , Y i ) are independent and identically distributed (i.i.d.) from the distribution Q. The accuracy of a regression function g : R d → R is measured by its quadratic risk, defined as Thanks to Hypothesis (H1), we have where the arg min is taken over the class of all measurable regression functions. The regression functions generated from the data D n by a learning algorithm are called estimators of g * . We consider a set of regression functions G n that contains all such estimators. Let Q n be the empirical distribution of the sample D n . We define the empirical risk, the empirical risk minimizer and the minimizer of the risk over G n as, respectively, L n (g) = 1 n n i=1 (g(X i ) − Y i ) 2 , g n = arg min g∈Gn L n (g) andg n = arg min g∈Gn L(g). (2) The aim of this paper is to provide interpretable learning algorithms (see Section 1.2 for a discussion on the notion of interpretability) that generate G n so that the associated empirical risk minimizer g n is consistent, i.e. g n converges to g * as n → ∞. More precisely, we show the weak consistency of the estimator g n , i.e. its excess of risk (g * , g n ) = L(g n ) − L(g * ) = E[(g n (X) − g * (X)) 2 ] = o P (1) .

Rule-based algorithms using partitions and coverings
In this paper we consider algorithms generating interpretable models that are rule-based, such as CART Breiman et al. [1984], ID3 Quinlan [1986], C4.5 Quinlan [1993], FORS Karalič and Bratko [1997], M5 Rules Holmes et al. [1999]. In these models, the regression function is explained by the realization of a simple condition, an If-Then statement of the form: is the i th coordinate of X and c i ⊆ R. The If part, called the condition of the rule, or simply the rule, is composed of the conjunction of k ≤ d tests, each of which checking whether a feature (a coordinate of X) satisfies a specified property or not and k is called the length of the rule. The Then part, called the conclusion of the rule, is the estimated value when the rule is activated, i.e. when the condition in the If part is satisfied. The rules are easy to understand and allow an interpretable decision process when k is small. For a review of the best-known algorithms for descriptive and predictive rule learning, see Zhao and Bhowmick [2003] and Fürnkranz and Kliegr [2015].
Formally, the models generated by such algorithms are defined by a corresponding data-dependent partition P n of R d . Each element of the partition is named a cell and the empirical risk minimizer associated to P n satisfies Those algorithms use the dataset D n twice; first, the partition P n = P n (D n ) is chosen according to the dataset, second, this partition and the data are used to compute g n (x) as in (4). Note that g n is the empirical risk minimizer among the class of all piecewise constant functions over P n denoted G c • P n . The major issue for these algorithms is the model interpretability, which requires a small value for the length k of the rule, whereas the consistency of the estimator is usually proved for conditions implying that k = d, i.e. a high model complexity (see Section 1.2). In order to reduce the complexity of the model, we present a novel method of generating a partition. The idea is to generate a data-dependent covering C n = C n (D n ) of R d rather than a partition. To do so, the dataset D n is used to identify subsets of R d that fulfil coverage and significance conditions (see Definition 2.1). As elements of coverings can overlap, the construction of the subsets fulfilling these conditions can be done separately, which is not doable for the cells of partitions. Using a covering instead of a partition we ensure consistency without a condition on shrinkage of the cells. Moreover, each subset of the covering defines a rule with a small length k. Thus, we obtain a regression function described by a covering formed by simple rules rather than a partition formed by complex rules:

IF
(X ∈ r 1 ) And (X ∈ r 2 ) And . . . And (X ∈ r l ) THEN g n (X) = p where, for j = 1, . . . , l To estimate the value p, a partition P(C n ) is generated from the covering C n as an intermediate calculation. Formally, we define the partition generated from any collection of subsets C using the power set 2 C gathering all subsets of C: Definition 1.1. Let C be a finite collection of subsets of R d and let c = r∈C r.
We define the activation function as Then P(C), the partition of c generated from C, is defined as We illustrate this transformation P on an example of four elements in Figures  1 and 2.  Remark 1. If C is a covering of R d , then P(C) is a partition of R d . The relation C = P(C) holds if and only if C is a partition of Im(ϕ C ).
For each element r of C, the cells of the partition generated by C that are included in r are gathered in P(r) := {A ∈ P(C) : A ⊆ r} .
We also introduce the maximal (resp. minimal) redundancy of C on a subset r ∈ C: We shorten M (C, c) in M (C) and m(C, c) in m(C).
Remark 2. If C is a partition then for any r ∈ C, we have P(r) = {r} and M (C, r) = m(C, r) = 1.
By using this transformation on a data-dependent covering, C n , we get the partition P(C n ) and the associated estimator (4). The major difference compared to an estimator defined on a data-dependent partition is its interpretability (see Section 1.2). Moreover, using a partition from a data-dependent covering in place of a data-dependent partition generates a more complex partition where cells are not necessarily conjonctions of tests as in (3). We illustrate it in Figure  3.
As the construction of a partition from a covering is time consuming, it is important to note that the partition P(C n ) does not need to be constructed. The trick is to identify the unique cell of P(C n ) which contains some x ∈ R d used for calculating the prediction at x. By creating binary vectors of size #C n , whose value is 1 if x fulfilled the rule's condition and 0 otherwise, this cell identification becomes a simple sequence of vectorial operations. Figure 3 is an illustration of this process (cf Margot et al. [2018] for more details).
All the estimators generated by the data-dependent covering algorithm belong to the class of piecewise constant functions on the partition P(C n ) such that ∀g ∈ G n , ∀x ∈ R d , |g(x)| ≤ L. Hence, from definitions (2) we have and the risk minimizer over G n is The functions g n andg n are indeed both in G n , although the later is not computable from the data only.
Remark 3. The definition (6) of g n guarantees that ∀x ∈ R d , |g n (x)| ≤ L so that L doesn't need to be known.
In the following Subsection we discuss about the important notion of interpretability.

Interpretability
In many fields, such as healthcare, marketing or asset management, decisions makers prefer an interpretable models rather than models with better accuracy but uninterpretable. As mentioned in Lipton [2016], there are several meanings of the term 'interpretable' and no rigorous mathematical foundation of the concept. In this paper, interpretability correspond to parsimonious characterisation of the estimators of g * generated by a given algorithm, i.e. the facility to to describe the generated model in human words. Nowadays, the most popular Figure 3: Evaluation steps of the cell containing x = (0.1, 0.7) of the partition generated from the covering of [0, 1] 2 , C = {r 1 , r 2 , r 3 }. Using partition from a covering allows to generate complex cells with a simple interpretation (r 1 And r 2 ), where a classical partitioning algorithm cannot. Note that the condition x satisfies (r 1 And r 2 ) implicitly implies that x does not satisfy r 3 . and efficient algorithms for regression, such as Support Vector Machines, Neural networks, Random Forests,. . . are uninterpretable. The lack of interpretability comes from the complexity of the models they generate. We refer to them as black box models. Usually, these black box models have an optimal accuracy. We assert that the novel family of covering algorithms described here, can achieve a better Interpretability-Accuracy trade-off by reducing the complexity of the generated models keeping Accuracy guarantees, i.e. weak consistency.
There exist two ways of constructing interpretable models. The first one is to create black-box models and then to summarise them. For example, recent researches propose to use explanation models, such as LIME Ribeiro et al. [2016], DeepLIFT Shrikumar et al. [2017] or SHAP Lundberg and Lee [2017], to interpret black-box models. These explanation models try to measure the im- portance of a feature (a coordinate of X) on the prediction process (see Guidotti et al. [2018] for a survey of existing methods). The second way to interpretability is to use algorithms that only generate interpretable models, such as rule-based algorithms.
The interpretability of the rule-based algorithms of type (3) is achieved when the length k of each rule is small. But in order to prove the consistency of the estimator g n , one usually applies results such as Theorem 13.1 in Györfi et al. [2006] under the condition of shrinkage of the cells (Condition 13.10 in Györfi et al. [2006]). Each rule (3) must have a length k = d in order to fulfil this sufficient condition without extra condition on the feature space. Then, for large d, the condition becomes uninterpretable. Moreover, as illustrated in Figure 4, the number of cells necessary to have an accurate model is very large as the more precise the partition, the more complex the model.
For an estimator defined on a data-dependent covering, each prediction is explained by a small set of fulfilled rules which are easy to understand, see Table  1 in Section 4 for an example. Even if the partition generated may be finer and more complex than a classical data-dependent partition, the explanation of the prediction is given by the covering and not the partition, and it remains understandable by humans, as illustrated in Figure 3.
Despite the fact that the parsimony of the selected set of rules is not theoretically guaranteed, the redundancy conditions (10) and (11) described below are heading in the right direction.
We obtain a consistent estimator g n by carefully constructing the covering elements. We can apply none of the classical approaches based on Stone's theorem Stone [1977] because the covering is data-dependent nor based on Theorem 13.1 in Györfi et al. [2006] as Condition 13.10 in Györfi et al. [2006] forces rules to be complex (k = d).
The key notion of this paper is the notion of suitable data-dependent covering introduced in Section 2. Proposition 3.2 provides the main tool to prove the weak consistency of suitable data-dependent covering estimators stated in Theorem 2.1. This result of independent interest is given in Section 3. Finally we apply our approach on covering elements using Random Forest as rule generator in Section 4. Supplementary material gathers the proof of Proposition 3.2.

Main result
We denote P n the empirical distribution associated to the sample X 1 , . . . , X n . For any r ⊆ R d such that P n (r) > 0, we also denote In the same way,

Significance and coverage conditions
We introduce some conditions on each element of the covering. We use the classical notation x + = max{x, 0} for any x ∈ R.
Definition 2.1. We call a sequence (C n ) n≥1 of data-dependent coverings of R d suitable if it satisfies the two following conditions: 1. the coverage condition: (H3) ∃α ∈ [0, 1/2), ∀r ∈ C n , P n (r) > n −α a.s., for n sufficiently large; 2. the significance condition: there exists two sequences β n → 0 and ε n → 0 such that: for n sufficiently large, where the significant subsets C s n are defined by the insignificant subsets C i n are defined by and their redundancies satisfy and The coverage condition (H3) guarantees that the empirical within group expectation is a good estimation of the within group expectation. Up to our knowledge, the definitions of significant and insignificant elements of a covering in (H4) are new. An element fulfils the significance condition (10) if its conditional expectation is sufficiently different from the unconditional expectation. It ensures, in some sense, that the within-group variances of coverings with significant elements is controlled by the between-group variances. The insignificant condition (11) guarantees that the conditional variance of the insignificant elements shrinks to the noise variance. Both conditions (H3) and (H4) can be checked for each element of the covering separately. Thus the construction of such subsets can be parallelized which allow imagining algorithms less complex in comparison of usual ones.
Remark 4. An easy way to ensure (12) and (13) is to avoid inclusion between elements of the covering. Let (C n ) be a sequence of coverings that fulfills (H3). We consider 1 ≤ i ≤ #C n any ordering of the covering. If then the cardinal of C n is upper bounded by n α 1−γ for every n sufficiently large. Indeed, by the inclusion-exclusion principle we get Thus (12) and (13) can be checked for any α ∈ [0, 1/4), using the fact that M (C s n ) and M (C i n ) are smaller than n α 1−γ and setting β n = o P (n 1/4−α/2 ) and ε n = o P (n 1/4−α/2 ).
Example 1. The significant condition (10) can hold for a subset r with arbitrary diameter that does not satisfy Condition 13.10 of Györfi et al. [2006]. For instance, consider the case g * = 1 x∈A for some Borel set A such that 0 < P(X ∈ A) < 1. Then r = A is a significant subset as it satisfies the condition (10) with high probability for any β n such that n −1/4 = o(β n ) and n sufficiently large. Indeed, from the Strong Law of Large Numbers k n := #{X i ∈ A} ∼ nP(X ∈ A) a.s. as n → ∞. On the one hand, we obtain thanks to several applications of the Central Limit Theorem On the other hand, we obtain Thus (10) holds for r = A with high probability for n sufficiently large. Note that for similar reasons (10) also holds with high probability for r = A c , n −1/4 = o(β n ) and n sufficiently large. Finally, conditions (8), (12) and (13) are easily checked on the partitions C n = P n = {A, A c } that constitute a suitable coverings sequence with high probability for n large enough.
Remark 5. The significant condition (10) does not follow from a condition on the diameter of the subset. On the opposite, the insignificant condition (11) can follow from a condition on the diameter of the subset, see Proposition 3.3.

Partitioning number
To control the complexity of families of partitions, some tools introduced in [Nobel, 1996, Sec. 1.2] are recalled (see also [Györfi et al., 2006, Def 13.1]).
Definition 2.2. Let Π be a family of partitions of R d .
1. The maximal number of cells in a partition of Π is denoted by be the number of distinct partitions of x n 1 induced by elements of Π.
3. The partitioning number ∆ n (Π) of Π is defined by: The partitioning number is the maximal number of different partitions of any n points set that can be induced by elements of Π.

Consistency of data-dependent covering algorithms
In the following, we use the classical notion of Donsker class that is discussed in details in Section 3.
Theorem 2.1. Assume that Q satisfies (H1) and (H2). Let (C n ) be a suitable data-dependent covering sequence (i.e. it satisfies (H3) and (H4)) fulfilling the two following conditions: where Π n := {P(C n (d n )) : d n ∈ S n } for any n ∈ N * ; (H5) where B is a Q-Donsker class. (H6) Then the predictor g n definied by (6) is weakly consistent: The proof of this theorem is postponed to Section 3. This theorem gives us conditions on data-dependent covering algorithms to ensure that the generated empirical risk minimizer g n converges in probability to the regression function g * defined in (1). The condition (H5) is a classical one (e.g. [Györfi et al., 2006, Conditions (13.7) and (13.8)]) used to ensure that the family of partitions Π n is not too "complex". It means that the maximal number of cells in a partition, and the logarithm of the partitioning number, are small compared to the sample size. This condition guarantees that the estimation error tends to 0. The conditions (H3), (H4) and (H6) guarantee that the approximation error tends to 0 without any condition on the diameter of the cells.

Proof of Theorem 2.1
In order to prove the main theorem, we need some preliminary results based on notions of Q-Donsker class and outer probability.
The outer probability, defined for A ⊆ Ω by P Remark 7. It can be checked that if (Z n ) n∈N is a sequence of non-negative random variables, (a n ) n∈N ∈ (R + ) N such that a n = o P (1) and (M n ) n∈N is a sequence of maps such that M n = O P * (1) and Z n ≤ a n M n for any n, then Z n P −→ n→+∞ 0.
The usual notion of boundedness in probability for sequences of random variables need be generalized because sequences of maps are to be considered, with values in metric spaces which are not Euclidean spaces (thus bounded and closed sets need not be compact) and which are not guaranteed to be measurable. We need involve the outer probability P * .

Empirical estimation of conditional expectations
We shall also use the following proposition, which is inspired by Proposition 3.2 of Grunewalder [2018].
Proof od Proposition 3.2. Let ε > 0. First, for any f ∈ F and A ∈ B n , since Q n (A) > 0 and then Q(A) > 0, Now, according to Proposition 3.1, Thus, According to Remark 8, there exists M > 0 such that for any n large enough, Then (15) which, together with Remark 8 again, proves the proposition.
Corollary 3.1. Let B ⊆ B S be a Q-Donsker class. If Y is bounded then for any i ∈ N and any α ∈ [0, 1/2), with B n := {A ∈ B, Q n (A) ≥ n −α } we have and Proof of Corollary 3.1 (16). Let L = ess sup Y , i ∈ N, and f i ∈ L 1 (Q) be defined by f i is bounded and {f i } is finite thus Donsker. The result is then a straightforward application of Proposition 3.2.

Proof of Corollary 3.1 (17). This part follows from (17) since Y is bounded and
It seems that the result of Corollary 3.1, which is of independent interest, does not appear as such in the existing literature. As a first application of Corollary 3.1, we show that any partition with shrinking cells diameters is a suitable covering. We define the diameter of a cell r as Diam(r) = sup x∈r, x ∈r x − x , where · is any norm of R d .
then the sequence (P n ) is suitable.
Proof. Let us show that each cell is significant or insignificant. Thanks to Condition (8), Corollary 3.1 Eq. (17) and Remark 7, Moreover V(Y | X ∈ r) = V(g * (X) | X ∈ r) + σ 2 . Thus, as the redundancy condition (13) is automatically satisfied for cells of a partition, the desired result will follow if we check that For all n, if r ∈ P n , then r×R ∈ B S . We denote X r and X r two independent variables distributed as X given that X ∈ r. We obtain Thus, if we denote w the modulus of continuity of g * , we get V(g * (X) | X ∈ r) ≤ 2 −1/2 w(Diam(r)).
Thus, from (11), each cell wich is not significant is insignificant and the corresponding covering sequence is suitable.
Remark 9. The condition of uniform continuity of g * in Proposition 3.3 may be simply raised. Indeed, from [Györfi et al., 2006, Corollary A.1], g * can be approximated arbitrarily closely in L 2 (Q X ) by fonctions of C ∞ 0 (R d ) where Q X is the marginal distribution of X.
The estimation error (20) controls the distance between the best function in G n and g n . The approximation error (21) is the smallest error for a function of G n . The two terms have opposite behaviors. Indeed, if G n is not too complex the empirical risk will be close to the risk uniformly over G n . Thus, the error due to the minimization of the empirical risk instead of the risk will be small. On the other hand, the risk cannot be better than for the best function of G n . So, G n must be complex enough. It is the classical Bias/Variance or Approximation/Estimation trade-off.
The functiong n is in G n , thus to prove (21), it suffices to show that W n = o P (1) where W n := E (g n (X) − g * (X)) 2 From (7), which shows that W n is a within-group variance for the variable g * (X) and the groups P(C n ).
First we use the decomposition of the total variance into the sum of the within-group and the between-group variances: where Let's consider B n and replace the summation over the partition P(C n ) by a summation over the covering C n . We have, from the definition of M (C n , r), where we last applied Jensen's inequality. Now, we focus on the set C s n of significant elements of the covering. Since C n = C s n ∪ C i n , we have where the empirical counterpart of U n , ∆ n,r := V 2 n,r − U 2 r and ∆ n := sup r∈Cn {∆ n,r } .
In order to control B n with its empirical counterpart, we shall make use of the outer probability P * defined in Section 3. Using hypotheses (H2) and (H6) and Corollary 3.1 (with B n = {c × [−L, L], c ∈ C n }) we have : Continuing (24), By definition of C s n , ∀r ∈ C s n , Using again Corollary 3.1 leads to Thus, By independence between Z and X, we have Hence we have Thus, by definition of m(C s n ), We remark that and W i n := Continuing (28), From (22) and (29), we conclude: (25) and (26) and (H4). Regarding the insignificant part of the within group variance and assuming that C i n is not empty, we have Using (27) we have Then, (H4) Hence, (21) is proved.
Recall from (5) that G n is the set of piecewise constant functions with values in [−L, L] on the elements of the partition P(C n (D n )). Then, with the definition of Π n in (H5) in mind, The following is based on the same idea as [Györfi et al., 2006, Theorem 13.1].
According to [Györfi et al., 2006, Theorem 9.1 and Problem 10.4] we have, where X n 1 = {X 1 , . . . , X n }. Here N 1 (ε, G c • Π n , X n 1 ) is the random variable corresponding to the minimal number N ∈ N such that there exist functions g 1 , . . . , g N : R d → [−L, L] with the property that for every g ∈ G c • Π n there is a j ∈ {1, ..., N } such that This number is called the ε-covering number of G c •Π n . It can be interpreted as the complexity of the class. Then using [Györfi et al., 2006, Lemma 13.1] we have , According to [Györfi et al., 2006, Lemma 9.2] for any set of function G and any sample z m 1 we have for all 1 ≤ j < k ≤ N . It is called L 1 ε-packing of G on z m 1 . See [Györfi et al., 2006, Definition 9.4 (c)]. Now, from the definition of G c , sup z1,...,zm∈{X1,...,Xn},m≤n Finally, sup z1,...,zm∈{X1,...,Xn},m≤n . (31) According to (30) and (31) we have: this concludes the proof of (20) and of Theorem 2.1.

Application
In this section we propose a simple algorithm to generate a suitable sequence of data-dependent coverings using the Random Forests algorithm (RF) Breiman [2001] as rule generator. The interest is double; first, it shows that there exists a sequence of suitable data-dependent coverings in practice. Second, it could prove the consistency of an estimator generated from RF. For now there are few results about the consistency of an estimator generated by RF, we may cite Denil et al. [2013], Scornet et al. [2015].
Let C be the set of all hyperrectangles of R d : The following result ensures that any covering C n such that C n ⊆ C satisfies (H6).
The proof is given for completeness in Appendix.

Algorithm
The proposed algorithm is an easy way to generate an estimator using datadependent coverings. It can be decomposed into four steps.
1. The generation of RF with m tree trees fully deployed (without pruning).
3. The selection of a minimal set of rules using Algorithm 1. The redundancy is controlled recursively as described in Remark 4; A rule is added to the current set of rules if and only if it has at least a rate γ ∈ (0, 1) of points not covered by the current set of rules.
4. If the selected set of rules does not form a covering, generation of a unique no-rule that is one of the smallest hyperrectangle satisfying (H3) containing the remaining points.
Remark 10. The sequence (ε n ) of the no-rule condition (11) is not controlled by this algorithm. Indeed, the no-rule is added to ensure a covering without any control on its variance.

Simulation
We generate artificial n = 2000 data following the regression setting where Z ∼ N (0, 1), X 1 , X 2 ∼ U(−1, 1), η 1 = 3.5 and η 2 = −2.5. We α = 1/4 − 1/100. The data are randomly split into a training set and a test set, with a ratio of 80% / 20%, respectively. We use RF with m tree = 50 generating 100808 rules among which 73 are significant according to (8) and (10) and 18 are insignificant according to (8) and (11). Then, the selection process, with γ = 0.95, extracts a set of 5 significant rules which cover 57% of the training data and add 5 insignificant rules to generate a covering (see Table  1). It is no necessary to add a no-rule.
Algorithm 1: Selection of minimal set of rules Input: • the rate 0 < γ < 1; • a set of significant rules S; • a set of insignificant rules I; Output: • a minimal set of rules C n ; 1 C n ← arg max r∈S P n (r); 2 S ← S \ C n ; 3 while r∈Cn P n (r) < 1 do 4 r * ← arg max r∈S P n (r); 5 if P n (r * ∩ {∪ r∈Cn r}) ≤ γ P n (r * ) then 10 end 11 if r∈Cn P n (r) < 1 then 12 while r∈Cn P n (r) < 1 do 13 r * ← arg min r∈I V n (Y |X ∈ r); 14 if P n (r * ∩ {∪ r∈Cn r}) ≤ γ P n (r * ) then 15 C n ← C n ∪ r * ;

Results
Regarding the accuracy on the test set, RF has a M SE score of 1.32 and the data-dependent covering estimator has a M SE score of 1.57 (see Fig 5). This loss of accuracy is the price of turning a black box model into an interpretable one. The Figure 5 is composed of four graphics: the dataset (upper left), the model generated by RF (upper right), the model generated by the selected set of rules (lower left) and the 10 selected rules (lower right). It is interesting to note that the cells of the partition generated by the 10 set covering are represented in the bottom left graphic.

Comments
The presented algorithm has no control over the sequence (ε n ) of the no-rule. So, it cannot guarantee that the generated sequence of data-dependent coverings C n is suitable. However this application emphases that data-dependent coverings are very efficient to generate an interpretable estimator without a significant loss of accuracy. With 0.01% of the rules from RF, it constructs an estimator with 1.19 M SE ratio accuracy compared with the RF one.

Conclusion and perspectives
In this paper we provide a general setting for studying the consistency of interpretable rule-based estimators. The novelty is to introduce the notion of covering composed by two kinds of set, the significant and the insignificant ones. The significant sets are thought as interpretable sets by construction. The insignificant ones are thought as small sets which variances tend to zero. We provide an algorithm that extracts from any rule generator a suitable data-dependent covering. We apply it to Random Forest. This very effective approach appeals for an algorithm that generates significant and insignificant rules and a suitable sequence of data-dependent coverings on its own. Generating insignificant rules with shrinking diameters as in Proposition 3.3, a control on the sequence (ε n ) of the insignificant condition (11) seems possible. It is a subject of research for future works. The theoretical setting could also be refined; unbounded Y may be considered by introducing a truncation operator as in Györfi et al. [2006]; strong consistency and rates of convergence of the data-dependent covering estimators may be established under slightly stronger assumptions. Finally, the scope could be broaden from the regression setting to the classification one by adapting the significant condition.
According to Theorem 5.1, this guarantees that I C is a Q-Donsker class.