On Variance Estimation of Random Forests with Infinite-Order U-statistics

Infinite-order U-statistics (IOUS) has been used extensively on subbagging ensemble learning algorithms such as random forests to quantify its uncertainty. While normality results of IOUS have been studied extensively, its variance estimation approaches and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decomposition. However, such a view usually leads to biased estimation when the kernel size is large or the sample size is small. On the other hand, while several unbiased estimators exist in the literature, their relationships and theoretical properties, especially the ratio consistency, have never been studied. These limitations lead to unguaranteed performances of constructed confidence intervals. To bridge these gaps in the literature, we propose a new view of the Hoeffding decomposition for variance estimation that leads to an unbiased estimator. Instead of leading term dominance, our view utilizes the dominance of the peak region. Moreover, we establish the connection and equivalence of our estimator with several existing unbiased variance estimators. Theoretically, we are the first to establish the ratio consistency of such a variance estimator, which justifies the coverage rate of confidence intervals constructed from random forests. Numerically, we further propose a local smoothing procedure to improve the estimator's finite sample performance. Extensive simulation studies show that our estimators enjoy lower bias and archive targeted coverage rates.


Introduction
Given a set of n i.i.d. observations D n " tX i u n i"1 and an unbiased estimator, hpX 1 , . . . , X k q, of the parameter of interest θ with k ď n, the U-statistic [13] defined in the following is a minimum-variance unbiased estimator of θ: where each S i is a subset of k samples from the original D n , where k is called the kernel size. When k grows with n, U n becomes an Infinite-Order U-statistic (IOUS) [10]. Without the risk of ambiguity, we drop k in the notation. In recent years, there has been an increasing interest in statistical inference with IOUS, with application to subbagging ensemble approaches, such as random forests [3,12]. It is easy to see that large`n k˘r enders the computationally infeasible to exhaust all subsamples. Instead, random forests sample B subsamples from D n to build trees and average. This leads to incomplete U-statistics [15]. Further incorporating randomness in the kernel function h, Mentch and Hooker [16] first show the asymptotic normality of random forests under the U-statistics framework when k grows at the rate of op ? nq. DiCiccio and Romano [6] further relax its assumptions. Zhou, Mentch and Hooker [30] set the connection between U-and Vstatistics. Peng, Coleman and Mentch [17] extend the kernel size to k " opnq under a generalized U-statistic framework. We also note, but mainly omit, a large literature outside the applications of random forests. For example, for incomplete high-dimensional U-statistics, where h P R d , Chen and Kato [4] and Song, Chen and Kato [22] study the asymptotic normality for fixed and growing k, respectively.
With the normality of random forest estimators established under the U-statistics [16] or other frameworks [23,1], another line of the topic is the variance estimation. Wager, Hastie and Efron [24] propose to use jackknife and infinitesimal jackknife (IJ, Efron [7]). Mentch and Hooker [16] use Monte Carlo methods to estimate the leading term in the Hoeffding decomposition of VarpU n q. Recent developments include Zhou, Mentch and Hooker [30], who propose a computationally efficient approach and set the connection with the IJ estimator. Peng, Mentch and Stefanski [18] further study the bias and consistency of the IJ estimator.
However, an essential practical issue is that these estimators can display a significant amount of bias when the sample size n is small or k is large compared to n. In practice, it is common to use a fixed proportion of the total sample size [12] as the kernel size k. Variance estimators in the aforementioned literature often suffer from this bias issue because they all rely on some form of leading term dominance phenomenon. However, when k is large compared to n, such dominance is weak. Searching through the literature, several unbiased estimators have been proposed in different forms and aspects based on the U-statistics view. Some of them can handle a subsampling size k as large as n{2. Folsom [9] propose a variance estimator of complete U-statistics following a sequence of literature on sampling design [14,29,20]. Schucany and Bankson [19] propose to estimate all terms in the Hoeffding decomposition [13] of the variance of an order-2 complete U-statistic. However, they do not extend the estimator to a general case with k ď n{2. Note that Folsom [9], Schucany and Bankson [19] do not consider the incomplete case; hence their estimators are computationally infeasible for large k or large n. More recently, Wang and Lindsay [26] propose partition-based, unbiased variance estimators of both complete and incomplete Ustatistics motivated from the second-moment expression EpU 2 n q´E 2 pU n q. Wang and Wei [28] further apply this estimator to random forest variance estimation. However, there is a lack of theoretical justification for these estimators in terms of their ratio consistency, which is crucial for achieving a proper coverage rate based on the derived confidence interval. Moreover, there is a lack of understanding of their connections and differences with the estimators mentioned previously.
To address these limitations in the literature, the major contribution of our paper is three-fold. First, we reanalyze the Hoeffding decomposition and propose a peak region dominance view of the variation estimation of Ustatistics to address the bias issue. This leads to a class of unbiased estimation approaches for both complete and incomplete U-statistics, called Matched Sample Variance Estimator, which can handle a subsampling size k as large as n{2. Computationally, our incomplete variance estimator is efficient and can be directly applied to random forests. Besides, we discuss two extensions of our estimators. One is a local smoothing strategy to mitigate negative variance estimations [19,26], and the other extends our method to k ą n{2. Secondly, we are the first to establish the connection and equivalence of the three existing estimators [9,19,26]. We show that our proposed estimator coincides with each under specific settings (see Section 3.5 for a detailed discussion). Thirdly, we establish the ratio consistency for our complete variance estimator under k " op ? nq. To the best of our knowledge, this is the first result for such estimators, even for fixed k. This is a crucial step to achieve the nominal coverage level when we plug in the variance estimator in constructing a confidence interval. To this end, we fill a significant gap in the literature by proposing a set of interpretable conditions. We proceed with additional notation and preliminaries of U-statistics to motivate the proposed variance estimator and establish the peak region dominance view.

Variance of U-statistics
Our analysis starts with a classical result of the variance of U-statistics. We first review the Hoefdding decomposition of the variance of a complete U-statistics. Then, we present the connection between the complete and incomplete versions. In particular, the variance of an order-k complete U-statistics is given by Hoeffding [13]: Var pU n q "ˆn k˙´1 where ξ 2 d,k is the covariance between two kernels hpS 1 q and hpS 2 q with S 1 and S 2 sharing d overlapping observations, i.e., ξ 2 d,k " Cov phpS 1 q, hpS 2 qq, with |S 1 X S 2 | " d. Here both S 1 and S 2 are size-k subsamples. Alternatively, we can represent ξ 2 d,k as [15] ξ 2 d,k " Var rE phpSq|X 1 , ..., X d qs .
This form will be utilized later.
When k grows with n, it is computationally almost infeasible to exhaust all subsamples due to large`n k˘. Instead, it is typical in random forests and other ensemble algorithms to build incomplete infinite-order U-statistics [15] by sampling B many S i 's, which gives The gap between variances of an incomplete U-statistic and its complete counterpart can be understood as VarpU n,B q " Var rEpU n,B |X n qs`E rVarpU n,B |X n qs (5) " VarpU n q`E rVarpU n,B |X n qs , where the additional term ErVarpU n,B |X n qs depends on the subsampling scheme. In particular, when all subsamples are drawn with replacements from the collection of all such subsamples [15], we have VarpU n,B q " p1´1 B qVarpU n q`1 B ξ 2 k,k .
This suggests that we can close the gap by using a large B. Hence, we will first discuss the complete U-statistics setting and then propose the incomplete one. We also note that for applications to random forests, random kernels (trees) are involved. However, the difference can be negligible when using a large B Mentch and Hooker [16].

Methodology
The main technical challenge for estimating the variance is when k is relatively large compared with n. Besides the aforementioned obvious computational issue in the complete version, most existing methods will also encounter a significant bias due to only estimating the leading term in the Hoefdding decomposition. By establishing a peak region dominance view, we develop a new unbiased estimator for VarpU n q in both complete and incomplete forms whenever k ď n 2 . Its connection with existing methods will be discussed in Section 3.5. Its extension to n{2 ă k ă n setting will be presented in Section 5.1. We demonstrate the application to random forests in Section 5, where we also introduce a locally smoothed version for better numerical performances.

Existing Methods and Limitations
Continuing from the decomposition of VarpU n q in Equation (2), we define γ d,k,n "`n k˘´1`k d˘`n´k k´d˘f or convenience. Then VarpU n q " ř k d"1 γ d,k,n ξ 2 d,k . It is easy to see that γ d,k,n corresponds to the probability mass function of a hypergeometric distribution with parameters n, k and d. A graphical demonstration of such coefficients under different k and d settings, with n " 100, is provided in Figure 1. Many existing methods [16,6] rely on the asymptotic approximation of VarpU n q when k is small, e.g., k " opn 1{2 q. Under such settings, the first coefficient γ 1,k,n " r1`op1qs k 2 n dominates all remaining ones, as we can see in Figure 1 when k " 10. In this case, to estimate VarpU n q, it suffices to estimate the leading covariance term ξ 2 1,k if ξ 2 k,k {pkξ 2 1,k q is bounded. However, as k becomes larger, the density of the hypergeometric distribution concentrates around d " β 2 n instead of d " 1. Hence, the variance will be mainly determined by terms in a range of large d values, which we refer to as the peak region. In comparison, estimating just ξ 2 1,k will introduce a significant bias even if we are able to exhaust all possible subsamples.
Another source of bias for using the leading term dominance property is the lack of samples to estimate ξ 2 1,k realistically. Note that the definition involves approximating the Var and E operations in Equation (3) [16,30]. A natural strategy is to hold one shared sample, e.g., X p1q , and vary the remaining samples in S among existing observations D n to approximate E phpSq|X 1 q. However, this causes trouble for the variance estimator since we won't have enough samples to independently produce estimators of E phpSq|X i q with varying X i when k becomes slightly larger. Overall, a new strategy is needed to better utilize the Hoefdding decomposition.
We also note that another theoretical strategy proposed by Wager and Athey [23], Peng, Coleman and Mentch [17] can be used for k " opnq if the U -statistic can be understood through the Hajek projection with additional regularity conditions. In this case, the variance of a Ustatistic can be well approximated by the variance of a linearised version, while the infinitesimal jackknife procedure Efron and Stein [8] provides a valid estimator. However, it is difficult to assess whether the kernel function satisfies these assumptions. In practice, a significant bias can still occur, as seen in the simulation section.

An Alternative View
At this point, estimating ξ 2 d,k 's for some d values seems inevitable. However, we may utilize the law of total variance to change the estimation procedure, which could gain a significant computational advantage. Note that for any given d, ξ 2 d,k "Var rE phpSq|X 1 , ..., X d qs "VarphpSqq´E rVarphpSq|X 1 , ..., X d qs where we defineξ 2 d,k :" ErVarphpSq|X 1 , ..., X d qs. In this representation, V phq is equivalent to ξ 2 k,k , the variance of a single kernel. It is also equivalent toξ 2 0,k since ξ 2 0,k " 0. Incorporating these into the decomposition formula in Equation (2), we obtain an interesting connection: where we define V psq as ř k d"0 γ d,k,nξ 2 d,k . Note that in the second equation, we add and subtract a term with d " 0 to complete the hypergeometric distribution coefficients.
While this alternative view is valid for all k, the difficulty lies in finding a computationally feasible estimator, especially when we have to deal with incomplete Ustatistics, instead of the complete version. In particular, when k ď n{2, both terms can be unbiasedly estimated with a proper sampling design. In the following, we first present a straightforward formula for estimating V phq and V psq in a complete U-statistics version. The main result is Theorem 3.1, which shows that V psq can be estimated using a sample variance of all trees. Section 3.4 extends these estimators to incomplete versions.

Variance Estimation for Complete U-statistics
Our goal is to create estimators of V psq and V phq such that they can be directly computed from the trees (kernels) fitted in the random forest itself. This seems to be a challenging task given that we are estimating an infinite sum V psq . However, the fundamental idea we will utilize is to estimate VarphpSq|X 1 , ..., X d q using pairs of trees. We proceed with the complete case when all trees are already available.
3.3.1 Joint Estimation of the Infinite Sum V psq . Suppose we pair subsamples S i and S j among`n k˘s ubsamples and let d " |S i X S j | " 0, 1, ..., k. Then for each d, there exist N d,k,n "`n k˘2 γ d,k,n "`n k˘`k d˘`n´k k´d˘p airs of subsamples S i , S j such that |S i X S j | " d. Note that for any such pair, phpS i q´hpS j qq 2 {2 is an unbiased estimator ofξ 2 d,k " VarphpSq|X 1 , ..., X d q. We may then construct an unbiased estimator ofξ 2 d,k by averaging them: This motivates us to combine all such terms in the infinite sum, which surprisingly leads to the sample variance of all kernels. The result is given in the following proposition, with its proof collected in Appendix E. PROPOSITION 3.1. Given a complete U-statistic U n , and the estimatorξ 2 d,k defined in Equation (9), when k ď n{2, we have the following unbiased estimator of V psq : Furthermore, when k ą n{2, the first 2k´n terms in the summation ř k d"0 is removed, since corresponding γ d,k,n terms are zero.
SinceV psq enjoys a sample variance form, its incomplete version would also be easy to calculate. The advantage is that it can be computed without any hassle because all hpS i q's are ready to use when we calculate U n . However, additional consideration may facilitate the estimation of V phq so that both V psq and V phq can be done using the same set of hpS i q's.

Estimation of Kernel
Variance V phq . Estimating V phq may follow the same idea using " hpS i qh pS j q ‰ 2 {2 if the pair S i and S j are disjoint. However, this is only possible when k ď n{2, given a finite sample. In this case, following Equation (9), we have an unbiased estimator of V phq : Therefore, we combine estimatorsV psq (10) andV phq (11) to get an unbiased estimator of VarpU n q:

Variance Estimation for Incomplete U-statistics
In random forests and other ensemble learning models, we often construct incomplete U-statistics by drawing random subsamples instead of exhausting all`n k˘s ubsamples. This creates difficulties in calculatingV phq since very few of these subsamples would be mutually exclusive (d " 0). Hence, a new subsampling strategy is needed to allow sufficient pairs of subsamples to estimate both V phq and V psq .
The following "matched sample" sampling scheme is proposed to have enough disjoint samples to estimatê V psq . For any 2 ď M ď tn{ku, we can sample a set of the matched sample group that consists M mutually exclusive subsamples tS 1 , . . . , S M u from D n . This enables us to estimate V phq by the sample variance of thpS 1 q, . . . , hpS M qu. Then, we repeat this procedure B times to average the estimator. To be precise, denote the subsamples in the b-th matched sample group as S This differs from the conventional incomplete U-statistic due to the new sampling scheme. Though M " 2 is enough for estimating V phq , we recommend using M " tn{ku for a smaller variance. This is guaranteed by the following proposition.
For an incomplete U-statistic with M¨B samples obtained using the matched sample sampling scheme, The proof is collected in Appendix E.1. We should note that when fixing the total number of kernels, M¨B and let M ě 2, the variance of U n,B,M is always smaller than the variance of U n,B given in (6). However, these two are identical when M " 1.
Based on this new sampling scheme, we can propose estimatorsV Similarly, with some algebra, we can defineV Note thatV phq B,M is still an unbiased estimator of V phq whileV psq B,M introduces a small bias when estimating V psq because these subsamples are not randomly obtainedthere is an over-representation of non-overlapping pairs. The following proposition quantifies this bias.

Unifying Existing Unbiased Estimators
To conclude this section, we discuss the relationships and differences between our view of the variance decomposition versus existing approaches. As noted in the introduction, various variance estimators appeared in the literature to correct the bias when the leading term does not dominate.
where P c " tpS i , S j q s.t. |S i XS j | ď cu, and N c is the cardinality of P c . Motivated by this formulation, they further propose an ANOVA form of the estimator and its corresponding incomplete version.
Although various unbiased estimators exist in the literature, they are all motivated by entirely different perspectives. The unique motivation of our estimator is its peak-region dominance phenomenon and the corresponding conditional variance view, which allows unbiased estimation. However, it is interesting that the connections among existing estimators have never been investigated. To complete our analysis, we further established several connections. In Appendix E.3 and E.4 we show that all existing unbiased complete estimators are essentially the same estimator, presented in different formats and settings. In particular, we show that Folsom [9]'s formula is identical to our complete version and also equivalent to Wang and Lindsay [26]. We further restrict a setting with k " 2 for a direct comparison with Schucany and Bankson [19]. In Appendix E.4, we show the equivalence between our incomplete estimators and Wang and Lindsay [26].

Theoretical Results
To the best of our knowledge, ratio consistency of variance estimator in the context of infinite-order U-statistics has not been investigated. In this section, we attempt to fill some gaps in the literature by establishing the results of our proposed estimator, meaning that we want to show where P Ý Ñ denotes convergence in probability. The notion of ratio consistency is important here since the variance of U-statistics would naturally converge to 0 as n grows. Hence any variance estimator that converges to 0 is consistent. However, a consistent estimator does not guarantee normal coverage. In the following, we shall rewrite y VarpU n q asV u , and show a sufficient condition of the above VarpV u q{E 2 pV u q " VarpV u q{Var 2 pU n q Ñ 0, as n Ñ 8.
We want to note that such a result under general k settings is likely impossible without strong assumptions or knowledge of the specific form of t he kernel h. The main difficulty in the proof is caused by the fourth-order term in the form of CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs which naturally appears in the variance ofV u . Untangling the dependencies of the fourth-order term under large k is a difficult task. Hence, we focus on the k " op ? nq setting in which the result is more attainable, although computationally, the estimator can still be applied whenever k ď n{2. Even though the k " op ? nq setting is somewhat restrictive, it is still the first in the literature under the context of this paper. And further investigations may be established by extending the proposed strategy to higher orders.
Our main strategy can be summarized as follows. First, we observe that the proposed estimatorV u "V phq´V psq can be written an order-2k U-statistic: where S p2kq is a size-2k subsample set and ψ`S p2kq˘i s the corresponding size-2k kernel, defined as Here ψ k 1`S p2kq˘f or k 1 " 0, 1, 2, . . . , k satisfies where N n,k,k 1 "`n 2k˘`n k˘´1`n´k`k 1 k˘´1 and N d "`n´2 k`d dȋ s the number of different size-2k sets such that its two size-k subsets S 1 and S 2 share d overlaps. We remark that in this paper, S refers to a size-k set, and S p2kq refers to a size-2k set. Similar to a regular U-statistic, the variance of an order-2k U-statisticV u can be decomposed as where σ 2 c,2k is the covariance between ψpS | " c and c " 1, 2, . . . , 2k: If we follow the existing literature, it is common to impose high-level assumptions on the kernel ψ and also bound the ratio of the last term, σ 2 2k,2k over the first term σ 2 1,2k [6]. However, not only such assumptions are difficult to verify and can be possibly violated (see discussion in Appendix B.4), but also ψ is viewed as some form of a "black box", which does not help in analyzing the convergence of VarpV u q{Var 2 pU n q. Therefore, we have to establish some connection between ψ kernel h.
Hence, the key strategy of our approach is to avoid direct assumption onV u 's kernel ψ and instead only impose assumptions on a fourth-order term of h: CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs. This leads to the main technical challenge in this work: since VarpV u q is the variance of variance estimator of U-statistics, it becomes inevitable to study the fourth-order term of h instead of a second-order term. To be specific, ξ 2 d,k " CovphpS 1 q, hpS 2 qq in the classical U-statistics only involves the overlaps between S 1 , S 2 while CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs involves the 4-way overlaps among S 1 , S 2 , S 3 , S 4 , although it shares similar intuition as ξ 2 d,k . We first establish the Double U-statistics notion ofV u in Section 4.1. The double U-statistic structure in Proposition 4.2 shows a cancellation effect (see Appendix F) inside ofV u , which helps accelerate the convergence rate of VarpV u q. Using this structure, we can further decompose each σ 2 c,2k in the Hoeffding decomposition (21) into η 2 c,2k pd 1 , d 2 q terms (see Proposition 4.3). Then, we bound all η 2 c,2k pd 1 , d 2 q's by decomposing each term into a basic covariance term CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs. Hence, it suffices to impose primitive assumptions on CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs to analyze the behavior of V u . We should highlight the challenge that we need to use 11 parameters to describe the 4-way overlapping among S 1 , S 2 , S 3 , S 4 . Details are left in the discussion in the assumption section (Section 4.2). We also remark that it is easier to understand the difficulties and strategies related to the nature of Double U-statistic structure through a simplified example, the linear average kernel, presented in Appendix I. And finally, in Section 4.3, we present the ratio consistency. Section 4.4 is used to summarize a roadmap of the proof.

Double U-statistic Structure
We define a notion of Double U-statistic to facilitate our discussion and show thatV u is a Double U-statistic. The advantage of this tool is to break down our variance estimator into lower-order terms, which alleviates the difficulty involved in analyzing σ 2 c,2k . Essentially, a Double U-statistic is a "U-statistic of U-statistic". By (19),V u "`n 2k˘´1 ř S p2kq ĎXn ψ`S p2kq˘. V u involves a size-2k kernel ψ. However, by Equation (20), the kernel ψ has a complicated form. The following proposition shows that we can further decompose ψ into linear combinations of ϕ d 's, which are still U-statistics. PROPOSITION 4.2 (V u is a Double U-statistic). The order-2k U-statisticV u defined in Equation (19) is a Double U-statistic. Its kernel ψ`S p2kq˘c an be represented as a weighted average of U-statistics, such that Here, for d " 0, 1, ..., k, ϕ d is the U-statistic with size-p2k´dq asymmetric kernel as following k´d˘, which is the number of pairs S 1 , S 2 Ă S p2kq , s.t. |S 1 X S 2 | " d; and w d :" n 2k˘`n k˘´2`2 k d˘`2 k´d d˘`2 k´2d k´d˘{`n´2 k`d d˘, @d ě 1, w 0 "`n k˘´1´`n´k k˘´1¯`n k˘´1`n 2k˘`2 k k˘. The w d 's defined above satisfy the following. ř k d"0 w d " 0.w d ą 0, @d ą 0.
Particularly, for fixed d, The proof is collected in Appendix D.1. We observe that given k " op ? nq, w d decays with d at a speed even faster than the geometric series. In our later analysis, we can show that the first term, w 1 rϕ 1 pS p2kq q´ϕ 0 S p2kq qs, can be a dominating term in ψpS p2kq q. Moreover, with kernel ϕ d pS p2kq q, we introduce the following decomposition of σ 2 c,2k .
This proposition can be directly concluded by combining the alternative form of U n 's kernel ψ in Equation (23) and the definition of η 2 c,2k pd 1 , d 2 q. With the help of the Double U-statistic structure, upper bounding σ 2 c,2k can be boiled down to analyzing η 2 c,2k pd 1 , d 2 q. Detailed analysis of this connection is provided in Section 4.4 and Appendix F. Note that we can further decompose η 2 c,2k pd 1 , d 2 q (see Appendix G.5), so σ 2 c,2k can be viewed as a weighted sum of CovrhpS 2 qhpS 2 q, hpS 3 qhpS 4 qs's.

Assumptions
Assumption 1 limits the kernel size k as a lowerorder of ? n, while Assumption 2 controls the growth rate of ξ 2 d,k with d. Assumption 3, 4, and 5 are related to CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs. As previously mentioned, CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs can be viewed as an extension of ξ 2 d,k " CovrhpS 1 q, hpS 2 qs, the classical covariance of two kernels. While ξ 2 d,k only depends on one parameter, i.e., d " |S 1 X S 2 |, 11 parameters are needed to fully determine CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs, since it involves a 4-way overlapping structure. This can be visualized in Figure 4 in Appendix. We denote the number of parameters as "Degree of Freedom (DoF)" of the covariance. Essentially, Assumptions 3, 4, and 5 are about reducing this DoF and controlling the growth of CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs with overlapping samples.
In Appendix B, we provide further discussion and examples of our assumptions. In Appendix H, we propose a relaxation of Assumption 3 and present the proof of the main results under the new assumptions. ASSUMPTION 1. There exist a constant P p0, 1{2q, so that the growth rate of kernel size k regarding sample size n is bounded as k " Opn 1{2´ q. ASSUMPTION 2. @k P N`, ξ 2 1,k ą 0 and ξ 2 k,k ă 8. There exist a universal constant a 1 ě 1 independent of k, satisfying that Note that a smaller a 1 in Assumption 2 implies a stronger assumption. It is well known that kξ 2 d,k ď dξ 2 k,k [15], the smallest possible value of a 1 is 1, which is used in the existing literature [16,6,30,17]. Hence, if we force a 1 " 1 and only focus on the upper bound of VarpU n q, the growth rate of k in Assumption 1 can be relaxed to opnq. However, this trade-off between Assumptions 1 and 2 cannot be applied to ratio consistency directly. To motivate our other assumptions, we provide a brief discussion on the 4-way overlap of CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs. As we mentioned before, the goal is to avoid direct assumptions of ψpS p2kq q and its covariance σ 2 c,2k and study the fourth-moment term CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs. To simplify the notation, we let ρ :" Cov rhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs .
Then ρ involves 11 different overlap schemes, Hence, 11 parameters are needed to describe ρ. We denote the number of these parameters as the "Degrees of Freedom" (DoF) of ρ. Furthermore, there are two types of these parameters: d 1 " |S 1 X S 2 | and d 2 " |S 3 X S 4 | describes the overlapping within S . We can describe these 9 overlapping sets by a 9-dimensional vector r, whose definition is collected in Appendix B.1. Hence, the 11 DoF can be denoted by tuple pr, d 1 , d 2 q.
However, it may not be necessary to know all r, d 1 , d 2 values to calculate this covariance ρ. For example, in the linear average kernel (Example 2 in Appendix B.2), ρ only depends on r. This may be expected for an estimator that is approximately linear. Hence, we propose the following assumption. ASSUMPTION 3. ρ only depends on the 9 DoF vector r. Hence, without the risk of ambiguity, we define a function ρprqp¨q with The assumption implies that the within S overlapping counts have no impact on ρ. This simplifies a cancellation pattern when analyzing η c,2k pd 1 , d 2 q (27). A comprehensive discussion of this assumption can be found in Appendix B. We first demonstrate that this assumption is valid for the linear average kernel, as previously mentioned. Next, we provide an example to illustrate the challenges of reducing DoF below 9 by only considering two-way overlaps, indicating that further simplification of this assumption may require specific assumptions about the kernel functions. In addition, in Appendix H, we suggest a relaxation of this assumption and provide an alternative proof of the main results based on this relaxed assumption. ASSUMPTION 4 (Ordinal Covariance). For all size-k subsets S 1 , S 2 , S 3 , S 4 and S 1 1 , S 1 2 , S 1 3 , S 1 4 , let ρ and ρ 1 denote the corresponding covariance as defined in Equation 29 with DOFs r and r 1 (defined in Appendix B.1), respectively. Then, we have: ij , @i, j " 0, 1, 2. Moreover, given size-k sets S 1 , S 2 , and r such that |S 1 X S 2 | " c and |r| " c, we have: Assumption 4 implies that more overlapping leads to larger ρ. This is a reasonable result to expect. For every c " |S p2kq 1 X S p2kq 2 |, it also provides an upper bound of ρ, where F pkq c refers to ρ with "maximum possible overlaps" given c such that S 1 " S 2 , S 3 " S 4 . The overlapping associated with F pkq is visualized in Figure 5 in Appendix B. It's easy to see that ρ ě 0, an analog to ξ 2 d,k ě 0 in a regular U-statistics setting [15].  CovrhpS 1 q 2 , hpS 2 q 2 s rCovrhpS 1 q, hpS 2 qss 2 " Op1q (32) In addition, there exist a universal constant a 2 ě 1 independent of k, satisfying sup c"2,3,...,2k Equation (32) states that a fourth-moment term cannot exceed a second-moment term ξ 2 1,k . This can be verified for the linear average kernel with basic moment conditions. Similarly to the polynomial growth rate of ξ 2 d,k specified in Assumption 2, Equation (33) controls a polynomial growth rate of ρ with respect to c, as F pkq is an upper bound of ρ. It is worth noting that Assumption 5 can be implied by Assumption 2 for certain specific kernels (see Example 4 in Appendix B).

Main Results
We now present our main results. As a direct consequence of the following theorem, the ratio consistency property is provided in Corollary 4.6. THEOREM 4.4 (Asymptotic variance of U n andV u ). Under Assumptions 1-5, we can bound VarpU n q (2) and VarpV u q (21) as VarpU n q " p1`op1qq is the upper bound of σ 2 1,2k given by Proposition F.2 in Appendix F. Here, "fg" implies f " Opgq and g " Opf q.
The proof of the results is provided in Appendix C.2. The calculation of VarpU n q in (34) and VarpV u q in (35) requires controlling the growth of ξ 2 d,k and σ 2 c,2k . In particular, (34) can be derived from a general proposition (Proposition 4.5) provided below. However, the proof of (35) is more complex, as it relies on the double U-statistic structure ofV u . A proof roadmap is presented in Section 4.4, and technical lemmas to upper bound η 2 c,2k pd 1 , d 2 q (27) and σ 2 c,2k are provided in Appendix F. PROPOSITION 4.5 (Leading covariance domination). For a complete U-statistic U n with size-k kernel and k " op ? nq, assume that ξ 2 1,k ą 0 and there exists a nonnegative constant C such that VarpU n q{`k 2 ξ 2 1,k {n˘" 1.
The proof of this proposition can be found in Appendix C.3. This proposition relaxes the conditions from Theorem 3.1 in DiCiccio and Romano [6] and provides a foundation for our approach to bounding VarpV u q. Specifically, our condition allows for the ratio ξ 2 d,k {ξ 2 1,k to grow at a factorial rate of d, whereas the conditions in [16, 30, 6] only allow for linear growth. A comparison between our assumption on ξ 2 d,k {ξ 2 1,k and existing literature is provided in Section 4.2.
This result is a corollary of Theorem 4.4 and demonstrates the consistency of the variance estimatorV u in terms of ratios. The proof can be found in Appendix C.1. To the best of our knowledge, this is the first proof of the ratio consistency of an unbiased variance estimator for growing order U-statistics.

Proof Roadmap
The roadmap to upper bound VarpV u q (35) in Theorem 4.4 is provided in Equation (36). The relevant technical lemmas are summarized in Appendix F.
The quantity v c :"`n 2k˘´1`2 k c˘`n´2 k 2k´c˘r epresents the coefficients in the Hoeffding decomposition of VarpV u q (21); q σ 2 c,2k is the upper bound of σ 2 c,2k given by Propositions F.2 and F.3; and "fg" means that f " Opgq and g " Opf q. The inequalities in (36) should be interpreted as follows.
• The first asymptotic notation -(denoted with˚) is concluded from Lemma F.4. The value of T 1 " X 1 \`1 only depends on the growth rate of k, not n, as we assume k " opn 1{2´ q in Assumption 1. • The second asymptotic notation -(denoted with :) is a result of comparing the finite q σ 2 c,2k terms for c " 1, 2, ..., T 1 . • The last asymptotic notation -(denoted with ;) is concluded from Lemma F.2.

Application to Random Forests
Random forests can be viewed as an incomplete infiniteorder U-statistic with a random kernel [16]. The purpose of this section is to present a comprehensive algorithm, as well as two extensions: one for the case when k ą n{2 and another one that uses local smoothing to address the issue of negative estimation values.
Notation-wise, we present the algorithm in the context of regression, where we observe a vector of covariates x i P R p and y i P R for observations i. Hence, define X i " px i , y i q, and the kernel function hpS i q can be viewed as the tree prediction on a given target point x˚with subsample S i . The implementation of the variance estimator is straightforward using this setting and is summarized in Algorithm 1. We want to make a few comments. First, the original random forest [3] uses bootstrap samples, i.e., sampling with replacement, to build each tree. However, sampling without replacement Geurts, Ernst and Wehenkel [12] is also prevalent and achieves similar performances. Secondly, most random forest models utilize a random kernel instead of fixed ones. This is mainly due to the random feature selection [3] and random splitting point [12] when fitting each tree. Mentch and Hooker [16] show that U-statistics with random kernel converge in probability to its fixed kernel counterpart by viewing the fixed kernel version as the expectation of the random version. Under suitable conditions, given B large enough, the theoretical analysis of random U-statistic can be reasonably reduced to analyzing the non-random counterpart, allowing our method to be applied. It is possible that both our estimators of V phq and V psq are inflated by the influence of the randomness due to their U statistic representation. However, such inflations are likely canceled out by the difference, and our simulation results in Section 6 confirm this speculation by showing that the estimator is mostly unbiased.

Extension to k ą n{2
The previous estimator y VarpU n,B,M q (18) is restricted to k ď n{2 due to the sampling scheme. However, this does not prevent the application of formulation (8), VarpU n q " V phq´V psq . To the best of our knowledge, the existing literature does not provide further discussion Within-group average: Tree variance (15): Tree sample variance (16): under k ą n{2 for general kernels, while some theoretical strategies such as Wang and Lindsay [27] simplify the kernel into a low-order approximation. Alternatively, the infinitesimal jackknife Wager, Hastie and Efron [24] has been shown to be almost equivalent to the leading term estimator in V statistics Zhou, Mentch and Hooker [30]. Here, we discuss a generalization of our formulation for k ą n{2. Re-applying Propositions (8) and (3.2) with M " 1, we can obtain the variance of an incomplete U statistic sampled randomly with replacement: By Proposition 3.3,V pSq B,M "1 is still an unbiased estimator of V psq . However, V phq has to be estimated with a different approach, since any pair of subsamples would share at least some overlapping samples. A simple strategy is to use bootstrapping. Hence, we generate another set of size-k samples, sampled with replacement, and evaluate the kernel, using their sample variance as an estimator of V phq . We remark that the bootstrap procedure introduced will introduce an additional computational burden.

Locally Smoothed Variance Estimator for Random Forest
Even though the proposed estimator is unbiased, large variance of this estimator may still result in possible under-coverage of the corresponding confidence interval (CI). Note that due to its variation, our variance estimator might be negative, though this rarely happens in our simulations. A similar phenomenon is also noticed by Schucany and Bankson [19], and Wang and Lindsay [26]. To alleviate this issue, we propose a local smoothing estimator, namely Matched Sample Smoothing Variance Estimator (MS-s). The improvement is especially effective when the number of trees is small. This will be demonstrated in the simulation study, see, e.g., Table 1 and Figure 2.
In Algorithm 2, dp¨,¨q can be Euclidean distance for continuous covariates and other metrics for categorical covariates. In practice, we can pre-process data before fitting random forest models, such as performing standardization and feature selection. Due to the averaging with local target samples, there is naturally a bias-variance trade-off in choosing D min and neighbors. This is a rather classical topic, and there can be various ways to improve such an estimator based on the literature. Our goal here is to provide a simple illustration. In the simulation section, we consider generating 10 neighbors on an 2 ball centered at x˚. The radius of the ball is set to be the Euclidean distance from x˚to the closest training sample. We found that the performance is not very sensitive to the choice of neighbor distance. Also, the computational cost of this smoothing estimator only involves new predictions, which is also minor compared to fitting random forests.

A Discussion on Existing Normality Theories of Random Forests
Before demonstrating the simulation results, we would like to discuss the normality theories of random forests briefly. The main concern is that there is no universal guarantee of normality for random forests, and a variance estimator may not ensure the desired coverage rate. Hence, the use of any variance estimators should be done with a reasonable understanding of the random forest itself, especially by considering the impact of its tuning parameters.
Many existing works in the literature have studied the asymptotic normality of U n given k " op ? nq to opnq under various regularity conditions [16, 23, 6, 30, 17, 1]. Existing empirical study also shows that the normality usually holds when k is small while begins to break down for certain cases [30, Table 2]. As we will see in the following, there are both examples and counter-examples for the asymptotic normality of U n with a large k, depending on the specific form of the kernel.
Essentially, when a kernel hp¨q is very adaptive to local observations without much randomness, e.g., 1-nearest neighbors and the kernel size is at the same order of n, there is too much dependency across different hpS i q's. This prevents the normality of U n . On the other hand, when the kernel size is relatively small, there is enough variation across different kernel functions to establish normality. This is the main strategy used in the literature for establishing normality. The following example demonstrates these ideas. EXAMPLE 1. Given covariate-response pairs: Z 1 " px 1 , Y 1 q, ...., Z n " px n , Y n q as training samples, where x i 's are unique and deterministic numbers and Y i 's i.i.d. F such that EpY i q " µ ą 0, VarpY i q " σ 2 , for i " 1, 2, ..., n. We want to predict the response for a given testing sample x˚.
Suppose we have two size-k (k " βn) kernels: 1) a simple (linear) average kernel: hpSq " 1 k ř ZjPS Y j ; 2) a 1nearest neighbor (1-NN) kernel, which predicts using the closest training sample of x˚based on the distance of x. Without loss of generality, we assume that x i 's are ordered such that x i is the i-th nearest sample to x˚. We denote corresponding sub-bagging estimator as U mean and U 1-NN respectively. It is trivial to show that where a i "`n´i k´1˘{`n k˘a nd ř n´k`1 i"1 a i " 1. Accordingly, we have VarpU mean q " 1 n σ 2 and VarpU 1´N N q ě a 2 1 VarpY 1 q " k 2 n 2 σ 2 " β 2 σ 2 . Since U mean is a sample average, we still obtain asymptotic normality after scaling by ? n. However, β " k{n ą 0, a 1 makes a significant proportion in the sum of all a i 's and VarpU 1´N N q does not decay to 0 as n grows. Hence, asymptotic normality is not satisfied for U 1-NN .
In practice, it is difficult to know apriori what type of data dependence structure these hpS i q's may satisfy. Thus, the normality of a random forest with a large subsampling size is still an open question and requires further understanding of its kernel. In our simulation study, we observe that the confidence intervals constructed with normal quantiles work well, given that data are generated with Gaussian noise (see Section 6.1).

Simulation Study
We present simulation studies to compare our variance estimator with existing methods [30, 23] on random forests. We consider both the smoothed and nonsmoothed versions, denoted as "MS-s" and "MS", respectively. The balance estimator and its bias-corrected version in Zhou, Mentch and Hooker [30] are denoted as "BM" and "BM-cor". The infinitesimal jackknife in Wager and Athey [23] is denoted as "IJ". Our simulation does not include the Internal Estimator and the External Estimator in Mentch and Hooker [16], since the BM method has been shown to be superior to these estimators [30]. Note that the BM estimator works for both U-statistics and V-statistics [30, Section4, paragraph 1]. However, the V-statistics version is almost equivalent to IJ [30, Theroem 3.3 and 3.4]. Hence, in our simulation, we only include the U-statistics version.

Simulation Settings
We consider two regression settings: The MARS model is proposed by Friedman [11] for the multivariate adaptive regression splines. It has been used previously by Biau [2], Mentch and Hooker [16]. The second model is a simple multivariate linear regression. In both settings, features x " px 1 , . . . , x 6 q are generated uniformly from the feature space, and responses are generated by f pxq` , where iid " N p0, 1q. We use n " 200 as the total training sample size and pick different subsample sizes: k " 100, 50, 25 when k ď n{2 and k " 160 when k ą n{2. The numbers of trees are nTrees " B¨M " 2000, 10000, 20000. For tuning parameters, we set mtry as 3, which is half of the dimension, and set nodesize parameter to 2tlogpnqu " 8. We repeat the simulation N mc " 1000 times to evaluate the performance of different estimators. Our proposed methods (MS, MS-s), BM and BM-cor estimators are implemented using the RLT package available on GitHub. The IJ estimators are implemented using grf and ranger. Each estimation method and its corresponding ground truth (see details in the following) is generated by the same package. Note that we do not use the honest tree setting by Wager and Athey [23], since it is not essential for estimating the variance. However, it may affect the coverage rate due to the normality behavior.
The performance of the variance estimator is evaluated in terms of its bias and the coverage rate of its corresponding confidence interval. We denote the random forest estimator as p f px q and evaluate the coverage based on the mean of the random forest estimator, Ep p f px q q, instead of the true model value, f px q , as our focus is the variance estimation of p f px q and the random forest itself may be a biased model. To obtain the ground truth of the variance, we generate the training dataset 10000 times and fit a random forest to each, using the mean and variance of the 10000 forest predictions as approximations of Ep p f px q q and Varp p f px q q. The relative bias and the confidence interval (CI) convergence are the evaluation criteria, with the relative bias defined as the ratio of the bias to the ground truth of the variance estimation. The 1´α CI is constructed usingf˘Z α{2 We evaluate the variance estimation on two types of testing samples for both MARS and MLR data. The first is a central sample with x˚" p0.5, . . . , 0.5q and the second includes 50 random samples whose coordinates are independently sampled from a uniform distribution between r0, 1s. These testing samples are fixed for all experiments. The central sample is used to show the distribution of variance estimators over 1000 simulations, while the 50 random samples are used to evaluate the average bias and CI coverage rate. The results of the evaluation are presented in Figure 2 and Tables 1 and 2. A small difference in the ground truth generated by different packages is noted in Appendix J due to subtle differences in the packages' implementations.
The computational cost of the different methods is similar, as the main cost lies in fitting trees in the random forest, rather than calculating the estimator from tree predictions. While the M S´s method may incur additional costs for obtaining tree predictions for the target testing sample's neighboring samples, this added cost is low as it only involves making predictions using existing trees, rather than fitting new trees.  A comparison of different methods on the MARS data is presented. Each column in the figure represents a different tree size: k " n{2, n{4, n{8 respectively. The first row displays boxplots of the relative variance estimators on a central test sample, evaluated over 1000 simulations. The mean is represented by the red diamond symbol in each boxplot. The second row displays boxplots of the 90The third row displays the average coverage rate over 50 testing samples, with nTrees " 20000. The black reference line, y " x, represents the desired coverage rate. estimation, respectively. The coverage for each method is calculated as the average over 50 testing samples, and the standard deviation, indicated in the bracket, reflects the variation among these samples. Our simulation results show that the random forest estimators are approximately normally distributed, as the CIs constructed using the true variance achieve the desired confidence level (see Appendix J). In summary, MS and MS-s demonstrate consistently better performance compared to other methods, especially when the tree size k is large, i.e., k " n{2. The improved performance can be seen in terms of accurate CI coverage and reduced bias.
First, the third row of Figure 2 shows that the MS-s method achieves the best CI coverage under every k, i.e., the corresponding line is nearest to the reference line: y " x. The MS method performs the second best when k " n{2 and n{8. Furthermore, the CI coverages of the proposed methods are stable over different testing samples with a small standard deviation (less than 3%), as seen in Table 1. Secondly, with regards to the bias of the variance estimation, our methods show a much smaller bias than all other approaches ( Figure 2, first row). More details of the relative bias are summarized in Table 2. The average bias of MS is smaller than 0.5% with a small standard deviation, mainly due to the Monte Carlo error. The MS-s method has a slightly positive average bias (0% to 6.2%), but it is still much smaller than the competing methods. The standard deviation of bias for MS-s is around 4.3% to 13.6%, which is comparable to IJ.
On the other hand, the performance of the competing methods varies. When the tree size is k " n{2, the BM, BM-cor, and IJ methods show a large bias, but their per- formance improves for smaller tree sizes. It is worth noting that these methods are theoretically designed for small k. BM and BM-cor tend to underestimate the variance in most settings, while IJ tends to overestimate. In Table 2, on the MARS data with 20000 nTrees, the bias of both BM and BM-cor is more than´50%, resulting in severe under-coverage (65.4%, 59.8%), while IJ leads to overcoverage. Even when the tree size is as small as k " n{8, these methods still display a noticeable bias. However, the proposed methods still outperform them when more trees (nTrees = 20000) are used, as shown in the last column of Table 2.
The results indicate that the choice of the number of trees has a significant effect on the performance of the estimators. This is to be expected due to the influence of the random kernels, the variation involved in incomplete U-statistics, and other theoretical aspects. As the number of trees increases, the variation of all estimators decreases, as can be seen in the first row of Figure  2. Our estimators, being mostly unbiased, benefit from larger nTrees values. For instance, the 90% CI cover-ages of the MS method on the MARS data increase from 81.2% (k " n{2) and 81.8% (k " n{8) with nTrees = 2000 to 85.8% and 88.1% respectively with nTrees = 20000. On the other hand, the performance of competing methods does not necessarily improve with an increase in nTrees. For example, BM shows over-coverage with nTrees = 2000 but under-coverage with nTrees = 20000 when k " n{4 or n{8. This phenomenon, known as estimation inflation, has been discussed in Zhou, Mentch and Hooker [30] and is addressed by the BM-cor method, which reduces the bias. When k " n{8, the gap between BM and BM-cor decreases as nTrees increases. However, this trend is no longer evident when k is large, as the dominating term used in their theory is no longer applicable.
Finally, we would like to emphasize the relationship between the bias of the estimator and the coverage rate of the confidence interval. Even though a random forest predictor is normally distributed and the variance estimator is unbiased, large fluctuations of the variance estimator can still lead to under-coverage. The same also applies to the IJ estimator. For example, on MARS data with k " n{8 and nTrees = 20000, IJ has a positive bias (11.5%), but its confidence interval is still under-coverage and even more severe than the proposed methods. Increasing the number of trees can improve this performance to some extend. An alternative strategy is to perform local averaging as implemented in the MS-s method, especially when nTrees is relatively small. The heights of the boxplots in the figure clearly demonstrate the variance reduction effect. As a result, the MS-s method with 2000 trees shows better coverage than the MS method with 20000 trees when k " n{2 (see Table 1). However, this maybe at the cost of larger bias. Hence, we still recommend using a larger number of trees whenever it is computationally feasible.

Results for k ą n{2
As discussed in Section 5.1, when n{2 ă k ă n, we cannot jointly estimate V phq and V psq . Additional computational cost is introduced using the bootstrap approach for estimating V phq . In this simulation study, we attempt to fit additional nTrees with bootstrapping (sampling with replacement) subsamples to estimate V phq so we denote our proposed estimator and smoothing estimator as "MS(bs)" and "MS-s(bs)". We note that the grf package does not provide IJ estimator when k ą n{2 so we generate the IJ estimator and corresponding ground truth by the ranger package.
As seen from Table 3, all methods suffer from severe bias, but our methods and IJ are comparable and better than BM and BM-cor. More specifically, our proposed method generally over-covers due to overestimating the variance. The IJ method shows good accuracy on MARS data but has more severe over-coverage than our methods on MLR. Overall, to obtain a reliable conclusion of statistical inference, we recommend avoiding using k ą n{2. This can be a reasonable setting when n is relatively large, and k " n{2 can already provide an accurate model.

Real Data Illustration
We use the Seattle Airbnb Listings dataset, which was obtained from Kaggle 1 . The purpose of this analysis is to predict the price of Airbnb units in Seattle. The dataset consists of 7515 samples and nine covariates, including latitude, longitude, room type, number of bedrooms, number of bathrooms, number of accommodates, number of reviews, presence of a rating, and the rating score. Further information about the dataset, including the missing value processing, can be found in Appendix K. FIG 3. Random Forest prediction on Airbnb testing data. The 95% confidence error bar is generated with our variance estimator, Matched Sample Variance Estimator. "2B1B" denotes the house/apartment has two bedrooms and one bathroom.
Given the large sample size, we fit 40000 trees to obtain a variance estimator. The tree size is fixed as half of the sample size: k " 3757. We construct 12 testing samples at 3 locations: Seattle-Tacoma International Airport (SEA Airport), Seattle downtown, and Mercer Island. We further consider four bedroom/bathroom settings as 1B1B, 2B1B, 2B2B, and 3B2B. Details of the latitude and longitude of these locations and other covariates are described in Appendix K. The price predictions, along with 95% confidence intervals, are presented in Figure 3. Overall, the predictions match our intuitions. In particular, we can observe that the confidence interval of 1B1B units at SEA Airport does not overlap with those corresponding to the same unit type at the other two locations. This is possible because the accommodations around an airport usually have lower prices due to stronger competition. We also observe that 2-bathroom units at SEA Airport and downtown have higher prices than 1-bathroom units. However, the difference between 2B2B and 3B2B units at SEA Airport is insignificant. From the perspective of U -statistics, we have proposed a new framework of variance estimator for infinite-order U statistics. Instead of utilizing the leading term dominance property, we instead establish the peak region dominance notion, which addresses the bias issue under large subsampling size k or small sample size n. Additionally, new tools and strategies have been developed to study the ratio consistency behavior which is crucial for obtaining a proper coverage rate. Here, we discuss several open issues and possible extensions for future research.
First, our current methods are computationally valid for k ď n{2. The difficulty of extending to the k ą n{2 region is to estimate the tree variance, i.e., V phq . We proposed to use bootstrapped trees to extend the method to k ą n{2. However, this could introduce additional bias and also leads to large variation, as we can see in the simulation study. We suspect Bootstrapping may be sensitive to the randomness involved in fitting trees. Since we estimate V phq and V psq separately, the randomness of the tree kernel could introduce different added variances, which leads to non-negligible bias. When k ą n{2, Wang and Lindsay [27] propose an asymptotic unbiased variance estimator for the U-statistic estimator of a Kullback-Leibler risk in the k-fold cross-validation. However, this depends on a specific approximation of the kernel of Kullback-Leibler risk. The problem remains open for a general kernel.
Secondly, we developed a new double-U statistics tool to prove ratio consistency. This is the first work that analyzes the ratio consistency of a minimum-variance unbiased estimator (UMVUE) of a U-statistic's variance. The tool can be potentially applied to theoretical analyses of a general family of U-statistic problems. However, our ratio consistency result is still limited to k " opn 1{2´ q, introducing a gap between theoretical and practical versions.
The limitation comes from the procedure we used to drive the Hoeffding decomposition of the variance estimator's variance. In particular, we want the leading term to dominate the variance while allowing a super-linear growth rate of each σ 2 c,2k in terms of c. Hence, the extension to the k " βn setting is still open and may require further assumptions on the overlapping structures of double-U statistics.
Thirdly, in our smoothed estimator, the choice of testing sample neighbors can be data-dependent and relies on the forest-defined distance. It is worth considering more robust smoothing methods for future work.
Lastly, this paper focuses on the regression problem using random forest. This variance estimator can also be applied to the general family of subbagging estimators. Besides, we may further investigate the uncertainty quantification for variable importance, the confidence interval for classification probability, the confidence band of survival analysis, etc. U n , h U n is the U-statistic with size-k kernel h.
V u , ψV u denotes the estimator (19) of VarpU n q, which is a U-statistic with size-2k kernel ψ.
S S denotes the size-k subsample set associated with kernel h.
S p2kq S p2kq denotes the size-2k subsample set associated with kernel ψ.
w dwd " Opk 2d {pd!n d qq is the upper bound of w d given by Equation (25).
r, |r| r is a 9-dimensional vector defined in (38), describing the 4-way overlapping among S 1 , S 2 , S 3 , S 4 . |r| is the 1 vector norm of r.
ρpr, d 1 , d 2 q This is a notation emphasizing 11 DoF of ρ used in Appendix H.
ρprqρprq is the 9 DoF benchmark of used in Assumption 6.

Influential Overlaps
The samples in S

B Discussion of Assumptions
In this section, we present discussion and validation examples for Assumption 3-5, which are related to the covariance term ρ " CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs (29). Before the discussion, we first illustrate the definition of the 9-dimensional vector r, which plays an important role in quantifying ρ.

B.1 Definition of r
We present the definition of the 9-dimension vector r, which characterizes the overlaps between S , and size-k subsample sets S 1 , 1 " S 3 zS 4 , and T 1 2 " S 3 zS 4 (see Figure 4). Based on the above, we denote R ij :" T i X T 1 j , and r ij :" |R ij |, for i, j " 0, 1, 2. Then, a 9-dimensional vector r is defined as follows: r :" pr 00 , r 01 , r 02 , r 10 , r 11 , r 12 , r 20 , r 21 , r 22 q T .
Thirdly, we define the norm of r as |r| " q is counted exactly once in r so |r| ď c. We remark that the value of r ij is naturally bounded by sample size in the corresponding overlapping set. For example, r 0˚" |pS 1 X S 2 q X pS 3 Y S 4 q| ď |pS 1 X S 2 q| " d 1 . Note that the 11 DoF of ρ can also be illustrated by the left panel of Figure 4. There are 15 blocks but we have constraints |S 1 | " |S 2 | " |S 3 | " |S 4 | " k, so we get 11 " 15´4 DoF.
Hence, we have Given the above 3 cases of CovpX 1 X 2 , X 3 X 4 q, it is easy to verify that ρ can be represented as a weighted average of a 1,n cov 1`a2,n cov 2`a3,n cov 3 , where cov 3 is 0. Moreover, by the definition of cov 1 and cov 2 , it is easy to verify that a 1,n only depends on r ij for pi, jq ‰ p0, 0q and a 2,n only depends on r 0,0 . This verifies Assumption 3 on this linear average kernel. Besides, we can also show that F pkq in (31) (see Assumption 4) is a quadratic function of c for this kernel.
In addition to the above example, the following discussion shows that we may not be able to further reduce the DoF from 9 to 4 by stronger assumptions. When EphpSqq " 0, it is natural to consider the following fourth cumulant of ρ: cum 4 rhpS 1 q, hpS 2 q, hpS 3 q, hpS 4 qs "ρ´CovrhpS 1 q, hpS 3 qsCovrhpS 2 q, hpS 4 qś CovrhpS 1 q, hpS 4 qsCovrhpS 2 q, hpS 3 qs.
As long as VarpX 2 1 q´2Var 2 pX 1 q ą 0, which is common for non-Gaussian X 1 , Equation (40) is larger than op1q. This implies that the fourth cumulant is not always a lower order of ρ.
We can further verify that given the kernel function is simple quadratic average kernel hpS 1 q " hpX 1 , .., X k q " 1 k 2 r ř k i"1 X i s 2 , even if X i 's are i.i.d. standard Gaussian, the fourth cumulant is still not a lower order term of ρ,.

B.3 Discussion of Assumption 4
In Equation (31), F pkq c is defined as an upper bound for ρ for a given c. As illustrated by Figure 5: the more samples shared by S 1 and S 2 , hpS 1 qhpS 2 q becomes closer to hpS 1 q 2 . Therefore, given that |S  This assumption is trivial when considering the linear average kernel again: hpS l q " 1 k ř k i"1 X plq i , for l " 1, 2, 3, 4. In particular, considering pS 1 , S 2 , S 3 , S 4 q s.t.
c q| " c and pS 1 , S 2 q s.t. |S 1 X S 2 | " c, the equality in Equation (31) attains, i.e., It is also straightforward to verify the assumption under simple quadratic average kernel function hpS 1 q " 1 k 2 r ř k i"1 X i s 2

B.4 Discussion of Assumption 5
Assumption 2 shows a polynomial growth rate of the second moment term ξ 2 d,k while Assumption 5 shows a polynomial growth rate of the fourth moment term F pkq . Assumption 5 also assumes F pkq 1 is not a higher order term of ξ 4 1,k . To better illustrate the idea of Assumption 5, we consider the following example with an oversimplified setting: the cum 4 term is 0 in Equation (39). Note that linear average kernel with i.i.d. standard Gaussian X i 's satisfies this setting.  (31). Hence, even if ξ 2 d,k has a linear growth rate regarding d, F pkq still can growth at a quadratic rate of c (see (42)). Therefore, we cannot assume σ 2 2k,2k {p2kσ 2 1,2k q " Op1q., which is the common assumption on the counterpart ξ 2 1,2k [16, 30, 6]. That is one reason that Assumptions 3-5 are imposed on the primitive term ρ instead of on σ 2 c,2k .  We first present a technical proposition to be used soon. PROOF OF PROPOSITION C.1. This proof is provided by DiCiccio and Romano [6]. We first write the combinatorial numbers as factorial numberŝ n k˙´1ˆk c˙ˆn´k k´c˙" It suffices to upper bound things inside two square brackets separately. We have   50) is concluded by the bounded growth rate of F pkq in Assumption 5 and finite T 1 . If we denote k 2 n 2 F pkq 1 as q σ 2 1,2k , we conclude that VarpV u q " Op k 2 n q σ 2 1,2k q. Therefore, it suffices to show that the rest part of VarpU n q is dominated by the leading term: By Proposition C.1, the numerator of the above can be bounded aŝ where b n " k 2 n´k´1 . Notice that n ă 2pn´k`1q, so we havè n k˘´1 ř k d"2`k d˘`n´k k´d˘ξ By Assumption 2, the growth rate of ξ 2 d,k is bounded, there exists a uniform constant C s.t. ď Cd! for d " 2, 3, ..., k. Therefore, the RHS of Equation (53) is bounded as The RHS of (54) goes to 0 when n Ñ 8, since b n " k 2 n´k´1 Ñ 0. This completes the proof.
Hence, there is a cancellation for hpS 1 qhpS 2 q s.t. d " |S 1 X S 2 | " 0, thus we have For the RHS of above equation, multiply and divide M d,k (24) inside ř k d"1 : We denote Thus we have Given that ř k d"0 w d " 0 is true (to be proved soon), then w 0 "´ř k d"1 w d . Therefore, p2kq¯ı .

Proof of Equation Equation (25)
First, we show that ř k d"0 w d " 0. As discussed above, w d is a product of three normalization constants: A 1 " Since A 1,0 ą A 1 ą 0, M d,k ą 0, N d ą 0, we have w d ą 0, @d ě 1 and w 0 ă 0. Then we show ř k d"0 w d " 0. Though this can be justified by direct calculation, we present a more intuitive proof. RecallV u " Qpkq´Qp0q. By the definition of Qpkq, Qpkq can be represented as a weighted sum of hpS 1 qhpS 2 q, i.e., ř 1ďi,jďn a ij hpS i qhpS j q, where ř 1ďiăjďn a ij " 1. Thus, Qpkq´Qp0q can be represented in a similar way: where ř 1ďi,jďn a 1 ij " 0. Therefore ψ k`S p2kq˘´ψ 0`S p2kq˘, as the kernel of U-statistic Qpkq´Qp0q, can also be represented in the form of a weighted sum: where ř 1ďiăjďn b ij " 0 since ψ k`S p2kq˘´ψ 0`S p2kq˘i s an unbiased estimator of Qpkq´Qp0q. On the other hand, for d " 0, 1, 2, ..., k, ϕ d pS p2kq q is still a U-statistic, which can be represented in the form of a weighted sum: where ř 1ďiăjďn c pdq ij " 1. Since ψ k`S p2kq˘´ψ 0`S p2kq˘" ř k d"0 w d ϕ d`S p2kq˘, by comparing Equation (57) and (58), ij " 1 and ř 1ďiăjďn b ij " 0, we can take hpS i q " 1 for i " 1, 2, ..., n and conclude that Secondly, we present the details to bound w d " A 1 M d,k {N d , for d " 1, 2, ..., k. Plug in the expression of A 1 , M d,k , N d , we have After direct cancellation of the same factorials, we have For Part I in (59), The last equality is because for any k " op ? nq, d ď k, we have On the other hand, Combining ď and ě, we have " r1`op1qs 1 n d . For Part II (59), k!k! pk´dq!pk´dq! " rkpk´1q...pk´d`1qs 2 s ď k 2d .
Particularly, when d is fixed, we have k!k! pk´dq!pk´dq! " kpk´1q...pk´d`1qs 2 " r1`op1qsk 2d . Combining Part I, II, III in (59), we have Comparing VarpU n,B,M q " p1´1 B qV arpU n q`1 M B V phq (14) with VarpU n,B q " VarpU n q`ErVarpU n,B |X n qs (5) , it suffices to show that Here we adopt an alternative view of a complete U-statistic U n with k ď n{2 by Wang and Lindsay [26]. Follow our notation of "matched group", we can always take M " tn{ku mutually disjoint subsamples S 1 , ..., S M from pX 1 , ..., X n q, such that |S i X S j | " 0 for 1 ď i ă j ď M . Wang and Lindsay [26] take integer M " n{k while we allow 2 ď M ď tn{ku. Recall such pS ph pbq´E pU n qqpU n´E pU n q‚ "Var´h p1q¯´V arpU n q " 1 M V phq´V arpU n q.
In the above equations, the first equality is the conclusion by Wang [25]; the next-to-last equality holds since

E.2 Unbiasedness of Variance Estimators
PROOF OF PROPOSITION 3.1. First, we restrict the discussion given k ď n{2. We first show thatV psq " ř k d"0 γ d,k,nξ 2 d,k . By the discussion in Section 3.3.1, we have N d,k,n "`n k˘2 γ d,k,n . For a complete U-statistic with k ď n{2, N d,k,n " n k˘`n´k k´d˘`k d˘a nd we denote N "`n k˘ą 0. Then, Here, the second equality holds by plugging in the definition of N d,k,n and interchanging the finite summation ř dPD with ř 1ďiăjďB . The third equality omits the cases with i " j, where hpS i q´hpS j q " 0. The second to last equality holds because the sample variance is essentially an order-2 U-statistic, with kernel phpS i q´hpS j qq 2 {2.
We can defineV psq " ř dPD γ d,k,nξ Similar to previous proof Since eachξ 2 d,k is still an unbiased estimator ofξ 2 d,k , similarly, we have EpV pSq 1 q " V psq . Remark that the summation in Equation (62) is over d P D instead of d " 0, 1, 2, ..., n. This is because γ d,k,n is 0 for small d, given k ą n{2. In other words, when k ą n{2, several terms of γ d,k,n ξ 2 d,k in the Hoeffding decomposition (2) is already 0.
We remark that Schucany and Bankson [19] consider an alternative estimator of ξ 2 1,2 , denoted asζ 2 1 [21]. However, that one shows connection to the work of Mentch and Hooker [16] and Zhou, Mentch and Hooker [30] (see Section 3.5) but is not the focus of this appendix.
Thirdly, note that Wang and Lindsay [26]'s estimator involves the definition of their partitioning scheme and we use the notation of our matching group and assume M " n{k to present their estimator (see G n,k,M (60) and in Appendix E.1). Here we present Wang and Lindsay [26]'s estimator in their ANOVA form, which is the alternative to their secondmoment view. This alternative form uses the within and between-variances of the groups [see 26, page 1122]. However, the form is still different from ours. To simplify the notation, we denote B " |G n,k,M | andh pbq " 1 , can be represented as follows.
whereV psq "`n k˘´1 ř p n k q i"1 phpS i q´U n q 2 (10);V phq "`n k˘`n´k d˘ř|SiXSj|"0 rhpS i q´hpS j qs 2 {2 (11); σ 2 W P " 1 PROPOSITION E.1. Our complete variance estimatorV u is equivalent to the estimatorV PROOF. To simplify the notation, we denote To show the equivalence betweenV u andV pW &Lq u , we will show that they are the same linear combination of A 1 , A 2 , A 3 . First, it is trivial to verify that σ 2 W P and σ 2 BP are linear combinations of A 1 , A 2 , A 3 : Secondly, we show thatV phq " σ 2 W P by showing thatV phq also equals to M M´1 pA 1´A3 q. Considering the summation "ˆn k˙´1ˆn´k k˙´1 Thirdly, we show thatV psq is also a linear combination of A 1 , A 2 , A 3 . We start withV psq "`n k˘´1 ř p n k q i"1 hpS i q 2´U 2 n . Due to the definition of G n,k,M , the collection tS pbq i u i,b are basically replications of tS i u p n k q i"1 . Hence,`n k˘´1 ř p n k q i"1 hpS i q 2 " A 1 , which implies thatV psq " A 1´A2 .

E.4 Equivalence of Incomplete Variance Estimators
Only Wang and Lindsay [26] and our paper propose variance estimator for in complete U-statistics. Similar to the analysis in Appendix E.3, we will show that our incomplete Variance estimator (18) is equivalent to the counterpart in Wang and Lindsay [26, page 1124]. Given B matching groups and M subsamples in each group, we denote the above estimators asV pincq u andV pinc,W &Lq u respectively: Similarly, it is trivial to verify thatV By plugging the above into equation (69), we havê Hence, we conclude thatV

F Technical Propositions and Lemmas
In this section, we present the technical propositions and lemmas. The proofs of these results are collected in Appendix G.
PROPOSITION F.1. The value of ψ`S p2kq˘d oes not depend on E rhpX 1 , .., X k qs. Therefore, WLOG, we can assume the kernel is zero-mean, i.e., E rhpX 1 , .., X k qs " 0 The proof of this proposition is collected in Appendix G.1. The proof of the above propositions is collected in Appendix G.2 and Appendix G.3 respectively. Note that the upper bound in Proposition F.2 actually works for any fixed and finite c but it suffices to restrict c ď T 1 to show our main results. These two propositions depend on the further decomposition of σ 2 c,2k into weighted sum of η 2 c,2k pd 1 , d 2 q's, which is later discussed in Appendix F.2. In particularly, we can show that η 2 c,2k p1, 1q dominates σ 2 c,2k for c " 1, 2, ..., T 1 . As a corollary of the above results, we can show the following lemma. Here, we denote q σ 2 c,2k as the upper bound of σ 2 c,2k given by Proposition F.2.
The proof of Lemma F.4 is collected in Appendix G.4. This implies that to bound VarpV u q, it suffices to bound the weighted average of first T 1 terms of σ 2 c,2k , instead of all 2k terms. Note that we use } Var instead of Var in the denominator of (70). Here T 1 " X 1 \`1 does not grow with n. It only relies on , which quantifies the growth rate of k with respect to n (see Assumption 1,). For example, if k " n 1{3 , i.e., " 1{6, then we can choose T 1 " 7. Hence, to show σ 2 1,2k dominates in VarpV u q when n Ñ 8, it suffices to show that σ 2 1,2k dominates in T 1 -truncated Var pT1q pV u q when n Ñ 8.
With the above bounds on η 2 c,2k pd 1 , d 2 q, we introduce the following truncated σ 2 c,2k and show Lemma F.8, which implies that to bound σ 2 c,2k , it suffices to bound the first finite η 2 c,2k pd 1 , d 2 q terms in its decomposition (28).
DEFINITION F.7 (Truncated σ 2 c,2k ). Let T 2 " Hence, given two size-2k subsamples S (28) only involves the sum of T 2 2 terms, i.e., σ 2 c,2k,pT2q rather than k 2 terms. Here T 2 is again finite and does not grow with k. Though T 1 and T 2 take the same value, we note that T 1 is the truncation constant for VarpV u q in (71) while T 2 is the truncation constant for σ 2 c,2k in (74).

G.1 Proof of Proposition F.1
PROOF OF PROPOSITION F.1. Suppose EphpS 1 qq " µ and we rewrite hpS 1 q " h p0q pS 1 q`µ, where Eph p0q pS 1 qq " 0. Then ϕ d pS p2kq q defined in (24) can be written as Plug Equation (75) into Equations (19) and (23). By the fact that ř k d"0 w d " 0 and the symmetry of U-statistic, the terms of µ and µ 2 are cancelled. Consequently,V u does not depend on µ, so W.L.O G., we assume that µ " 0 G.

G.4 Proof of Lemma F.4
PROOF. Let T 1 " k 2k´c˘σ 2 c,2k . We first present the intuition of this lemma. VarpV u q is a weighted sum of σ 2 c,2k , where the coefficient of σ 2 c,2k decays with c at a rate even faster than a geometric rate. If the growth rate of σ 2 c,2k is not too fast, then the tail terms can be negligible. This involves both the precise upper bound of σ 1,2k 2 (Proposition F.
. PROOF. The proof of Proposition G.1 is similar to the proof of Proposition 4.5. The idea is that the sum of tail coefficients is a geometric sum and thus dominates the growth rate of moments.
First we consider the coefficient . By Proposition C.1 and our analysis in Equation (52) and (53), let b n " 4k 2 n´2k`1 which is the common ratio in the geometric sequence. ř 2k Second, combining c a1 with (80), it's again the problem of geometric series with common ratio b n " op1q. We have where the last equality is concluded by the sum of geometric series.
In Equation (82), the upper bound of c " T 1`1 term of the numerator is a lower order term compared to the denominator. (82),
We first show Equation (83), i.e., bound the ratio of coefficient. Similar to the analysis for Equation (53), by Proposition C.1, we haveˆn where b n " 4k 2 n´2k`1 . Therefore the ratio of coefficient, It remains to show b T1 n " op 1 n 2 q. By T 1 " Second, we show (84), i.e., bound the ratio of moments. By Lemma F.8 and F.2, σ 2 1,2k " Opk 2 {n 2 F pkq 1 q and q σ 2 1,2kk 2 {n 2 F pkq 1 . We have . We will decompose η 2 c,2k pd 1 , d 2 q as a finite weighted sum: where i is the summation index to be specified later. Based on this form, we will show a i " Op 1 k 2 q and b i " OpF pkq c q for each i. Since (87) is a finite sum, we can conclude η 2 c,2k pd 1 , d 2 q " O´1 k 2 F pkq c¯. Details of (87) will be presented later. We remark that it is straightforward to upper bound η 2 1,2k pd 1 , d 2 q by enumerating all the possible 4-way overlapping cases of S 1 , S 2 , S 3 , S 4 given c, d 1 , d 2 for small c. However, the growth of c from 1 to 2, 3, 4, ..., T 2 makes "enumerating" impossible.
We start the proof by reviewing the definition of η 2 c,2k pd 1 , d 2 q (27): ř S1,S2ĂS p2kq ,|S1XS2|"d hpS 1 qhpS 2 q (24). The following proof is organized in two parts. First, we propose an alternative representation of the covariance Covrϕ d1 pS p2kq 1 q, ϕ d2 pS p2kq 2 qs, which helps discover the cancellation pattern of η 2 c,2k pd 1 , d 2 q (27). Secondly, we derive Equation (87) and specify a i 's and b i 's. First, we notice that ϕ d`S p2kq˘( 24) is a weighted average of the product of two kernels hpS 1 qhpS 2 q: Denote ř P12 as summation over all pairs of S 1 , S 2 , s.t.
Here p pr0˚,r1˚,r2˚,d1,cq and p pr˚0,r˚1,r˚2,d2,cq are non-negative and satisfy the following. For non-negative integers x 0 , x 1 , x 2 that x 0`x1`x2 ď c, gpr˚, d 1 , d 2 q is the following weighted average of ρ, where the weight is some constant p pr, r˚q satisfying that ř r˚ppr, r˚q " 1. The proof of Lemma G.3 is deferred to the end of Appendix G.5. Under Assumption 3, ρ does not depend on d 1 , d 2 . Hence, gpr˚, d 1 , d 2 q also does not depend on d 1 , d 2 . We further denote Gpr˚q " gpr˚, d 1 , d 2 q. Then, pr0˚,r1˚,r2˚,d1,cq p pr˚0,r˚1,r˚2,d2,cq Gpr˚q.
feasible r˚c an be written as two sequential sums: We have two observations on the above Equation (95). First, this η 2 c,2k pd 1 , d 2 q " ř i a i b i is a finite summation because c ď T 1 . Hence, to show η 2 c,2k " OpF pkq c {k 2 q, it suffices to bound every term a i¨bi . Secondly, by Lemma G.3, b i " Gpr˚q is a weighted average of ρ where the non-negative weights ř r˚ppr, r˚q " 1. Hence, each b i is naturally bounded by the upper bound of ρ. We conclude that b i " OpF pkq c q. Therefore, it remains to show that a i " Opk´2q for every i. This is provided by the following lemma.
REMARK G.5. This proof has proceeded under Assumption 3. It can be adapted to a weaker assumption: Assumption 6.The according proof using this new assumption will be present in Appendix H, where we cannot exactly cancel two ρ with the same r but different d.
We present the proof of two important technical facts in the above proof: Lemma G. 3  |, d 1 " |S 1 X S 2 | and d 2 " |S 3 X S 4 |, suppose we randomly sample a feasible S 1 , S 2 , S 3 , S 4 from all possible cases, we can use a 9-dimension random variable R to denote the 4-way overlapping structure of S 1 , S 2 , S 3 , S 4 . Hence, the the coefficient p pr,d1,d2,cq in (90) is P pR " r|d 1 , d 2 , cq. Then, denote a 6-dimension random variable R˚" pR 0˚, R 1˚, R 2˚, R˚0, R˚1, R˚2q, taking all possible values of r˚given d 1 , d 2 , c. By Bayesian rule, P pR " r|d 1 , d 2 , cq " P pR " r|R˚" r˚, d 1 , d 2 , cqP pR˚" r˚|d 1 , d 2 , cq.
For Part III, Combining Part I, II, III, we have This completes the proof.
We decompose σ 2 c,2k into three parts: Similarly, denoteǍ where q η c,2k pd 1 , d 2 q is the upper bound given in Lemma F.5. To prove this lemma, it suffices to show It remains to boundǍ, B, C asǍ wherew d " Op k 2d d!n d q is the rough upper bound of w d in (25). We need to quantify two parts, the coefficients w d1 w d2 and the covariance η 2 c,2k pd 1 , d 2 q. Let us fix one c ď T 1 and first quantify η 2 c,2k pd 1 , d 2 q. By Lemma F.5 and F.6, we have η 2 c,2k pd 1 , d 2 q " OpF c q, for c ď 2k, d 1 , d 2 ď k.
By Proposition F.2, we have A " σ 2 c,2k,pT2q " Op k 2 n 2 F c q. SinceǍ is the upper bound of A given in Proposition F.2, A " Op k 2 n 2 F c q. For B, C, we upper bound η 2 c,2k pd 1 , d 2 q by OpF c q in Equation (108). Hence, we can reduce the analysis for both coefficients and covariance to the analysis on only coefficients w d , for To be more specific, it remains to show that wherew d " Op k 2d d!n d q is the rough upper bound of w d in (25). For ř T2 d"1 w d , by Equation (25), each w d " r1`op1qs k 2d d!n d ď r1`op1qs k 2d n d . The common ratio of geometric decay is k 2 {n " op1q. Therefore, the first term w 1 dominates ř T2 d"1 w d . For Op k 2d n d q " Oˆp k 2 n q T2`1˙.

I Low Order Kernel h Illustration
The purpose of this appendix is two-fold. First, we present a simple analysis of ratio consistency under a oversimplified kernel h, which illustrate the basic idea to unitize the double U-statistic structure. Secondly, we discuss the gap between a simple kernel and a general kernel h in Appendix I.3, which illustrates the motivation of imposing our assumptions on CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs (see assumptions in Section 4.2).
LEMMA I.2. Under the conditions in Theorem I.1, we have EpV u q " Ωp 1 n q, and VarpV u q " Op 1 n 3 q, where Ω indicates an asymptotic lower bound (see Table 4).
We remark that we are able to generalize Lemma I.2 from the linear average kernel h to the intrinsic low order kernel h (Definition I.5). After showing the proof of Lemma I.2, we will summarize the benefits of the low order structure of kernel h. Then, we present the potential difficulties and solutions when low order kernel assumption no longer holds. REMARK I.4. An alternative approach to bound VarpV u q is by further simplifyingV u as an order-2 U-statistic: V u " r1`op1qs n pX 2 pnq´X Y pnq q. We present the analysis in Appendix I. 4. The insight is that the lower-order structure of kernel k implies the low order structure ofV u .
The reremaining part of this section is organized as follows. In Section I.2, we discuss the generalization of this 3-step analysis for low order kernel h (see Definition I.5). In Section I.3 we show the issues of adapting this the 3-step analysis for a general kernel h (without any low order structure); then, we discuss the motivations of new tools to overcome this difficulty.

I.2 Kernel h with Low Order Structure
The analysis in step 1 depends on the double U-statistic structure ofV u . It definitely works with any kernel h. To generalize theorem I.1, it remains to discuss the analysis in steps 2 and 3.
For example, if we fix L " 1 and g p1q as identical map, then h becomes the linear average kernel we discussed in the last section: hpX 1 , ..., X k q " 1 k ř k i"1 X i . Suppose hpX 1 , ..., X k q has the structure in Definition I.5. In step 2, similar to Equation (122) and (123), we can specify the form of pϕ d´ϕ0 qpS p2kq q by plugging in h's low order structure. Therefore, we are still able to have cancellations resulting in the order of polypdq k 2 . In step 3, additional Op 1 n q rate comes from the variance of fixed order U-statistic. Similarly, since the summand inV u is in the form of hpS 1 qhpS 2 q, we can verify thatV u is a linear combination of U-statistic up to kernel order 2L, where L is the low order parameter in Definition I.5. Therefore, we reduce infinite order U-statistic to finite order U-statistic. Note that We only discussed the intrinsic low-order kernel cases without proof because they are trivial extensions of the previous arguments.

I.3 Discussion about General Kernel h
For general kernel h without low order structure, we show the difficulty in the 3-step analysis, which motivates our new strategies, including the assumptions (Appendix B) and technical lemmas (Appendix F).
The first difficulty is that ϕ d pS p2kq q´ϕ 0 pS p2kq q no longer has a simple expression since hpSq does not have an explicit low order form. However, we still believe that a smaller d can imply a smaller difference between ϕ d and ϕ 0 . Our remedy is to quantify the implicit cancellation in covariance η 2 c,2k pd 1 , d 2 q :" Covrϕ d1 pS qs (see Equation (27)). The further decomposition of η 2 c,2k pd 1 , d 2 q involves the following covariance (29) ρ :" CovrhpS 1 qhpS 2 q, hpS 3 qhpS 4 qs.
As demonstrated in Section 4.2, ρ is determined by 11 free parameters. Assumption 3 reduce DoF from 11 to 9 of 11, which enables us to further reduce DoF to 6 in the proof of Lemma F.5 and bound η 2 c,2k pd 1 , d 2 q " OpF pkq c {k 2 q for finite c, d 1 , d 2 . Note that a relaxation of Assumption 3 is presented in Appendix B.
| " 2k We borrow the proof of Proposition I.6 by replacing n with 2k. In other words, we get coefficient Op 1 k q because this is the variance of a fixed-order kernel with 2k-sample U-statistic. We have

J.1 Ground Truth in the Simulation
As mentioned in Section 6.1, we simulate the ground truth of the expectation of forest predictions: Epf px˚qq and the variance of forest predictions: Varpf px˚qq by 10000 simulations. Since variance estimators are produced by different packages, we use the corresponding package to generate their ground truth. The result of the central testing sample (see Section 6) is presented in Table 6 and Table 7. We observe that there is a small difference between different packages though similar tunning parameters are used to train random forests.
In addition, we present the "oracle" CI coverage rate in Table 8, which matches 1´α. To construct these CIs, we still use the random forest prediction over 1000 simulations but replace the estimated variance with the "true variance", Varpf px˚qq. This result also shows the normality of the random forest predictor. Figure 6 shows the performance of different methods on the MLR model. This is a counterpart of Figure 2 in Section 6. Table 9 describes the covariates of Airbnb data in Section 7. We use the samples with the price falling in the interval p0, 500s dollars. The missing values (NA) in the rating score and bathroom number are replaced. The "having rating" covariate is created based on the "review number".

K Additional Information and Results on the Real Data
To train the random forest model, we set mtry (number of variables randomly sampled as candidates at each split) as 3, and set nodesize parameter as 36. Here we also present the details of testing samples. The latitude and longitude of SEA Airport, Seattle downtown, and Mercer Island are (47.4502, -122.3088), (47.6050, -122.3344), and (47.5707, -122.2221) respectively. The "room type" "accommodates" and "having a rating" are fixed as "Entire home/apt", the double of "bedroom numbers", and 1 respectively. We use averages in the training data as the values of "reviews number" and "rating score". FIG 6. A comparison of different methods on MLR data. Each column of figure panel corresponds to one tree size: k " n{2, n{4, n{8. The first row: boxplots of relative variance estimators of the central test sample over 1000 simulations. The diamond symbol in the boxplot indicates the mean. The second row: boxplots of 90% CI coverage for 50 testing samples. For each method, three side-by-side boxplots represent nTrees as 2000, 10000, 20000. The third row: the coverage rate averaged over 50 testing samples with 20000 nTrees and the confidence level (x-axis) from 80% to 95%. The black reference line y " x indicates the desired coverage rate. The number of reviews of this unit. having a rating It is 1 if the number of reviews is greater than 0; and is 0 otherwise. rating score The average rating score. NA is replaced by the average score.