Kaplan-Meier V- and U-statistics

In this paper, we study Kaplan-Meier V- and U-statistics respectively defined as $\theta(\widehat{F}_n)=\sum_{i,j}K(X_{[i:n]},X_{[j:n]})W_iW_j$ and $\theta_U(\widehat{F}_n)=\sum_{i\neq j}K(X_{[i:n]},X_{[j:n]})W_iW_j/\sum_{i\neq j}W_iW_j$, where $\widehat{F}_n$ is the Kaplan-Meier estimator, $\{W_1,\ldots,W_n\}$ are the Kaplan-Meier weights and $K:(0,\infty)^2\to\mathbb R$ is a symmetric kernel. As in the canonical setting of uncensored data, we differentiate between two asymptotic behaviours for $\theta(\widehat{F}_n)$ and $\theta_U(\widehat{F}_n)$. Additionally, we derive an asymptotic canonical V-statistic representation of the Kaplan-Meier V- and U-statistics. By using this representation we study properties of the asymptotic distribution. Applications to hypothesis testing are given.


Introduction
Let F be a distribution on (0, ∞) of interest. In this paper we study the estimation of parameters of the form θ(F ) = ∞ 0 ∞ 0 K(x, y)dF (x)dF (y), where K : (0, ∞) 2 → R is a measurable and symmetric function, commonly known as kernel function. If we consider i.i.d. samples X 1 , . . . , X n from F , the standard estimators for θ(F ) are the canonical V-and U-statistics, θ(F n ) = 1 n 2 n i=1 n j=1 K(X i , X j ), and θ U (F n ) = 1 n(n − 1) i =j K(X i , X j ), respectively, whereF n (t) = 1 n n i=1 1 {Xi≤t} denotes the empirical distribution. Under some regularity conditions θ(F n ) → θ(F ) and θ U (F n ) → θ(F ) almost surely as n approaches infinity. It is of interest to study the limit distribution of the differences θ(F n ) − θ(F ) and θ U (F n ) − θ(F ). The standard theory of V-and U-statistics distinguishes between two asymptotic behaviours for the distribution of the errors θ U (F n ) − θ(F ) and θ(F n ) − θ(F ) when the number of samples tends to infinity. These two asymptotic regimes, widely known as degenerate and nondegenerate, are characterised by the behaviour of the variance of the projection φ : R + → R, defined as φ(x) = E(K(x, X 2 )) = ∞ 0 K(x, y)dF (y). (1.1) On the other hand, we are in the degenerate regime if Var(φ(X 1 )) = 0, where it holds that where λ 1 , λ 2 , . . . are constants in R, and ξ 1 , ξ 2 , . . . are independent standard Gaussian random variables. Similar results hold for V-statistics under slightly stronger assumptions. Indeed, if the extra condition E(|K(X 1 , X 1 )|) < ∞ holds, then the exact same limit is obtained in the non-degenerate regime, while for the degenerate regime, n(θ(F n ) − θ(F )) D → E(K(X 1 , X 1 )) + ∞ i=1 λ i (ξ 2 i − 1). We refer to the book of Koroljuk and Borovskich [1994] for a comprehensive account of the theory of Vand U-statistics.
In this paper, we study the analogue of V-and U-statistics in the setting of rightcensored data, that usually appears in Survival Analysis applications, in which we observe samples of the form (X i , ∆ i ) n i=1 where X i ∈ (0, ∞) and ∆ i ∈ {0, 1}. Here, ∆ i = 1 indicates that X i is an actual sample from F , while ∆ i = 0 indicates that X i corresponds to a right-censored observation. Similar to the uncensored setting, we are interested on estimating θ(F ), however, as the data is right-censored, it is not possible to compute the canonical V-and U-statistics as the empirical distributioñ F n is not available in this setting. Instead, we propose to replace the empirical distribution by the Kaplan-Meier estimator F n which is the standard estimator for F in the setting of right-censored data.
The Kaplan-Meier V-and U-statistics are defined as and θ U ( F n ) = i =j W i W j K(X [i:n] , X [j:n] ) i =j W i W j , respectively, where W i , are the so-called Kaplan-Meier weights and X [i:n] denotes the i-th order statistic. In this paper we conveniently write θ( F n ) as which is known in the literature as a Kaplan-Meier double integral. Asymptotic properties of Kaplan-Meier integrals have been studied by several authors. For the simplest case of univariate functions, Central Limit Theorems for Multiple Kaplan-Meier integrals were studied by Gijbels and Veraverbeke [1991] and by Bose and Sen [2002]. Gijbels and Veraverbeke [1991] studied a simplification of the problem which considers the class of truncated Kaplan-Meier integrals t 0 . . . t 0 K(x 1 , . . . , x m ) m j=1 F n (x j ), where t is a fixed value, avoiding integration over the whole support of the observations. Then, by using an asymptotic i.i.d. sum representation of the Kaplan-Meier estimator F n together with integration by parts, the authors derive an almost sure canonical V-statistic representation of θ( F n ) − θ(F ) up to an error of order O(n −1 log(n)). While this result allows them to derive limiting distributions in the non-degenerate case (by scaling by √ n), it is not possible to obtain results for the degenerate case since the error is too large to be scaled by n. Moreover, their representation is restricted to continuous distribution functions F . Bose and Sen [2002] analysed Kaplan-Meier double integrals in a more general setting by using a generalisation of the i.i.d. representation of the Kaplan-Meier estimator derived by Stute [1995] for uni-dimensional Kaplan-Meier integrals. By using this representation, Bose and Sen [2002] were able to write the Kaplan-Meier double integral as a V-statistic plus some error terms. Nevertheless, similar to the univariate case, the error terms, that appear as consequence of using this approximation, are quite complicated and thus, dealing with them requires very strong and somewhat artificial conditions which are hard to verify in practice.
An alternative estimator of θ(F ) in the right-censored setting is obtained by using the so-called Inverse Probability of censoring weighted (IPCW) estimator of F introduced by [Satten and Datta, 2001], which can be seen as a simplification of the Kaplan-Meier estimator. IPCW U-statistics coincide with Kaplan-Meier Ustatistics when the survival times are continuous and the largest sample point is uncensored Satten and Datta [2001]. IPCW U-statistics were studied by Datta et al. [2010], however, they only provide results for the non-degenerate regime.
There are several works that study limit distributions of V-and U-statistics in the setting of dependent data (see Sen [1972/73], Denker and Keller [1986], Yoshihara [1976], Dewan and Prakasa Rao [2002], Dehling and Wendler [2010], Zähle [2012, 2014]). Nevertheless, most of these results are tailored for specific types of dependency and thus they are not suitable or it is not clear how to translate these results into our setting. The recent approach of Beutner and Zähle [2014] provides a general framework which can be applied to right-censored data, however its application is limited to very well-behaved cases. This is mainly because such an approach is based on an integration-by-parts argument, requiring the function K to be locally of bounded variation, and thus denying the possibility of working with simple kernels like K(x, y) = 1 {x+y>1} . Also, it is required to establish convergence of √ n( F n − F ) to a limit process (under an appropriate metric), leading to stronger conditions than the ones considered in our approach. Moreover, such a general result is less informative about the limit distribution than ours.
In this paper, we obtain limit results for Kaplan-Meier V-statistics. Our proof is based on two steps. First, we find an asymptotic canonical V-statistic representation of the Kaplan-Meier V-statistic, and second, we use such a representation to obtain limit distributions under an appropriate normalisation. We also obtain similar results for Kaplan-Meier U-statistics. Our results not only provide convergence to the limit distribution, but we also find closed-form expressions for the asymptotic mean and variance.
Applications to goodness-of-fit are provided. In particular, we study a slight modification of the Cramer-von Mises statistic under right-censoring that can be represented as a Kaplan-Meier V-statistic. Under the null hypothesis, we find its asymptotic null distribution, and we obtain closed-form expressions for the asymp-totic mean and variance under a specific censoring distribution. Our results agree with those obtained by Koziol and Green [1976]. We also provide an application to hypothesis testing using the Maximum Mean Discrepancy (MMD), a popular distance between probability measures frequently used in the Machine Learning community. Under the null hypothesis and assuming tractable forms for F and G, we obtain the asymptotic limit distribution, as well as the asymptotic mean and variance of the test-statistic.
Our results hold under conditions that are quite reasonable, in the sense that they require integrability of terms that are very close to the variance of the limit distribution. Compared to the closest works to ours, the approach of Bose and Sen [Bose and Sen, 2002] and the IPCW approach [Datta et al., 2010], our conditions are much weaker and easy to verify. We explicitly compare such conditions in Section 2.3.

Notation
We establish some general notation that will be used throughout the paper. We denote R + = (0, ∞). Let f : R + → R + be an arbitrary right-continuous function, In this work we make use of standard asymptotic notation Janson [2011] (e.g. O p , o p , Θ p , etc.) with respect to the number of sample points n. In order to avoid large parentheses, we write , especially if the expression for Y is very long. Given a sequence of stochastic processes (W n (x) : x ∈ X ), depending on the number of observations n, and a function f (x) > 0, we say that W n (x) = O p (f (x)) uniformly on x ∈ A n , if and only if sup x∈An (1), where A n ⊆ X is a set that may depend on n.

Right-censored data
Right-censored data consists of pairs ((X i , ∆ i )) n i=1 , where X i = min(T i , C i ) denotes the minimum between survival times of interest T i i.i.d.

∼ F and censoring times
∼ G, and ∆ i = 1 {Ti≤Ci} is an indicator of whether we actually observe the survival time T i or not, that is, ∆ i = 1 if T i ≤ C i , and ∆ i = 0 otherwise. We assume the survival times T i 's are independent of the censoring times C i 's which is known as non-informative censoring, and it is a standard assumption in applications.
We denote by S(x) = 1 − F (x) and by Λ(x) = (0,x] S(t−) −1 dF (t), respectively, the survival and cumulative hazard functions associated with the survival times T i . The common distribution function associated with the observed right-censored times X i = min{T i , C i } is denoted by H. Note that 1−H(x) = (1−G(x))S(x) due to the non-informative censoring assumption. For simplicity, we assume that F and G are measures on R + , otherwise, we can apply an increasing transformation to the random variables, e.g. e Xi . Notice that we do not impose any further restriction to the distribution functions, particularly, F and G are allowed to share discontinuity points.

Kaplan-Meier estimator
The Kaplan-Meier estimator [Kaplan and Meier, 1958] is the non-parametric maximum likelihood estimator of F in the setting of right-censored data. It is defined j:n] n−j are the so-called Kaplan-Meier weights, X [i:n] is the i-th order statistic of the sample X 1 , . . . , X n , and ∆ [i:n] is its corresponding censor indicator. To be very precise, ties within uncensored times or within censored times are ordered arbitrarily, while ties among uncensored and censored times are ordered such that censored times appear later. Observe that when all the observations are uncensored, that is, when ∆ i = 1 for all i, each weight W i becomes to 1/n and thus F n becomes the empirical distribution of T 1 , . . . , T n . Finally, we denote by S n (x) = 1− F n (x) the corresponding estimator of S(x).

Counting processes notation
In this work we use standard Survival Analysis/Counting Processes notation. For each i ∈ {1, . . . , n} we define the individual and pooled counting processes by N i (t) = 1 {Xi≤t} ∆ i and N (t) = n i=1 N i (t) respectively. Notice that the previous processes are indexed in t ≥ 0. Similarly, we define the individual and pooled risk functions by We assume that all our random variables are defined in a common filtrated probability space (Ω, F, . . , n}}, and the P-null sets of F. It is well-known that N i (t), N (t) and the Kaplan-Meier estimator F n (t) are adapted to (F t ) t≥0 , and that Y i (t) and Y (t) are predictable processes with respect to (F t ) t≥0 . Yet another well-known fact is that N i (t) is increasing and its compensator is given by (0,t] Y i (s)dΛ(s). We define the individual and pooled (F t For a martingale W , we denote by W its predictable variation process, and by [W ] its quadratic variation process.
Due to the simple nature of the processes that appear in this work, i.e. counting processes, checking integrability and/or square-integrability is very simple and thus we state these properties without giving an explicit proof. For more information about counting processes in survival analysis we refer to Fleming and Harrington [1991].

Interior and Exterior regions
Let I H = {t > 0 : H(t−) < 1} be the interval in which X i = min{T i , C i } takes values. Define τ = τ H = sup{t : 1 − H(t) > 0} and notice that τ ∈ I H if and only if H has a discontinuity at τ . Define τ n = max{X 1 , . . . , X n }. We denote by I = (0, τ n ] the interior region in which we observe data points, and by E = I H \ I the exterior the region. Notice that both I and E depend on τ n even if we do not explicitly write it.
In this work, the integral symbol b a means integration over the interval (a, b], unless we state otherwise. An exception to this rule is when we integrate over the interval I H , in which case, instead of writing I H , we write τ 0 and define Lastly, let g : R + → R be an arbitrary function, then The same holds when integrating with respect the martingale M (t).

Efron and Johnstone's Forward Operator
We consider the forward operator A : L 2 (F ) → L 2 (F ), independently introduced by Ritov and Wellner [1988] and Efron and Johnstone [1990], defined by For a bivariate function g : R 2 + → R such that g ∈ L 2 (F × F ), we denote by A i , the operator A applied only to the i-th coordinate of the function g, e.g. (A 2 g)(x, y) = g(x, y) − 1 S(y) τ y g(x, s)dF (s). Note that A 1 and A 2 commute. Similarly, for g : R + → R such that g ∈ L 2 (F ), we define A : L 2 (F ) → L 2 (F ), as x > τ n .
Observe that the difference between A and the forward operator A is the upper limit of integration. For bivariate functions, we define A i as the operator A applied only to the i-th coordinate.
Notice that for x ≤ τ n , (1.2) Also, notice that if g(x) = 1 is a constant function, then (Ag)(x) = 1 − S(x) . Finally, observe that the definitions of A and A depend only on F and it does not consider the censoring distribution G.

Main Results
The Kaplan-Meier V-statistic associated with θ(F ) is defined by where the second equality follows from the definition of the Kaplan-Meier estimator F n . Bose and Sen [1999] proved that as n approaches infinity. Notice that the limit θ(F ; τ ) has a dependency on τ since the data is right-censored, and thus we do not observe any survival time beyond the time τ . Consider the difference θ( F n ) − θ(F ; τ ), and notice it can be decomposed into two error terms: where the projection φ is given by Note that in our setting, our definition of φ integrates up to τ , instead of the whole support of F as in Equation (1.1) used in the uncensored setting. To ease notation, we write α and β instead of α(F, F n ) and β(F, F n ), respectively. Each error term α and β can be seen as a first and second order approximation of the difference θ( F n ) − θ(F ; τ ). That being said, we expect that the error term α is of a much larger order than β. Indeed, it holds that α = O p (n −1/2 ) and β = O p (n −1 ). This suggests the use of two different scaling factors, splitting our main result into two cases: the non-degenerate and the degenerate case. In the first case, we will show that, under appropriate conditions, √ n(θ( F n ) − θ(F ; τ )) converges in distribution to a zero-mean normal random variable when n approaches infinity. This result will follow from proving a normal limit distribution for the scaled error term √ nα and an in-probability convergence to zero of the scaled error √ nβ. In the degenerate case, the error term α is trivially 0, and thus we only care about the term β. We will show that nβ converges in distribution to a linear combination of (potentially infinity) independent χ 2 random variables plus a constant. From these results, we will be able to derive analogue results to those in the canonical V-statistics setting.
To express our results and conditions, we define the kernel K : R 2 + → R as K (x, y) = (A 1 A 2 K)(x, y), which, by the definition of the operators A 1 and A 2 , is equal to (2.4) We introduce two sets of conditions, one for the non-degenerate case, and one for the degenerate case.
Condition 2.1 (non-degenerate case: scaling factor √ n). Assume the following conditions hold: Condition 2.2 (degenerate case: scaling factor n). Assume the following conditions hold: Notice that Condition 2.2 implies Condition 2.1.

Results for Kaplan-Meier V-statistics
i) The non-degenerate case: √ n-scaling: Recall that α is defined in terms of the projection φ defined in Equation (2.2). Then, the main result follows under Condition 2.1 from a standard application of the Central Limit Theorem (CLT) derived by Akritas [2000] for univariate Kaplan-Meier integrals and by proving √ nβ As noticed by Efron and Johnstone [1990], σ 2 is finite if where σ < ∞ is given in Equation (2.5).
ii) The degenerate case: n-scaling. The previous result considers that σ 2 > 0. Notice this is not satisfied if α = 0 as it implies σ 2 = 0. In turn, α = 0 can be deduced from either of the following conditions of the projection φ defined in Equation (2.2): i) φ(x) = 0, F -a.s., or ii) φ(x) = c, F -a.s., for some non-zero constant c and S n (τ ) = S(τ ) a.s. for all n large enough. In the theory of V-and U-statistics, these conditions are known as the degeneracy properties. In such a case, the √ n-scaling does not capture the nature of the asymptotic distribution of θ( F n ) − θ(F ; τ ), suggesting that we need to consider a larger scaling factor.
Define J : where dm x,r (s) = rδ x (s)−1 {x≥s} dΛ(s), and notice that dm Xi,∆i = dM i , and recall that M i is the i-th individual martingale defined in Section 1.
∼ N (0, 1) and the λ i 's are the eigenvalues associated with the integral operator T J : L 2 (X, ∆) → L 2 (X, ∆), defined as where L 2 (X, ∆) is the space of square-integrable functions with respect the measure induced by (X 1 , ∆ 1 ).
Moreover, E(Ψ) = 0 and An immediate consequence of the previous Theorem is the asymptotic behaviour of the degenerate case for the Kaplan-Meier V-statistic.
Corollary 2.5. Suppose one of the following degeneracy conditions hold.
Then, under Condition 2.2, As a part of the proof, we find an asymptotic representation of β as a canonical V-statistic, this representation is as following.
Theorem 2.6. Under Condition 2.2 it holds that All the results of this section are proved in Section 5 except for Theorem 2.6, which is proved in Section 9.

Kaplan-Meier U-statistics
The Kaplan-Meier U-statistic is defined by where the second equality follows from adding and subtracting the diagonal term ( . Without loss of generality, assume θ(F ; τ ) = 0. Then, the asymptotic distribution of θ U ( F n ) can be related to the one for θ( F n ) by analysing the asymptotic behaviour of i =j W i W j and i . For the first term, Bose and Sen [1999] proved that i =j W i W j a.s. → F (τ ) 2 . For the second term we enunciate the following result, which is proved in Appendix D.
The previous Lemma combined with the results obtained in the previous section allow us to deduce the following results for Kaplan-Meier U-statistics.
Corollary 2.8. Assume Condition 2.1, and additionally assume that we have Corollary 2.9. Suppose one of the following degeneracy conditions hold.
Assume Condition 2.2 and that τ 0 where Ψ is as in Theorem 2.4.

Analysis of Conditions 2.1 and 2.2, and comparison with related works
In this section we discuss our conditions, and we compare them with the work of Bose and Sen [2002] and Datta et al. [2010]. We begin by analysing Condition 2.1, used in the non-degenerate regime, which implies Theorem 2.3. Recall that Efron and Johnstone [1990] showed that the variance of the limit distribution in Theorem 2.3 is finite if which is very close to term in Equation (2.7). Indeed, there is just one Cauchy-Schwarz inequality gap from the condition of Efron and Johnstone [1990], suggesting little room for improvement. On the other hand, Condition 2.1.ii is a standard condition to deal with the diagonal term that appears in the V-statistic representation. It is only used in Lemma 9.1 and it is usually much simpler to verify due to the multiplicative factor S(x−) that appears in the integral, which makes the tail much lighter. We compare our Condition 2.1 with the conditions of Theorem 1 of Bose and Sen [2002], which establishes the same limit result as our Theorem 2.3 under different conditions. Theorem 1 of Bose and Sen [2002] requires our Condition 2.2.i (which implies our Condition 2.1.i), together with three extra conditions involving the function (2.8) For example, one of the extra conditions required is which, compared to our Condition 2.1.i, is much harder to satisfy as the function C(x) grows much faster than (1 − G(x−)) −1 when x approaches infinity. Indeed, by assuming that G and F are continuous distributions, it is not hard to verify . Therefore, unless the kernel K(x, y) decays very fast, Equation (2.9) is very hard to satisfy. In example 2.10 below, we show that C(x) can grow exponentially faster than (1 − G(x)) −1 .
We continue by comparing our Condition 2.1 with the ones of Theorem 1 of Datta et al. [2010]. Theorem 1 of [Datta et al., 2010] which is the condition of Efron and Johnstone [1990] for finiteness of the variance. However, it also requires which is very hard to satisfy as it involves the function C(x) defined in Equation (2.8).
is finite if and only if a < 1. In this setting, our Conditions 2.1.i and 2.1.ii are respectively, which are satisfied for a < 1. Hence, our conditions are the best possible in this case, as the variance σ 2 of the limit distribution is finite if and only if a < 1. Bose and Sen's approach Bose and Sen [2002] requires the finiteness of the expression in Equation (2.9). In this example, C(x) = a e x+ax −1 1+a , then Equation (2.9) is equal to which is infinite for all a > 0. We deduce that Theorem 1 of Bose and Sen [2002] cannot be applied in this setting.
While the first equation is satisfied for a < 1, the second equation cannot be satisfied for any a > 0. Hence, Theorem 1 of Datta et al. [2010] does not hold in this setting.
In the previous example note that C(x) = a e x+ax −1 1+a grows exponentially fast with x. Therefore, to use the respective theorems of Bose and Sen [2002] and Datta et al. [2010] we will need to use kernels that decay exponentially fast.
We continue by analysing Condition 2.2 which is used in the degenerate case. Observe that the integral of Condition 2.2.ii is equal to the first moment of the limit distribution of Theorem 2.4, thus this condition cannot be avoided. The variance of the limit distribution in Theorem 2.4, is given by which ensures the finiteness of (2.10). Recall that K , defined in Equation (2.4), is given by From here, if we consider continuous distributions, we observe that the expression in our condition is similar to the variance given in (2.10). If we consider an appropriate kernel K, it may happen that some terms in K cancel each other, resulting in a kernel K of much smaller order than K. An example of this is the kernel K(x, y) = (x − c)(y − c), where c = xdF (x) and S(x) = e −x . Note this kernel is similar to the previous example, but we subtract c to make it degenerate. In this setting we have K (x, y) = 1, hence it is easier to have finite variance than to satisfy our condition. However, in general cases we do not expect to have cancellation between the terms in K and thus K and K should be of similar order, making our Condition 2.2.i sufficient and necessary.
Up to the best of our knowledge, the work of Bose and Sen [2002] is the only one that establishes results for the degenerate case in a general setting. Compared to their result, our conditions are better since their Theorem 2 has the same requirements as their Theorem 1, i.e. conditions involving the function C(x), including Equation (2.9) which, as we saw in our previous example, is very hard to satisfy. Indeed, if we repeat Example 2.10 with the kernel K(x, y) = (x − c)(y − c), the conditions of Theorem 2.4 are satisfied for a < 1 (in which case the asymptotic variance is well-defined), while the conditions of Bose and Sen are not satisfied.

Applications
We give two examples of applications that motivated us to study Kaplan-Meier Vstatistics. First we analyse a slight variation of the Cramer-Von Mises statistic that allows us to treat it as a Kaplan-Meier V-statistic. In our second application, we measure goodness-of-fit via the Maximum Mean Discrepancy (MMD), a popular distance between probability measures frequently used in the Machine Learning community.
Example 3.1 (Cramér-von Mises test-statistic). Consider the problem of testing the null hypothesis H 0 : F = F 0 against the general alternative H 1 : F = F 0 . The Cramér-von Mises statistic measures the closeness between F and F 0 by computing (3.1) When F is a probability distribution function, it can be verified that Equation (3.1) equals to Under the null hypothesis H 0 : F = F 0 , we estimate θ(F ) = 0 by using Equation (3.2), replacing F by the Kaplan-Meier estimator F n . Then, our test-statistic is Notice that the equality between Equations (3.1) and (3.2) is only valid when F is a probability distribution, unfortunately, the Kaplan-Meier estimator F n is not always a probability distribution, indeed, F n is a probability distribution if and only if the largest observation is uncensored, thus θ( F n ) is slightly different from the Cramér-von Mises test-statistic . Under the null hypothesis, we observe two different asymptotic behaviours of our test-statistic θ( F n ), one for F 0 (τ ) < 1 and the other for F 0 (τ ) = 1. To see this, for x ∈ I H , consider the projection φ defined in Equation (2.2), which in this case is given by and notice that if F 0 (τ ) < 1, then φ(x) does not satisfy the degeneracy condition of Corollary 2.5. Thus, by Theorem 2.3, it holds that √ n(θ( F n ) − θ(F 0 , τ )) is asymptotically normally distributed. On the other hand, if F 0 (τ ) = 1, then φ(x) satisfies the degeneracy condition of Corollary 2.5, indeed, we have that φ(x) = 0 for all x ∈ I H . Hence, under Condition 2.2, Corollary 2.5 applies, concluding that nθ( F n ) is asymptotically distributed as the weighted sum of i.i.d. χ 2 1 random variables plus some constant term.
For comparison purposes, we consider the alternative formulation of the Cramer-von Mises statistic by Koziol and Green [1976]. They consider the random integral Φ n = ∞ 0 (F n (t) − F 0 (t)) 2 dF 0 (t), whereF n is exactly as the Kaplan-Meier estimator, but they forceF n (τ n ) = 1 even if the largest observation is censored. For simplicity of the analysis, Koziol and Green [1976] assumed that the censoring distribution satisfies 1 − G(t) = S 0 (t) γ for γ < 2 and that F 0 is a continuous distribution. Then, based on Gaussian processes arguments, they proved that nΦ n D → Φ where Φ denotes (a potentially infinite) linear combination of χ 2 1 − 1 independent random variables, and that , and Var(Φ) = 2 9(5 − γ)(2 − γ) .
Using our techniques, we consider θ( F n ) as in Equation (3.3). In this case, we get Then, if γ < 1, the conditions of Corollary 2.5 are satisfied and thus where Ψ is as in Theorem 2.4. Recall that E(Ψ) = 0, then the asymptotic mean is given by 1 3(2−γ) and the asymptotic variance is given by Our result suggests that our estimator and the one considered by Koziol and Green [1976] have similar behaviours, even when rescaled by n. In Figure 1, we show simulations of the empirical distribution of nθ( F n ) for different sample sizes n, and γ ∈ {0.5, 1, 1.5}. For γ = 0.5 we can observe a clear convergence of the distribution functions as predicted by our results. The plot for γ = 1.5 shows a shift of the distribution functions as the sample size increases, suggesting divergence. The simulations for γ = 1 are, unfortunately, not very revealing.
Example 3.2 (Maximum mean discrepancy). Let (H, ·, · H ) be a reproducing kernel Hilbert space of real-valued functions with reproducing kernel denoted by K. Denote by P the set of all probability distribution functions on R + , we define the map µ · : P → H by µ F (·) = E X∼F (K(·, X)) for any distribution function F ∈ P. A reproducing kernel is called characteristic if the map µ is injective [Sriperumbudur et al., 2010]. It is worth mentioning that most of the standard positive-definite kernels (e.g. Gaussian and Ornstein-Uhlenbeck) are characteristic. In such a case, the map µ allows us to establish a proper distance between probability measures in terms of the norm of the space H. That is, given two probability distributions F and F 0 , we define their distance by Also, under the conditions stated above, such distance coincides with the Maximum mean discrepancy with respect to the unit ball of H, which is defined as follows In the uncensored setting, the Maximum mean discrepancy has been used in a variety of testing problems. Indeed, in the simplest case, we can assess if our data points are generated from a distribution F 0 by comparing it with the empirical distributionF n . By using the equivalency between Equation (3.4) and (3.5), we deduce that M M D(F n , F 0 ) 2 is a V-statistic. This fact allows us to easily derive the relevant asymptotic results to construct a statistical test.
In the setting of right-censored data we study M M D( F n , F 0 ) 2 using the Kaplan-Meier estimator F n . By using Equations (3.4) and (3.5), our test-statistic can be written as Notice that M M D( F n , F 0 ) 2 coincides with β defined in Equation (2.3). Hence, under the null hypothesis H 0 : F = F 0 , and Condition 2.2, Theorem 2.4 states where Ψ is as in Theorem 2.4. Notice that Theorem 2.4 does not require the degeneracy condition of Corollary 2.5. For the sake of simplicity, let us consider K as the Ornstein-Uhlenbeck kernel given by K(x, y) = e −|x−y| , and let S 0 (x) = e −x and 1 − G(x) = S 0 (x) γ (notice that for this choice of parameters τ = ∞). A tedious computation shows that Then, under the null hypothesis and Condition 2.2, which is satisfied for γ < 1, it holds Since E(Ψ) = 0, the asymptotic mean is given by and [ µ − σ, µ + σ], where σ 2 and σ 2 denote the empirical and asymptotic variance respectively. We use a fixed sample size n = 3000.
variance corresponds to .
In Figure 2, we compare the empirical mean and variance with the mean and variance of the limit distribution. We repeat this experiment 1000 times for different values of γ and a fixed sample size of 3000 data points. We observe that as γ approaches 1 the empirical estimation starts to get far away from the mean and variance predicted by our result, suggesting a slow convergence rate.

Conclusions and Final Remarks
In this work we studied the limit distribution of Kaplan-Meier V-and U-statistics under two different regimes: the degenerate and the non-degenerate. Our results hold under very simple conditions and, in practice, we just need to check the finiteness of two simple integrals. Compared to previous approaches our results are much simpler to state and the conditions required to apply them are much easy to satisfy and verify. Additionally, our result gives more information about the limit distribution, e.g. we give closed-form expressions for the asymptotic mean and variance, as well as an asymptotic canonical V-statistic representation of the Kaplan-Meier V-statistic.
We give a few comments about our results. First, in the canonical case (uncensored data), U-statistics are preferred over V-statistics due to several reasons. Arguably, the most important reason is that U-statistics are unbiased while Vstatistics are, in general, biased. The bias of V-statistics implies that limit theorems need to deal with the behaviour of the biased part of the estimator, resulting in stronger conditions in the statement of the results. In the right-censored setting, it does not seem to be a major difference between U-and V-statistics, and indeed, V-statistics are easier to work with as they can be represented by an integral with respect to the Kaplan-Meier estimator. Furthermore, due to the complex structure of the Kaplan-Meier weights, the Kaplan-Meier U-statistics are usually biased as opposed to its canonical counterpart, losing their main advantage over V-statistics.
Second, we think that our proof can be implemented in the settings of random kernels K n that depend on the data points (X i , ∆ i ) n i=1 as long certain regularity conditions hold, namely, i) K n is predictable in the sense of Definition 6.6, ii) it exists a deterministic kernelK such that sup x,y≤τn |K n (x, y)/K(x, y)| = O p (1), iii) K n converges in probability to some deterministic kernel K, iv)K and K satisfy Conditions 2.1 or 2.2, depending on the case of interest.
Third, our analysis can be extended to kernels of dimension greater than two by using the same underlying ideas exposed in this work. Nevertheless, the statements and proofs of the results become much more complicated due to long computations that come from the fact that the core of our proof strategy relies on decomposing the integration region into Interior and Exterior regions, thus, as the number of integrals grows so do the possible combinations of these type of regions. We do not include these type of results as they do not add much value to the current work, especially because U-and V-statistics of dimension two are the most common in applications.
Finally, after the publication of the first preprint of this paper a few works have followed the path of using MMD distances to hypothesis testing in the setting of right-censored data. Particularly, Matabuena and Padilla [2019] implemented our MMD example for the two-sample problem, and extended it to Energy Distances, which are a generalisation of the MMD. Their analysis is a direct application of Theorems 2.3 and 2.4, and Corollary 2.5. In a similar direction, in Fernandez and Rivera [2019] the authors studied MMD distances in the context of hazard functions, obtaining as test-statistic a double integral with respect to the Nelson-Aalen estimator. Due to the relationship between the Nelson-Aalen estimator and the counting process martingale M (t), the asymptotic analysis of their test-statistic is carried out by using the techniques of this paper. Somewhat related is the work of Rindt et al. [2019], where an MMD independent test for right-censored data is presented. Their test-statistic is a Kaplan-Meier quadruple integral, however, under their null hypothesis, such a test-statistic becomes the product two Kaplan-Meier double integrals, thus its asymptotic analysis follows from our results.

Proofs I: Road Map
In order to keep our proofs as tidy as possible and to emphasise the key steps without the distraction of messy computations, we give a list of intermediate steps that are needed to carry out the proof of our main results.
Recall that Equation ( . We analyse α and β individually.

Treatment of α
We distinguish between two cases, when φ(x), defines in Equation (2.2), satisfies the degeneracy condition stated in the Corollary 2.5 and when it does not. For the first case, observe that α = 0 holds trivially. For the second case, an application of the Central Limit Theorem (CLT) of Akritas [2000] gives us the asymptotic behaviour of α.

Treatment of β
Let R 1 , R 2 be two subsets of R + . Denote by β R1×R2 , the integral Observe that β = β I H ×I H = β I 2 H , and that β can be decomposed into β We recall that I, E and I H are defined in Section 1.1.4. To avoid extra parentheses, we write I 2c instead of (I 2 ) c .
In Section 7, we study the asymptotic properties of β I 2c , obtaining as main result the following Lemma.
The handling of the term β I 2 is far more complicated since it contains all the important information about the limit distribution. In Section 8, we transform β I 2 into a more tractable object by performing a change of measure, where instead of integrating with respect to d( F n − F ), we integrate with respect to the measure dM = dN − Y dΛ. This is done by using Duhamel's equations (Proposition 6.3). The main result of Section 8 is the following.

Lemma 5.4. Under Condition 2.1, it holds that
and under Condition 2.2, it holds that where the kernel K is defined in Equation (2.4).

Proof of Theorem 2.3
In order to prove Theorem 2.3, we require the following intermediate result, which is formally proved in Section 9.

Proof of Theorem 2.4
We proceed to give proof to Theorem 2.4 and Corollary 2.5. Observe that under Condition 2.2, Lemma 5.3, and Equation (5.2) of Lemma 5.4 yield Theorem 2.6, states that S n (x−) and Y (x)/n in Equation (5.3) can be substituted by their respective limits, S(x−) and 1 − H(x−), obtaining that The proof of Theorem 2.4 follows by noticing that the leading term in Equation (5.4) is a degenerate V-statistic. Proof of Theorem 2.4: Equation (5.4) states where J is the kernel defined in Equation (2.6), thus we deduce that β is a canonical V-statistic up an error of order o p (n −1 ). From Condition 2.2, we can deduce that E(J((X 1 , ∆ 1 ), (X 2 , ∆ 2 )) 2 ) < ∞, and that J satisfies the following degeneracy condition E(J((X i , ∆ i ), (x, r))) = 0, for all (x, r) ∈ I H × {0, 1} (see Appendix C). Therefore, by applying standard results for degenerate V-statistics, e.g. [Koroljuk and Borovskich, 1994 , ξ 1 , ξ 2 , . . . are i.i.d. standard normal random variables, and the λ i 's are the eigenvalues of the integral operator T J : L 2 (X, ∆) → L 2 (X, ∆) associated with J.
Finally, note that E(Ψ) = 0, and that where the last equality is formally verified in Appendix C.
The rest of the paper is devoted to prove Lemmas 5.3, 5.4 and 5.5, and Theorem 2.6.

Proofs II: Preliminary Results
The following results are going to be used several times in this paper.

Some Results for Counting Processes
Proposition 6.1. The following results holds a.s.
Item i. is due to Stute and Wang [1993], item ii. is the Glivenko-Cantelli theorem, and item iii. follows from the two previous items.
Items i. and ii. are due to Gill [1983]. Item iii. is from [Gill, 1980, Theorem 3.2.1], and Item iv. is due to Yang [1994].
Yet another useful result is the so-called Duhamel's Equation.
Proposition 6.3 (Prop. 3.2.1 of [Gill, 1980]). For all x > 0 such that S(x) > 0, where A is the operator defined in Section 1.1. Secondly, for a 2-dimensional kernel K, we obtain

Some Convergence Theorems
We state, without proof, the following elementary result that is useful to prove that a sequence of (random) integrals converge to zero in probability.
Lemma 6.4. Let (X , B, µ) be a σ-finite measure space. Let (R n (x) : x ∈ X ) be a sequence of stochastic process indexed on X . We assume that R n (·) is measurable with respect to B (for any fixed realisation of R n ). Suppose that i) For each x ∈ X , R n (x) → 0 almost surely as n tends to infinity, and ii) it exists a deterministic non-negative function R : and that x∈X R(x)µ(dx) < ∞.
Define the sequence of random integrals I n = X R n (x)µ(dx), then I n = o p (1).

Some Martingale Results
For a given martingale W , we denote by W and [W ], respectively, the predictable and quadratic variation processes associated with W . It is particularly useful to remember that for counting process martingales M i and M we have that and note that 1 − ∆Λ(y) = S(y)/S(y−). In our proofs we will constantly use the Lenglart-Rebolledo inequality [Fleming and Harrington, 1991, Theorem 3.4.1], in particular, we will use the fact that if W is a submartingale depending on the number of observations n, with compensator R, then the Lenglart-Rebolledo inequality implies that sup t≤T W (t) P → 0 for any stopping time T such that R(T ) P → 0. Here limits are taken as n approaches infinity. Throughout the proofs, we may not explicitly write the dependence on n when writing a stochastic process, e.g. the martingale M (t) depends on all data points.
In this work we will often encounter (sub)martingales with extra parameters, and we will integrate with respect to them. A particular case is stated in the following lemma, whose proof is very simple (and thus omitted).
Lemma 6.5. Let (X , B, µ) be a σ-finite measure space. Consider the stochastic process (M y (t) : t ≥ 0, y ∈ X ), and assume that i) For every fixed y ∈ X , M y (t) is a square-integrable (F t )-martingale, and ii) for every t ≥ 0, E X M y (t) 2 µ(dy) < ∞.
Then, for fixed B ∈ B, the stochastic process W (t) = B M y (t) 2 dµ(y) is an (F t )submartingale, and its compensator, R, is given by R(t) = B M y (t)dµ(y).
Another interesting type of stochastic processes that appear in our proofs are double integrals with respect to martingales. Define the process where C t = {(x, y) : 0 < x < y ≤ t}, and C 0 = ∅. The natural questions are whether W (t) defines a proper martingale with respect to (F t ) t≥0 and, if that is the case, what is its predictable variation process (if it exists). We answer these questions below.
Definition 6.6. Define the predictable σ-algebra P as the σ-algebra generated by the sets of the form is called elementary predictable if it can be written as a finite sum of indicator functions of sets belonging to the predictable σ-algebra P. On the other hand, if a process h is P-measurable then it is the almost sure limit of elementary predictable functions.
Straightforwardly from Definition 6.6 we get the following proposition.
Also, all deterministic functions are P-measurable.
Theorem 6.8. Let h be a P-measurable process, and suppose that for all t ≥ 0 it holds that Then, is a martingale on R + with respect to the filtration then W (t) is a square-integrable (F t )-martingale with predictable variation process W given by The proof of Theorem 6.8 is given in Appendix B.

Forward Operators
Lemma 6.9. Under Condition 2.1, it holds that (6.6) and that Under Condition 2.2, it holds that (6.8) and that The proof of Lemma 6.9 is given in Appendix A.
Then, by the Cauchy-Schwartz's inequality, it holds Multiplying by (1 − G(τ n ))/(1 − G(τ n )) = 1, we get where the last equality follows from the facts that n(1 − H(τ n )) = O p (1) by Proposition 6.2.iv, and that the double integral tends to 0 since τ n → τ when n tends to infinity, and by Condition 2.1. Following the same argument, under Condition 2.2, we get (1), and since the double integral tends to 0 by Condition 2.2, together with the fact that τ n → τ .

Proof:
We start by noticing that if τ is a point of discontinuity of H then τ n = τ almost surely for a sufficiently large n. Consequently, the set E × I is empty and thus the statement above holds trivially. Therefore, we assume that τ is a continuity point of H. Replacing Equation (6.2) in β E×I yields Recall that Equation (1.2) states that for x ≤ τ n , where we define L(τ n ) = τn 0 To verify the last equality, we use Lemma 7.1, and the fact that L(τ n ) = O p (1), which follows from Lemma 2.4 of Gill [1983] that states that L(τ n ) = 1 − S n (τ n )/S(τ n ), and then by Proposition 6.2.i, we get L(τ n ) = O p (1). Therefore, from Equation (7.1), we deduce We proceed to show that n τ τn M y (τ n )dF (y) = o p (1) which implies that nβ E×I = o p (1).
Notice that for any fixed y ∈ R + , M y (t) is a square-integrable (F t )-martingale. By applying the Cauchy-Schwartz's inequality, we obtain n τ τn M y (τ n )dF (y) ≤ n 1/2 S(τ n ) 1/2 n τ τn M y (τ n ) 2 dF (y) where the last equality follows from Proposition 6.2.iv. We proceed to prove that Notice that the previous equation considers random integration limits. Our first step will be to prove that τ n can be replaced by a deterministic value, say T n , without affecting the result we wish to prove. Let C > 0 be a large constant, define T n = inf{t > 0 : H(t) ≥ 1 − C/n} and the event B n = {1 − C/n ≤ H(τ n )}. By Proposition 6.2.iv, it holds P(B n ) ≥ 1 − e −C and, by the definition of T n , we have that {τ n ≥ T n } ⊆ B n . Since lim C→∞ P(B n ) = 1, it is enough to prove that n τ Tn M y (τ n ) 2 1 − G(y−) dF (y) = o p (1).
Observe that by Lemma 6.5, the process is an (F t )-submartingale with compensator, evaluated at t = τ n , given by where the second equality is due to Propositions 6.2.i and 6.2.ii, and the third equality holds by noticing that T n → τ and that by Equation (6.8) of Lemma 6.9 under Condition 2.2. We conclude then that Since the previous result is valid in the event B n , which can be chosen with arbitrarily large probability, we conclude finishing our proof.
Proof: Following the same steps of the proof of Lemma 7.2, it holds is an (F t )-martingale for every fixed y ∈ R + . Let T n be the same deterministic sequence used in the proof of Lemma 7.2. Then, it suffices to show that By the Cauchy-Schwartz's inequality Moreover, notice that by Lemma 6.5, n τ Tn M y (t) 2 dF (y) is an (F t )submartingale with compensator, evaluated at t = τ n , given by where the second equality holds by Propositions 6.2.i and 6.2.ii. We prove that the compensator in the previous equation converges to zero by noticing that T n → τ , and that which holds due to Equation (6.6) of Lemma 6.9 under Condition 2.1. The previous result implies that n τ Tn M y (τ n )dF (y) = o p (1). By the Lenglart-Rebolledo inequality, we deduce n τ Tn M y (τ n ) 2 dF (y) = o p (1), and by substituting this result in Equation (7.3) we get n 1/2 β E×I = o p (1).
We proceed to prove that A 1 and A 2 can be replaced in the previous equation by the operators A 1 and A 2 , respectively. After that, Equation (5.2) follows immediately by recalling that K (x, y) = (A 1 A 2 K)(x, y).
We begin with the following equality, then, by the symmetry of K, we just need to prove that We begin by proving Equation (8.1). From Equation (1.2) we get Y (x) , then substituting Equation (8.3) in Equation (8.1) yields where the third equality holds as L(τ n ) = O p (1) (which is proved in Lemma 7.2), and the last equality is exactly Equation (7.2). Hence, by Lemma 7.2, we deduce that Equation (8.1) holds true. For Equation (8.2), a similar computation yields where the last equality holds by Lemma 7.1.
To finish the proof of Lemma 5.4 we need to check Equation (5.1) holds under Condition 2.1, which follows from repeating the same steps but replacing the scaling factor n by √ n.

Proofs V: Double Stochastic Integral
In this section we prove Lemma 5.5 and Theorem 2.6. To begin with, from Lemmas 5.3 and 5.4, we deduce that holds for c = 1/2 under Condition 2.1, and for c = 1 under Condition 2.2. The form of β suggests that we need to study the double stochastic integral process given by The strategy to study Q(t) is to consider its decomposition into a diagonal and an off-diagonal term, and to analyse them individually. To this end, we define the sets D(t) = {(x, y) : x = y, 0 < x ≤ t} and C(t) = {(x, y) : 0 < x < y ≤ t}, and define the processes Notice that Q(t) = Q D (t) + 2Q C (t) follows by the symmetry of K = (A 1 A 2 K). The proofs of Lemma 5.5 and Theorem 2.6 are an immediate consequence of the following results concerning the process Q(t).

Proof of Theorem 2.6:
From Equation (9.1) it holds that nβ = nQ(τ n )+o p (1) = nQ D (τ n )+2nQ C (τ n )+ o p (1). Then a direct application of Lemmas 9.2 and 9.4 yields It just remains to prove Lemmas 9. 1, 9.2, 9.3, and 9.4. 9.1 Integral over Diagonal D(t): Proof of Lemmas 9.1 and 9.2 Observe that Q D (t) satisfies The latter can be checked by noticing that the measure dM (x)dM (y) of a small square whose main diagonal goes from (a, a) to (b, b) is (M (b) − M (a)) 2 . When b approaches a from above, we have that (M (b) − M (a)) 2 → (∆M (a)) 2 (the limit is well-defined for M ). Since M is the difference of two increasing processes we have that the number of discontinuities is at most countable, then Define the (F t )-submartingale Thus, it is enough to prove that √ nW (t) = o p (1). Abusing notation, denote by W (t) the compensator of W (t). Then we will prove that √ n W (τ n ) = o p (1), and thus, by the Lenglart-Rebolledo inequality, we will get √ nW (τ n ) = o p (1).

Observe that
where the fourth equality follows from Propositions 6.2.i and 6.2.ii. Finally, we claim that . This is verified by applying Dominated Convergence Theorem. Indeed, notice that → 0 for each fixed x ∈ I H , thus the integrand tends to zero. Moreover, by using that Y (x) ≥ 1 for x ≤ τ n , the integrand is bounded by an integrable function due to Condition 2.1.
Proof of Lemma 9.2: Observe that it is enough to prove that , and observe that it corresponds to an (F t )-submartingale. We prove that W (τ n ) = o p (1) by using the Lenglart-Rebolledo inequality. For such, we have to prove that its compensator, which by abusing notation we denote by W , satisfies W (τ n ) = o p (1). A simple computation shows where the third equality follows from Proposition 6.2.iii, and the last equality follows from applying Lemma 6.4, whose conditions we proceed to verify: for the first condition, Proposition 6.1 yields 1−H(x−) , which is integrable by Condition 2.2, then, Propositions 6.2.i and 6.2.ii give U n (x) = O p (1) 1 (1−G(x−)) 2 , uniformly on x ≤ τ n . Then uniformly on x ≤ τ . 9.2 Integral over Off-diagonal C(t): Proof of Lemmas 9.3 and 9.4 Proof of Lemma 9.3: From Theorem 6.8, Q C (t) is a square-integrable (F t )martingale with mean 0. Then, by the Lenglart-Rebolledo inequality, it is enough to prove its predictable variation process, denoted by Q C (t), satisfies n Q C (τ n ) = o p (1).
From Theorem 6.8 we have that n Q (τ n ) is equal to where the first equality is due to Propositions 6.2.i and 6.2.ii, and in the second equality we define M y (t) = t 0 , which is a squareintegrable (F t )-martingale for any fixed y ∈ I H .
We claim that the previous quantity tends to 0 as n approaches infinity by an application of the Dominated Convergence Theorem. Indeed, notice that → 0 for any fixed x ∈ I H , and that which is integrable by Equation (6.7) of Lemma 6.9 (Recall that K = A 1 A 2 K).
Proof of Lemma 9.4: The result follows from proving that is o p (1), which is equivalent to prove that 1 n τn 0 (0,y) We only prove Equation (9.3), as Equation (9.4) follows by repeating the same steps.
which is predictable w.r.t. (F x ) x≥0 , and define the process W (t) as which, by Theorem 6.8, is a square-integrable (F t )-martingale. We just need to prove that W (τ n ) = o p (1). By the Lenglart-Rebolledo inequality, it is enough to check that the predictable variation process of W (t), W (t), satisfies W (τ n ) = o p (1). From Theorem 6.8, we have where the second equality is due to Propositions 6.2.i and 6.2.ii, and in the third equality we define M y (t) = t 0 U n (x)K (x, y)1 {x<y} dM (x). We proceed to check that Equation (9.5) is o p (1). Observe that for any fixed y ∈ I H , M y (t) = t 0 U n (x)K (x, y)1 {x<y} dM (x) is a square-integrable (F t )martingale, thus, by Lemma 6.5, the process Z(t) = 1 n τ 0 M y (t) 2 S(y)dF (y) 1−H(y−) is an (F t )-submartingale. Note that W (τ n ) = Z(τ n ). We check that Z(τ n ) = o p (1) by verifying that the compensator of Z, which by abusing notation we denote by Z (t), satisfies Z (τ n ) = o p (1). From Lemma 6.5, Z (τ n ) is given by where the equality follows from Proposition 6.2.iii. We shall verify the conditions of Lemma 6.4 to prove that Equation (9.6) is o p (1). Set R(x, y) = K (x,y) 2 S(x)S(y) and R n (x, y) as the integrand in Equation (9.6). To verify the first condition of Lemma 6.4, note that R n (x) → 0 for each x ∈ I H , since U n (x) → 0 by Proposition 6.1. To verify the second condition, Propositions 6.2.i and 6.2.ii yield U n (x) = O p (1)(1 − G(x−)) −1 uniformly on x ≤ τ n , thus the integrand satisfies uniformly in x. Finally, the function R(x, y) = K (x,y) 2 S(x)S(y) (1−H(y−))(1−H(x−)) is integrable due to Equation (6.9) of Lemma 6.9 (recall that K = A 1 A 2 K).
A Proof of Lemma 6.9 In order to prove Lemma 6.9, we introduce the operator R : L 2 (F ) → L 2 (F ), Note that the operator A can be written as A = Id −R, where Id is the identity operator. Additionally, for bivariate functions K : R 2 + → R, we define R 1 K and R 2 K as the operator R applied on the first and second coordinate of K, respectively. Note that R 1 and R 2 commute Let X ∼ F and g ∈ L 2 (F ), then we claim the operator R satisfies that The previous equation follows from Equation (4.3) of Efron and Johnstone [1990], which states that Then, by using that (Rg) 2 (X) ≤ 2(g(X) 2 + (Ag)(X) 2 ), we get where in the last step we used Equation ( , and notice that Conditions 2.1 and 2.2 imply Γ ∈ L 2 (F × F ) and Σ ∈ L 2 (F × F ), respectively. Assume Condition 2.1 holds, then a simple computation shows where the last equation follows from Equation (A.1), and Condition 2.1. Another similar computation shows where the first inequality follows from Equation (A.1) applied on R 2 (i.e. applied on y). The second inequality is exactly Equation (A.3). Similar computations, show that under Condition 2.2, we have Also, Equation (6.7) follows directly from Equations (A.3) and (A.4) since Condition 2.2, together with Equations (A.5) and (A.6), yields Equations (6.8) and (6.9) by following the exact same procedure.
B Proof of Theorem 6.8 We proceed to prove Theorem 6.8. Let h be P-measurable. As M (x) is the difference between two right-continuous increasing processes, we have h(x, y)dM (x)dM (y).
We proceed to prove that the process φ(y) = (0,y) h(x, y)dM (x) is predictable with respect to the sigma-algebra (F y ) y≥0 . For this, it is enough to verify the claim for elementary functions of P, and then we extend the result to general functions in P.
If h(x, y) = X1 {(x,y)∈(a1,b1]×(a2,b2]} with X ∈ F a2 and 0 ≤ a 1 ≤ b 1 ≤ a 2 ≤ b 2 , then (0,y) h(x, y)dM (x) = X1 (a2,b2] (y) (0,y) 1 (a1,b1] (x)dM (x), which is predictable with respect to (F y ) y≥0 since both processes, X1 (a2,b2] (y) and (0,y) 1 (a1,b1] (x)dM (x), are adapted to (F y ) y≥0 and left-continuous. For the first process note that it is important that X ∈ F a2 to ensure it is adapted, and for the second one it is key that we are integrating on (0, y) instead of (0, y] to ensure it is left-continuous. Therefore, the process h(x, y)dM (x)dM (y) is the integral of a predictable process, and thus Z t is an (F t )-martingale. By using Equation (6.4) together with Lebesgue Dominated Convergence theorem, we extend the result to general functions h of the predictable sigma algebra P. From Equation (6.5), we get that Z(t) is a square-integrable process, and its predictable variation process is given by h(x, y)dM (x) 2 Y (y)S(y) S(y−) 2 dF (y).
As the term inside the parenthesis is a deterministic function of t, the previous integral is just a stochastic integral with respect to the zero mean martingale M 1 , then by the Optional Stopping Theorem, its expected value is 0.

D Proof of Lemma 2.7
Equation (3.41) of Aalen et al. [2008] states that hence a Kaplan-Meier weight W i for an uncensored observation X i equals ∆ F n (X i ) divided by all the uncensored observations that fall exactly in X i , i.e., the weight W i associated with X i equals ∆ Fn(Xi) ∆N (Xi) = Sn(Xi−) Y (Xi) . Then We will first prove that √ n n i=1 K(X i , X i )W 2 i = o p (1) under Condition 2.1. Note that the process S(x−) dF (x). By the Lenglart-Rebolledo inequality, it is enough to prove that √ nZ(τ n ) = o p (1). An application of Propositions 6.2.i and 6.2.ii shows that where the equality holds by Proposition 6.2.iii. We claim that Equation (D.3) equals o p (1) by Lemma 6.4, whose conditions we proceed to verify. Set R(x) = |K(x,x)| 1−G(x−) , which is integrable, and set R n (x) as the integrand in Equation (D.3).
For the first condition of Lemma 6.4, note that U n (x) → 0 for all x, due to Proposition 6.1, hence R n (x) → 0. For the second condition, Propositions 6.2.i and 6.2.ii yield U n (x) = O p (1)(1 − G(x−)) −2 uniformly in x ≤ τ n , and thus R n (x) = O p (1)R(x). Finally, the Lenglart-Rebolledo Inequality, we get that Equation (D.2) holds true. Combining (D.1) and (D.2) yields hence, by the Law of Large numbers we obtain that n