Change-point detection based on weighted two-sample U-statistics

We investigate the large-sample behavior of change-point tests based on weighted two-sample U-statistics, in the case of short-range dependent data. Under some mild mixing conditions, we establish convergence of the test statistic to an extreme value distribution. A simulation study shows that the weighted tests are superior to the non-weighted versions when the change-point occurs near the boundary of the time interval, while they loose power in the center.


Introduction
In this paper, we study nonparametric tests for change-points in time series that are based on weighted two-sample U -statistics. By a suitable choice of the weights, we obtain tests that are able to detect changes that occur very early or very late during the observation period. Our results cover both the CUSUM test and the Wilcoxon change-point test, as well as many other robust and nonrobust tests. We investigate the large-sample behavior of our tests in the case of short-range dependent data under mild conditions that cover, e.g. ARMA and ARCH processes. By means of a simulation study, we analyze the small sample behavior of our tests, e.g. regarding robustness and the ability to detect early and late changes. As an application, we analyze a data set of daily stock returns of Wirecard during the weeks prior to the detection of accounting fraud in June 2020.
There is a vast body of literature on change-point tests using U-statistics. Gombay [14], and Kirch and Stoehr [16] investigate sequential change-point tests based on U-statistics. Wang, Volgushev and Shao [27] apply certain U-statistics for change-point detection in high-dimensional time series. Zhang, Wang, and Shao [29] propose an adaptive change-point test based on U-statistics. Račkauskas and Wendler [20] apply U-statistics for the detection of epidemic changes.
We assume that the data are generated by a stochastic process (X i ) i≥1 which follows the model where (µ i ) i≥1 is an unknown signal, and where (ξ i ) i≥1 is a short-range dependent stationary stochastic process. Given the observations X 1 , . . . , X n , we want to test the hypothesis that the signal is constant, i.e.
H 0 : µ 1 = . . . = µ n against the alternative of a change in the mean at an unknown point k * in time, i.e.
If the change-point k * was known, we would have a two-sample problem with the samples X 1 , . . . , X k * and X k * +1 , . . . , X n , respectively, and we could apply standard tests such as the two-sample Student t-test or the Wilcoxon two-sample test for a change in mean. Up to normalization, both are special cases of two-sample U -statistics n j=k * +1 h(X i , X j ), with a suitably chosen kernel function h : R 2 → R.
In the change-point setting, where a change occurs at an unknown point in time, we have a family of two-sample problems, indexed by the potential changepoint k, and thus we are naturally led to the two-sample U -statistic process k i=1 n j=k+1 h(X i , X j ), 0 ≤ k ≤ n.
A variety of change-point tests can be derived from this process by taking suitable functionals such as weighted maxima max 1≤k≤n−1 where 0 ≤ γ ≤ 1 2 is some parameter to be chosen. For γ = 0, we obtain non-weighted tests, which have been widely studied in the literature, starting with Darkhovskh [5] and Pettitt [19], who studied the special case of a Wilcoxon-type test statistic. Csörgő and Horvath [2] investigated U-statistics with general kernels, in the case of independent data. They could show that the asymptotic distribution under the null hypothesis is the Kolmogorov-Smirnov distribution, which is the distribution of the supremum of a Brownian bridge. Dehling, Fried, Garcia and Wendler [7] extended these results to weakly dependent data. Under long-range dependence, the limiting distribution is given by the supremum of a linear combination of Hermite processes; see Dehling, Rooch and Taqqu [10] for the Wilcoxon test, and Dehling, Rooch and Wendler [11] for arbitrary kernels. For 0 < γ < 1 2 , the limit distribution under independence is the supremum of the appropriately weighted Brownian bridge, see e.g. the seminal monograph by Csörgő and Horváth [3]. The moment conditions for such a limit theorem have been relaxed by Csörgő, Szyszkowicz, and Wang [4].
In the present paper, we focus on the extreme case γ = 1 2 , where we obtain the test statistic Under the null hypothesis, after some suitable normalization and centering, T n converges in distribution to the Gumbel extreme value distribution. This has been derived by Csörgő and Horváth [2] under independence. We will show that the same holds in the case of short-range dependent data. Antoch, Hušková und Prášková [1] have studied the large-sample behavior of weighted versions of the CUSUM test for dependent observations, in particular for linear processes. We also show that the test is consistent against a wide class of alternatives. For independent data, the behavior under the alternative has been studied by Ferger [13] and Gombay [14]. We have conducted an extensive simulation study comparing this test with the corresponding non-weighted test. Our simulations confirm the intuition that the weighted tests have more power against very early and very late changes, while the non-weighted tests are more powerful against changes in the middle of the observation period. By the choice of the kernel function h, the weighted two-sample U-statistics lead to a flexible class of change-point tests. As special examples, we obtain the CUSUM test for h(x, y) = y − x, and the Wilcoxon test for h(x, y) = 1 {x≤y} − 1 2 . More generally, one can take the kernels h(x, y) = ψ(y − x) for some antisymmetric function ψ : R −→ R. Depending on the choice of ψ, one obtains tests with specific properties, such as robustness against outliers, and tests that are powerful for certain alternatives.
The rest of the paper is organized as follows. In the next section, we present the detailed technical assumptions, and we give the main theoretical results of our paper. In section 3, we present the outcomes of a major simulation study comparing weighted and non-weighted tests as well as the robust Wilcoxon test and the non-robust CUSUM test. Full details of the proofs are presented in the final section.

Main theoretical results
In this section, we analyze the large sample behavior of the suitably normalized and centered test statistic T n under the hypothesis, and under a wide class of alternatives. A major ingredient in the proof is a new Darling-Erdős type limit theorem for the tied-down random walk of dependent random variables, which might also be of independent interest. Before we present our results, we give some definitions. Throughout this paper, the stochastic process (X i ) i≥1 will be assumed to be α-mixing in the sense of Rosenblatt [23].
as k −→ ∞, where F b a denotes the σ-field generated by the random variables X a , . . . , X b . We define the generalized inverse α −1 : [0, 1] −→ N by Our theoretical results require assumptions on the rate of decay of the mixing coefficients (α(k)) k≥1 . We formulate these assumptions using a concept introduced by Rio [21] that is based on the quantile function which we define below. Definition 2.2. For a random variable X, the upper quantile function Q X : Finally, we will assume that the kernel function h : R 2 → R satisfies the variation condition, which is a continuity assumption introduced by Denker and Keller [12], and that the kernel has uniform (2 + δ)-moments.
where X, Y are independent random variables with the same distribution as X 1 , and where · denotes the Euclidean norm on R 2 .
Definition 2.4. Let (X i ) i≥1 be a stationary process. A kernel h has uniform (2 + δ)-moments if for all k ∈ N 0 where X, Y are independent copies of X 1 , and where M is a constant.
The following theorem is the main theoretical result of this paper. Throughout, Q |X| will denote the common quantile function of the X k 's. Theorem 1. Let (X i ) i≥1 be an α-mixing strictly stationary process. Let h(x, y) be a bounded anti-symmetric kernel with uniform (2+δ)-moments and satisfying the variation condition (2). Moreover, assume that there exist constants p > 2 and ε > 6/δ such that Then, under the null hypothesis H 0 , as n → ∞, where G 2 is the Gumbel extreme value distribution with distribution function and where the centering constants b n and the long-run variance σ 2 h are defined by b n = 2 log log n + 1 2 log log log n − 1 2 log π (5) Here, h 1 (x) denotes the first order term in the Hoeffding decomposition of h, as defined below.
The idea of the proof is to apply the Hoeffding decomposition, which was introduced by Hoeffding [15], and to show that the degenerate part is asymptotically negligible. Thus, it will remain to show that the linear part converges to the extreme value distribution G 2 . For a two-sample U-statistic, the Hoeffding decomposition of the kernel h is given by where the terms on the right hand side are defined by and where X, Y are two independent random variables with the same distribution as X 1 . Note that in our case, since h is assumed to be anti-symmetric, θ = 0 and h 2 (x) = −h 1 (x). Applying the Hoeffding decomposition of the kernel h to the test statistic T n , we obtain In order to show that and that √ 2 log log n σ h max The asymptotic negligibility of the degenerate part, i.e. (9), will be established in Proposition 5.1, while (10) will be a consequence of a suitable Darling-Erdős theorem; see Theorem 3 below.
In the next theorem, we investigate the large sample behavior of T n under the alternative H 1 . We define where X n is independent of X 1 and has the same distribution as X n . Note that ∆ n measures the size of the change in the distribution of X i at the change point.
Theorem 2. Assume that the degenerate kernel h has uniform (2+δ)-moments and that it satisfies the variation condition (2).
Moreover, assume that the alternative H 1 holds, and that where k * = k * n denotes the location of the change. Then Corollary 2.5. The test that rejects the null hypothesis H 0 of no change when where g 2,α denotes the upper α-th quantile of the Gumbel distribution G 2 , has asymptotic level α. Moreover, the test is consistent against any alternatives H 1 that satisfy (12).
Proof. Under the null hypothesis, the test statistic converges by Theorem 1 to the Gumbel extreme value distribution G 2 , and thus the test that rejects the null hypothesis when the statistic exceeds g 2,α has asymptotically level α. Regarding the behavior under the alternative, we will show that under the assumptions of the corollary Let K > 0 be a given constant, then for all n large enough, since (b n + K)/ log log n → 2 as n → ∞. Now, the right hand side converges to 1 by Theorem 2.
Remark 1. The condition (12) puts restrictions on the time k * n as well as the magnitude ∆ n of the change, in order for our test to be consistent. For an early change, i.e. when k * n = o(n), the condition (12) is equivalent to stating that k * n · ∆ 2 n has to grow faster than log log n for the weighted test to be consistent. Keeping the magnitude of the change ∆ n ≡ ∆ constant, this means that k * n must grow faster than log log n. In the next theorem, we derive the asymptotic distribution of the weighted tied-down random walk. This result will be used in the analysis of the asymptotic distribution of the linear part of the test statistic T n . Theorem 3. Let (X i ) i≥1 be an α-mixing stationary process, let S n := n i=1 X i denote the partial sum process, and let denote the long-run variance. If there exists a 2 < p ≤ 3 such that where G 2 (x) = exp(−2 exp(−x)), and where b n is defined as in (5).
The proof of the above theorem follows the ideas of Yao and Davis [28] who showed that for i.i.d. standard normally distributed data the likelihood ratio converges in distribution to a Gumbel extreme value distribution. An important tool in the proof is the celebrated Darling-Erdős theorem on the asymptotic distribution of max 1≤k≤n Theorem 4 below establishes such a result for dependent data. Such theorems have been proved before, see e.g. Shorack [25], but not under the conditions required in the present paper.
Theorem 4. Let (X i ) i≥1 be an α-mixing strictly stationary process satisfying (16) for some 2 < p ≤ 3, and let S n := n i=1 X i denote the partial sum process.
where G(x) = exp(− exp(−x)) denotes the Gumbel extreme value distribution function, and where σ and b n is defined in (15) and (5), respectively.
The proof of Theorem 4, presented in Section 5.4 below, follows the ideas of Shorack [25] who proved that an almost sure invariance principle together with a suitable maximal inequality implies the Darling-Erdős theorem.

Simulations
In this section we present some simulation results for the weighted and the unweighted test statistic. We compare the power, the empirical size and the critical values, and consider the CUSUM and the Wilcoxon kernel, namely h(x, y) = y − x and h(x, y) = 1 {x<y} − 1 2 . Remark 2. It is easy to see that the Wilcoxon kernel satisfies the variation Let T C n denote the CUSUM and T W C n the weighted CUSUM test statistic and let T W n and T W W n denote the Wilcoxon and the weighted Wilcoxon test statistic, all properly centered and normalized, i.e.
Most of the simulation study is based on i.i.d. standard normally distributed observations. In this case In Figure 3 we we will consider dependent observations. In this case the long run variance has to be estimated. We use a subsampling estimator introduced by Dehling, Fried, Sharipov, Vogel and Wornowizki [8]. To achieve consistency under the alternative, we split the data into three disjoint sub-sequences of similar length and use the median of the resulting three separate estimations, see Dehling, Fried and Wendler [9]. First, we have simulated the critical values c i (α) and compared them to the asymptotic ones. The results are summarized in Table 1. For the unweighted test statistics the simulated critical values are almost in agreement with the asymptotic ones, whereas for the weighted test statistics the difference is larger. An overview is also given in Figure 1, which shows the different empirical distribution functions compared to the asymptotic ones. On the left hand side the empirical distribution function for the CUSUM and Wilcoxon test statistic is compared to the distribution function of the Kolmogorov-Smirnov distribution, and on the right hand side the distribution functions for the weighted CUSUM and weighted Wilcoxon test statistics are compared to the distribution function of the Gumbel distribution with location parameter log(2) and scale parameter 1. Next, we evaluate the performance of the test statistics by computing the empirical sizes and the power. Table 2 presents the empirical sizes for the unweighted and weighted CUSUM and Wilcoxon test statistics. The empirical sizes are lower than the nominal size in all cases considered. For the unweighted test statistics the size distortion shrinks to zero as the sample size increases, whereas for the weighted test statistics the difference between the empirical size and the nominal size is much larger for all considered sample sizes. the Wilcoxon test statistics have better power. Figure 3 shows analogous power plots, based on AR(1)-processes with t(3)distributed innovations and correlation coefficients ρ = 0.3, 0.5, 0.7. It is easy to see that a higher correlation results in a lower power. Again, in the case of early and late changes the weighted test statistics have better power than the unweighted ones. Compared to the CUSUM test statistics, the Wilcoxon test statistics always have greater power.

Data example
As an application we analyze the daily absolute log returns of the closing Wirecard stock prices (currency in EUR, downloaded from https://de.finance. yahoo.com/quote/WDI.DU/history?p=WDI.DU on June 14, 2021). We consider the time period February 10, 2020, to June 26, 2020 , which is 19 weeks and 95 observations (trading time from monday to friday). The absolute log returns in this time period are visualized in Figure 4. As the focus of this paper is on robust tests, we apply the weighted and non weighted Wilcoxon test to the data. The long run variance is estimated by the same subsampling estimator used in the simulation study. Considering a significance level of 5%, the weighted Wilcoxon test statistic rejects the null hypothesis of a constant mean in the absolute log returns. In contrast, the unweighted Wilcoxon test doesn't detect any significant change (c.f. Table 3

Proof of Theorem 1
First, we show in the following proposition that the degenerate part is asymptotically negligible.
Proposition 5.1. Let Ψ be the kernel given by Hoeffding's decomposition of h in (7), satisfying the variation condition. Moreover, we assume that (11) holds. Then, under the null hypothesis H 0 , as n → ∞ Proof. We can split the maximum into the stretch up to √ n, the stretch between √ n and n − √ n and the stretch after n − √ n, such that Now we can deal with every single maximum. To show that these maxima converge in probability to 0, we use Theorem A of Serfling [24]. To apply that theorem, we need a functional g(F a,n ) depending on the joint distribution of a vector (Y a+1 , . . . , Y a+n ) of n random variables, and satisfying Then it follows where C 1 is a constant, the required conditions are satisfied. We have g(F a,k ) + g(F a+k,l ) = C 1 kn + C 1 ln = C 1 n(k + l) = g(F a,k+l ) and For k = 0 and l = √ n we get This goes to 0 for n → ∞. Applying Chebyshev's inequality, we obtain that (19) converges to 0 in probability as n → ∞. By stationarity, this also holds for (20). An analogous procedure leads to As this goes to zero for n → ∞ and as max √ n≤k≤n− √ n log log n n 5/2 |V k | ≤ max 1≤k≤n− √ n log log n n 5/2 |V k |, we obtain that (18) converges in probability to 0.
Proof of Theorem 1. In the same way as in (8), we apply Hoeffding's decomposition to the kernel h(x, y) and get Ψ(X i , X j ).
As the variation condition holds for the kernel h, it also holds for Ψ. In order to be able to apply Proposition 5.1, we need to verify assumption (11). From (3), we obtain see Remark 2.2 in Merlevède and Rio [17] or Annex C in Rio [22]. Now, we define As q k is nonincreasing and by assumption, we obtain where c 1 is a constant. Hence n Hence, by Slutsky's theorem, it remains to show that max 1≤k≤n−1 n log log n k(n − k) converges in distribution to the desired extreme value distribution. This follows from Theorem 3 with S k = k i=1 h 1 (X i ).

Proof of Theorem 2
Without loss of generality, we assume that ∆ > 0. Since We obtain the following Hoeffding decomposition where ∆ is given as in Theorem 2, and where It holds By the law of iterated logarithm It remains to show that for some constant C. We get as n → ∞, which completes the proof.

Proof of Theorem 3
For the proof of Theorem 3, we need the following lemmas, which all hold under the assumptions of Theorem 3.
For large n and 1 ≤ k ≤ n log n the inequalities By the almost sure invariance principle of Merlevède and Rio (2012), one can find a Brownian motion (W t ) t≥0 such that for some λ > 0. Hence Set k = nr, r ∈ (0, 1) and consider the first summand of the right-hand side of (22). We have By the law of the iterated logarithm, it holds for r → 0 It follows that which completes the proof.
For abbreviation, we define a n := √ 2 log log n. Recall the definition of b n (5).
Proof. By Theorem 4, one has Since a n a [ n log n ] as n → ∞, we obtain with Slutsky's Theorem a n σ max as n → ∞, and hence the lemma follows from Lemma 5.2.
Proof. Follows from Lemma 5.5 and the stationarity under the hypothesis.
Applying the above lemmas, we can now prove Theorem 3.
Proof of Theorem 3. We have P a n σ max From Lemma 5.5 and Lemma 5.6 we get a n σ max and a n σ max Applying Lemma 5.4 we get And again applying Lemma 5.4 to the reversed time series, we get Due to the underlying α−mixing process it holds All in all we get the desired result lim n→∞ P a n σ max

Proof of Theorem 4
First, we state two lemmas, both valid under the assumptions of Theorem 4. We define the random variables Y n and Z n by for some γ > 1 2 . Lemma 5.7.
2 log log n Y n − 2 log log n P −→ −∞ Proof. We apply the following maximal inequality for α-mixing processes, due to Rio [22] where Q k denotes the quantile function of ξ k , to the random variables ξ k = X k √ k . Note that Furthermore, we use the inequality in Lemma A.1, i.e. max 1≤k≤n |S k | √ k ≤ 2 max 1≤k≤n k j=1 ξ j .
Then, for K > 0, we obtain for all n ≥ n K P 2 log log n Y n − 2 log log n ≥ −K ≤ P Y n ≥ log log n ≤ P Proof. We have the following chain of inequalities P (V n > x) ≤ P (max(U n , V n ) > x) ≤ P (U n > x) + P (V n > x), and hence |P (max(U n , V n ) > x) − P (V n > x)| ≤ P (U n > x) → 0.
Proof of Theorem 4. Let (W ) t≥0 be a Brownian motion with Var(W 1 ) = σ 2 . Define analogues of the random variables Y n and Z n , replacing the partial sum process by Brownian motion A j − A j−1 , and thus k j=1 Remark 3. This is a special case of an inequality stated as Lemma 1 in Shorack and Smythe [26].
Let g satisfy the variation condition with constant L. Then, g K satisfies also the variation condition with the same constant L. Assume, without loss of generality, that m = i 2 − i 1 . Moreover, we can assume that there exists a uniform on [0, 1] random variable that is independent of (X i ) i≥1 . With Theorem 1 of Peligrad [18], choose a random variable X i1 independent of X i2 , X i3 , X i4 with the same distribution as X i1 and As g is a degenerate kernel, we have E g(X i1 , X i2 )g(X i3 , X i4 ) = 0. Then there exists a constant C such that Proof. This was proved for functionals of absolutely regular processes by Dehling et al. [7] in Lemma 1. They make use of an upper bound for the expectations |E(g(X i1 , X i2 )g(X i3 , X i4 ))|. Such a bound for an α-mixing process is stated in Lemma A.2. The rest of the proof is analogous.