Open-end nonparametric sequential change-point detection based on the retrospective CUSUM statistic

The aim of online monitoring is to issue an alarm as soon as there is significant evidence in the collected observations to suggest that the underlying data generating mechanism has changed. This work is concerned with open-end, nonparametric procedures that can be interpreted as statistical tests. The proposed monitoring schemes consist of computing the so-called retrospective CUSUM statistic (or minor variations thereof) after the arrival of each new observation. After proposing suitable threshold functions for the chosen detectors, the asymptotic validity of the procedures is investigated in the special case of monitoring for changes in the mean, both under the null hypothesis of stationarity and relevant alternatives. To carry out the sequential tests in practice, an approach based on an asymptotic regression model is used to estimate high quantiles of relevant limiting distributions. Monte Carlo experiments demonstrate the good finite-sample behavior of the proposed monitoring schemes and suggest that they are superior to existing competitors as long as changes do not occur at the very beginning of the monitoring. Extensions to statistics exhibiting an asymptotic mean-like behavior are briefly discussed. Finally, the application of the derived sequential change-point detection tests is succinctly illustrated on temperature anomaly data.


Introduction
In situations in which observations are continuously collected over time, the aim of sequential or online change-point detection is to issue an alarm as soon as possible if it is thought that the probabilistic properties of the underlying unobservable data generating mechanism have changed. While this problem has a long history in statistical process control (see, e.g., Lai, 2001;Montgomery, 2007, for an overview), we adopt herein the alternative perspective proposed in the seminal work of Chu, Stinchcombe and White (1996) and treat the issue from the point of view of statistical testing. To fix ideas, assume that we have at hand an initial stretch X 1 , . . . , X m (frequently referred to as the learning sample) from a univariate stationary time series. As new observations arrive, the aim is to look for evidence against stationarity and issue an alarm if such evidence is deemed significant. Specifically, assuming that the kth observation has been collected, a positive statistic D m (k) measuring departure from stationarity is computed from the available observations X 1 , . . . , X k and compared to a suitably chosen threshold w(k/m). If D m (k) > w(k/m), the hypothesis that X 1 , . . . , X k is a stretch from a stationary time series is rejected and the monitoring stops. Otherwise, a new observation X k+1 is collected and the previous steps are repeated using X 1 , . . . , X k+1 .
Among such procedures, one can distinguish between closed-end approaches for which the monitoring eventually stops if stationarity is not rejected after the arrival of observation X n , n > m, and open-end approaches which, in principle, can continue indefinitely. The sequential testing procedures studied in this work pertain to this second category and, for that reason, their null hypothesis is H 0 : X 1 , . . . , X m , X m+1 , X m+2 , . . . , is a stretch from a stationary time series.
(1.1) Their interpretation as statistical tests implies that, given a significance level α ∈ (0, 1/2), the thresholds w(k/m), k ≥ m + 1, need to be chosen such that, ideally, under H 0 , (1. 2) The previous display shows how closely the choice of the so-called detector D m and the threshold function w are related. Intuitively, under H 0 , the sample paths of the stochastic process {D m (k)/w(k/m)} k≥m+1 should fluctuate but not exceed a constant threshold too often while, when H 0 is not true, they are expected to rapidly exceed it after a change in the data generating process has occurred. The starting point of our investigations is the recent seminal work of Gösmann, Kley and Dette (2021) who study open-end monitoring schemes sensitive to potential changes in a parameter θ (such as the mean) of a time series. A first natural approach to tackle this problem, often referred to as the ordinary CUSUM (cumulative sum) in the sequential change-point detection literature, consists of comparing an estimator θ 1:m of θ computed from the learning sample X 1 , . . . , X m with an estimator θ m+1:k of θ computed from the observations X m+1 , . . . , X k collected after the monitoring has started; see, e.g., Horváth et al. (2004), Aue and Horváth (2004), Aue et al. (2006) as well as the references given in the recent review by Kirch and Weber (2018). The idea of Gösmann, Kley and Dette (2021) is to define detectors that take into account all of the differences θ 1:j − θ j+1:k , j ∈ {m, . . . , k − 1}. This approach, that can be regarded as adapted from retrospective (or offline, a posteriori ) change-point detection (which assumes that data collection has been completed before testing is carried out), treats each j ∈ {m, . . . , k − 1} as a potential change-point. In their experiments, Gösmann, Kley and Dette (2021) found that their approach is not only more powerful than the ordinary CUSUM but also more powerful than the so-called Page CUSUM procedure which consists of defining a detector from the differences θ 1:m − θ j+1:k , j ∈ {m, . . . , k − 1}; see, e.g, Fremdt (2015) and Kirch and Weber (2018).
When computed after the kth observation has been collected, the detector proposed by Gösmann, Kley and Dette (2021) (thus involving all the differences θ 1:j − θ j+1:k , j ∈ {m, . . . , k − 1}) does not however coincide with the so-called retrospective CUSUM statistic that could be computed from X 1 , . . . , X k (also involving all the differences θ 1:j −θ j+1:k , j ∈ {m, . . . , k−1}; see, e.g., Csörgő and Horváth, 1997;Aue and Horváth, 2013 and the references therein). Reasons that led Gösmann, Kley and Dette (2021) not to consider such an approach are discussed in their Remark 2.1. As shall be explained in the next section, we believe that an open-end monitoring scheme using the retrospective CUSUM statistic as detector could be even more powerful than the procedure of Gösmann, Kley and Dette (2021) as long as changes do not occur at the very beginning of the monitoring.
The aim of this work is to address the theoretical and practical issues associated with defining a nonparametric detector for open-end monitoring such that it coincides at each k with the retrospective CUSUM statistic. The theoretical issues are mostly related to the choice of the threshold function, while the practical issues come from the fact that quantiles of the underlying limiting distribution required to carry out the sequential test are harder to estimate. This paper is organized as follows. In the second section, we propose three open-end nonparametric monitoring schemes related to the retrospective CU-SUM statistic designed to be sensitive to changes in the mean of univariate time series. Their asymptotic behavior as the size m of the learning sample tends to infinity is studied in the third section both under the null hypothesis of stationarity and relevant alternatives. Section 4 is concerned with the estimation of high quantiles of related limiting distributions necessary in practice to carry out the sequential tests. The fifth section presents a summary of extensive numerical experiments demonstrating the good finite-sample properties of the resulting sequential testing procedures. An extension to time series parameters whose estimators admit an asymptotic mean-like linearization as considered in Gösmann, Kley and Dette (2021) is briefly discussed in Section 6. A short illustration involving temperature anomaly data concludes the work.
All proofs are deferred to the appendices. Throughout the paper, all convergences are for m → ∞ unless mentioned otherwise. A preliminary implementation of the studied tests is available in the package npcp (Kojadinovic, 2020) for the R statistical system (R Core Team, 2020).

The retrospective CUSUM for monitoring changes in the mean
Our aim is to derive open-end nonparametric sequential change-point detection procedures that are particularly sensitive to alternative hypotheses of the form (2.1) After the arrival of the kth observation with k > m, the data at hand consist of the stretch X 1 , . . . , X k . If we were in the context of retrospective change-point detection, a natural test statistic would be the so-called retrospective CUSUM statistic (see, e.g., Csörgő and Horváth, 1997;Aue and Horváth, 2013, and the references therein) defined by In the definition of R k , we see that every j ∈ {1, . . . , k − 1} is treated as a potential change-point in the sequence X 1 , . . . , X k . The maximum over j then implies that R k will be large as soon as the difference betweenX 1:j andX j+1:k is large for some j.
In the sequential context considered in this work, since X 1 , . . . , X m is the learning sample known to be a stretch from a stationary time series, a first natural modification of (2.2) is to restrict the maximum over j to j ∈ {m, . . . , k− 1}. This is the idea considered by Dette and Gösmann (2019) in a closed-end setting who, additionally, replaced the normalizing factor k 3/2 by m 3/2 so that the asymptotics of the corresponding monitoring scheme could be studied as the size m of the learning sample tends to infinity. The resulting detector is In an open-end setting, Gösmann, Kley and Dette (2021) choose however not to consider the detector R m (see Remark 2.1 in the latter reference) but suggested instead the detector The difference between R m and E m evidently lies in the weighting of the absolute differences of means |X 1:j −X j+1:k |, j ∈ {m, . . . , k − 1}. Instead of weighting |X 1:j −X j+1:k | by j(k − j)/m 2 , (2.5) replaces j/m by 1. While this modification may be beneficial in terms of power when k is close to m, it could have a negative impact when k is substantially larger than m because, then, j(k − j)/m 2 can be substantially larger than (k −j)/m. In other words, we suspect that, for changes not occurring at the beginning of the monitoring, a suitable detection scheme based on R m could be more powerful than the one proposed by Gösmann, Kley and Dette (2021) based on E m . Remark 2.1. As mentioned in the introduction, the simplest detector in an open-end setting is probably the so-called ordinary CUSUM initially considered by Horváth et al. (2004) for investigating changes in the parameters of linear models. With the aim of detecting changes in the mean, following Aue and Horváth (2004), it can be defined by We will use it as a benchmark in our Monte Carlo experiments. As explained in the introduction, the choice of a detector needs to be accompanied by the choice of a suitable threshold function. To heuristically justify our choice of a suitable threshold function for the detector R m in (2.4), we momentarily consider the closed-end setting in which monitoring stops at the latest after observation X n is collected. Following Dette and Gösmann (2019) among others, to be able to study the sequential testing scheme asymptotically, we set n = mT for some real T > 1. Then, under H 0 in (1.1), assuming that the functional central limit theorem holds for the underlying stationary time series (X i ) i∈Z , it can be verified using relatively simple arguments (see, e.g., Gösmann, 2019, Section 3, or Kojadinovic andVerdier, 2021, Section 2 W is a standard Brownian motion and σ 2 = i∈Z Cov(X 0 , X i ) > 0 is the finite long-run variance of (X i ) i∈Z . For any fixed t ∈ [1, T ], by Brownian scaling and Hence, for large t, the distribution of t −3/2 σ −1 L(t) is close to that of the supremum of a Brownian bridge on [0, 1]. As a consequence, under H 0 in (1.1), the distribution of t −3/2 σ −1 R m ( mt ) stabilizes as m and t increase. The latter observation suggests the possibility of an open-end sequential testing scheme based on R m in (2.4) with a threshold function that is not too different from t → t 3/2 . As shall become clear from Theorem 3.3 below, in order to ensure that (1.2) is fully meaningful when D m = R m , the corresponding threshold function actually needs to diverge to ∞ (as t → ∞) slightly faster than t 3/2 . We propose to use as threshold function for R m where η > 0 is a real parameter and with γ ≥ 0 another real parameter and > 0 a technical constant that can be taken very small in practice (we used = 10 −10 in our implementation). Let us first explain the role of the parameter η. Following the perspective adopted in the discussion below (1.2), the resulting monitoring can be seen as consisting of computing {w R (k/m)} −1 R m (k) for k ≥ m + 1. It is elementary that for any fixed k ≥ m + 1, as we increase η both the mean and the variance of {w R (k/m)} −1 R m (k) decrease. The top-left plot of Figure 2.1 displays one sample path of {{w R (k/m)} −1 R m (k)} m+1≤k≤m+5000 for m = 100, γ = 0 and η = 0.1 computed from a sequence of independent standard normals (solid line) and the sample path computed from the same sequence but with η = 0.001 instead (dotted line). Unsurprisingly, because of the factor (k/m) −η in {w R (k/m)} −1 R m (k), the sample path with η = 0.1 is below the sample path with η = 0.001. This effect of η is confirmed by the top-right plot of Figure 2.1 which displays the corresponding empirical standard deviations at k against k −m computed from 1000 sample paths. As expected, increasing the parameter η increases the rate of convergence (as k → ∞) of {w R (k/m)} −1 R m (k) (and its mean and variance) to zero. Intuitively, in the context of open-end monitoring, one would therefore want η to be very small so that there is little reduction in variability as time elapses. The practical choice of the parameter η will be discussed in detail in Section 4.
Let us now explain the role of the parameter γ. The multiplication of a candidate threshold function by w γ in (2.7) was initially considered in Horváth et al. (2004). From a practical perspective, as we shall see below, this may improve the finite-sample performance of the sequential testing scheme at the beginning of the monitoring (and has a negligible effect later). From a theoretical perspective, by doing so for the ordinary CUSUM detector Q m in (2.6), Aue and Horváth (2004) were able, under alternatives related to (2.1), to derive the limiting distribution of the detection delay (the time delay after which the detector exceeds the threshold function). The use of such a modification of a threshold function turns out to be common in the literature and can also be found for instance in Fremdt (2015), Kirch and Weber (2018) and Gösmann, Kley and Dette (2021), among many others. Notice that, unlike what is frequently done in the literature, we do not impose that γ be strictly smaller than 1/2. To provide some further insight, consider the top-left plot of Figure 2.2 which displays the empirical 95% quantile (solid line), empirical standard deviation (dashed line) and sample mean (dotted line) of {w R (k/m)} −1 R m (k) for m = 100, η = 0.001 and γ = 0 against k − m computed from 1000 sample paths computed from independent standard normal sequences. As one can see, because of the small value of η, the distribution of {w R (k/m)} −1 R m (k) appears to stabilize rather quickly as k increases. The speed at which this occurs can be increased by increasing γ. For instance, by comparing the top-left and bottom-left plots of Figure 2.2, one can clearly see that the distribution of {w R (k/m)} −1 R m (k) stabilizes quicker as k increases for γ = 0.25 than for γ = 0. However, as one can see from the plot for γ = 0.45, the value of γ should not be taken too large.
In addition to the detector R m in (2.4), we shall also consider the detectors with corresponding threshold functions (2.12) As one can see, S m and T m could be regarded as the L 1 and L 2 versions, respectively, of R m in (2.4).
The parameters η and γ in (2.11) and (2.12) play the same role as in (2.7). For η, this can be empirically verified from the second and third rows of graphs in Figure 2.1. For γ, plots similar to Figure 2.2 for S m and T m reveal that values of γ larger than 0.25 seem meaningful for these two detectors. Specifically, it seems possible to improve the finite-sample performance of the corresponding schemes at the beginning of the monitoring by taking γ as large as 0.85 for S m and as large as 0.45 for T m .

Asymptotics of the procedures
To study the asymptotic behavior of the three considered monitoring schemes under H 0 in (1.1), we follow Horváth et al. (2004), Aue et al. (2006), Fremdt (2015) and Gösmann, Kley and Dette (2021), among others, and assume that the observations satisfy the following condition.
Condition 3.1. The data are a stretch from a stationary time series (X i ) i∈Z such that σ 2 = i∈Z Cov(X 0 , X i ), the long-run variance of (X i ) i∈Z , is strictly positive and finite. Furthermore, for all m ∈ N, there exists two independent standard Brownian motions W m,1 and W m,2 such that, for some 0 < ξ < 1/2, As mentioned in Remark 2.6 (i) of Gösmann, Kley and Dette (2021), the validity of Condition 3.1 can be verified by proceeding along the lines of Aue et al. (2006) using the strong approximation results of Eberlein (1986). The latter is actually already carried out in Lemma A.1 of Aue et al. (2006) for augmented GARCH processes. Note that the verification of (3.1) is also discussed in Section 2 of Aue and Horváth (2004) for various types of stochastic processes. Remark 3.2. In the prototypical situation in which (X i ) i∈Z is a sequence of independent normal random variables with variance σ 2 , there exists a probability space on which the above conditions are trivially satisfied with O P (1) replaced with 0. This is essentially just the statement that the increments of a Brownian motion are standard normal random variables. To be more precise, let (B j ) j≥0 be independent standard Brownian bridges that are independent of (X i ) i∈Z . Without loss of generality, we may assume that the X i are standard normal, and we can define W 1,m by first specifying its values at integer times by W 1,m (k − m) = k i=m+1 X i , k ≥ m, and then interpolating between these values using the bridges (B m+j ) j≥0 : for t ∈ (k − m, k − m + 1), set W 1,m (t) = W 1,m (k−m)+B k {t−(k−m)}+{t−(k−m)}{W 1,m (k−m+1)−W 1,m (k−m)} (it is an exercise to check that the resulting process W 1,m is a standard Brownian motion). Then, the term in absolute values in (3.1) is exactly 0 for each k and m. Similarly, setting W 2,m (j) = j i=1 X i for j ≤ m and interpolating with the bridges (B j ) 0≤j≤m−1 makes (3.2) hold with 0 in the absolute value for each m. Furthermore, by construction, W 1,m and W 2,m are independent.
The following result is proven in Appendix A.
Theorem 3.3. Under H 0 in (1.1) and Condition 3.1, for any fixed η > 0, > 0 and γ ≥ 0, where the arrow ' ' denotes convergence in distribution, the detectors R m , S m and T m are defined in (2.4), (2.9) and (2.10), respectively, the threshold functions w R , w S and w T are defined in (2.7), (2.11) and (2.12), respectively, and W is a standard Brownian motion. In addition, all the limiting random variables are almost surely finite.
Imposing that η is strictly positive in the previous theorem is necessary for ensuring that the limiting random variables are almost surely finite. Recall the definition of the function w γ in (2.8). Since the function t → 1/w γ (t) is bounded from below by 1, the following result, proven in Appendix A, implies that for the monitoring scheme based on R m and w R , this condition is necessary and sufficient.

Proposition 3.4. For any fixed
where W is a standard Brownian motion.
Remark 3.5. We thus have that sup 1≤s≤t<∞ {w R (t)} −1 |tW (s)−sW (t)| is almost surely finite for η > 0 and infinite for η = 0. By the law of the iterated logarithm for Brownian motion, this supremum remains finite if t η in w R in (2.7) is replaced by h(t), where h(t) = √ log log t when t > e e and h(t) = 1 when t ≤ e e . We expect that Theorem 3.3 remains valid with such a modification which could be considered optimal in the sense that, as t → ∞, h diverges slower to infinity than t → t η for any η > 0. The latter implies that the use of h(t) instead of t η entails the lowest possible variability reduction for the monitoring scheme (in the sense of the discussion on the role of η in the previous section) in the limit as k → ∞.
The next corollary provides an operational version of Theorem 3.3 to carry out the three sequential change-point detection tests in practice.
Corollary 3.6. The statement of Theorem 3.3 remains true if the long-run variance σ 2 is replaced by an estimator σ 2 m of σ 2 computed from the learning sample X 1 , . . . , X m such that σ 2 m − σ 2 = o P (1). To fix ideas, let us briefly explain how Corollary 3.6 can be used to carry out the sequential test based on R m in (2.4). Given a significance level α ∈ (0, 1/2), it is first necessary to accurately estimate q R,1−α , the (1 − α)-quantile of the continuous random variable sup 1≤s≤t<∞ {w R (t)} −1 |tW (s) − sW (t)| (this aspect will be discussed in detail in Section 4). Then, under H 0 in (1.1), from the Portmanteau theorem, Hence, for large m, we can expect that, under H 0 , In practice, after the arrival of observation , the null hypothesis is rejected and the monitoring stops. Otherwise, the next observation is collected and the previous iteration is carried out with the k + 1 available data points. Corollary 3.6 and (3.3) in particular guarantee that such a sequential testing procedure will have asymptotic level α. Steps to carry out the tests based on S m in (2.9) and T m in (2.10) can be obtained mutatis mutandis.
We now turn to the asymptotic behavior of the monitoring schemes under sequences of alternatives related to H 1 in (2.1). We start with the procedure based on R m in (2.4) which we study under a condition similar to the one used by Gösmann, Kley and Dette (2021, Theorem 2.13).
Condition 3.7. The data are a stretch, for some m ∈ N, from the sequence of random variables (X (m) i ) i∈N defined by . . are sequences of random variables defined on the same probability space and satisfying and either Remark 3.8. Statement (iv) (resp. (v)) in Condition 3.7 is related to what were called early (resp. late) changes in Gösmann, Kley and Dette (2021) because It is then easy to verify that the sequences (Y (m) i ) i∈Z , m ≥ 0, satisfy (i) and (ii) in Condition 3.7. Let us verify that they also satisfy (iii), (iv) and (v). Let ε > 0. Since the central limit theorem holds for ( Hence, (3.4) holds for any sequence (k m ) m∈N . For c = 1, the quantity on the left-hand side of (3.5) is equal to i > M) < ε which in turn implies that (3.5) holds with c = 1 for any sequence (k m ) m∈N . Similarly, for c = 1, the quantity on the left hand side of (3.6) is equal to The following result proven in Appendix B establishes the consistency of the procedure based on R m under Condition 3.7.
Assume that, under Condition 3.7, the estimator σ 2 m appearing in Corollary 3.6 converges in probability to a strictly positive constant. An immediate corollary of the previous theorem is then that, under Condition 3.7, for any diverges in probability to infinity. In other words, under the considered conditions, for any fixed η ≥ 0, ≥ 0 and γ ≥ 0, and any threshold κ > 0, Proving the consistency of the procedures based on S m in (2.9) and T m in (2.10) for late changes turns out to be more difficult. The following result, proven in Appendix B, shows that they are consistent for early changes. Note that the considered condition is very similar to those considered for instance in Dette and Gösmann (2019, Theorem 3.8) or Kojadinovic and Verdier (2021, Proposition 2.7) in the context of closed-end monitoring.
Condition 3.11. The data are a stretch, for some m ∈ N, from the sequence of random variables (X (m) i ) i∈N defined by where k m = mc for some constant c ≥ 1, and (Y i ) i∈Z and (Z i ) i∈Z are stationary sequences defined on the same probability space such that E(Y 1 ) = E(Z 1 ) and for which the functional central limit theorem holds.
Theorem 3.12. Under Condition 3.11, for any fixed T > c, the empirical and ∨ and ∧ are the maximum and minimum operators, respectively. Consequently, for any fixed η ≥ 0, ≥ 0 and γ ≥ 0, (3.8) Since the estimator σ 2 m appearing in Corollary 3.6 converges in probability to a strictly positive constant under Condition 3.11, an immediate corollary of Theorem 3.12 is that, under the same conditions,

Estimation of high quantiles of the limiting distributions
From the discussion following Corollary 3.6, we know that accurate estimations of high quantiles of the limiting distributions appearing in Theorem 3.3 are necessary to carry out the sequential tests based on the detectors R m , S m and T m defined in (2.4), (2.9) and (2.10), respectively. Before we present the approach considered in this work, let us briefly explain how Horváth et al. (2004) and Gösmann, Kley and Dette (2021) proceeded for the procedure based on Q m in (2.6) and the one based on E m in (2.5), respectively. From the latter references, suitable threshold functions for these two detectors are w where the function w γ is defined in (2.7) but with γ restricted to the interval [0, 1/2). Furthermore, from Gösmann, Kley and Dette (2021), under H 0 in (1.1) and Condition 3.1, one has where W is a standard Brownian motion. Notice that, using the law of the iterated logarithm/local modulus of continuity for Brownian motion, it can be verified, since γ < 1/2, that the limiting random variables are almost surely finite even if is taken equal to zero. When γ = 0, it seems particularly natural to estimate high quantiles of these limiting distributions by Monte Carlo simulation using an approximation of W on a fine grid of [0, 1] (one should keep in mind that although simulating such a process at discrete times always underestimates the true supremum, one can get arbitrarily close with probability as close to 1 as one wishes by taking a sufficiently fine grid). Note that in practice Horváth et al. (2004) and Gösmann, Kley and Dette (2021) use = 0 even when γ > 0.
The following result, proven in Appendix C, suggests that the latter approach could be attempted for the first limiting distribution appearing in Theorem 3.3 related to the procedure based on R m in (2.4).
Proposition 4.1. For any fixed η > 0, > 0 and γ ≥ 0, the random variable and the random variable are equal in distribution, where W is a standard Brownian motion.
Preliminary numerical experiments revealed however that there do not seem to be advantages to work with expressions of the form (4.2) as these seem to transpose practical issues due to the presence of ∞ in (4.1) into practical issues near zero. For that reason, we opted for a heuristic estimation approach which we detail in the rest of this section in the case of the procedure based on R m in (2.4) (and which we used mutatis mutandis for the procedures based on S m in (2.9) and T m in (2.10) as well).
Fix η > 0, γ ≥ 0 and α ∈ (0, 1/2), and let To estimate q R,1−α , we propose to take m relatively large and simulate a large number N of sample paths of {{w R (k/m)} −1 R m (k)} m+1≤k≤m+2 p for p large from sequences of independent standard normal random variables. At this point, it may be tempting to use the (1 − α)-empirical quantile of sup m+1≤k≤m+2 p {w R (k/m)} −1 R m (k) as an estimate of q R,1−α . Although the latter is expected to be a good estimate of the (1 − α)-quantile of sup 1≤s≤t≤T {w R (t)} −1 |tW (s) − sW (t)| for T = 1 + 2 p /m, it may underestimate q R,1−α since the supremum for 1 ≤ s ≤ t < ∞ obviously dominates the supremum for 1 ≤ s ≤ t ≤ T for any T > 1. The latter will actually depend on the value η. Indeed, recall from our discussion in Section 2 that, because of the factor (k/m) −η , increasing η decreases the variance of {w R (k/m)} −1 R m (k) as k increases. This has two consequences as far as the above mentioned empirical estimation of q R,1−α is concerned: (i) for any fixed η > 0, since the variability of {w R (k/m)} −1 R m (k) decreases as k increases, if we could set p large enough, we would be able to obtain a good estimate of q R,1−α ; unfortunately, our margin of action in that respect is limited by computational resources; (ii) for any fixed p, as η decreases to 0, the probability that our esti- decreases to zero; in other words, for any fixed p, as η decreases, it is less and less likely that the distribution of sup m+1≤k≤m+2 To try to empirically solve this quantile estimation problem, we propose, for any fixed η > 0, to attempt to model the relationship between p and the (1−α)empirical quantile of sup m+1≤k≤m+2 p {w R (k/m)} −1 R m (k), and to use the fitted . . , 18}, denote the resulting estimates. Then, we fitted a so-called asymptotic regression model to the points (p, q  (Ritz et al., 2015). A candidate estimate of q R,1−α , the (1 − α)quantile of sup 1≤s≤t<∞ {w R (t)} −1 |tW (s)−sW (t)|, is then the resulting estimate of the parameter d.
The first row of graphs in Figure 4.1 shows the scatter plots of the points {(p, q As one can see from the first two graphs in the first row of Figure 4.1, the scatter plots for η = 0.1 and η = 0.05 reveal the presence of a plateau for larger values of p. The latter is an empirical indication of the fact that the distribution of the random variable sup m+1≤k≤m+2 p {w R (k/m)} −1 R m (k), say for p = 18, seems to be a good approximation of the distribution of the random variable sup m+1≤k<∞ {w R (k/m)} −1 R m (k) and thus of sup 1≤s≤t<∞ {w R (t)} −1 |tW (s) − is decreasing for any fixed k ≥ m + 1, the estimated quantiles q (p) R,1−α , p ∈ {10, . . . , 18}, and their corresponding estimated asymptotes (which are candidate estimates of q R,1−α ) are increasing as η decreases in the first row of plots in Figure 4.1. As one can see, the rate of increase of the estimated quantiles decreases as η decreases. For instance, there are hardly any differences between the plot for η = 0.001 and the plot for η = 0.0001. Unsurprisingly, this last plot is actually hardly indistinguishable from the plot for η = 0. The latter empirical observation practically implies that the proposed estimation technique will provide a finite estimate of q R,1−α even when, according to Proposition 3.4, q R,1−α is known to be infinite. It follows that it is not meaningful in practice to consider values of η that are "very small". Based on the previous approximate variance reduction calculations and the plots given in Figure 4.1, our intuitive recommendation is not to take η smaller than 0.001. Notice, that with m = 100, this value of η induces an approximate standard deviation reduction under H 0 for {w R (k/m)} −1 R m (k) after 10 10 monitoring steps of less than 2%.
The estimated quantiles for η = 0.001, α ∈ {0.01, 0.05, 0.1}, γ = 0 as well for the largest meaningful value of γ (among those that were considered) for the three monitoring schemes are reported in Tables 4.1 and 4.2 along with standard errors. Estimated quantiles for larger values of η are available in the R package npcp (Kojadinovic, 2020).
Remark 4.2. It is a research project of its own to investigate more thoroughly the estimation of the quantiles both empirically and theoretically. For instance, one may wish to investigate bounds on the probability for a given η > 0 and T > 1 that sup 1≤s≤t≤T {w R (t)} −1 |tW (s) − sW (t)| occurs at some (s, t) with t > T , and whether such occurrences tend to be associated with small suprema or large suprema.

Monte Carlo experiments
To investigate the finite-sample properties of the studied open-end sequential change-point detection procedures, we carried out extensive Monte Carlo experiments. One should however keep in mind that numerical experiments cannot provide a full insight into the finite-sample behavior of open-end approaches as finite computing resources impose that monitoring has to be stopped eventually.
To estimate the long-run variance σ 2 related to the learning sample, we used the approach of Andrews (1991) based on the quadratic spectral kernel with automatic bandwidth selection as implemented in the function lrvar() of the R package sandwich (Zeileis, 2004). We first considered 10 data generating models, denoted M1, . . . , M10. Models M1, . . . , M5 are simple AR(1) models with normal innovations whose autoregressive parameter is equal to 0, 0.1, 0.3, 0.5 and 0.7, respectively. These models were chosen among others to allow a comparison of our results with those of Gösmann, Kley and Dette (2021, Section 4.1). Model M6 generates independent observations from the Student t distribution with 5 degrees of freedom. Model M7 is a GARCH(1,1) model with normal innovations and parameters ω = 0.012, β = 0.919 and α = 0.072 to mimic SP500 daily log-returns following Jondeau, Poon and Rockinger (2007). Models M8 and M9 are the nonlinear autoregressive model used in Paparoditis and Politis (2001, Section 3.3) and the exponential autoregressive model considered in Auestad and Tjøstheim (1990) and Paparoditis and Politis (2001, Section 3.3). The underlying generating equations are respectively, where the i are independent standard normal innovations. Note that, for all time series models, a burn-out sample of 100 observations was used. Finally, in order to mimic count data, model M10 generates independent observations from a Poisson distribution with parameter 3. In a first series of experiments, we attempted to assess how well the studied tests hold their level. To do so, we generated 5000 samples from models M1-M10 and used the first m observations of each sample as learning sample. All the sequential tests were carried out at the 5% nominal level using the estimated quantiles available in the R package npcp (see also Tables 4.1 and 4.2). Because monitoring cannot go on indefinitely, we stopped the sequential testing procedures after the arrival of observation n = m + 10000. The percentages of rejection of H 0 in (1.1) for the procedures based on R m in (2.4), S m in (2.9), T m in (2.10) with γ = 0 and η ∈ {0.05, 0.01, 0.005, 0.001}, as well as for the procedures based on E m in (2.5) and Q m in (2.6) with γ = 0, are reported in Tables 5.1 and 5.2 (to carry out the procedures based on Q m and E m , we used the quantiles reported in Table 1 of Horváth et al. (2004) and Gösmann, Kley and Dette (2021), respectively). Given the closed-end nature of the experiments, one should keep in mind that the empirical levels would have been higher had larger values of n been considered. As one can see from Tables 5.1 and 5.2, for most models, the empirical levels drop below the 5% threshold rather quickly as m increases. When the tests are too liberal, it is probably mostly a consequence of the difficulty of the estimation of the long-run variance σ 2 . It is for instance unsurprising that a large value of m is necessary to obtain a reasonably accurate estimate of σ 2 for model M5 (the AR(1) model with the strongest serial dependence) or model M9 (an exponential autoregressive model). Notice that, overall, the tests based on E m and Q m are less liberal then the tests based on R m , S m and T m when the serial dependence is strong and m is small. The opposite tends to occur as m increases. Among the three proposed procedures, the one based on S m is the most conservative, followed by the procedure based on T m . The fact that the empirical levels are higher for large values of η is due to the factor (k/m) −η which favors the occurrence of false alarms (threshold exceedences) at the beginning of the monitoring.
In a second series of experiments, we studied the finite-sample behavior of the tests under H 1 in (2.1) using simulation scenarios similar to those considered in Gösmann, Kley and Dette (2021, Section 4.1). We generated 1000 samples of size n = m + 7000 from models M1 with m = 100 and M4 with m = 200 and, for each sample, added a positive offset of δ to all observations after position k ∈ {m, m + 500, m + 1000, m + 5000}. We started by investigating the influence of η on the most conservative procedure according to Tables 5.1 and 5.2. The rejection percentages for the test based on S m with γ = 0 and η ∈ {0.1, 0.01, 0.001}, as well as for the procedures based on E m and Q m with γ = 0 are represented in Figure 5.1. Note that only threshold exceedences after position k were taken into account in the computation of rejection percentages. Those occurring before (false alarms) were ignored. As one can see, when the change occurs right at the beginning of the monitoring, the larger η, the more powerful the procedure based on S m . As k increases, the influence of the fac- tor (k/m) −η comes into effect and the opposite tends to occurs (although the difference in power does not seem of practical importance for the monitoring  periods under consideration). As far as the procedures based on E m and Q m are concerned, they are more powerful than the procedure based on S m when the change occurs at the beginning of the monitoring but, as expected, become less powerful as k increases. The fourth column of plots in Figure 5.1 shows that the difference in terms of power can be substantial. One should however keep in mind that, because of the factor (k/m) −η , the test based on E m for instance might again become more powerful than the procedure based on S m for k extremely large. Nevertheless, even if we consider the least favorable setting (η = 0.1), we believe that such a scenario is extremely unlikely to occur in practice given the (relatively) slow decrease of the function t → t −η and the fact that the weighting used in the definition of E m penalizes in some sense late changes as explained in the discussion below (2.5). From the second row of plots of Figure 5.1, we see that the previous conclusions seem to remain qualitatively the same when model M4 is used although, unsurprisingly, the stronger serial dependence gives the impression that the values of k are smaller. Figure 5.2 reports the rejection percentages of H 0 in (1.1) for the same experiments but for the procedures based on R m , S m , T m with γ = 0 and η = 0.005, as well as for the procedures based on E m and Q m with γ = 0. As one can see, among the three studied procedures, those based on R m and T m seem more powerful than the one based S m when the change occurs at the beginning of the monitoring (which is in accordance with the fact that the procedure based on S m is the most conservative as can be seen from Tables 5.1 and 5.2), while there seems to be very little difference between the three tests when the change  occurs later.
The increase in power resulting from taking γ equal to its largest meaningful value is illustrated in Figure 5.3. As expected, the improvement is visible only when changes occur at the beginning of the monitoring. In practice, we suggest to increase the value of γ only when it is believed that the size m of the learning sample permits a reasonably accurate estimation of the long-run variance σ 2 .
As a second to last experiment, we considered a longer monitoring period with late changes. Table 5  δ = 0.1 was added to all observations after position k = 15000. These results highlight again the role of η and its influence on power through the factor (k/m) −η . Notice that the corresponding rejection percentages of the procedures based E m and Q m are less than 1%. Finally, we report the results of a similar experiment illustrating the influence of a change in the serial dependence in the data generating mechanism on the power of the tests. Specifically, we considered two additional models: for Model M11 (resp. M12), observations consist of a realization of a sequence of m independent standard normals followed by observations generated from an AR(1) model with standard normal innovations and autoregressive parameter evolving linearly from 0 to 0.9 (resp. 0 to −0.9) over the n − m remaining observations. Table 5.4 provides the percentages of rejection of H 0 in (1.1) for the procedures based on R m , S m , T m with γ = 0 and η = 0.001, as well as for the procedures E m and Q m with γ = 0, estimated from 2000 samples of size n = 20000 from model M1, M11 or M12 with m = 100 such that, for each sample, a positive offset of δ ∈ {0, 0.5, 0.1} was added to all observations after position k = 15000. As one can see from the first horizontal block of the table (corresponding to δ = 0), the tests based on R m , S m and T m reject very frequently H 0 in (1.1) when data are generated from Model M11 although the underlying (non-stationary) time series has mean zero. The latter is actually not surprising. On one hand, the variance estimate σ m involved in the procedure (see the practical description of the tests below Corollary 3.6) was computed from the initial i.i.d. learning sample of size m and should therefore not be too different from 1. On the other hand, the progressive increase of the autoregressive parameter in Model M11 entails an increase of the variance of the observations as time elapses and makes it more and more probable that large observations in absolute value are followed by large observations in absolute values of the same sign. The two previous facts lead to an increase, as time elapses, of the probability that the detectors will exceed their corresponding thresholds. The high rejection percentages observed when data are generated from Model M11 confirm that the null hypothesis of the studied procedures is not that the underlying time series has mean zero but that it is stationary as stated in (1.1). The closeness of the rejection percentages to the 5% nominal level when data are generated from Model M12 is a consequence of the fact that, as time elapses, large observations in absolute value tend to be followed by large observations in absolute values but of the opposite   sign. As already explained when describing the first series of experiments, the fact that the rejection percentages are slightly above the 5% nominal level when data are generated from Model M1 is due to the relatively small learning sample (m = 100). We expect the empirical levels to improve for larger values of m as in Tables 5.1 and 5.2. The last two horizontal blocks of Table 5.4 (corresponding to δ ∈ {0.05, 0.1}) confirm that, even in the presence of a change in the serial dependence, the studied procedures have some power when a change in mean occurs.
Taking into account all the empirical results summarized in this section, we recommend to set η to 0.005 or 0.001 and use either the procedure based on T m or the procedure based on S m as they are more conservative than the monitoring scheme based on R m while almost as powerful (except when changes occur at the beginning of the monitoring). If it is believed that the size m of the learning sample permits a reasonably accurate estimation of the long-run variance σ 2 , to improve the behavior of the procedures at the beginning of the monitoring, one can additionally set γ to 0.45 for T m and 0.85 for S m .

Extensions to parameters whose estimators exhibit a mean-like behavior
In their seminal work, Gösmann, Kley and Dette (2021) actually considered monitoring schemes sensitive to changes in time series parameters whose estimators exhibit an asymptotic mean-like behavior. The aim of this section is to briefly demonstrate that the same type of extension is possible for the sequential procedures studied in this work. For the sake of keeping the notation simple, we restrict our discussion to univariate time series and univariate parameters. Let F be a univariate distribution function (d.f.) and let θ = θ(F ) be a univariate parameter of F (such as the expectation, the variance, etc). Let F j:k be the empirical d.f. computed from the stretch X j , . . . , X k of available observa-tions. More formally, for any integers j, k ≥ 1 and x ∈ R, let and let θ j:k = θ(F j:k ) be the corresponding plug-in estimator of θ computed from the stretch X j , . . . , X k . Natural extensions of the detectors R m , S m and T m defined in (2.4), (2.9) and (2.10), respectively, for monitoring changes in the parameter θ are then given, for any k ≥ m + 1, by Furthermore, assuming it exists, let is the d.f. of the Dirac measure at x. To be able to study the asymptotic validity of monitoring schemes based on R θ m in (6.1), S θ m in (6.2) and T θ m in (6.3), we follow Gösmann, Kley and Dette (2021) and focus on parameters θ that admit an asymptotic linearization in terms of the influence function, that is, such that where the remainders R j,k are asymptotically negligible in the sense of the following condition.
Condition 6.1. The remainders in (6.4) satisfy where the arrow ' as →' denotes almost sure convergence.
In the rest of this section, we assume that H 0 in (1.1) holds. Moreover, for any integers j, k ≥ 1, let If the random variables IF (X 1 , F, θ), . . . , IF (X m , F, θ), IF (X m+1 , F, θ), . . . were observable, one could naturally consider analogues of the detectors R m , S m and T m in (2.4), (2.9) and (2.10), respectively, defined, for k ≥ m + 1, by Upon assuming that Condition 3.1 holds for the sequence IF (X i , F, θ) i∈Z , one immediately obtains an analogue of Theorem 3.3 for the detectors R IF m , S IF m and T IF m . The next result, proven in Appendix C, shows that, if Condition 6.1 is additionally assumed, an analogue of Theorem 3.3 also holds for the computable detectors R θ m in (6.1), S θ m in (6.2) and T θ m in (6.3). Proposition 6.2. Under H 0 in (1.1) and Condition 6.1, for any η > 0, > 0 and γ ≥ 0, where the threshold functions w R , w S and w T are defined in (2.7), (2.11) and (2.12), respectively. Remark 6.3. As mentioned in Gösmann, Kley and Dette (2021), the verification of Condition 6.1 is highly non-trivial. When θ is the variance or a quantile of F , it was shown to hold in probability in Section 4 of Dette and Gösmann (2019). In the multivariate parameter and time series case, it was verified in Gösmann, Kley and Dette (2021, Section 3.2) for a time-dependent linear model. In a related way, note that an inspection of the proof of Proposition 6.2 reveals that Condition 6.1 could actually be replaced by the requirement that the remainders in (6.4) satisfy

Data example
As a small data example, we consider a fictitious scenario consisting of monitoring global temperature anomalies for changes in the mean. Specifically, we use the time series of monthly global (land and ocean) temperature anomalies available at http://www.climate.gov/ which covers the period January 1880 -May 2020. The time series in degrees Celsius is represented in the left panel of Figure 7.1. The solid vertical line marks the beginning of the fictitious monitoring and corresponds to September 1921 (and thus to a learning sample of size m = 500). Note that this monitoring scenario is indeed fully fictitious, among other things, because temperature anomalies are computed with respect to the 20th century average (see, e.g., Smith et al., 2008) and, therefore, the corresponding time series would not have been available until the beginning of the current century.
The solid curve in the right panel of Figure 7.1 displays the evolution of the normalized detector σ −1 m {w T (k/m)} −1 T m (k) with η = 0.001 and γ = 0.45 against k ≥ m + 1. The solid (resp. dashed) horizontal line represents the estimated 0.95-quantile (resp. 0.99-quantile) of the corresponding limiting distribution in Theorem 3.3. The solid vertical line represents the date at which the normalized detector exceeded the aforementioned 0.95-quantile and corresponds to November 1939. Note that to estimate the point of change corresponding to an exceedence at position k, one could use argmax m≤j≤k−1 j(k − j) m 3/2 |X 1:j −X j+1:k | + 1. The date of change corresponding to an exceedence in November 1939 is thus estimated to be April 1925 and is marked by a dashed vertical line in the right panel of Figure 7.1.

Concluding remarks
This work has demonstrated that it is relevant to define open-end sequential change-point tests such that the underlying detectors coincide with (or are related to) the retrospective CUSUM statistic at each monitoring step. From a practical perspective, when focusing on changes in the mean, such an approach was observed to lead to an increase in power with respect to existing competitors except when changes occur at the very beginning of the monitoring. Given the potentially very long term nature of open-end monitoring (by definition), it can be argued that having extra power for all but very early changes is strongly desirable. The price to pay for the additional power is a more complicated theoretical setting and the fact that quantiles of the underlying limiting distributions required to carry out the sequential tests in practice are harder to estimate. As far as monitoring for changes in other parameters than the mean is concerned, extensions are possible as long as the underlying estimators exhibit a mean-like asymptotic behavior as considered in Gösmann, Kley and Dette (2021).
where w R , w S and w T are defined in (2.7), (2.11) and (2.12), respectively, and R m , S m and T m are defined in (2.4), (2.9) and (2.10), respectively.
Proof. Let us first show (A.4). From (2.4) and (A.1), using the reverse triangle inequality for the maximum norm, we have that, for any k ≥ m + 1, using the fact that t → 1/w γ (t) is bounded by −1 . From (2.3) and under Condition 3.1, we obtain that, for any k ≥ m + 1 and j ∈ {m, . . . , k − 1}, Hence, by the triangle inequality, we have that where To prove (A.4), it remains to show that I m , I m , J m and J m converge to zero in probability. For any ξ ∈ (0, 1/2), which, under Condition 3.1, is smaller than a term bounded in probability times m ξ−1/2 , which converges to zero because ξ < 1/2. Similarly, Using again Condition 3.1, the latter converges to zero in probability since The fact that J m in (A.9) and J m in (A.10) converge to zero in probability can be checked by proceeding similarly. Hence, we have shown (A.4). It remains to prove (A.5) and (A.6). From (2.9) and (A.2), using the reverse triangle inequality for the L 1 norm, it can be verified that, for any k ≥ m + 1, where U m (k) is defined in (A.7). Similarly, from (2.10) and (A.3), using the reverse triangle inequality for the Euclidean norm, The fact that (A.5) and (A.6) hold then immediately follows from the two previous displays, (A.8) and the fact that I m , I m , J m and J m converge to zero in probability.
Lemma A.2. For any fixed η > 0, the random function R η defined by (A.11) where W 1 and W 2 are independent standard Brownian motions, is almost surely bounded and uniformly continuous.
Proof. Fix η > 0, > 0 and γ ≥ 0 and let W 1 and W 2 be independent standard Brownian motions. Then, from (A.1), for any m ∈ N, the random variable Next, notice that, for any k ≥ m + 1 and any j ∈ {m, . . . , k − 1}, there exists 1 ≤ s ≤ t such that k = mt and j = ms . Hence, the previous expression can be rewritten as By Brownian scaling, the latter is equal in distribution to σ sup which, using (2.7), (2.8) and the function R η defined in (A.11), can be expressed as σ sup Using additionally the fact that, for any functions f, g, since sup t∈[1,∞) | mt /m − t| ≤ 1/m. From Lemma A.2, we know that R η is almost surely bounded and uniformly continuous on {(s, t) ∈ [1, ∞) 2 : s ≤ t}. Furthermore, the function t → 1/w γ (t) being bounded, continuous and converging to zero as t → ∞, it is also uniformly continuous on [1, ∞). The latter facts imply that the function (s, t) → {w γ (t)} −1 R η (s, t) is almost surely bounded and uniformly continuous on {(s, t) ∈ [1, ∞) 2 : s ≤ t} and, therefore, that (A.21) converges almost surely to zero and, finally, that (A.17) holds with the limit being almost surely finite.
Let us now show (A.18). First, notice that the limiting random variable is almost surely finite as an immediate consequence of the inequality In the above, we have used the fact that the sign of the increment and its magnitude are independent of the past and each other, and that the increment is Gaussian with mean 0 and variance 2 k−1 .
Proof of Theorem 3.12. We adapt the proof of Proposition 2.7 of Kojadinovic and Verdier (2021)  Notice first that