Optimal nonparametric change point analysis

: We study change point detection and localization for univariate data in fully nonparametric settings, in which at each time point, we acquire an independent and identically distributed sample from an unknown distribution that is piecewise constant. The magnitude of the distributional changes at the change points is quantiﬁed using the Kolmogorov–Smirnov distance. Our framework allows all the relevant parameters, namely the minimal spacing between two consecutive change points, the minimal magnitude of the changes in the Kolmogorov–Smirnov distance, and the number of sample points collected at each time point, to change with the length of the time series. We propose a novel change point detection algorithm based on the Kolmogorov–Smirnov statistic and show that it is nearly minimax rate optimal. Our theory demonstrates a phase transition in the space of model parameters. The phase transition separates parameter combinations for which consistent localization is possible from the ones for which this task is statistically infeasible. We provide extensive numerical experiments to support our theory.


Introduction
Change point analysis is a well-established topic in statistics that aims to detect and localize abrupt changes in the data generating distributions in time series data. Initiated during World War II (e.g. Wald, 1945), the field of change point analysis has since then produced a large body of literature as well as host of methods for statistical inference. These techniques are now widely used to address important real-life problems in a wide range of disciplines, including biology (e.g. Fan et al., 2015;Jewell et al., 2018), speech recognition (e.g. Fox et al., 2011;Chowdhury, Selouani and O'Shaughnessy, 2012), social sciences (e.g. Liu et al., 2013), climatology (e.g. Itoh and Kurths, 2010) and finance (e.g. Preuss, Puchstein and Dette, 2015;Russell and Rambaccussing, 2018).
The theoretical understanding of the statistical challenges associated with change point problems has also progressed considerably. The initial groundbreaking results of Yao (1988), Yao and Au (1989) and Yao and Davis (1986) that studied change point detection for a univariate piecewise-constant signal, have now been extended in several ways. For instance, Fryzlewicz (2014), Frick, Munk and Sieling (2014) and Kovács et al. (2020), among others, proposed computationally-efficient methods to deal with situations of potentially multiple mean change points (e.g. Wang, Yu and Rinaldo, 2020). More recently, Pein, Sieling and Munk (2017) constructed a method that can handle mean and variance changes simultaneously. Chan, Yau and Zhang (2014), Cho (2016), Cho and Fryzlewicz (2015) and Wang and Samworth (2018) studied high-dimensional mean change point detection problems. A different line of work, including efforts by Aue et al. (2009), Avanesov and Buzun (2016) and Wang, Yu and Rinaldo (2021), has investigated scenarios where covariance matrices change. Cribben and Yu (2017), Liu et al. (2018) and Wang, Yu and Rinaldo (2018), among others, inspected dynamic network change point detection problems.
Most of the existing theoretical frameworks for statistical analysis of change point problems, however, rely heavily on strong modelling assumptions of parametric nature. Such approaches may be inadequate to capture the inherent complexity of modern, high-dimensional datasets. Indeed, the statistical literature on nonparametric change point analysis is surprisingly limited compared to its parametric counterpart. Among the nonparametric results, Carlstein (1988) considered the scenario where there is at most one change point. Harchaoui and Cappé (2007) utilized a kernel-based algorithm. Hawkins and Deng (2010) proposed a Mann-Whitney-type statistic to conduct online change point detection. Matteson and James (2014) established the consistency of change point estimators based on statistics originally introduced in Rizzo and Székely (2010). Zou et al. (2014) developed a nonparametric multiple change point detection method with consistency guarantees. Li et al. (2019) proposed two computationallyefficient kernel-based statistics for change point detection, which are inspired by the B-statistics. Padilla et al. (2018) proposed an algorithm for nonparametric change point detection based on the Kolmogorov-Smirnov statistic. Fearnhead and Rigaill (2018) devised a mean change point detection method which is robust to outliers. Vanegas, Behr and Munk (2019) constructed a multiscale method for detecting changes in a fixed quantile of the distributions. Arlot, Celisse and Harchaoui (2019) considered a kernel version of the cumulative sum (CUSUM) statistic and an 0 -type optimization procedure which can be solved by dynamic programming. Arlot, Celisse and Harchaoui (2019) also derived an oracle inequality for their estimator, though without guarantees on change point localization.
In this paper we advance both the theory and methodology of nonparametric change point analysis by presenting a computationally-efficient procedure for univariate change point localization. The resulting method is proven to be consistent and in fact nearly minimax rate optimal for estimating the change points. Our analysis builds upon various recent contributions in the literature on parametric and high-dimensional change point analysis but allows for a fully nonparametric change point model. To the best of our knowledge, Zou et al. (2014) is one of very few examples yielding a procedure for nonparametric change point detection with provable guarantees on localization. A detailed comparison between our results and the ones of Zou et al. (2014) will be given in Appendix A, which also includes detailed comparisons of our work with Pein, Sieling and Munk (2017), Vanegas, Behr and Munk (2019) and Garreau and Arlot (2018).

Problem formulation
In this section we describe the change point model that we will consider. Our notation and settings are fairly standard, with one crucial difference from most of the contributions in the field: the changes in the underlying distribution at the change points are not parametrically specified, but are instead quantified through a nonparametric measure of distance between distributions. This feature renders our methods and analysis applicable to a wide range of change point problems.
(1.1) Furthermore, we set n min = min t=1,...,T n t and n max = max t=1,...,T n t , and assume that n max n min n for some n.
Throughout this paper, for any positive sequences a T and b T , we denote a T b T if there exists T 0 ∈ N and an absolute constant C > 0 such that for any T ≥ T 0 , it holds a T ≥ Cb T . We let We quantify the magnitude of the distributional changes between distribution functions at two consecutive change points using the Kolmogorov-Smirnov distance, a natural and widely-used metric for univariate probability distributions. Though stronger than the weak convergence, convergence in the Kolmogorov-Smirnov distance is weaker than many other forms of convergence. Examples of those stronger forms of convergence include the ones induced by the total variation distance and, provided that the distributions admit bounded Lebesgue densities, by the L 1 -Wasserstein distance. Nonetheless, reliance on the Kolmogorov-Smirnov distance offers significant advantages: (1) it allows for fully nonparametric change point analysis settings and procedures, that hold with minimal assumptions on the underlying distribution functions and (2) is amenable to statistical analysis. Although the Kolmogorov-Smirnov statistics is known to exhibit low power in goodness-of-fit testing problems (e.g. Duembgen and Wellner, 2014), this issue does not affect our results: as shown below in Lemma 3.2, our Kolmogorov-Smirnov-based change point localization procedure is nearly minimax rate optimal. Our numerical experiments further confirm that our procedure compares favourably with other nonparametric methods.
In Condition 1.1, we allow for multiple observations n t to be collected at each time t. This generalizes the classical change point detection framework where n t = 1 for all t (e.g. Zou et al., 2014). This flexibility is inspired by the recent interest in anomaly detection problems where multiple observations can be measured at a fixed time (e.g. Reinhart, Athey and Biegalski, 2014;Padilla et al., 2018). Our results remain valid even if n t = 1 for all t.
The nonparametric change point model defined above in Condition 1.1 is specified by a few key parameters: the minimal spacing between two consecutive change points δ, the minimal jump size in terms of the Kolmogorov-Smirnov distance κ, and the number n t of data points acquired at each time t. We adopt a high-dimensional framework in which all these quantities are allowed to change as functions of the total number of time points T .
We consider the change point localization problem of establishing consistent change point estimators {η k } K k=1 of the true change points. These are measurable functions of data and return an increasing sequence of random time pointŝ η 1 < . . . <η K , such that, as T → ∞, the following event holds with probability tending to 1: Throughout the rest of the paper, we refer to the quantity /δ as the localization rate. Our goal is to obtain change point estimators that, under the weakest possible conditions, are guaranteed to yield the smallest possible .

Summary of results
We will show that under Condition 1.1, the hardness of the change point localization task is fully captured by the quantity which can be regarded as a signal-to-noise ratio of sort. We list our contributions as follows.
• We demonstrate the existence of a phase transition for the localization task in terms of the signal-to-noise ratio κδ 1/2 n 1/2 . We show that in the low signal-to-noise ratio regime κδ 1/2 n 1/2 1, no algorithm is guaranteed to be consistent. We also show that if κδ 1/2 n 1/2 ≥ ζ T , where ζ T is any diverging sequence as T → ∞, then a minimax lower bound of the localization error rate is of order (nκ 2 T ) −1 .
• We develop a novel detection procedure, based on the Kolmogorov-Smirnov distance, given in Algorithm 1. We show that under suitable conditions, our method is consistent and nearly minimax rate optimal, up to logarithmic factors, in terms of both the required signal-to-noise ratio. The specific assumption is given in Condition 3.1, and the localization error rate in Theorem 3.1. Interestingly, for the lower bounds on the signal-to-noise ratio and the localization error rate, our rates match those derived for the univariate mean change point localization problem under sub-Gaussian noise, e.g. Wang, Yu and Rinaldo (2020). • We provide extensive comparisons of our algorithm and theoretical guarantees with several competing methods and results from the literature. See Appendix A and Section 4, respectively. In particular, our simulations indicate that our procedure performs well across a variety of scenarios, in terms of both estimating the number of change points and their locations.
We point out that, although in deriving the theoretical guarantees for our methodologies we follow techniques proposed in existing work, namely Venkatraman (1992) and Fryzlewicz (2014), our results deliver improvements in two aspects. Firstly, the extension to nonparametric settings, in which the magnitude of the distributional changes is measured by the Kolmogorov-Smirnov distance, requires novel and nontrivial arguments, especially to quantify the order of the stochastic fluctuations of the associated CUSUM statistics. Secondly, the arguments used in Fryzlewicz (2014) for the theoretical analysis of the performance of the WBS algorithm have to be sharpened in order to allow for all the model parameters to vary as the sample size diverges and in order to yield optimal localization rates.

Methodology
In this section, we detail our Kolmogorov-Smirnov detector procedure, which is based on the CUSUM Kolmogorov-Smirnov statistic defined next. Similar or related Kolmogorov-Smirnov statistics based procedures have been considered previously in Darkhovski (1994), Boukai and Zhou (1997) and Padilla et al. (2018), among others. In Definition 2.1, n s:e is the total number of observations collected in the integer interval [s, e], and F s:e is the empirical cumulative distribution function estimated using the data collected in [s, e].
The proposed procedure applies the wild binary segmentation procedure (Fryzlewicz, 2014). The latter was originally developed for univariate mean change point detection problems based on the univariate CUSUM statistic. We, instead, use the CUSUM Kolmogorov-Smirnov statistic. To be specific, we draw a collection of random time intervals, and within each interval, we search the time point which maximizes the CUSUM Kolmogorov-Smirnov statistic. If the corresponding maximal value of such statistic exceeds an appropriate threshold, then that time point is added to the collection of estimated change points. The process is then repeated separately for each of the resulting two time sub-intervals and stops when all the CUSUM Kolmogorov-Smirnov statistics are below the threshold, or when the resulting time interval is too narrow. See Algorithm 1 for details. In Algorithm 1, the main input is the threshold τ , a tuning parameter controlling the number of returned change points, with larger values of τ producing smaller numbers of estimated change points. Our theory in the next section will shed some lights on how to choose τ .
For any integer triplet (s, t, e), 1 ≤ s < t < e ≤ T , the computational cost of calculating D t s,e is of order O{(e − s)n s:e log(n s:e )}. This can be seen using a naïve calculation based on the merge sort algorithm (e.g. Knuth, 1998), and the fact that the supremum in (2.1) only needs to be taken over z ∈ {Y u,i : s ≤ u ≤ e, i = 1, . . . , n u }. Algorithm 1 therefore has the worst case running time of order O{MT n 1:T log(n 1:T )} where M is the number of randomly drawn time intervals.

The consistency of the Kolmogorov-Smirnov detector algorithm
We first state a condition involving the minimal spacing, the minimal jump size and the total number of time points T , which overall amount to a minimal signal-to-noise ratio condition.
Condition 3.1. There exists an absolute constant C SNR > 0 such that where φ T is any diverging sequence as T grows unbounded.

1161
The scaling exhibited in Condition 3.1 covers nearly all combinations of model parameters for which the localization task is feasible: in Section 3.2 we show that no estimator of the change points is guaranteed to be consistent when the model parameters violate Condition 3.1, up to a poly-logarithmic term in T .
We then show that, under Condition 3.1 and with appropriate choices of the input parameters, the Kolmogorov-Smirnov detector procedure will yield, with high probability, the correct number of change points and a vanishing localization rate. In fact, as shown below in Section 3.2, the resulting localization rates are nearly minimax rate optimal (achieving the minimax risk, e.g. Section 2.1 in Tsybakov, 2009). where c τ,1 , c τ,2 > 0 are absolute constants. Let {η k } K k=1 be the corresponding output of Algorithm 1. Then, there exists a constant C > 0 such that P K = K and k ≤ C κ −2 k log(n 1:T )n −1 , ∀k = 1, . . . , K ≥1 − 24 log(n 1:T ) T 3 n 1:T − 48T n 1:T log(n 1: If M log(T/δ)T 2 δ −2 , then the probability in (3.3) approaches to 1 as T → ∞, which shows that Algorithm 1 is consistent.
Based on Condition 3.1, the range of tuning parameters τ defined in (3.2) is not empty, and the upper bound on the localization error rate satisfies where the inequality follows from Condition 3.1. It is important to point out that Theorem 3.1 yields a family of localization rates that depend on how n, κ and δ scale with T . The slow rate φ −2 T exhibited in the last display represents the worst case scenario corresponding to the weakest possible signal-to-noise ratio afforded in Condition 3.1. In fact, for most combinations of the model parameters, the rate can be much faster. For instance, when κ is bounded away from zero, the resulting rate is of order O log(n 1:T )n −1 T −1 φ T , if δ T . Still assuming a nonvanishing κ, and provided that n increases with T , our Kolmogorov-Smirnov detector estimator remains consistent even if δ is as small as log(n 1:T ), with a localization rate of order O n −1 . Finally, if the change points are evenly spaced with δ T 1/2 , then the number of change points satisfies K T 1/2 . In this case the Kolmogorov-Smirnov detector procedure will output a consistent estimator of the change points even with κ tending to 0 as T increases, as long as the rate of decay for κ is slower than φ T log 1/2 (n 1:T )n −1/2 T −1/4 . However, if K → ∞ as T grows, then (3.1) is somewhat unsatisfactory, as it assumes some knowledge about the rate of growth of the minimal spacing δ. This may not be available in practice, even though in many cases an educated guess on the minimal spacing is not too unreasonable to assume. Though (3.1) does not appear among the assumptions of Theorem 3.2 in Fryzlewicz (2014) about the performance of the wild binary segmentation algorithm, it is implicitly assumed there. More generally, some knowledge of δ, in one form of another, is routinely assumed in order to demonstrate consistency or optimality of wild binary segmentation-like algorithms, see, e.g. Wang and Samworth (2018), Wang, Yu and Rinaldo (2020), Baranowski, Chen and Fryzlewicz (2019), Anastasiou and Fryzlewicz (2019) and Eichinger and Kirch (2018). For more discussions on this, see Wang, Yu and Rinaldo (2020).
In the proofs, we randomly draw intervals . This procedure continues until we have M such intervals. In practice, as demonstrated in Section 4, there are always only finite number of change points and we therefore do not need to impose (3.1).
We remark that condition (3.1) is only needed to guarantee that the localization rate of the Kolmogorov-Smirnov detector procedure given in (3.2) is nearly minimax rate optimal, see the next section, but not to ensure consistency. In fact, (3.1) is not at all necessary for consistency. For instance, assuming that κ(δn) 1/2 (δ/T ) φ T log 1/2 (n 1:T ), a slightly stronger signal-to-noise ratio setting than those allowed in Condition 3.1, a simple adaptation of the proof of Theorem 3.1 shows that, even without (3.1), the Kolmogorov-Smirnov detector procedure will be consistent with a vanishing localization rate /δ log(n 1: Theorem 3.1 delivers a strict improvement upon the guarantees claimed in other nonparametric change point detection papers, since it guarantees localization rates that are local. Indeed, according to Theorem 3.1, each change point η k possesses its own localization rate, depending on κ k , the magnitude of the corresponding distributional change. In particular, change points at which the distributional change is more significant can be estimated more accurately.
The detailed proof of Theorem 3.1 can be found in Appendix B. Here we provide a brief summary of the technical arguments used there. The first key step is to obtain finite sample concentration inequalities to control the deviation between the CUSUM Kolmogorov-Smirnov statistics {D t s,e : 1 ≤ s < t ≤ e ≤ T } and their population versions. The rest of the proofs are conducted on the so-called good events, where with probabilities tending to 1 as T grows unbounded, the fluctuations remain within an appropriate range. Next, we show that the population versions of the CUSUM Kolmogorov-Smirnov statistics are maximized at the true change points and, crucially, decay rapidly away from them. Consequently, on the good events, our algorithm will correctly detect or reject the existence of change points and localize the change points accurately.

Phase transition and minimax optimality
In this subsection, we prove that Algorithm 1 is optimal, in the sense of guaranteeing nearly minimax optimal localization rates across almost all models for which the localization task is possible. Towards that end, recall that in Theorem 3.1 we have shown that Algorithm 1 provides consistent change point estimators under the assumption that κ(δn) 1/2 φ T log 1/2 (n 1:T ). (3.4) In Lemma 3.1 below, we will show that if then no algorithm is guaranteed to be consistent uniformly over all possible change point problems. In light of (3.4), (3.5), Theorem 3.1 and Lemma 3.1 below, we have found a phase transition over the space of model parameters that separates scalings of the model parameters for which the localization task is impossible, from those for which Algorithm 1 is consistent. The separation between these two regions is sharp, saving for the term φ T log 1/2 (n 1:T ).
be a time series satisfying Condition 1.1 with one and only one change point. Let P T κ,n,δ denote the corresponding joint distribution. Set P = P T κ,n,δ : δ = min 2 −1 κ −2 n −1 , T/3 and, for each P ∈ P, let η(P ) be the corresponding change point. Then, where the infimum is over all possible estimators of the change point locations.
In our next result, we derive a minimax lower bound on the localization task, which applies to nearly all combinations of model parameters outside the impossibility region found in Lemma 3.1.
where the infimum is over all possible estimators of the change point locations.
Note that the condition δ ≤ T/2 automatically holds due to the definition of δ. The above lower bound matches, saving for a poly-logarithmic factor, the localization rate we have established in Theorem 3.1, thus showing that Algorithm 1 is nearly minimax rate optimal.

Choice of tuning parameters
Algorithm 1 calls for three tuning parameters: the upper bound of random interval widths C M , the number of random intervals M and the threshold τ . In this subsection, we discuss the practical guidances on choosing the tuning parameters.
The constant C M is involved in (3.1) and purely for theoretical purposes. Since we allow the number of change points to diverge, to prompt the nearly minimax optimality, the constant C M is required in Theorem 3.1. In practice, there are only finite change points in any given data sets, hence (3.1) automatically holds and this is not a tuning parameter in use.
For the number of random intervals M , as stated in Theorem 3.1, we need to choose M to satisfy that M log(T/δ)T 2 δ −2 . However, in practice, δ is likely unknown but it holds that δ T , which leads to M 1. In all the numerical experiments in Section 4, we let M = 120, which works well in practice.
Arguably, the most important tuning parameter in Algorithm 1 is τ , whose value determines whether a candidate time point should be deemed a change point. If we let τ decrease from ∞ to 0, then the procedure produces more and more change points. In particular, if all the other inputs, namely {Y t,i } and {(α m , β m )}, are kept fixed, then it holds that B(τ 1 ) ⊆ B(τ 2 ), for τ 1 ≥ τ 2 , where B(τ ) is the collection of estimated change points returned by Algorithm 1 when a value of τ for the threshold parameter is used. We take advantage of such nesting in order to design a data-driven method for picking τ .
To proceed, we now introduce Algorithm 2. Algorithm 2 is a generic procedure that can be used for merging two collections of estimated change points B 1 and B 2 . Algorithm 2 deletes from B 1 ∪ B 2 potential false positives by checking their validity one by one based on the CUSUM Kolmogorov-Smirnov statistics. However, Algorithm 2 does not scan for potential false positives in the set B 1 ∩B 2 .
The criterion deployed in Algorithm 2 is based on the following check: ( 3.6) where λ > 0 is a specified tuning parameter to be discussed in Theorem 3.2.
In order to have a practical scheme for selecting τ , we now propose Algorithm 3. Algorithm 3 is a fully data-driven change point detection procedure with automated tuning parameter selection. Algorithm 3 requires two independent subsamples: {Y t,i } and {W t,i }. In practice, one can perform sample splitting. If n t ≥ 2, for all t, then one can partition the data at every time point. If all or some n t 's equal 1, then for t ∈ {j : n j = 1}, we randomly assign the associated observation to {Y t,i } or {W t,i } with equal probability. In fact, there is no need to ensure that both subsamples have exactly the same number of sample points n t for all t. Our theoretical guarantees in Theorem 3.2 still hold as long as the number of observations have the same scaling at each time point in the two samples {Y t,i } and {W t,i }. When calling (3.6) in Algorithm 3, the empirical distribution functions are constructed based on {Y t,i }. To present Algorithm 3, we slightly abuse the notation: in order to emphasize that Algorithm 1 is conducted on the sample {W t,i }, we include {W t,i } as a formal input to the Kolmogorov-Smirnov detector algorithm. Since the CUSUM Kolmogorov-Smirnov statistics are based on {Y t,i }, we use the notation D η η k ,η k+1 (z, {Y t,i }) instead. As for the implementation of Algorithm 3, we arrange all the candidate sets in increasing order of their corresponding τ j values. This ensures a decreasing nesting of the candidate change points. We begin with the set corresponding to the smallest τ j and compare consecutive sets. However, unlike Algorithm 2, we pick a single element from the difference sets, and decide to move on or to terminate the procedure. Theorem 3.2 provides suitable conditions guaranteeing that this procedure results in a consistent estimator of the change points. where c τ,1 , c τ,2 > 0 are absolute constants and for some j * ∈ {2, . . . , J − 1}. Let B = {η 1 , . . . ,ηK} be the output of Algorithm 3 with inputs satisfying the conditions above. If λ = C log(n 1:T ), with a large enough constant C > 0, then

Algorithm 3 Kolmogorov-Smirnov detector with tuning parameter selection
The proof of Theorem 3.2 can be found in Appendix D. It implicitly assumes that the nested sets {B j } in Algorithm 3 satisfy |B j \ B j+1 | = 1, for j = 1, . . . , J. If this condition is not met, then the conclusion of Theorem 3.2 still holds provided that we replace the inequality condition in Algorithm 3, with the inequality λ > max m=1, A similar proposal for selecting the threshold tuning parameter for the wild binary segmentation algorithm can be found in Theorem 3.3 in Fryzlewicz (2014). The proof of our Theorem 3.2 delivers a more careful and precise analysis.
Finally, Theorem 3.2 suggests to choose the parameter λ in Algorithm 3 as λ = C log(n 1:T ). In practice, we set C = 2/3 and find this choice performing reasonably well. We also include further simulations in Appendix E showing that Algorithm 3 is less sensitive to the parameter C than Algorithm 1 is to the parameter τ .

Simulations
In this section we present the results of various simulation experiments aimed to assess the performances of our method in a wide range of scenarios and in relation to other competing methods. The R (R Core Team, 2019) code used in our simulations is available at https://github.com/hernanmp/NWBS. We measure the performance of an estimator K of the true number of change points K by the absolute error | K − K|. In all our examples, we report the average absolute errors over 100 repetitions. Furthermore, denoting by C = {η 1 , . . . , η K } the set of true change points, the performance of an estimator C of C is measured by the one-sided Hausdorff distance d( C|C) = max η∈C min x∈ C |x − η|. By convention we set it to be infinity whenĈ is an empty set. If C = {1, . . . , T }, then d( C|C) = 0. Thus, d( C|C) can be insensitive to overestimation. To overcome this, we also calculate d(C| C). In all of our simulations, for a method that produces an estimator C, we report the median of both d( C|C) and d(C| C) over 100 repetitions.  (Padilla et al., 2018).
We start by focusing on the case in which n t = 1 for all t = 1, . . . , T . The methods that we benchmark against are the following: wild binary segmentation (Fryzlewicz, 2014;, Bai and Perron's method (Bai and Perron, 2003;Zeileis et al., 2002), pruned dynamic programming (Rigaill, 2010;Cleynen, Rigaill and Koskas, 2016), pruned exact linear time algorithm (Killick, Fearnhead and Eckley, 2012;Killick and Eckley, 2014), the simultaneous multiscale change point estimator of (Frick, Munk and Sieling,  (2017), the robust functional pruning optimal partitioning method of (Fearnhead and Rigaill, 2018) and the kernel change point detection procedure of (Celisse et al., 2018;Arlot, Celisse and Harchaoui, 2019). For all the competing methods, we set the respective tuning parameters by default choices. For the kernel change point detection procedure, we use the function KernSeg MultiD in the R (R Core Team, 2019) package kernseg . The function produces a sequence of candidate models, each of which is associated with a measure of fit. We choose the best model based on the elbow criterion from the sequence of measures of fit.
We apply Algorithm 3 with λ = 2/3 log(n 1:T ), a choice that is guided by Theorem 3.2 and that we find reasonable in practice (see Appendix E for sensitivity simulations comparing Algorithms 1 and 3). Moreover, we construct the samples {Y t,i } and {W t,i } by splitting the data into two time series having odd and even time indices. We set the number of random intervals as M = 120.
We explain the different generative models that are deployed in our simulations. For all scenarios, we consider T ∈ {1000, 4000, 8000}. Moreover, we consider the partition P of {1, . . . , T } induced by the change points η 0 = 1 < η 1 < . . . < η K < η K+1 = T + 1, which are evenly spaced in {1, . . . , T }. The elements of P are A 1 , . . . , A K+1 , with A j = [η j−1 , η j −1]. We consider the following scenarios. Scenario 1. Let K = 7 for each instance of T . Define F t to have probability density function as in the left panel of Figure 1 for t ∈ A j with odd j and as in the right panel of Figure 1 for t ∈ A j with even j. θ t + 3 −1/2 ε t , t = 1, . . . , T , where ε t are independent and identically distributed as a t-distribution with 3 degrees of freedom.
Scenario 5. Let K = 2 and the data be generated as y t = ε t , t ∈ A j with odd j, and y t = 5 −1/2 ξ t otherwise, where ε t are independent and identically distributed as N (0, 1) and ξ t are independent and identically distributed as t-distribution with 2.5 degrees of freedom.
A visualization of different scenarios is given in Figure 2. We can see that these five scenarios capture a broad range of models that can allow us to assess the quality of different methods.
Based on the results in Tables 1 and 2, we can see that, generally, the best performance is attained by the Kolmogorov-Smirnov detector and kernel change point detection. In fact in some cases in terms of the localization error rate, the kernel change point detection outperforms the Kolmogorov-Smirnov detector, despite that to the best of our knowledge, the theoretical guarantees of the kernel change point detection's localization error rate is not yet established. This is seen in Scenario 1, where, as the sample size grows, the Kolmogorov-Smirnov detector and robust functional pruning optimal partitioning provide the best estimates of K. This is not surprising as Scenario 1 presents a situation where the distributions are not members of usual parametric families. In Scenario 2, robust functional pruning optimal partitioning attains the best performance with heterogeneous simultaneous multiscale change point estimators, kernel change point detection and Kolmogorov-Smirnov detector as the closest competitors. This scenario poses a challenge for most methods due to the heavy tails nature of the t-distributions. In Scenario 3, Kolmogorov-Smirnov detector is outperformed by methods like wild binary segmentation, pruned dynamic programming, simultaneous multiscale change point estimators and Bai and Perron's method. While Kolmogorov-Smirnov detector is still competitive in estimating the number of change points K, its localization errors are subpar. This should not come as a surprise, since all these methods are designed to work well in this particular mean change point model with sub-Gaussian errors. In Scenario 4, we see a clear advantage of using one of the nonparametric approaches, such as kernel change point detection, Kolmogorov-Smirnov detector and nonparametric multiple change point detection. Methods like wild binary segmentation, pruned exact linear time algorithm, Bai and Perron's procedure, simultaneous multiscale change point estimators and heterogeneous simultaneous multiscale change point estimators perform poorly in this scenario. This is especially interesting for the heterogeneous simultaneous multiscale change point estimator, as such method has been proven to be effective in detecting changes in variance also when there are concurrent changes in mean. However, Scenario 4 only includes changes in variance with the mean remaining constant.  Finally, Scenario 5 seems to be the most challenging one for all methods. In fact, Kolmogorov-Smirnov detector and kernel change point detection seem to be the only methods capable of estimating correctly the numbers of change points, with the kernel change point detection procedure yielding smaller localization rates. In our second set of simulations we study the case where the number of data points collected at any time can be more than one. We consider the same 5 scenarios, same tuning parameter selection method and same performance  metrics as in the first set of simulations. However, instead of setting n t = 1, we let n t to be fixed as 5, 15 and 30 or to be randomly distributed as Poisson (5), Poisson(15) and Poisson (5), for each t. We also set T = 1000. We only present the results of the Kolmogorov-Smirnov detector, because no other methods would automatically be able to handle this situation. We omit d(C| C) as it does not provide additional information. The results on Table 3 show the effectiveness of the Kolmogorov-Smirnov detector for estimating the number of change points and their locations.

Real data analysis
We consider the array comparative genomic hybridization micro-array data set from Bleakley and Vert (2011). This consists of individuals with bladder tumours. The data set has been processed and can be obtained in the R (R Core Team, 2019) package ecp (Matteson and James, 2013). For the microarray corresponding to the first individual we consider change point localization using different methods.   Figure 3 shows that all the methods seem to recover the important change points in the time series associated with individual 1 in the data set. A potential advantage of the Kolmogorov-Smirnov detector method is that it seems less sensitive to potentially-spurious change points as opposed to wild binary segmentation, nonparametric multiple change point detection and robust functional pruning optimal partitioning.

Appendix A: Comparisons
We compare our rates with those in the univariate mean change point detection problem, which assumes sub-Gaussian data (e.g. Wang, Yu and Rinaldo, 2020). On one hand, this comparison inherits the main arguments when comparing parametric and nonparametric modelling methods in general. Especially with the general model assumption we impose on the underlying distribution functions, we enjoy risk-free from model mis-specification. On the other hand, seemingly surprisingly, we achieve the same rates of those in the univariate mean change point detection case, even though sub-Gaussianity is assumed thereof. In fact, this is expected. We are using the empirical distribution function in our CUSUM Kolmogorov-Smirnov statistic, which is essentially a weighted Bernoulli random variable at every z ∈ R. Due to the fact that Bernoulli random variables are sub-Gaussian, and that the empirical distribution functions are step functions with knots only at the sample points, we are indeed to expect the same rates.
Furthermore, the heterogeneous simultaneous multiscale change point estimator from Pein, Sieling and Munk (2017) can also be compared to the Kolmogorov-Smirnov detector. Assuming Gaussian errors, δ T and K = O(1), Theorems 3.7 and 3.9 in Pein, Sieling and Munk (2017) proved that heterogeneous simultaneous multiscale change point estimator can consistently estimate the number of change points, and that r(T ), for any r(T ) sequence such that r(T )/ log(T ) → ∞. This is weaker than our upper bound that guarantees log(T ). The Kolmogorov-Smirnov detector can handle changes in variance when the mean remains constant, a setting where it is unknown if the heterogeneous simultaneous multiscale change point estimator is consistent.
Another interesting contrast can be made between the Kolmogorov-Smirnov detector and the multiscale quantile segmentation method in Vanegas, Behr and Munk (2019). Both algorithms make no assumptions on the distributional form of the cumulative distribution functions. However, the multiscale quantile segmentation is designed to identify changes in one known quantile. This is not a requirement for the Kolmogorov-Smirnov detector which can detect any type of changes without previous knowledge of their nature. As for statistical guarantees, translating to our notation, provided that δ log(T ), multiscale quantile segmentation can consistently estimate the number of change points and have log(T ). This matches our theoretical guarantees in Theorem 3.1. We compare the theoretical properties of the Kolmogorov-Smirnov detector with the ones of the nonparametric multiple change point detection procedure in Zou et al. (2014). Both methods guarantee consistent change point localization of univariate sequences in fully nonparametric settings.
• We measure the magnitude κ of the distribution changes at the change points using the Kolmogorov-Smirnov distance, as in (1.1). In contrast, Zou et al. (2014) deploy a weighted Kullback-Leibler divergence, see Assumption A4 in their paper, which is stronger than the Kolmogorov-Smirnov distance, and therefore more discriminative. At the same time, the authors require the cumulative distribution functions to be continuous, while our results hold with arbitrary distribution functions. • We let κ change with T , thus allowing for the minimal jump size to decay to 0, and provide localization rates that depend explicitly on κ. In contrast, Zou et al. (2014) constraints the jump sizes to be bounded away from 0, and the resulting localization rate does not involve the jump sizes. • Zou et al. (2014) impose stronger conditions on the rate at which the number of change point is permitted to diverge as a function of T . For example, assuming equally spaced change points, the assumptions in their Theorem 1 require, in our notation, that K = o(T 1/4 ). In contrast, we allow for K to grow as fast as T/ log(T ). • If we let the number of change points be bounded in T , as assumed in Theorem 2 of Zou et al. (2014), then, again translating into our notation, their procedure yields log 2+c (T ), for c > 0, while we can achieve log(T ).
• Both ours and theirs study a nonparametric change point localization problem, and both achieve parametric rates in the sense that the complexity of the class of the distribution functions are not shown in the results. In our case, this is because we adopt the Kolmogorov-Smirnov distance and transform general distributions to Bernoulli distributions. In Garreau and Arlot (2018), this is because they map the data to a reproducing kernel Hilbert space. • The jump size κ are measured by different measurements in ours and theirs. In our paper, the jump size is taken to be the difference in terms of the Kolmogorov-Smirnov distance, which reflects the genuine differences between different distributions. In Garreau and Arlot (2018), it is a distance induced by the Hilbert space norm. More specifically, it is the distance between two different Bochner integrals. With this definition, despite the flexibility, as pointed out in Garreau and Arlot (2018), it is not guaranteed that the change points detected are the change points of the original distributions. • In terms of the signal-to-noise ratio condition and the localization rates, there are two differences between ours and theirs. Firstly, we allow all model parameters to vary with the sample size n, while the variance parameter in Garreau and Arlot (2018) is assumed to be a constant. Secondly, in both the signal-to-noise ratio condition (see Eq. (3.3) in Garreau and Arlot, 2018) and the localization rate (see the final result in Theorem 3.1 in Garreau and Arlot, 2018), there is an additional factor K, the number of change points. Note that this factor does not appear in our results. Proof of Theorem 3.1. Let = C κ −2 log(n 1:T )n 7 max n −8 min . Since is the upper bound of the localisation error, by induction, it suffices to consider any interval (s, e) ⊂ (1, T ) that satisfies It has to be the case that for any change point η k ∈ (0, T ), either |η k − s| ≤ or |η k −s| ≥ δ − ≥ 3δ/4. This means that min{|η k −s|, |η k −e|} ≤ indicates that η k is a detected change point in the previous induction step, even if η k ∈ (s, e). We refer to η k ∈ (s, e) an undetected change point if min{|η k − s|, |η k − e|} ≥ 3δ/4. In order to complete the induction step, it suffices to show that we (i) will not detect any new change point in (s, e) if all the change points in that interval have been previous detected, and (ii) will find a point b ∈ (s, e) such that |η k − b| ≤ if there exists at least one undetected change point in (s, e).
For j = 1, 2, define the events
Set γ to be C γ log 1/2 (n 1:T ), with a sufficiently large constant C γ > 0. The rest of the proof assumes the event A 1 (γ) ∩ A 2 (γ) ∩ S, the probability of which can be lower bounded using Lemma B.3 and also Lemma 13 in Wang, Yu and Rinaldo (2020).
Step 1. In this step, we will show that we will consistently detect or reject the existence of undetected change points within (s, e). Let a m , b m and m * be defined as in Algorithm 1. Suppose there exists a change point η k ∈ (s, e) such that min{η k −s, e−η k } ≥ 3δ/4. In the event S, there exists an interval (α m , β m ) selected such that α m ∈ [η k − 3δ/4, η k − δ/2] and β m ∈ [η k + δ/2, η k + 3δ/4]. where the last inequality is from the choice of γ, the fact n min n max and c τ,2 > 0 is achievable with a sufficiently large C SNR in Condition 3.1. This means we accept the existence of undetected change points.
Suppose that there are no undetected change points within (s, e), then for any (s m , e m ), one of the following situations must hold.
(a) There is no change point within (s m , e m ); (b) there exists only one change point η k ∈ (s m , e m ) and min{η k − s m , e m − η k } ≤ k ; or (c) there exist two change points η k , η k+1 ∈ (s m , e m ) and Observe that if (a) holds, then we have max t=sm+1,...,em−1 so no change points are detected. Cases (b) and (c) are similar, and case (b) is simpler than (c), so we will only focus on case (c). It follows from Lemma B.2 that max t=sm+1,...,em−1 Under (3.2), we will always correctly reject the existence of undetected change points.
Step 2. Assume that there exists a change point η k ∈ (s, e) such that min{η k − s, e − η k } ≥ 3δ/4. Let s m , e m and m * be defined as in Algorithm 1. To complete the proof it suffices to show that, there exists a change point η k ∈ (s m * , e m * ) such that min{η k − s m * , η k − e m * } ≥ δ/4 and |b m * − η k | ≤ .
To this end, we are to ensure that the assumptions of Lemma B.9 are verified. Note that (B.26) follows from (B.2), (B.27) and (B.28) follow from the definitions of events A 1 (γ) and A 2 (γ), and (B.29) follows from Condition 3.1.
Thus, all the conditions in Lemma B.9 are met. Therefore, we conclude that there exists a change point η k , satisfying where the last inequality holds from the choice of γ and Condition 3.1. The proof is completed by noticing that (B.3) and (s m * , e m * ) ⊂ (s, e) imply that min{e − η k , η k − s} > δ/4 > .
As discussed in the argument before Step 1, this implies that η k must be an undetected change point.
Below are a number of auxiliary lemmas. Lemma B.1 plays the role of Lemma 2.2 in Venkatraman (1992). Lemma B.3 controls the deviance between sample and population Kolmogorov-Smirnov statistics. Lemma B.4 is the density version of Lemma 2.4 in Venkatraman (1992). Lemma B.5 plays the role of Lemma 2.6 of Venkatraman (1992). Lemma B.6 is essentially Lemma 17 in Wang, Yu and Rinaldo (2020). Lemma B.8 is Lemma 19 in Wang, Yu and Rinaldo (2020).
we have the following.
The following Lemma provides a concentration result for the sample CUSUM statistic around its population version. One natural way to proceed in its proof is to exploit Dvoretzky-Kiefer-Wolfowitz's inequality. However, given the presence of multiple change points, an immediate application of Dvoretzky-Kiefer-Wolfowitz's inequality produces an extra factor K in the upper bound in Lemma B.3. We instead obtain a better upper bound with a different proof technique.
where m is a positive integer to be specified. Let On the other hand, for j ∈ {1, . . . , M}, let z ∈ I j and without loss of generality assume that z j < z. Let . . , n k } and y k,i ∈ I j }|.
We have, Without loss of generality, assume that F η k (z 0 ) > F η k+1 (z 0 ). For s < t < e, it holds that and where c > 0 is a sufficiently small constant.
Lemma B.6. Suppose (s, e) ⊂ (1, T ) such that e − s ≤ C M δ and that Denote κ s,e max = max κ p : p = k, . . . , k + q . Then for any p ∈ {k − 1, . . . , k + q}, it holds that   The second inequality in (B.25) follows from the observation that the sum of the squares of errors is minimized by the sample mean.
For any s < t < e, define G t s,e = sup For any t ≥ η k and z ∈ R, it holds that Δ t s,e (z) − G t s,e (z) = n (t+1):e n s:e n s:t 1/2 n s:η k F η k+1 (z) − F η k (z) ≤ n 1/2 s:η k κ k .