Univariate Mean Change Point Detection: Penalization, CUSUM and Optimality

The problem of univariate mean change point detection and localization based on a sequence of $n$ independent observations with piecewise constant means has been intensively studied for more than half century, and serves as a blueprint for change point problems in more complex settings. We provide a complete characterization of this classical problem in a general framework in which the upper bound $\sigma^2$ on the noise variance, the minimal spacing $\Delta$ between two consecutive change points and the minimal magnitude $\kappa$ of the changes, are allowed to vary with $n$. We first show that consistent localization of the change points, when the signal-to-noise ratio $\kappa \Delta^{1/2}\sigma^{-1}<\log^{1/2}(n)$, is impossible. In contrast, when $\kappa \Delta^{1/2}\sigma^{-1}$ diverges with $n$ at the rate of at least $\log^{1/2}(n)$, we demonstrate that two computationally-efficient change point estimators, one based on the solution to an $\ell_0$-penalized least squares problem and the other on the popular wild binary segmentation algorithm, are both consistent and achieve a localization rate of the order $\sigma^2\kappa^{-2} \log(n)$. We further show that such rate is minimax optimal, up to a $\log(n)$ term.


Introduction
Research on change point detection in time series data has a relatively long history in modern statistics, covering both online (e.g. Wald, 1945;Page, 1954;James et al., 1987) and offline (e.g. Vostrikova, 1981;Yao and Au, 1989) search problems. It has been recently going through a renaissance due to the routinely collected complex and large amount of data sets in the 'Big Data' era. Change point detection problems in high-dimensional means (e.g. Cho and Fryzlewicz, 2015;Cho, 2015;Aston and Kirch, 2014;Jirak, 2015;Wang and Samworth, 2018), in covariance structures (e.g. Aue et al., 2009;Avanesov and Buzun, 2016;Wang et al., 2017), in dynamic networks (e.g. Gibberd and Roy, 2017;, and in sequentially-correlated time series (e.g. Lavielle, 1999;Davis et al., 2006;Aue et al., 2009) have been actively studied in recent years.
Arguably, the simplest and best-studied change point detection problem is on univariate mean from independent observations. It is fair to say that this is the most important ingredient in more complex problems. We formalize the model in Assumption 1.
Remark 1. In fact we do not need the condition that Y i 's have continuous densities. We include it here for simplicity, such that we do not need to consider the event in which that two sets of random variables have the same sample mean. This is the only place this condition is used.
The model is completely characterized by the sample size n, the upper bound σ on the fluctuations in terms of Orlicz-ψ 2 -norm 1 , the minimal spacing ∆ between two consecutive change points and the lower bound κ of the jump size in terms of the absolute value of the difference between two consecutive population means. All three parameters σ, ∆ and κ are allowed to change as n grows. Since the number of change points K is upper bounded by n/∆, we will not keep track of K, as its upper bound can be derived from the other parameters.
The goal of a change point detection problem is to obtain consistent change point estimators {η k } K k=1 , with η 1 < η 2 < . . . < η K , such that K = K and max k=1,..., K where ǫ/n → 0, with probability tending to 1 as n → ∞. In the rest of the paper, we will refer to the sequence ǫ/n as the localization rate. Notice that the inequality in (1) can be seen as providing an upper bound on the Hausdorff distance between {η k } K k=1 and { η k } K k−1 , both viewed as subsets of {1, . . . , n} ; see (4) below.
In order to quantify the difficulty of the problem, we rely on the quantity which can be thought of as measuring the signal-to-noise ratio. As we will see, the intrinsic statistical hardness of the change point detection and localization problems is fully captured by this quantity. In particular, the difficulty of the problem increases as κ and ∆ decrease, and σ increases. Equation (2) is rooted in two-sample mean testing (with known variance), resembling t-statistics used therein, and has counterparts in high-dimensional mean, covariance and network change point detection problems (e.g. Wang and Samworth, 2018;Wang et al., 2017. With the previously defined localization rate and signal-to-noise ratio, the optimality of the estimators possesses two aspects. (i) Consistency. The first natural question one might ask is under what conditions localization is itself possible. We tackle this problem by identifying combinations of the model parameters, which we express using the signal-to-noise ratio (2), for which no estimator of the change points is guaranteed to be consistent, in a minimax sense.
(ii) Outside the region of impossibility identified in the previous step, the second natural question is to derive a lower bound on the localization rate that holds for any estimator. Once the information-theoretic lower bound is established, one may then proceed to demonstrate a computationally-efficient algorithm whose localization rate matches such lower bound. This algorithm is therefore minimax optimal.
In this article we will be focusing on two types of change point estimators, one based on penalized least squares and the other on CUSUM statistics. Both types of estimator have been thoroughly studied.
• There exist several results and algorithms for change point detection using ℓ 0 penalization, including Liebscher and Winkler (1999), Friedrich et al. (2008), Boysen et al. (2009) and Killick et al. (2012). It is worth comparing three papers providing theoretical results based on ℓ 0 -penalization methods. Lavielle and Moulines (2000) studied the ℓ 0 -penalization approach under general distributions, and showed that if one chooses the penalization parameter λ properly, then one would get similar asymptotic results to the case where the model assumes Gaussian noise. The closest-related result there is Theorem 9, which only showed asymptotic results. In this paper, we obtain the lower bounds based on Gaussian noise, but the upper bounds are achieved for sub-Gaussian noise, and provide non-asymptotic results. Boysen et al. (2009) studied consistent estimation of a general class of functions based on the solution of the ℓ 0 least squares problem given in Equation (7) below, which they referred to as the Potts functional. In particular, under the assumption that the mean function is piecewise-constant with a fixed number of change points, the authors show that a solution to (7) can consistently locate the change points if the minimal spacing satisfies ∆ = cn for some 0 < c < 1 and the change size κ is a constant. We extend such results by allowing all the parameters in the model -namely κ, ∆ and σ -to change with n at a nearly minimax rate, and will demonstrate the existence of a phase transition in the space of model parameters. Furthermore, our analysis is non-asymptotic. Fan and Guan (2017) studied the ℓ 0 -denoising on a general class of graphs including chains, i.e. piecewise-constant time series signals, and provided a number of information-theoretic results. Our paper and theirs have different targets -we focus on the change point localization but theirs focus on prediction, which are complementary to each other.
There are also a number of papers in 1980's studying the univariate mean change point detection problem from the least squares estimators perspective, for instance, Yao and Davis (1986), Yao (1988), Yao and Au (1989). The change point estimators are derived from least squares estimators, and the number of change points are chosen via the Schwarz' information criterion. It can be shown (e.g. Tickle et al., 2018) that the Schwarz' information criterion is asymptotically equivalent to the ℓ 0 penalization. Note that the results obtained there are asymptotic, while ours are non-asymptotic and allow all parameters to vary as the sample size n. Another related area is the reduced isotonic regression problem, which assumes the monotonic signal is piecewise-constant and which aims to recover the signal. Gao et al. (2017) has shown an iterated logarithmic lower bound when there are multiple change points. Despite the close connection, the focus and results thereof are different from ours.
It is worth mentioning that ℓ 0 -penalization method is appealing from the computational aspect, at least in the univariate case. Friedrich et al. (2008) showed that (7) can be computed using dynamic programming and its computational cost is of order O(n 2 ). Killick et al. (2012) introduced the pruned exact linear time (PELT) method, which has the worst case computational cost of order O(n 2 ); while in the situations where the number of change points increases linearly with n, the expected time of PELT is of order O(n). There are also other algorithms, including Rigaill (2010) and Maidstone et al. (2017), which have been shown to have an expected cost which is smaller than that of PELT, but which have the worst case cost also of order O(n 2 ).
• The CUSUM (see Definition 1) is short for the cumulative sums, proposed in Page (1954) for an online change point problem, and has been a cornerstone in numerous change point detection methods. We will show in Section 4 that in the univariate situation, it is identical to the likelihood ratio test statistics to test whether or not there exists a change point. Binary segmentation (BS) (e.g. Scott and Knott, 1974;Vostrikova, 1981) based on CUSUM statistics has been shown to be consistent, yet optimal, in locating the change points. In the last few years, a considerable amount of efforts have been made into developing variants of BS in order to handle multiple change points scenarios, see e.g. Fryzlewicz (2014), Baranowski et al. (2016) and Eichinger and Kirch (2018).
The closest-related paper is Fryzlewicz (2014) which proposed wild binary segmentation (WBS), which is a variant of BS and which has optimal localization rate. The results in Section 4 and proofs in Appendix C are based upon Fryzlewicz (2014), with more comprehensive and systematic analysis. As a result, we provide optimal results with all parameters being allowed to change with n and weaker conditions.
The univariate mean change point detection problem has been studied intensively, and we are aware that the results in this paper have been produced in different forms in existing literature. However, we still see the need to produce this paper merely focusing on this simple scenario, providing systematical analysis on various theoretical points, which can be served as benchmarks in more modern challenges.
We summarize our contributions as follows.
(i) We describe a phase transition in space of the model parameter that separates parameter combinations for which consistent change point estimation is impossible (in a minimax sense) from those for which there exist algorithms that are provably consistent. Furthermore, we provide a global information-theoretic lower bound on the localization rate that holds over most of the region of the parameter space for which consistent estimation is possible. It is worth pointing out that although similar results have been stated elsewhere (e.g. Proposition 3 of Wang and Samworth, 2018), but the logarithmic term in the detection boundary is first time formally stated and the gap between the lower and upper bounds are closed in the change point detection literature.
(ii) We demonstrate that the ℓ 0 -penalization method produces a minimax rate-optimal estimator of the change points. In addition, we demonstrate that the localization error rate of ℓ 0 -penalization method is locally adaptive to the jump size at each change point, a desirable feature both in theory and in practice (see Remark 5).
(iii) Among CUSUM-based methods, we show that the WBS algorithm put forward by Fryzlewicz (2014) is also minimax rate-optimal. While our analysis of the WBS is heavily inspired by the proof techniques in Fryzlewicz (2014), we are able to provide more refined results with optimal tracking of the underlying parameters, thus obtaning optimak rates. We also require weaker conditions than in Fryzlewicz (2014).
The paper is organized as follows. The information-theoretic results are exhibited in Section 2. Matching upper bounds provided by ℓ 0 -penalization method and WBS can be found in Sections 3 and 4, respectively. Most of the proofs and technical details are in the Appendices.

Phase Transition and Optimality Minimax Rates
Recall the two aspects of optimality we describe in Section 1: to identify parameter combinations for which consistent localization is possible and to determine a minimax lower bound on the localization rate. In Lemma 1 we describe the low signal-to-noise ratio regime for which estimating the location of the change points cannot be done. In detail, we show that if then no consistent estimator of the locations of the change points exists. On the other hand, when κ √ ∆/σ ≥ log(n), Lemma 2 demonstrates a minimax lower bound on the localization rate of the form σ 2 κ 2 n , for all n large enough. The analysis of the localization procedures described in Sections 3 and 4 will confirm that these results are in fact quite sharp. Specifically, we will verify both the existence of a phase transition for the localization task as the signal-to-noise ratio crosses the threshold log(n), as prescribed by Lemma 1, and the near minimax optimality of the lower bound of Lemma 2.
Then, there exists a n(c), which depends on c, such that, for all n larger than n(c), where the infimum is over all estimators η = { η k } K k=1 of the change point locations and η(P ) is the set of locations of the change points of P ∈ P n c .
In the above result, it is possible to let c → 0 as n → ∞ (and in fact, the value of n(c) is increasing in c). Thus, we conclude that, if κ √ ∆/σ < ⌊ log(n)⌋ < ⌊n/4⌋, the localization rate is bounded away from 0, i.e. the estimator is not consistent.
In our next result we complement Lemma 1 by showing that if instead for any sequence {ζ n } n=1,2,... of positive numbers diverging to infinity at an arbitrary pace as n → ∞, then the corresponding lower bound is at least of order σ 2 κ 2 , for all n large enough. Of course, in light of Lemma 1, this lower bound is interesting only when ζ n is larger than log(n). In the next sections, we will further show that, provided that ζ n is of the order log 1+ξ (n) or larger, for any ξ > 0, then, up to a logarithmic factor in n, σ 2 κ 2 yields the asymptotic minimax lower bound on the localization rate.
be a time series satisfying Assumption 1 with one and only one change point. Let P n κ,∆,σ denote the corresponding joint distribution. Consider the class of distributions for any sequence {ζ n } such that lim n→∞ ζ n = ∞. Then, for all n large enough, it holds that where the infimum is over all estimators η of the change point location and η(P ) denotes the change point location of P ∈ Q n .
The bounds in Lemma 1 and Lemma 2 are slightly sharper than the minimax lower bounds obtained by taking p = 1 in Proposition 3 in the supplementary material of Wang and Samworth (2018). Indeed, our analysis allows for a more refined characterization of the phase transition for the localization task by exhibiting the threshold value of √ log n describing the transition from the low to high signal-to-noise ratio regime.

ℓ 0 Penalization
In this section we describe an estimator of the change point locations based on the ℓ 0 penalty and demonstrate that it is minimax rate optimal.
We first formalize the ℓ 0 -penalized optimization problem, and define the change point estimators generated therefrom. For the sake of analysis, we will provide an alternative objective function, which, we will show, generates identical change point estimators.
Alternatively, let P be any interval partition of {0, 1, . . . , n}, i.e. a collection of P k disjoint subsets of {1, . . . , n} of the form for some integers 0 < i 1 < · · · < i P k = n, where P k ≥ 1. In particular, if P k = 1, then P = {1, . . . , n} . For a fixed positive tuning parameter λ > 0 and data where the minimum ranges over all interval partitions of {1, . . . , n} and, for any such partition P, with Y I = |I| −1 i∈I Y i . The optimization problem (8) is known as the minimal partition problem and can be solved using dynamic programming in polynomial time (e.g. Algorithm 1 in Friedrich et al., 2008). The change point estimator resulting from the solution to (8) is simply obtained from taking all the right endpoints of the intervals I ∈ P except n. In general, without assuming any conditions on the inputs, there is no guarantee that the minimizers are unique.
We now make the simple observation that the optimization problems (7) and (8) with the same inputs yield the same change point estimators. To see this equivalence we will introduce some notation that we will be using throughout. For any vector v ∈ R n and any i ∈ {1, . . . , n − 1}, if v i = v i+1 , one calls i an induced change point of v, and the collection of all the induced change points of v is denoted as J(v). The set J(v) yields an interval partition, i.e., if J(v) = {i 1 , . . . , i N }, then one can define the interval partition induced by v as Conversely, for any interval partition P and a sequence {Y i } n i=1 , define their induced piecewiseconstant vector v as v i = Y I , for any i ∈ I and I ∈ P. Since for I ⊂ {1, . . . , n}, it follows that with the same inputs {Y i } n i=1 and λ > 0, the solutions to (7) and (8) induce each other in the sense specified above.
Remark 2 (Tuning parameter). If we view any vector u ∈ R n as a step function with at most n − 1 jumps, then the tuning parameter λ penalizes the number of jumps in u. For an integer interval I ⊂ {1, . . . , n}, the tuning parameter λ works in the following way. If an integer interval I is to be split into two integer sub-intervals I 1 and I 2 , then it follows from Lemma 5 that the sum of squares will decreases by but, at the same time, the penalty term will increase by λ. Therefore the trade-off guiding the choice between refining a candidate integral partition of {1, . . . , n} by introducing one additional split and leaving it unchanged (so that this partition must then provide an optimal solution to (8)), is described by comparing (10) to λ. In Theorem 3 we will provide a theoretically optimal choice for λ.
Remark 3. In the rest of this paper, when there is no ambiguity, we allow the following abuse of notation. If s < e, s, e ∈ Z, we sometimes refer {s, s + 1, . . . , e} and {s + 1, . . . , e} as [s, e] and (s, e], respectively.

Optimal change point localization
Recall in Lemma 1 we have shown that if κ √ ∆/σ < log(n), then no algorithm is guaranteed to produce consistent change point estimators. To demonstrate the performances of (7), we thus require the signal-to-noise ratio κ √ ∆/σ to be larger than a diverging function of n, which we take to be of the form log (1+ξ)/2 (n). As remarked in the previous section, such choice is consistent with Lemma 2, which in principle allows for a vanishing localization rate.
Assumption 2. There exists a sufficiently large absolute constant C SNR > 0 such that for any ξ > 0, We remark that the introduction of the parameter ξ > 0 is to guarantee that even if ∆ ≍ n, the resulting estimator is still consistent. We do not know whether the above assumption can be relaxed by allowing for a rate of increase for κ √ ∆/σ slower than log 1+ξ (n). In our proofs, this seems to be the slowest rate that we can afford.
satisfy Assumption 1 and, for any λ > 0, set k=1 be the collection of change points induced by u(λ). Under Assumption 2, for any choice of c > 3, there exists a constant C λ > 0, which depends on c such that, for λ = C λ σ 2 log(n), it holds that where C ǫ > 0 is a constant depending on C λ and C SNR .
Recalling Lemma 2, we see now that the error bound we derived in Theorem 3 is minimax rate optimal aside from possibly a log(n) factor. Theorem 3 shows that with a proper choice of the tuning parameter, (7) provides consistent change point estimators in the sense that with probability tending to 1 as n → ∞, it holds that K(λ) = K and for all k ∈ {1, . . . , K}, Remark 4 (Uniqueness). We mentioned earlier that the minimizers of the optimization problems (6) and (8) need not be unique. However, if the independent errors have a continuous distribution, as assumed in Assumption 1, the minimizer is unique almost surely, for each n and each λ; if not, then any two solutions, say P and P ′ , are such that This is a quadratic polynomial in the {Y i } n i=1 . The set of its real solutions (if any exists) has ndimensional Lebesgue measure 0. In general, if there are multiple solutions (so that Assumption 1 does not hold), Assumption 2 guarantees that, with large probability (more precisely, in the event B defined in the proof ), the minimizer is unique almost surely.
Remark 5. It is natural to expect that change point localization should be locally adaptive in a sense that, if the jump size κ k gets larger, then it is easier to estimate the location of the change point η k . In fact, the error rate ǫ k derived in Theorem 3 matches this feature.
Proof. Define the event where C λ > 0 is a large enough constant only depending on c, and a, b, c are integers. In the remainder of the proof we work on the event B. By Lemma 6 in Appendix B, this occurs with probability at least 1 − e · n 3−c .
For simplicity we will remove the dependence on λ in our notation as it will implicitly understood that λ = C λ σ 2 log(n). Let P be the interval partition induced by u (see (11)), and let {s + 1, . . . , e} be any member of P. The proof is completed by showing the following four steps.
Step 1 The interval (s, e] contains no more than two true change points. This is shown in Lemma 7.
Step 2 If (s, e] contains exactly two true change points, say η k , η k+1 , then . This is shown in Lemma 8.
Step 4 If (s, e] contains no true change point, then there exist two true change points η k and η k+1 satisfying . This is shown in Appendix B.4.

CUSUM
As for the univariate mean change point detection problem, the ℓ 0 -penalization estimator is not the only one which achieves the minimax optimality. Binary segmentation (BS) (e.g. Scott and Knott, 1974) based on CUSUM statistics is arguably the most popular change point detection method. It has been shown that BS is consistent yet not optimal (e.g. Venkatraman, 1992). Fryzlewicz (2014) proposed a variant of BS, namely wild binary segmentation (WBS), which is shown to lead to a better localization rate than the BS algorithm. In this section, we will recall the WBS algorithm, and give refined results on its performance, with a proof which has more careful tracking of all parameters and all the constants involved. As a result, we prove that WBS, just like the method studied in the previous section, also guarantees a localization error rate that is rate minimax optimal. However, compared to the ℓ 0 -penalization methods, WBS is computationally more expensive and involves more tuning parameters.
, any pair of time points (s, e) ⊂ {0, . . . , n} with s < e − 1, and any time point t = s + 1, . . . , e − 1, let the CUSUM statistics be For a collection of independent Gaussian random variables {Y i } n i=1 with E(Y i ) = f i and same variance, one can easily derive that max t=1,...,n−1 Y 0,n t is the generalized likelihood ratio statistic to test the hypothesis: In particular, the BS algorithm searches for the time point which has the largest absolute CUSUM statistics value, i.e.
However, as noted in Fryzlewicz (2014), when there are potentially multiple change points, their combined effect might cancel out and the BS is guaranteed to be effective only when applied to intervals containing at most one change point. WBS improves on BS by performing multiple CUSUM tests over randomly chosen sub-intervals in such a manner that each change point will, with high probability, be the only change point deep inside some selected interval and can be The set of estimated change points.
identified using the BS algorithm within that interval. See Algorithm 1 for a formal description of WBS.
It has been shown under a set of slightly stronger conditions, Fryzlewicz (2014) originally put forward the WBS algorithm and provided an analysis of its performance. Below we refine such analysis and formally prove that that WBS is minimax rate-optimal in terms of the required signal-to-noise ratio and the localization rate.
Let η k K k=1 be the corresponding output of the WBS algorithm. Then, where c > 3 is an absolute constant and C ǫ > 0 is a sufficiently large constant.
Remark 6. For simplicity, we require Assumption 1 in Theorem 4, but in fact we do not need continuous density functions condition. In addition, we can set ξ = 0 in Assumption 2 if ∆ = o(n).
Theorem 4 shows that with suitable choice for the tuning parameters, WBS is optimal in the sense that: • under the signal-to-noise ratio regime detailed in Assumption 2, it yields consistent estimators of the change point locations that with probability tending to 1: K = K, and for all k = 1, . . . , K, as n → ∞; and • it possesses a localization rate which is minimax rate optimal, save for a log(n) factor, according to Lemma 2.
Remark 7. To guarantee that (14) tends to 1 as n → ∞, the number of random intervals M needs to satisfy Remark 8 (Tracking constants). For readability, we refrain the pursuit of explicitly expressing all constants, and only show the hierarchy of the constants. One would first choose c > 3 in (14) to make sure that the consistency result holds. This will determine c τ,1 , which is the same as C γ used in the proof, and consequently c τ,2 , which also depends on C SNR and C R . All these constants finally determine C ǫ .
Remark 9 (Tuning parameters). The tuning parameter δ is introduced to avoid false positives. Specifically, the range displayed in Equation (13) is used in Step 1 in the proof. Notice that, by Assumption 2 and with properly chosen constants, such range is not an empty set for τ . As shown in the proof, over an event of probability tending to 1, the lower bound of (13) serves as an upper bound of the maximum CUSUM statistics when there are no change points, and the upper bound serves as a lower bound of the maximum CUSUM statistics when there exists a change point.
Remark 10 (Comparisons with Theorem 3). In Theorem 3, the only tuning parameter is the penalization level λ, while in Theorem 4, we require knowledge on τ , δ, and the number of random intervals M . In practice, Fryzlewicz (2014) suggest and AIC-based method for picking these parameters.
Proof. Since ǫ is the desired order of localization error rate, by induction, it suffices to consider any generic interval (s, e) ⊂ (0, T ) that satisfies where q = −1 indicates that there is no change point contained in (s, e). Under Assumption 2, it holds that with sufficiently large C SNR . It, therefore, has to be the case that for any change point η k ∈ (0, T ), either |η k − s| ≤ ǫ or |η k − s| ≥ ∆ − ǫ ≥ 3∆/4. This means that min{|η k − e|, |η k − s|} ≤ ǫ indicates that η k is a detected change point in the previous induction step, even if η k ∈ (s, e). We refer to η k ∈ (s, e) an undetected change point if min{η k − s, η k − e} ≥ 3∆/4. In order to complete the induction step, it suffices to show that WBS (i) will not detect any new change point in (s, e) if all the change points in that interval have been previous detected, and (ii) will find a point b ∈ (s, e) -in fact in (s + δ(e − s), e − δ(e − s)) -such that |η k − b| ≤ ǫ if there exists at least one undetected change point in (s, e).
We will consider the events A 1 (γ), A 2 (γ) and M defined in (40), (41) and (42), respectively. Set γ to be C γ σ log(n), with a sufficiently large C γ . The rest of the proof assumes the the event which, in light of Lemma 13 from Appendix C, has probability tending to 1.
Step 1. In this step, we will show that WBS will consistently detect or reject the existence of undetected change points within (s, e).
where e m − s m ≤ 2∆ is used in the last inequality. Therefore Thus for any undetected change point η k ∈ (s, e), it holds that where the last inequality is from the choice of γ and c τ,2 > 0 is achievable with a sufficiently large C SNR in Assumption 1. Then, WBS correctly accepts the existence of undetected change points. Suppose there does not exist any undetected change point within (s, e), then for any (s ′ m , e ′ m ) = (α m , β m ) ∩ (s, e), one of the following situations must hold. Observe that if (a) holds, then we have , it must be the case that (s m , e m ) does not contain any change points. This reduces to case (a). Therefore under (13) holds, WBS will always correctly reject the existence of undetected change points.
Step 2. Assume that there exists a change point η k ∈ (s, e) such that min{η k − s, η k − e} ≥ 3∆/4. Let s m , e m and m * be defined as in Algorithm 1. To complete the proof it suffices to show that, there exists a change point η k ∈ (s m * , e m * ) such that min{η k − s m * , η k − e m * } ≥ ∆/4 and |b m * − η k | ≤ ǫ.
To that end, we are to ensure that the assumptions of Lemma 21 are verified. The proof of Lemma 21 relies on a number of results, the relationship of which is shown in Figure 2. Observe that (52) is straightforward from Assumption 2, (50) and (51) follow from the definition of A 1 and A 2 , and that (49) follows from (15).
Thus, all the conditions in Lemma 21 are met, and we therefore conclude that there exists a change point η k , satisfying and where the last inequality holds from the choice of γ and Assumption 2. The proof is complete by noticing the fact that Equation (16) and (s m * , e m * ) ⊂ (s, e) imply that min{e − η k , η k − s} > ∆/4 > ǫ.
As discussed in the argument before Step 1, this implies that η k must be an undetected change point.

Conclusions
In this paper we have provided a complete characterization of the classical problem of univariate mean change point localization for a sequence of independent sub-Gaussian random variables with piecewise-constant means. We have considered the most general setting in which all the parameters of the problems are allowed to change with the length n of the sequence. We have identified a critical function of the model parameters that is able to discriminate the portion of the parameter space in which consistent localization is impossible from the part in which it is feasible. We have further derived the minimax optimal localization rate for this problem and showed that two computationally efficient methods achieve such a rate.
We would like to point out that the ℓ 0 -penalization methods can also be used in handling change point detection for more complex data types, such as high-dimensional mean, covariance and networks. The developments rely on feasible algorithms for their corresponding problems, but we conjecture that ℓ 0 -penalization methods on complex data types would also enjoy the same optimality with fewer tuning parameters than those in CUSUM-based methods.
Finally, we conjecture that the upper bounds on the localization rate exhibited in both Sections 3 and 4 can be sharpened by replacing the log(n) term with a smaller quantity of order log log(n), thus further reducing the gap with the lower bound in Lemma 2.

A Proofs of the Results in Section 2
Proof of Lemma 1. Without loss of generality, suppose that n/4 is an integer. For l ∈ {1, . . . , n/4}, let u l ∈ R n be such that the ith coordinate of u l (i), i = 1, . . . , n, satisfies where 0 < c < 1. Let v l ∈ R n be such that v l (i) = u l (n − i + 1), i = 1, . . . , n. Let P l and Q l be the multivariate Gaussian distributions N ( u l , σ 2 I n ) and N ( v l , σ 2 I n ), respectively and set P = 1 n/4 n/4 l=1 P l and Q = 1 n/4 n/4 l=1 Q l .
Note that for each l ∈ {1, . . . , n/4}, P l has two change points, at locations l − 1 and l, and therefore, ∆ = 1. Furthermore, the jump size is κ = cσ 2 log(n) and the fluctuation is σ 2 . As a result, κ √ ∆/σ = c log(n), which implies that all P l ∈ P n c . The same arguments show that Q l ∈ P n c , for all l. For each l and l ′ in {1, . . . , n/4}, we have that, by constructions, H(η( P l ), η( Q l ′ ) ≥ n 2 , where η( P l ) and η( Q l ′ ) denote the sets of change point locations of P l and Q l ′ , respectively. Then it follows from Le Cam's lemma (e.g. Yu, 1997 where d TV (·, ·) is the total variation distance between two probability measures and the infimum is over all estimators η = { η k } K k=1 of the change point locations. Above, η(P ) is the set of locations of all the change points of P ∈ P n c . Let u l ∈ R n/2 be a sub-vector of u l consisting of the first n/2 entries of u l . Let P l and P 0 be the multivariate Gaussian distributions N (u l , σ 2 I n/2 ) and N (0, σ 2 I n/2 ), receptively. Due to the symmetry between u l and v l , it holds that d TV ( P , Q) ≤ 2d TV (P, P 0 ), where P = 1 n/4 n/4 l=1 P l . Since d TV (P, P 0 ) ≤ χ 2 (P, P 0 ), where χ 2 (·, ·) is the χ 2 -divergence between two probability measures (see, e.g., Equation 2.27 in Tsybakov, 2009), it suffices to provide an upper bound for χ 2 (P, P 0 ). We have χ 2 (P, P 0 ) = 1 n/4 2 n/4 l,m=1 where the third identity follows from the observation that for l, m = 1, . . . , n/4, u ⊤ l u m = ½{l = m}cσ 2 log(n).
where δ is a positive integer no larger than n − 1 − ∆. Observe that η(P 0 ) = ∆ and η(P 1 ) = ∆ + δ. By Le Cam's Lemma (e.g. Yu, 1997) and Lemma 2.6 in Tsybakov (2009), it holds that where KL(·, ·) is the KullbackLeibler divergence between two probability measures. Since both P 0 and P 1 are product measures, it holds that where P 0,i and P 1,i are the distributions of Y i and Z i , respectively and the last identity follows from the fact that, if P and Q are the normal distributions with common variance σ 2 and means µ 1 and µ 2 , respectively, then K(P, Q) Next, set δ = min{⌈ σ 2 κ 2 ⌉, n − 1 − ∆}. By the assumption on ζ n , for all n large enough we must have that δ = ⌈ σ 2 κ 2 ⌉. Indeed, if n − 1 − ∆ ≤ ⌈ σ 2 κ 2 ⌉ then, as ∆ < n/2, we must have that κ 2 σ 2 ≤ 1 n−2−∆ < 1 n/2−2 , and, therefore, that where we may assume that n > 4. Since κ 2 ∆ σ 2 ≥ ζ 2 n by assumption and ζ n is diverging as n → ∞, the above bound can only hold for finitely many n. The claimed bound now follows from (19), for all n large enough.

B Proofs of the Results in Section 3
In this section, we provide technical details of the proof of Theorem 3. Recalling Assumption 1, for any change point η k , observe that the interval I = {η k−1 + 1, . . . , η k } contains one change point, but the signal {f i } n i=1 is unchanged in I. For convenience, in this section, any interval I is said to contain a true change point if there exists k ∈ {1, . . . , K} such that {η k , η k + 1} ⊂ I, where |I| ≥ 2. This convention ensures that if I contains a true change point, then it is necessary that there exist i, j ∈ I satisfying f i = f j .
Lemma 5. Let I 1 and I 2 denote any two disjoint intervals of {1, . . . , n} and I = I 1 ∪ I 2 . For any and i∈I Proof. Without loss of generality, let I 1 = {1, . . . , n 1 } and I 2 = {n 1 + 1, . . . , n = n 1 + n 2 }. For simplicity, denote X = X I , X 1 = X I 1 and X 2 = X I 2 . The results (20) and (21) can be proved by similar arguments. We will only show (21) here. Observe that Lemma 6. Assume that the sequence {Y i } n i=1 ⊂ R satisfies Assumption 1. It holds that where c B is an absolute constant chosen to satisfy c B > 3 and C B > 0 only depends on c B .
Proof. It follows from Assumption 1 that for all i ∈ {1, . . . , n}, Y i − f i is a centred sub-Gaussian random variable with max i Y i − f i ψ 2 ≤ σ. Due to Hoeffding inequality (see e.g. Vershynin, 2010), it holds that for any non-empty set I ⊂ {1, . . . , n} and any ε > 0, and for any triple i 1 < i 2 < i 3 chosen in {1, . . . , n} where c > 0 is an absolute constant only depending on σ. The result follows from a union bound.
For simplicity, in the rest of the proof, we will let C B = 1 and set c B > 3. This will only affect the constant C λ , and in the statement of Theorem 3, we require C λ > 0 to be large enough.
Since the change points of u are our change point estimators, with the error rate we refer to η k as an undetected change point, if η k ∈ (s, e] ∈ P( u) and and similarly e − ǫ k > ∆/3. The first and second inequalities of (22) follow from Assumptions 1 and 2, respectively. In the rest of this section, let P = P u .

B.1
Step 1: no more than two true change points In order to show that no I ∈ P contains more than two true change points, it suffices to show that no I ∈ P contains undetected change points, due to the minimal spacing ∆ condition in Assumption 1.
satisfy Assumptions 1 and 2, and λ satisfy the condition Then, in the event B, it holds that no I ∈ P contains any undetected change point.
Proof. We first point out that due to Assumption 2, (23) is not an empty set. For the sake of contradiction, suppose that there exists I ∈ P containing an undetected change point η k , i.e., min{e − (η k + 1), η k − s} > ∆/3.
Let P be such that P = P ∪ {I 1 , I 2 , I 3 , I 4 } \ {I}, and u be the piecewise constant vector induced by P. By the definition of u, it holds that Since P is a refinement of P and we have assumed in Assumption 1 that the distributions of Y i 's have continuous density functions, it follows that where the second inequality follows from (20) ; the fourth inequality follows from the definition of B and (23); and the last inequalities is due to (23). Since (25) is a contradiction, we conclude that there is no interval containing undetected change point.
It follows from Lemma 5 that Then, where the third inequality uses the same argument in the third inequality of (25), and the last inequality follows from the definition of B and Assumption 1.

B.3 Step 3: one and only one change point
Let I 1 = (s, e 1 ] ∈ P contain exactly one true change point, namely η k . With our convention set at the beginning of Appendix B, it holds that Denote δ = e 1 − η k and ǫ = η k − (s − 1). Without loss of generality, we assume that We are to show that there exists an absolute constant C > 8 such that and Equation (28) will be shown in Lemma 9. To show (29), we rely on the following arguments (see Figure 1 for an illustration): (i) Let I 2 = (e 1 , e 2 ] be the interval to the immediate right of I 1 in P. It must hold that This will be shown in Lemma 10. (ii) It follows from Appendix B.1 that there are at most two true change points in (e 1 , e 2 ]. If there are exactly two true change points, then due to Appendix B.2, (29) holds.
Proof. Observe that neither J 1 nor J 2 is empty by definition, and that {f i } n i=1 is constant within J 1 and J 2 , respectively. Let P be such that and let u be the piecewise-constant vector induced by P.
Recall that E(Y J 1 ) = f η k and E(Y J 2 ) = f η k+1 . Without loss of generality, assume f η k+1 = f η k + κ k . Thus, where the second identity follows from (20), the second inequality from the fact that (x − y) 2 ≥ y 2 /2 − x 2 with x = κ k and y = ( , and the last inequality from the definitions of the event B and the choice of λ. Therefore, Lemma 10. Let {Y i } n i=1 satisfy Assumptions 1 and 2 and set λ = C λ σ 2 log(n), with C λ > 85. Assume that I 1 = (s, e 1 ] ∈ P contains exactly one change point namely η k . Denote δ = e 1 − η k and ǫ = η k − (s − 1). Assume that ǫ ≤ δ. In the event B, if I 2 = (e 1 , e 2 ] ∈ P, then it must hold that Proof. Let I ′ 1 = (s, η k ] and I ′ 2 = (η k , e 2 ]. Then I 1 ∪ I 2 = I ′ 1 ∪ I ′ 2 . Let P 1 and P 2 be respectively. Let u 1 and u 2 be the piecewise-constant vectors induced by P 1 and P 2 , respectively. We proceed by contradiction. We assume that e 2 ≤ η k+1 . Without loss of generality, assume f η k+1 = f η k + κ k . Due to Assumption 1, it holds that E( where the second inequity follows form the fact that (x+ y) 2 ≤ 5x 2 + (5/4)y 2 and the last inequality follows from the definition of the event B.
We have and where the last inequality of (33) follow from the observation that Equations (32) and (33) lead to that Plugging in (34) into (32) with a choice of α = 1/4 yields that λ ≤ 50σ 2 log(n), which is a contradiction.
This contradicts the assumed condition on λ.

B.4 Step 4: no changes
Suppose I = (s 1 , e] ∈ P contains no true change point. By symmetry, it suffices to show that there exists a large enough constant C > 0 such that Assume I 0 = (s 0 , s 1 ] ∈ P. We are to show the following. (i) It is impossible that there is no true change point in I 0 ∪ I. This will be shown in Lemma 12.
(ii) If there exist exactly two true change points in I 0 , then (39) follows from Lemma 8.
(iii) If there exists one and only one change point η k ∈ I 0 and s 1 − η k < η k − s 0 , then (39) follows from Lemma 9.
(iv) If there exists one and only one change point η k ∈ I 0 and s 1 − η k ≥ η k − s 0 , it follows from Lemma 10 that this is impossible in the event of B.
Lemma 12. Assume the inputs {Y i } n i=1 satisfying Assumptions 1 and 2 and λ = C λ σ 2 log(n) with a sufficiently large C λ > 0. Assume that I = (s 1 , e] ∈ P contains no change point. Assume that I 0 = (s 0 , s 1 ] ∈ P. Then in the event B, there must exist a change point in I 0 .
Proof. Let J = I 0 ∪ I, P be the interval partition such that and u be the piecewise-constant vector induced by P.
Prove by contradiction, assuming that J contains no change points. Denote µ = E(Y I 0 ) = E(Y I ). Then where the last inequality follows from the definition of the event B, and results in a contradiction with the condition on λ.
C Proofs of the Results in Section 4
Lemma 13. For {Y i } n i=1 satisfying Assumption 1, it holds that Proof. Since for any suitable triples (s, t, e), both Y s,e t − f s,e t and (e − s) −1/2 e i=s+1 (Y i − f i ) can be written in the form e i=s+1 w i X i , where X i 's are centred sub-Gaussian random variables and w i 's satisfy e i=s+1 w 2 i = 1. It follows from Hoeffding inequality that there exists an absolute constant c > 0 only depending on σ such that P A c 1 (γ) ≤ e · n 3 exp −cγ 2 /σ 2 and P A c 2 (γ) ≤ e · n 2 exp −cγ 2 /σ 2 .
Since the number of change points are bounded by n/∆, it holds that

C.2 Technical Details for
Step 1 Lemma 14. Under Assumption 1, let 0 ≤ s < η k < e ≤ n be any interval satisfying

C.3 Technical details for Step 2
In this section, eight results will be provided. Before we go into details, we show the road map leading to complete the proof of Theorem 4 in Figure 2.
Lemma 15. Suppose (s, e) ⊂ (0, n) is a generic interval satisfying (ii) If F s,e t > 0 for some t ∈ (s, e), then F s,e t is either monotonic or decreases and then increases within each of the interval (s, η k ), . . . , (η k+q , e).
The proof of Lemma 15 can be found in Lemmas 2.2 and 2.3 of Venkatraman (1992). We remark that if F s,e t ≤ 0 for all t ∈ (s, e), then it suffices to consider the time series {−f i } n i=1 and a similar result as in the second part of Lemma 15 still holds.
Our next lemma is an adaptation of a result first obtained by Venkatraman (1992), which quantifies how fast the CUSUM statistics decays around a good change point.
Case 1. Let E l be defined as in the case 1 in Venkatraman (1992) Lemma 2.6. There exists a c ′ > 0 such that, for every d ∈ [η k , η k + c 1 ∆/16], f s,e η k − f s,e d (which in the notation of Venkatraman (1992) is the term E l ) can be written as .
Using the inequality (e − s) ≥ 2c 1 ∆, the previous expression is lower bounded by Case 2. Let h = c 1 ∆/8 and l = d − η k ≤ h/2. Then, following closely the initial calculations for case 2 of Lemma 2.6 of Venkatraman (1992), we obtain that , Since h = c 1 ∆/8 and l ≤ h/2, it holds that Observe that and Thus where (46) and (47) are used in the second inequality and the fact that l ≤ h/2 ≤ c 1 ∆/16 ≤ (η k − s)/16 is used in the last inequality. For E 3l , observe that This combines with (43) and that l/2 ≤ h = c 1 ∆/8, implying that Therefore, with a sufficiently small constant c ′′ > 0, it holds that where the first inequality follows from (46) and (46), and the last inequality follows from (44) and (45). Thus, Lemma 17. Suppose [s, e] ⊂ [1, n] such that e − s ≤ C R ∆, and that Denote κ s,e max = max{η p − η p−1 : k ≤ p ≤ k + q}. Then for any k − 1 ≤ p ≤ k + q, it holds that Proof. Since e − s ≤ C R ∆, the interval [s, e] contains at most C R + 1 change points. Observe that where |p 1 − p 2 | ≤ C R + 1 for any η p 1 , η p 2 ∈ [s, e] is used in the last inequality.
Proof. Consider the sequence {g t } e t=s+1 be such that For any t ≥ η r , it holds that f s,e η k − g s,e η k = (e − s) − t (e − s)(t − s) Thus, where the first inequality follows from the observation that the first change point of g t in (s, e) is at η k+1 .
For a pair (s, e) of positive integers with s < e, let W s,e d be the two dimensional linear subspace of R (e−s) spanned by the vectors For clarity, in the lemma below, we will use ·, · to denote the inner product of two vectors in the Euclidean space.  Lemma 21. Under Assumption 1, let (s 0 , e 0 ) be an interval with e 0 − s 0 ≤ C R ∆ and contain at lest one change point η k such that η k−1 ≤ s 0 ≤ η k ≤ . . . ≤ η k+q ≤ e 0 ≤ η k+q+1 , q ≥ 0.
For the sake of contradiction, throughout the rest of this argument suppose that, for some sufficiently large constant C 3 > 0 to be specified, (This will of course imply that η k + max{C 3 γ 2 (κ s,e max ) −2 , δ} < b). We will show that this leads to the bound Y s,e − P s,e b (Y s,e ) 2 > Y s,e − P s,e η k (f s,e ) 2 , which is a contradiction.
To derive (55) from (54), we note that min{e − η k , η k − s} ≥ min{1, c 2 1 }∆/16 and that |b − η k | ≤ γ √ ∆(κ s,e max ) −1 implies that where the last inequality follows from (52) and holds for an appropriately small c 2 > 0. Equation (55) is in turn implied by 2 ε s,e , P b (Y s,e ) − P η k (f (s,e) ) < f s,e − P b (f s,e ) 2 − f s,e − P η k (f s,e ) 2 , where ε s,e = Y s,e − f s,e . By (48), the right hand side of (57) satisfies the relationship with sufficiently small absolute constants c, c ′ > 0, f s,e − P b (f s,e ) 2 − f s,e − P η k (f s,e ) 2 = f s,e , ψ η k 2 − f s,e , ψ b 2 =( f s,e η k ) 2 − ( f s,e b ) 2 ≥ ( f s,e η k − f s,e b )| f s,e η k | ≥ c|d − η k |( f s,e η k ) 2 ∆ −1 ≥c ′ |d − η k |(κ s,e max ) 2 , where Lemma 16 and (53) are used in the second and third inequalities. The left hand side of (57) can in turn be rewritten as 2 ε s,e , P b (X s,e ) − P η k (f s,e ) = 2 ε s,e , P b (X s,e ) − P b (f s,e ) + 2 ε s,e , P b (f s,e ) − P η k (f s,e ) . (58) The second term on the right hand side of the previous display can be decomposed as In order to bound the terms I, II and III, observe that, since e − s ≤ e 0 − s 0 ≤ C R ∆, the interval [s, e] must contain at most C R + 1 change points.