Localising change points in piecewise polynomials of general degrees

In this paper we are concerned with a sequence of univariate random variables with piecewise polynomial means and independent sub-Gaussian noise. The underlying polynomials are allowed to be of arbitrary but fixed degrees. All the other model parameters are allowed to vary depending on the sample size. We propose a two-step estimation procedure based on the $\ell_0$-penalisation and provide upper bounds on the localisation error. We complement these results by deriving a global information-theoretic lower bounds, which show that our two-step estimators are nearly minimax rate-optimal. We also show that our estimator enjoys near optimally adaptive performance by attaining individual localisation errors depending on the level of smoothness at individual change points of the underlying signal. In addition, under a special smoothness constraint, we provide a minimax lower bound on the localisation errors. This lower bound is independent of the polynomial orders and is sharper than the global minimax lower bound.


Introduction
We are concerned with the data y = (y 1 , . . . , y n ) ∈ R n . For each i ∈ {1, . . . , n}, where f : [0, 1] → R is an unknown piecewise-polynomial function and ε i 's are independent mean zero sub-Gaussian random variables. To be specific, associated with f (·), there is a sequence of strictly increasing integers {η k } K+1 k=0 , with η 0 = 1 and η K+1 = n + 1, such that f (·) restricted on each interval [η k /n, η k+1 /n), k = 0, . . . , K, is a polynomial of degree at most r ∈ N. The maximum degree r is assumed to be arbitrary but fixed, and the number of change points K is allowed to diverge as the sample size n grows unbounded. The goal of this paper is to estimate {η k } K k=1 , called the change points of f (·), accurately and to understand the fundamental limits in detecting and localising these change points. More detailed model descriptions can be found in Section 2.
The work in this paper falls within the general topic of change point analysis, which has a long history and is being actively studied till date. In change point analysis, one assumes that the underlying distributions change at a set of unknown time points, called change points, and stay the same between two consecutive change points. A closely related problem is change point detection in piecewise constant signals. This is studied thoroughly in [6], [16], [12], [13], [24], [22] and [39], among others. [14], [4], [8], [1], [18], [25] and [9] studied change point analysis in piecewise linear signals. Our work in this paper can be seen as a generalisation of the aforementioned results, allowing for polynomials of arbitrary degrees, and the magnitudes of coefficients changes to vanish as the sample size grows unbounded, although some of the aforementioned work may contain more general assumptions on the noise structure. Detailed comparisons with some existing literature will be provided after we present our main results.
To divert slightly, it is worth mentioning that instead of focusing on estimating the locations of the change points, a complementary problem is to estimate the whole of the underlying piecewise polynomial function itself. This is a canonical problem in nonparametric regression and also has a long history. The piecewise polynomial function is typically assumed to satisfy certain regularity at the change points. The classical settings therein assume that the degrees of the underlying polynomials are taken to be some particular values and the change points, referred to as knots, are at fixed locations, see e.g. [20] and [36]. More recent regression methods have focussed on fitting piecewise polynomials where the knots are not fixed beforehand and is estimated from the data [e.g. 26,33,32,21].
In this paper, we focus on estimating the locations of the change points accurately, allowing for general and different degrees of polynomials within f (·), diverging number of change points, and different smoothness at different change points. This framework, to the best of our knowledge, is the most flexible one in both change point analysis and spline regression analysis areas. In the rest of this paper, we first formalise the problem and introduce the algorithm in Section 1.1, followed by a list of contributions in Section 1.2. The main results are collected in Section 2, with more discussions in Section 3 and the proofs in the Appendices. Extensive numerical experiments are presented in Section 4.

The problem setup and the description of the estimator
In order to estimate the change points of f (·), we propose a two-step estimator. The estimator is defined in this subsection, following introduction of necessary notation used throughout this paper.
For any fixed λ > 0 and given data y ∈ R n , let the estimated partition be Π ∈ argmin Π: Π∈Pn where the notation therein is introduced below.
• The norm · denotes the 2 -norm of a vector. with We can see that the loss function G(·, ·) is a penalised residual sum of squares. The penalisation is imposed on the cardinality of the partition, which is in fact an 0 penalisation. The residual sum of squares are the residuals after projecting data onto the discrete polynomial space. The initial estimators { η k } K k=1 are defined to be η( Π), the change points of Π.
As a summary, this two-step algorithm precedes with the optimisation problem (2), providing a set of initial estimators { η k } K k=1 . With the initial estimators, a parallelisable second step works on every triplet ( η k−1 , η k , η k+1 ), k ∈ {1, . . . , K}, to refine η k and yield η k . This update does not change the number of estimated change points. Note that the choice of 1/2 in the definitions of s k 's and e k 's in (6) is arbitrary, and any constant c ∈ (0, 1) would work.
To help further referring back to our two-step algorithm, we present the full procedure in Algorithm 1.

Algorithm 1 Two-step estimation
The initial estimators (7) The final estimators end if We conclude this subsection with two remarks, on the optimisation problem (2) and the computational aspect of the upper bound on the polynomial degree r, respectively. Remark 1 (The optimisation problem (2)). The uniqueness of the solution (2) is not guaranteed in general, but the properties we are to present regarding the change point estimators hold for any solutions. In fact, under some mild conditions, for instance the existence of densities of the noise distribution, one can show the minimiser of (2) is unique almost surely [e.g. Remark 4 in 39].
The optimisation problem (2), with a general loss function, is known as the minimal partitioning problem [e.g. Algorithm 1 in 17], which is related with the Schwarz Information Criterion [e.g. 43], and can be solved by a dynamic programming approach in polynomial time. The computational cost is of order O(n 2 Cost(n)), where Cost(n) is the computational cost of calculating G(Π, λ), for any given Π and λ. To be specific, for (2), Cost(n) = O(n), where the hidden constants depend on the polynomial degree r, therefore the total computationa cost is O(n 3 ). A reference where the computational cost and the dynamic programming algorithm is explicitly mentioned is Lemma 1.1 in [7].
We would like to mention that the minimal partitioning problem has previously being used in change point analysis literature for other models, including [15], [23], [39], [40] and [41], among others. In the spline regression analysis area, the 0 penalisation is also exploited, for instance, [32] and [7], to name but a few. We would like to reiterate that [32] and [7] studied the estimation risk of the whole underlying functions. The results derived in this paper focus on the change point localisation.
Remark 2 (The polynomial degree upper bound r). The degree r is in fact an input of the algorithm. One needs to specify the degree r in (2) and (7). Usually, when we define a degree-d polynomial, we let is regarded as a degenerate degree-d polynomial. In this paper, we do not emphasis on the highest degree coefficient being nonzero. With this flexibility, in practice, as long as the input r is not smaller than the largest degree of the underlying polynomials, then the performances of the algorithm are still guaranteed. However, the larger the input r is, the more costly the optimisation is. More regarding this point will be discussed after we present our main theorem.

Main contributions
To conclude this section, we summarise our contributions in this paper.
Firstly, to the best of our knowledge, this is the first paper studying the change point localisation in piecewise polynomials with general degrees. The model we are concerned in this paper enjoys great flexibility. We allow for the number of change points and the variances of the noise sequence to diverge, and the differences between two consecutive different polynomials to vanish, as the sample size grows unbounded.
Secondly, we propose a two-step estimation procedure for the change points, detailed in Algorithm 1. The first step is a version of the minimal partitioning problem [e.g. 17], and the second step is a parallelisable update. The first step can be done in O(n 3 ) time and the second step in O(n) time.
Thirdly, we provide theoretical guarantees for the change point estimators returned by Algorithm 1. To the best of our knowledge, it is the first time in the literature, establishing the change point localisation rates for piecewise polynomials with general degrees. Prior to this paper, the state-of-the-art results were only on piecewise linear signals. In this paper, we allow the underlying contiguous polynomials to pertain different smoothness at different change points. This is reflected in our localisation error bound for each individual change point. In short, we show that our change point estimator enjoys nearly optimal adaptive localisation rates. In addition to the global minimax rates, we have also derived minimax rates on the localisation errors when restricting to some special classes. These results again, are the first time shown for the general polynomials.
Lastly, in a fully finite sample framework, we provide information-theoretic lower bounds characterising the fundamental difficulties of the problem, showing that our estimators are nearly minimax rate-optimal. To the best of our knowledge, even for the piecewise linear case, previous minimax lower bounds only focused on the scaling in the sample size n whereas we derive a minimax lower bound involving all the parameters of the problem. More detailed comparisons with existing literature are in Section 3.

Main Result
In this section, we investigate the theoretical properties of the initial and the final estimators returned by Algorithm 1.

Characterising differences between different polynomials
In the change point analysis literature, the difficulty of the change point estimation task can be characterised by two key model parameters: the minimal spacing between two consecutive change points and the minimal difference between two consecutive underlying distributions. In this paper, the underlying distributions are determined by the polynomial coefficients. For two different at-most-degree-r polynomials, the difference is nailed down to the difference between two (r + 1)-dimensional vectors, consisting of the polynomial coefficients. To characterise the difference, for any integers r, K ≥ 0, we adopt the following reparameterising for any piecewise polynomial function f (·) ∈ F r,K n , where is a right-continuous with left limit polynomial of degree at most r. . (8) Remark 3 (Uniqueness of the change points). Note that if two adjacent different polynomials are continuous at the change point, then the definition of change point is not necessarily unique and may differ by the degree of the polynomials. In this case, either choice satisfying (8) can serve as a change point and will not affect the theoretical results.
where {a k,l , b k,l } r l=0 ⊂ R.
Associated with the change point η k , define the effective sample size as For l ∈ {0, 1, . . . , r}, define Finally, define the signal strength associated with the change point η k as We define the jump associated with each change point of f (·) ∈ F r,K n in Definition 1. The definition is based on a reparameterisation of two consecutive polynomials. Using the notation in Definition 1, due to the definition (8), f (·) is an at-most-degree-r polynomial in each [η k /n, η k+1 /n), k ∈ {0, . . . , K}. This enables the reparameterisation (9).
With the reparameterisation (9), it is easy to see, for any change point η k , there must exist at least one l ∈ {0, 1, . . . , r}, such that κ k,l > 0, thus ρ k > 0 for any change point.
Here we use the convention that if f (·) is −1-time differentiable at x, then f (·) at x is not continuous.
Based on the definition of ρ k , we remark that for each individual change point, the signal strength ρ k is associated with a certain polynomial order l k such that l k ∈ argmax l∈{0,...,r} We will come back to this after Theorem 1.
There are two key advantages of using Definition 1 to characterise the difference. Firstly, we allow for a full range of smoothness at the change points. Detecting change points in piecewise linear models was studied in [14], but the continuity at the change points is imposed. Our formulation covers this continuity but also allows for discontinuity. Most importantly, we allow for each change point to have its individual smoothness level, which is min{l = 0, . . . , r : κ k,l > 0}.
Secondly, in addition to allowing for a full range of smoothness, we also take into consideration the magnitude of coefficients change at different order. In the piecewise linear change point detection literature, [8] considered both continuous and discontinuous cases, but assuming all the changes are either zero or of order O(1) and there is only one true change point. Our formulation allows the changes and the locations of the change points to be functions of the sample size n, and allows for the number of change points to diverge as the sample size grows unbounded.

Change point localisation errors
In this section, we present our main theorem providing theoretical guarantees on the output of Algorithm 1, with assumptions collected in Assumption 1. Assumption 1 (Model assumptions). Assume that the data {y i } n i=1 are generated from (1), where f (·) belongs to F r,K n defined in (8) and ε i 's are independent zero mean sub-Gaussian random variables 1 with max n i=1 ε i ψ2 ≤ σ 2 . We denote the collection of all change points of f (·) to be {η 1 , . . . , η K }, where ρ k is defined in Definition 1.
The problem now is completely characterised by the sample size n, the maximum degree r, the number of change points K, the upper bound of the fluctuations σ, the effective sample sizes {Δ k } and the signal strengths {ρ k }. In this paper, we allow the maximum degree r to be arbitrary but fixed, i.e. not a function of the sample size n. We allow the number of change points K and the fluctuation bound σ to diverge, the ratios of the effective sample size to the total sample size {Δ k /n} and the active jump sizes {κ k } to vanish, as the sample size grows unbounded.
be the initial estimators and final estimators of Algorithm 1, with inputs and tuning parameter λ. Assume that λ = c noise Kσ 2 log(n) and ρ ≥ c signal λ.
For each k ∈ {1, . . . , K}, let In addition, let We have that The constants c prob , c noise , c signal and c error > 0 are all absolute constants.
Remark 4 (Tracking constants). All the absolute constants c prob , c noise , c signal , c error can be tracked in the proof, although we do not claim the optimality of the constants thereof. The hierarchy of the constants are as follows.
We first determine the constant c prob > 0, which only depends on the maximum degree r. Given c prob , we can determine c noise , which only depends on c prob . With c prob and c noise at hand, we can determine c signal > 0. Lastly, the constant c error > 0 depends on c signal , c noise and c prob . We note that the larger c signal is, the smaller c error is. Remark 5 (The choice of λ). The theoretical result relies on a choice of λ detailed in (11), which is a function of unknown parameters K and σ, in addition to an unknown quantity c noise . In practice, we do not recommend to estimate K and σ, separately, due to the involvement of c noise . One can adopt a data-driven method for tuning parameter selection [e.g. 30].
To understand Theorem 1, we conduct discussions in the following aspects: (1) how to understand the localisation rates; (2) how to understand the definitions of {r k }; and (3) how to understand the the signal strength condition in (11). We conclude this discussion with piecewise linear models as examples.

The localisation rates. From Theorem 1 we can see that the final estimators
, by getting rid of K, the dependence on the number of change points, in their localisation error upper bounds. It is possible that this K term is actually an artefact of our current proof, and we might not need to update our initial estimators further. See Section 3 for more on this issue. However, with our current proof technique we do need the second step update to obtain the improved localisation error bound.
As for each individual change point η k , k ∈ {1, . . . , K}, the localisation errors are Due to the definition of r k and the condition (11), it holds that With properly chosen constants, we ensure that in the event E, we have that | η k − η k | < cΔ, 0 < c < 1/2. This guarantees that in the second step in Algorithm 1, each interval [s k , e k ) contains one and only one true change point.
The definitions of r k . For each k ∈ {1, . . . , K}, the final localisation rates are functions of r k , which is one of the polynomial orders in the set S k . As defined in (13), the choice of r k minimises the term for any l ∈ S k . In fact, it can be seen from the proofs, for l ∈ S k as defined in (12), the term (15) can serve as an upper bound in the localisation error rates. Due to the definition of r k in (13), we see that our choice of r k ensures that the localisation rate is the sharpest. If the minimiser is not unique, we choose r k to be the smallest element to guarantee the uniqueness in definition. However, we remark any choice would unveil the final rate.
Recall that in Section 2.1 after we present Definition 1, we mentioned that the individual signal strength is associated with a certain polynomial order l k , defined in (10). The choice of r k in (13) is not necessarily the same as l k , but it holds that {r k , l k } ⊂ S k . If there is only one polynomial order whose the signal strength is large enough, i.e. |S k | = 1, then r k = l k ; otherwise they are not necessarily the same.
The signal strength condition (11). Recalling the definition of ρ, the condition (11) This is to say, at any true change point, there is at least one polynomial order, the jump associated with which has strength larger than Kσ 2 log(n). The signal strength ρ k,l is a function of the coefficient change size, as well as the corresponding order.
Piecewise linear models. Let us consider a concrete case where K = 1, r = 1 and the only change point is η. A question that can be asked now is as follows.
Is it easier to estimate the change point location when the underlying f (·) is continuous at η or discontinuous at η?
This question is partially answered in [8], while assuming κ σ 1, and [8] argues that (in our terminology) the localisation errors for the continuous and discontinuous cases are of order O(n 2/3 ) and O(1), respectively. Theorem 1 unfolds a more comprehensive picture. We remark that [8] has also proposed a super-efficient rate O(n 1/2 ) for the continuous case. We will provide more discussions with respect to that in Section 2.4.
-If κ 2 1,1 n σ 2 log(n) κ 2 1,0 n, then the localisation error rate is -If κ 2 1,0 n σ 2 log(n) κ 2 1,1 n, then the localisation error rate is -If min{κ 2 1,1 n, κ 2 1,0 n} σ 2 log(n), then the localisation error rate is Back to the the question we asked above, there is no simple answer that which case is simpler and one needs to carefully consider the different rates we discussed above. But if one assume κ −2 σ 2 log(n) 1, the continuous and discontinuous cases yield localisation rates as O(n 2/3 ) and O(1), respectively.

Global lower bounds
In this section, we aim to provide global information-theoretic lower bounds to characterise the fundamental difficulties of localisation change points in the model defined in Assumption 1. By "global" we mean we do not assuming knowing further continuity conditions, in contrast to Section 2.4 in the sequel.
In the change point analysis literature, in terms of localising the change point locations, there are two aspects we are interested in. One is the minimax lower bound on the localisation error and the other is on the signal strength. For simplicity, in this section, we assume that K = 1 and r 1 = r, using the notation defined in (13).
As for these two aspects, in Theorem 1, we show that provided .
In this section, we will investigate the optimality of the above results.
where η(P ) is the location of the change point for distribution P , the minimum is taken over all the measurable functions of the data, η is the estimated change point and 0 < c < 1 is an absolute constant.
Lemma 2 shows that the final estimators provided by Algorithm 1 are nearly optimal, in terms of the localisation error, save for a logarithmic factor. We remark that in Lemma 2, we consider the class of distributions with the signal strength at order r satisfies the signal-to-noise ratio condition (16), and the order r is used in the localisation error lower bounds. We leave the proof of Lemma 2 in the appendix, but we provide some explanations of the proofs here.
We adopt Le Cam's lemma [e.g. 44] to show the lower bound, and consider two explicit distributions when applying Le Cam's lemma. One of these two distributions is (r−1)-time differentiable but not r-time differentiable. The other distribution is not continuous. This construction provides us a global minimax lower bound when r 1 = r. For example, in the piecewise linear models, this includes both continuous and discontinuous cases, and the corresponding lower bound is of order Combining with (17), we know Theorem 1 is optimal saving a logarithmic factor. , n/3 .
where η(P ) is the location of the change point for distribution P , the minimum is taken over all the measurable functions of the data, η is the estimated change point and 0 < c < 1 is an absolute constant depending on ξ.
Lemma 3 shows that, if κ 2 Δ 2r+1 n −2r σ 2 , then no algorithm is guaranteed to be consistent, in the sense that This means, besides the logarithmic factor, Lemma 3 and Theorem 1 leave a gap in terms of K. To be specific, it remains unclear what results one would obtain if This gap only exists when we allow K to diverge. We will provide some conjectures inline with this discussion in Section 3.1.

A special case
In Section 2.3 we have shown the global minimax lower bound on the localisation error. In lemma 4 below, we provide a minimax lower bound in a smaller class.
where η(P ) is the location of the change point for distribution P , the minimum is taken over all the measurable functions of the data, η is the estimated change point and 0 < c < 1 is an absolute constant.
Comparing Lemmas 2 and 4, we notice that Q 1 the class of distributions considered in Lemma 4 is strictly smaller than Q the class of distributions considered in Lemma 2. In Q 1 , we enforce that the underlying polynomials are (r − 1)-time differentiable. We leave the proof of Lemma 4 in the appendix, but we highlight some key ingredients here. We again adopt Le Cam's lemma in deriving the lower bound, but different from the construction used in the proof of Lemma 2, the two explicit functions we choose are both (r − 1)-time differentiable.
Apparently, the localisation lower bound provided in Lemma 4 is sharper than the one in Lemma 2. This is not surprising, since Q 1 ⊂ Q. What is seemingly surprising is that the lower bound is not a function of any polynomial order. This is gained by knowing the fact that a 1,l = b 1,l , l ∈ {0, . . . , r − 1}.
In [8], similar results were obtained but only for the piecewise linear case. To match this lower bound, [8] proposed a super-efficient estimator, which assumes that it is known the piecewise linear models are continuous. The super-efficient estimator is essentially a penalised estimator, which forces the intercept estimators to equal, if their difference is not very large. One can straightforwardly extend the idea there to the class a k,l = b k,l , l ∈ {0, . . . , r k −1}, but a k,r k = b k,r k , for any k ∈ {1, . . . , K}. Enforcing the corresponding polynomial coefficient estimators to equal before and after each change point estimator, knowing the exact smoothness at every individual true change point, will prompt a localisation error of order detailed in Lemma 4. We would refrain from proposing such an effort, since in our paper, we allow for multiple change points and allow for individual smoothness levels. This will end up with K k=1 r k more tuning parameters.

Discussions
In this paper, we investigate the change point localisation in piecewise polynomial signals. We allow for a general framework and provide individual localisation error, associated with the individual smoothness at each change point. A two-step algorithm consisting of solving a minimal partitioning problem and an updating step is proposed. The outputs are shown to be nearly-optimal, supported by the information-theoretic lower bounds. To conclude this paper, we discuss some unresolved aspects of our work while comparing our results to some particularly relevant existing literature. Readers who are less familiar with the change point literature may safely skip this section.

Comparisons with [39]
[39] studied change point localisation in piecewise constant signals. They studied the 0 -penalised least squares method and proved that it is nearly minimax optimal in terms of both the signal strength condition and the localisation error. In contrast, with our proof technique, we have been able to generalise this result for higher degree polynomials up to a factor depending on K, the number of true change points. This can be seen in our change point localisation error bound of our initial estimators as provided in Theorem 1 and also in our required signal strength condition in (16). In our paper, with general degree polynomials, the localisation near-optimality is secured via an extra updating step, and a gap remains in the upper and lower bounds for our required signal strength condition. This gap is not present if K is assumed to be O(1) but is present if it is allowed to diverge.
We explain why the proof in [39] could not be fully generalised to our setting. Recall the definition of H(v, I) in (3) denoting a residual sums of squares term. In our analysis, a crucial role is played by the term Q{E(y); I 1 , I 2 } = H{E(y), I 1 ∪ I 2 } − H{E(y), I 1 } − H{E(y), I 2 }, where I 1 , I 2 are two contiguous intervals of {1, . . . , n}. Ideally, one needs to be able to upper and lower bound Q{E(y); I 1 , I 2 } when y is defined in (1), and its corresponding f (·) is a degree-r polynomial on I 1 and another degree-r polynomial on I 2 . In the case of r = 0, i.e. in the piecewise constant case, one can write an exact expression Q{E(y); I 1 , In addition, it holds that Therefore, it follows that 1 2 min{|I 1 |, |I 2 |}κ 2 ≤ Q{E(y); I 1 , where κ represents the absolute difference between the values of E(y i ), i ∈ I 1 and i ∈ I 2 . For general r, by adopting an elegant result in [32], one can actually generalise (20) to obtain that where 0 < C 1 < C 2 are two absolute constants, and κ is the absolute difference of the rth degree coefficients of E(y) on I 1 and I 2 . However, the problem is that the constants C 1 and C 2 are not explicit. We can only show the existence of such constants. Even if we can track these two constants down, in order to be able to generalise the argument of [39], we would still need to show that C 1 and C 2 are close enough. At this moment, it is not clear to us how to resolve this issue. We can only conjecture that for all r ∈ N, the 0 -penalised least squares method would itself be nearly optimal in terms of both the signal strength condition and the localisation error, and our second step update would not be needed. From a practical point of view, our second step can be done in O(n) time, which is negligible compared to the O(n 3 ) time required to solve the penalised least squares. The computational overhead of our second step is thus minor. [14] showed that penalised least squares method for change point localisation works well for piecewise linear signals. This work inspired us to investigate piecewise polynomial signals of higher degrees. Even in the piecewise linear case, there are some differences between our work and [14]. The algorithm provided in [14] can be seen as solving a variant of the penalised least squares problem mentioned in this paper. In fact, the dynamic programming algorithm mentioned in [14] appears to be more sophisticated than what would be required to solve our problem. It is because the algorithm in [14] is tailored specifically for continuous piecewise linear functions. Maintaining continuity makes the dynamic programming algorithm more involved. Translated to our notation, [14] assumes r k = 1, for all k ∈ {1, . . . , K}. Our formulation is more general than [14] as we do not impose continuity or any kind of smoothness at the change points. Our estimator adapts near-optimally to the level of smoothness at the change points. The theoretical results studied in [14] are under the conditions K, σ 1. Under these conditions, translated to our notation, their results read, provided that (κ/n) 2 Δ 3 log(n), the localisation error is log 1/3 (n)(n/κ) 2/3 . Both are consistent with the results we have obtained in this paper.

Comparisons with [29]
[29] studied the minimax rates of change point localisation in a nonparametric setting. The main focus there is how the localisation errors' minimax rates change with α, the degree of discontinuity in a Hölder sense. Due to the nonparametric essence, the class of functions considered in [29] is more general than the piecewise polynomial class we discuss here. However, the measures of regularity r k 's we have defined in Definition 1 are similar as the parameter α in [29], if we only consider polynomials. Having drawn this connection, translated into our notation, [29] in fact shows that the localisation error's minimax lower bound is of order n 2r log η (n) 1/(2r+1) , ∀η > 1.
This is a lower bound for a larger class of functions than ours, but the dependence on n is the same as ours up to a poly-logarithmic factor. In general, the larger the class is, the smaller the minimax lower bound is. Since [29] assumes all the other parameter to be of order O(1), our minimax lower bounds add value as they are in terms of all the relevant problem parameters and not just the sample size n.

Why not just differencing the sequences
In this paper, we are dealing with piecewise polynomials with general order r. We noticed that in practice, some practitioners tend to difference the sequences r times, wishing to obtain piecewise constant signals, and then conduct change point detection methods on the resulting differenced sequence. This is in fact not an effective method if the goal is to detect change points. We use piecewise linear models as concrete examples, assuming we have where a 0 , a 1 , b 1 ∈ R and a 1 = b 1 . As we have shown, the global and constrained minimax lower bounds on this problem are respectively.
If we now take differences, then we work under a new model This is now a piecewise constant case, the localisation error lower bound is now of order provided that the signal strength is still strong enough. (The differenced sequence is no longer independent, but weakly dependent. Therefore the variance parameter is inflated by a constant.) Comparing the rates in (23) and (23), we show that it is not always a good idea to difference the polynomial sequences.

Numerical experiments
In this section, we conduct extensive numerical experiments, based on piecewise quadratic functions.
Tuning parameter selection. The only tuning parameter λ is selected via the cross-validation method [30]. To be specific, we first divide the sample into training and validation sets according to odd and even indices. For each possible values of λ considered, the initial estimator Π is obtained based on the training set. On the validation set, for each I ∈ Π, we obtainŷ I = P I y I and compute the validation loss t mod 2≡0 (ŷ t − y t ) 2 . Finally, we select the λ which minimises the validation loss.
General settings. With the notation in Definition 1, f (x) on the interval The piecewise polynomials f (x) can therefore be parameterised by the degree r, the change points {η k } K k=1 , the sample size n, the coefficients {a 1,l } r l=0 for the first segment, the jump sizes {κ k,l } r l=0 for k = 1, . . . , K, and σ 2 which quantifies the tail behavior of error terms.
For each setting below, we simulate 100 repetitions and fix K = 2. Fixing the effects of K and σ 2 , the localisation errors shown in Theorem 1 can be regarded as an interplay among Δ, r k , n and ρ k,r k ; see (14).

Scenario 1: The effects of r k and n
In this scenario, we investigate the roles of r k and n, with equally-spaced change points. We fix the polynomial coefficients for the first segment as , which suggests the following. First, fixing n, the localisation error increases as r 2 increases; second, fixing r 2 , the localisation error decreases as n increases. This is supported by the results collected in Table 1. For fixed n and Δ, our method performs similarly for Cases (a)-(d) with the same r 2 = 0, and the performances deteriorate as r 2 increases. In each case, the performances improve as n increases.
We would like to mention that, when r 2 = 2, with a much larger signal strength, we can show a similarly good performance as that in the case r 2 = 0.

Scenario 2: The effect of Δ
In this scenario, we vary the minimal spacing Δ and consider un-balanced change points. We let n = 450, r = 2, the polynomial coefficients of the first segment be The results collected in Table 2 show that, keeping other factors unchanged, the more balanced the locations of change points are, the better the performance of our estimator.

Acknowledgments
The research of Yu and Xu are partially supported by EPSRC (EP/V013432/1).

Appendix A: Summary
We include all the proofs in the Appendices. Some preparatory results are provided in Appendix B. Appendix C contains the proof of Theorem 1. The lower bounds results Lemmas 2 and 3 are proved in Appendix D.

Appendix B: Preparatory Results
The following notation will be used throughout the proofs. For any I = {s, . . . , e} ⊂ {1, . . . , n}, recall the projection matrix P I defined in (4) using matrix U I,r defined in (5). We recall the notation for any vector v ∈ R n , where v I = (v i , i ∈ I) ∈ R |I| . For any contiguous intervals I, J ⊂ {1, . . . , n} and for any vector v ∈ R n , define Proof. The claims holds due to that Proof. First observe that Q(y; I, J) is a quadratic form in y. Moreover, it is a positive semidefinite quadratic form as Q(y; I, J) ≥ 0 for all y ∈ R n by Lemma 5. Therefore, we can write Q(y; I, J) = y Ay, for a positive semidefinite matrix A ∈ R n×n . Denoting A 1/2 as the square root matrix of A, satisfying A 1/2 A 1/2 = A, we can write Q(y; I, J) = A 1/2 y 2 . It then holds that which leads to the final claim.
Lemma 7 (Lemma E.1 in [32]). There exists an absolute constant c poly depending only on r such that for any integers and any real sequence Lemma 7 is a direct consequence of Lemma E.1 in [32]. We omit its proof here. Let θ I∪J , θ restricted on I ∪ J, be reparametrised as Then there exists an absolute constant c poly depending only on r such that for any d ∈ {l = 0, . . . , r : a l = b l }, Yu et al. Proof. For any fixed d ∈ {0, . . . , r} and any κ > 0, let In words, A d is the set of vectors which are discretised polynomials of order at most r on the interval I/n and different polynomials of order at most r on the interval J/n, with the dth order coefficients at least κ apart.
For v ∈ A d , since v is a discretised polynomial on I/n and J/n, separately, we have that Q(v; In addition, we claim that This is due to the following. Since orthogonal projections cannot increase the 2 norm, we have the LHS ≤ RHS. As for the other direction, observe that the vector v I∪J − P I∪J v I∪J also belongs to the set A d .

It now suffices to lower bound min
where c 1,d and c 2,d are the dth order coefficients of v as defined in (25), the first inequality is due to Lemma 7, and the second inequality follows from the fact that |c 1,d − c 2,d | ≥ κ.
Proof. For any interval I ⊂ {1, . . . , n}, there exists an absolute positive constant c > 0 depending only on r such that for any t > 0, which is due to the Hanson-Wright inequality [e.g. Theorem 1.1 in 31]. Since P I is a rank r + 1 orthogonal projection matrix, we have P I F = r + 1 and P I op = 1. Then In addition, we have that For an absolute constant C > c/2, letting t = Cσ 2 log(n) and applying a union bound argument over all possible I, we obtain that Finally, we choose c prob and c noise such that n −c prob > 2n 2−cC and c noise log(n) > C log(n) + (r + 1), then we complete the proof.

Appendix C: Proof of Theorem 1
In this section, we provide the proof of Theorem 1. We will prove the result by first proving that under an appropriate deterministic choice of the tuning parameter λ and some deterministic conditions on other parameters, obtaining the desired localisation error is possible. We will then conclude the proof using Lemma 9, under which all these required conditions hold. For any τ > 0, define Proof of Theorem 1. It follows from Lemma 9 that where M(·) is defined in (26). On the event M{c noise σ 2 log(n)}, it follows from Proposition 10 that , ∀k ∈ {1, . . . , K}.
We complete the proof.

C.1. The initial estimators { η k } K k=1
The following proposition is our main intermediate result used to prove Theorem 1.
We have that for any k ∈ {1, . . . , K}, there exists an absolute constant 0 < c < 5 2r/(2r+1) /2, such that . Remark 6. Note that Proposition 10 is a completely deterministic result. In particular, no probabilistic assumption is needed on the noise variables. The proposition is written with explicit constants but these constants are not optimal in any sense. We have written out explicit constants just to emphasise the deterministic nature of the result and in better understanding of the relative choices of the different problem parameters.
Proof of Proposition 10. We will show that .
Parts (a)-(d) are shown in Lemmas 11, 12, 13, 14 and 15, respectively. Letting η 0 = 1 and η K+1 = n + 1, it follows from part (b) that for every 3 consecutive k=0 , there is at least one true change point {η k } K k=0 . This implies that | Π| ≤ 3|Π|. In addition, by part (a), an interval I = [s, e) ∈ Π can contain two, one or zero true change point. If I contains exactly two true change points, then by part (c), the smaller true change point is close to the left endpoint s, and the larger true change point is close to the right endpoint e. The closeness is defined by part (c). If I contains exactly one true change point, then by part (d), the true change point is close to one of the endpoints. This shows that every true change point can be mapped to an estimated change point, and the distance between the true and the estimated is upper bounded by what is shown in (c) and (d).
Recall the definition of ρ in Assumption 1 and the condition (28 . This assures that the mapping of true change points to estimated change points is one to one and implies that | Π| ≥ |Π|. Finally, part (e) is deployed to complete the proof. Proof. We prove by contradiction, assuming that there exists at least three true change points in I = [s, e) ∈ Π, namely s ≤ η k−1 < η k < η k+1 < e. This implies that min{η k − s, e − η k } > Δ.
Denote where 1{·} is an indicator function, the first inequality follows the definition of Π, the second is from Lemma 5 and the third follows from Lemma 6. As for the final inequality, it follows from Proposition 8 and the fact that |I 2 |, |I 3 | = Δ that Q(θ; I 2 , I 3 ) ≥ ρ. Since our assumption implies that 2τ ≤ ρ, it holds that 12λ ≥ ρ which contradicts the second assumption in (27). where the first inequality is due to the definition of Π, the second identity follows from the fact that θ is polynomial of order at most r on J, and the last inequality holds on the event M. Therefore we reach a contradiction to (27). which implies that the change point of P 0 satisfies η(P 0 ) = Δ+1. Recalling (13), we also know that the corresponding order r 1 equals r, and the jump size κ 1 = κ. As for P 1 , it is easy to see that E(y i ) = 0, i ∈ {1, . . . , n − Δ}, κ {(i − n + Δ)/n} r , i ∈ {n − Δ + 1, . . . , n}, which implies that the change point of P 1 satisfies η(P 1 ) = n − Δ + 1. Recalling Definition 1, we also know that the corresponding smallest order r equals r, and the jump size κ = κ. Since Δ ≤ n/3, it follows from Le Cam's lemma [e.g. 44] and Lemma 2.6 in [34] that inf η sup P ∈P E P (| η − η|) ≥ (n/3)(1 − d TV (P 0 , P 1 )) ≥ n 6 exp{−KL(P 0 , P 1 )}.