Equivariant Variance Estimation for Multiple Change-point Model

The variance of noise plays an important role in many change-point detection procedures and the associated inferences. Most commonly used variance estimators require strong assumptions on the true mean structure or normality of the error distribution, which may not hold in applications. More importantly, the qualities of these estimators have not been discussed systematically in the literature. In this paper, we introduce a framework of equivariant variance estimation for multiple change-point models. In particular, we characterize the set of all equivariant unbiased quadratic variance estimators for a family of change-point model classes, and develop a minimax theory for such estimators.


Introduction
This paper focuses on the variance estimation under the presence of change points.Our goal is to estimate the noise variance without identifying the locations of the changes, so that the variance estimator can be used in the subsequent change-point detection procedures.We characterize the finite sample minimax risk of the proposed estimator, over a broad model class with little restrictions on the change-point structure.Our estimator is equivariant over the data sequence, which greatly simplifies the calculations and leads to explicit minimax risk bounds.
A premier goal of change-point detection is to estimate and make inferences about the change-point locations.A good variance estimator is vital in many change-point detection procedures.For example, in binary segmentation and related methods [32,13], the variance is required to decide when to stop the recursive procedure.In other methods, for example, the screening and ranking algorithm (SaRa) in [30] and the simultaneous multiscale change-point estimator (SMUCE) in [12], the choice of tuning or thresholding parameters depends on the variance.In general, it is important to gauge the noise level, which determines the optimal detection boundary and detectability of the change-point problem [2].Moreover, an accurate and reliable estimate of the variance is necessary for constructing confidence sets of the change points.In practice, the noise variance is usually needed and estimated as the first step of a changepoint analysis.However, most commonly used variance estimators, reviewed in Section 2.1, are based on some technical assumptions and can be severely biased when these assumptions fail to hold.The quality of these estimators, such as unbiasedness and efficiency, has been less studied.In fact, to our best knowledge, the exact unbiased variance estimator under a finite sample setup has not been discussed before this work.There are two main challenges to the error variance estimation for change-point models.First, the information on the mean structure such as the number of change points and jump magnitudes is unknown, while complex mean structures often make the variance estimation more difficult.Second, the noise may not be Gaussian in practice, while many methods work well only under normality.In spite of the importance of this problem and these issues, there has been no systematic study on variance estimation for the multiple change-point model (2.1).This work aims to fill this gap.
Our approach is inspired by the classical difference-based variance estimation in nonparametric regression, studied in [35,15,27,16], among many others.In particular, [28] innovatively builds a variance estimator by regressing the lag-k Rice estimators on the lags, in the context of nonparametric regression with discontinuities.Recent developments along this direction include [37,36]; see also a recent review [24].These works focused on asymptotic analysis of variance estimation for more flexible models, and hence required much stronger conditions on the number of change points or discontinuities.In contrast to the existing literature, we narrow down to change-point models, but the thrust of our study is to have exact and non-asymptotic results regarding the unbiasedness and the minimax risk of the variance estimators, under minimal conditions.To the best of our knowledge, similar results have not appeared in the literature, and are difficult to obtain without the equivariance framework introduced in this paper.
In this paper, we develop a new framework of equivariant variance estimation.Roughly speaking, we will embed the data index set [n] = {1, ..., n} on a circle instead of the usual straight line segment so the indices n and 1 are neighbors.In other words, there is no 'head' or 'tail' in the index set, and every position plays the same role.As we will illustrate in Section 2.4, there is a natural cyclic group action on the index set, which leads to an equivariant estimation framework.Under this framework, we are able to characterize all the equivariant unbiased quadratic variance estimators for a family of change-point model classes, and establish a minimax theory on variance estimation.This family of change-point model classes, denoted by Θ L , is indexed by a positive integer L, which is the minimal distance between change-point locations allowed for any mean structure in the class.In general, a smaller L leads to a broader model class, and hence, a higher minimax risk.In this work, we give both lower and upper bounds in nonasymptotic forms for the minimax risk of equivariant unbiased quadratic estimators for these model classes.Another advantage of the equivariant framework is that it requires minimal assumptions on the noise distribution.In fact, our theoretical analysis relies on no other assumption than the existence of the fourth moment.In particular, the performance of the proposed framework is guaranteed also for skewed or heavy-tailed distributions.We also note that the notion of equivariance has not been sufficiently explored in the literature except [32], which focuses on short segment detection rather than a framework of equivariant estimation.
To summarize the main contributions of our work, first, we introduce a new framework on equivariant variance estimation, and characterize the equivariant unbiased quadratic variance estimators for a family of change-point model classes.This framework resembles the classical theory of linear unbiased estimation, but is also technically more complicated.Second, we derive nonasymptotic lower and upper minimax risk bounds for the proposed estimators.In particular, in Corollary 2.2, we give a surprisingly simple and exact answer to the minimax problem with an explicit minimax risk for the broad change-point model class Θ 2 .Third, our approach requires minimal model assumptions on the noise distribution and mean structure, which can hardly be weaken further.Last but not least, we suggest an equivariant variance estimator that is computationally simple and practically useful in applications.As a by-product, we show the ℓ 2 risk explicitly for the regression based estimator proposed by [28] and theoretically compare its risk with our method.Therefore, our theoretical result implies that the Müller-Stadtmüller estimator is nearly minimax.In the numerical studies, compared to an oracle variance estimator that knows the true mean, the relative efficiency of our methods is often within 1.5 across different scenarios.

Existing variance estimators
In this paper, we focus on the problem of noise variance estimation for a multiple change-point model.In particular, consider a sequence of random variables ) where the mean vector θ = (θ 1 , ..., θ n ) ⊤ is piecewise constant, and τ = (τ 1 , ..., τ J ) ⊤ is the location vector of change points.We assume that the noises {ε i } n i=1 are independent and identically distributed (i.i.d.) with E(ε 1 ) = 0 and Var(ε 1 ) = σ 2 > 0.
Many estimators for the variance or standard deviation of the additive noise have been employed in recent works on change-point detection.One is the median absolute deviation (MAD) estimator [17], defined by where med(X) is the median of the vector X = (X 1 , ..., X n ) ⊤ , the constant 1.4826 the ratio between standard deviation and the third quartile of the Gaussian distribution.One advantage of this estimator is that it is robust against outliers.Obviously, the method depends on Gaussianity assumption and a sparsity assumption that θ is a constant vector except a small number of entries.[12] suggests an estimator used in [8], where ⊤ .This estimator is similar to the MAD except that it does not require θ to be an almost constant vector.Nevertheless, it still needs the normality of the noises.
The Rice estimator, introduced in [35], is another popular method.For example, [34] uses it for variance estimation.It does not depend on Gaussianity of the noise.But it might be seriously biased.In fact, as an immediate consequence of Proposition 2.2, the bias of σ2 To eliminate the bias and improve the efficiency, [28] proposed a regression based estimator via lag-k Rice estimators.As we will see in Section 2.3, it is a special case of difference-based quadratic variance estimator, which has been a popular approach in nonparametric regression [9].Nevertheless, it seems that this approach has not been widely recognized and employed in change-point analysis.There are a few interesting open problems to be answered for the Müller-Stadtmüller estimator.First, can we find its risk with respect to a loss function, e.g., ℓ 2 loss?Second, the quality of any variance estimators to a change-point model highly depends on the mean structure θ.It is desirable to find optimal or nearly optimal variance estimators for certain change-point model classes.In particular, is the Müller-Stadtmüller estimator optimal?Undoubtedly, affirmative answers to these questions will promote the applications of the difference-based quadratic variance estimator including the Müller-Stadtmüller estimator in the field of change-point analysis.
In fact, direct answers to these questions are difficult, as we will explain in the appendix.Instead, we take a detour via an equivariance framework and answer all questions above.

Model descriptions
In model (2.1), the data vector X = (X 1 , X 2 , . . ., X n ) ⊤ is observed and indexed by the set [n] = {1, ..., n}.We define a segment, denoted by [k, ℓ], as a subset of [n] consisting of consecutive integers {k, k + 1, • • • , ℓ}.The working model (2.1) is standard and widely used in the literature.Here we make and emphasize a key extension.That is, the index set is arranged on a circle, and the indices 1 and n do not play special roles as start and end points.Consequently, a segment [k, ℓ] with k ≥ ℓ is also well-defined.For example, [n − 1, 3] = {n − 1, n, 1, 2, 3}.For the mean vector θ with the form (2.2), we assume that it consists of J segments with constant means, [ Denote the common value of θ i on the segment [τ j + 1, τ j+1 ] by µ j .For a mean vector θ, we denote by L(θ) the minimal length of all constant segments in θ.The magnitude of L(θ) is a complexity measure of a change-point model.We will consider a family of nested model classes Θ (2.6) In general, the larger L is, the easier the change-point analysis.In particular, when L(θ) = 1, each observation can have its own mean different from all others, and there is no sensible change-point problem.Therefore, we only consider the case L(θ) ≥ 2 in this paper.Note that, by definition, L(θ) = n if θ is a constant vector, and otherwise, L(θ) ≤ n/2.
Note that the classical model treats the first segment and the last segment of θ as two separated segments.That is, the index n is treated as a known change point, no matter whether θ 1 = θ n or not.The classical model classes can be defined by In fact, Θ L ⊃ Θ c L by definition.For example, let θ = (0, 0, 1, 1, 1, 1, 0, 0) ⊤ .We have θ ∈ Θ 4 but θ / ∈ Θ c 4 .The larger generality of Θ L over Θ c L can be negligible in real applications.However, as we will see, it is advantageous to work on the family (2.6) to obtain neat theoretical results.
We use i, k, h, ℓ ∈ [n] to denote the index of the data, and K and L to denote the length of segments.Occasionally, an index i in X i or θ i may go beyond [n] in formulas.In that case, we use the convention X i = X i−nM where M is the unique integer such that i − nM ∈ [n].Similarly, we use j ∈ [J] to denote the index of change points and use the convention τ J+1 = τ 1 .The length of a segment [k, ℓ] is defined as the cardinality of the set [k, ℓ], which is ℓ − k + 1 when k ≤ ℓ and n + ℓ − k + 1 otherwise.
We assume the following condition on the error distribution in this paper.
We view this assumption as a "minimal" one for the variance estimation problem, because there is no distributional assumption.The existence of the 4-th moment is necessary for studying the mean squared error of the variance estimator.
We define two quantities related to the mean structure In fact, V (θ) and W (θ) measure the total variation of the mean vector in ℓ 2norm.There is no change point in the sequence if and only if V (θ) = W (θ) = 0.With the convention that X i = X n+i , we define which plays a central role in our variance estimation framework.In fact, it can be considered as a circular version of the lag-k Rice estimator, defined as In particular, S 1 is called Rice estimator, introduced in [35].

An equivariant approach for variance estimation
The means and covariances of T k 's can be calculated as follows.
Proposition 2.1.Under Condition 1, for 1 ≤ k ≤ L(θ), and for With Proposition 2.1, we rescale T k and consider a regression model where , and e k is the noise term with mean zero and covariance Cov(e 1 , ..., e where I K is the K × K identity matrix, 1 K is a vector of length K with all entries equal to 1, As Y k and T k are easily calculated from the data, we can estimate the variance, i.e., the intercept α in the regression model (2.8), by the ordinary least squares (OLS) estimator, denoted by αK .Specifically, let (2.11) (2.12) Theorem 1 gives an exact ℓ 2 risk of the variance estimator αK for 2 ≤ K ≤ L(θ)/2.Note that the risk depends on θ only through its total variation W (θ). When K > L(θ)/2, the exact risk also depends on other information of the mean, besides the total variation W (θ). See Theorem 2.3 for more details.In the proof of Theorem 2.1 in the appendix, we show that the equality in (2.12) is achieved for a specific θ satisfying: K = L(θ), n/K is an even number, all segments are of the same length, and the segment means µ j have the same absolute value, but with alternating signs.Therefore, the upper bound provided in (2.12) is tight.
There are three summands in the ℓ 2 -risk of αK (2.11).The first summand is the oracle estimator when the true mean is known.When K ≤ L(θ)/2, according to Proposition 2.1, the generalized least squares (GLS) estimator αK based on model (2.8) is obtained using the covariance matrix (2.9).Clearly αK depends on θ through W (θ)/σ 2 in the covariance (2.9).In a special case when W (θ) = 0, the covariance is compound symmetric, and the OLS and GLS estimators coincide [26] and equal to σ2 . Therefore, the first two summands in (2.11) can not be reduced for any linear unbiased estimators based on {Y k } K k=1 .We will elaborate the related minimax theory in subsection 2.5.
We may also calculate the mean and covariance of S k 's.

Proposition 2.2. Under Condition 1 with τ
To our best knowledge, Müller and Stadtmüller first constructed variance estimators via a regression approach based on S k 's [28].They studied variance estimation and tests for jump points in nonparametric estimation under an asymptotic setting L(θ)/n → c as n → ∞.Remark.The condition τ J = n in Proposition 2.2 means that when study the properties of S k 's, we consider the classical change-point model where the first segment is [1, τ 1 ], and the last segment is [τ Comparing with T k 's, the mean and covariance structure of S k 's is more complex.Moreover, Proposition 2.2 requires one more condition Eε 3 1 = 0, i.e. zero skewness.The following proposition gives a precise comparison of the OLS estimators based on T k 's and S k 's.
Proposition 2.3.Assume Condition 1, E(ε 3 1 ) = 0, and τ J = n.Let αK be the OLS estimator obtained by using S k in place of T k .Then αK is unbiased when .
We call αK the Müller-Stadtmüller (MS) estimator.As an immediate consequence of Theorem 2.1 and Proposition 2.3, when 2 .
It follows that αK has a smaller variance if θ 1 = θ n and K ≥ 7; and αK has a smaller variance if (θ So these two estimators often perform similarly, which is also verified by our numerical studies.In this paper, we aim to derive nonasymptotic and exact risk bounds for the variance estimators, which seems too complicated using S k 's.Therefore, we focus on T k 's subsequently and introduce the equivariant framework in the next subsection.

Equivariant unbiased estimation
Geometrically, we can embed the index set [n] = {1, ..., n} into the unit circle , where Z n is the cyclic group of order n, and the unit element 1 ∈ Z n maps S 1 to itself via a rotation by an angle 2π n .This group action naturally induces a group action of Z n on the sample space R n , where the unit element 1 ∈ Z n maps an n-vector (X 1 , ..., X n ) ⊤ to (X 2 , ..., X n , X 1 ) ⊤ .There is another way to represent this group action via n × n circulant matrices.Define C k as a circulant matrix with its (i, j) entry Again, we may treat the subscript k in C k as a number modulo n.It is easy to verify that C k C ℓ = C k+ℓ holds under standard matrix multiplication and Under this isomorphism, the group action Z n → R n can be represented by matrix multiplication X → C k X.Note that both the parameter space of the mean vector, Θ, and the sample space, X , are R n for the change-point model.An estimator θ of the mean vector θ is called equivariant if and only if C k θ(X) = θ(C k X) for all k, i.e., the estimation procedure commutes with the group action.For the problem of variance estimation, as the group action does not affect the value of variance parameter σ 2 , a variance estimator σ2 is equivariant (or simply invariant In this sense, T k is an equivariant version of S k because the values of T k 's remain the same under the group action.Consequently, we have Proposition 2.4.αK is an equivariant variance estimator.Under condition 1, αK is equivariant and unbiased for 2 ≤ K ≤ L(θ).
We consider the class of quadratic estimators of the form and their linear combinations are quadratic estimators.It turns out that any equivariant unbiased quadratic variance estimator for model class Θ L must be a linear combination of T 1 ,..., T L , as characterized by the following theorem.

Theorem 2.2. The set of all equivariant unbiased quadratic variance estimators for the model class
Interestingly, Q 2 consists of only one estimator, i.e., α2 = 2Y 1 − Y 2 .As a corollary of Theorems 2.1 and 2.2, we have Corollary 2.1.The OLS estimator α2 = 2Y 1 − Y 2 is the unique quadratic equivariant unbiased variance estimator for model class Θ 2 .Its variance satisfies Before we conclude this subsection, we point out that it is also possible to characterize the unbiased quadratic estimators over the class of classical changepoint models Θ c L defined in (2.7).It turns out this characterization is much more complicated than Theorem 2.2.Furthermore, the variance of an unbiased X ⊤ AX over Θ c L also depends on the mean vector θ in a more complicated way.These observations give us another motivation to consider the equivariant estimators over the larger class Θ L .We discuss the unbiased estimators over Θ c L with more details in Appendix D.

Minimax risk
Theorem 2.2 concludes that all equivariant unbiased quadratic estimators for model class Θ L are linear combinations of Y 1 ,...,Y L , including the OLS estimator studied in subsection 2.3.A natural question is whether the OLS estimator is optimal, and if not, how far it is from an optimal estimator.In this subsection, we will answer this question from the perspective of minimax theory.
Consider the class Q L of all equivariant unbiased estimators over the model class For any estimator σ2 , define the ℓ 2 risk up to a factor This risk is scale invariant by definition.As we will show soon, for a fixed model (θ, σ 2 ), the risk of the optimal estimator depends on the minimal segment length L(θ) and the ratio W (θ)/(nσ 2 ).Therefore, we consider the model class Θ L,w in our minimax analysis, where the two parameters L and w bound these two quantities respectively.Define the minimax risk of all equivariant unbiased estimators in Q L over model class Θ L,w as follows. ( We can solve the minimax problem for the case L = 2 as a simple corollary of Theorems 2.1 and 2.2.
Corollary 2.2 gives an elegant minimax solution for the broadest model class considered in this paper.At the level of L = 2, the OLS estimator is optimal, no matter what value w takes.Intuitively, as L grows and the model class shrinks, we may borrow more information from neighbors because of the piecewise constant mean structure, and get lower minimax risk.Nevertheless, the minimax estimator and the exact risk are difficult to find for L ≥ 3. We will provide instead both lower and upper bounds of the minimax risk.We first calculate the risk of any equivariant unbiased estimator in where (2.16) As shown in the proof of Proposition 2.5, the quadratic form in (2.15) is positive definite on the constrained linear space which c lies in.Therefore, we can minimize the risk (2.15) to get the optimal solution in Q L for any model in Θ L,w , putting aside the fact that the solution may depend on unknown parameters.Because all estimators in Q L are linear combinations of Y k 's, they are also linear estimators of the intercept in model (2.8).It is not surprising that the optimization problem (2.15) has the same optimal solution as the least squares problem (2.8).We state the result formally as below.

Proposition 2.5.
There is a unique solution to the optimization problem minimize r(σ 2 c ) subject to Let c θ,σ 2 be the minimizer for a model By minimizing (2.15) with linear constraints, we can easily find the optimal c and corresponding risk for an individual model (θ, σ 2 ) ∈ Θ L,w .Nevertheless, we see from (2.16) that the value of G kℓ depends on n i=1 (θ i − θ i+k+ℓ ) 2 , which is not a function of W (θ) when k + ℓ > L(θ).Thus, there is no simple way to characterize the behavior of G(θ) for all models in Θ L,w .As a result, it is a highly nontrivial problem to identify the minimax estimator and the minimax risk.
In Theorem 2.4, we will provide both lower and upper bounds of the minimax risk.We first introduce the main ideas and some necessary notations.We consider the OLS and GLS estimators and their risks over the model class to bound the minimax risk.For OLS, formula (2.12) in Theorem 2.1 implies an upper bound of minimax risk. w. (2.17) For GLS, we consider a smaller model class Θ 2L,w , over which the GLS estimator in Q L depends on θ only through W (θ)/(nσ 2 ).Specifically, let Σ L,w be the covariance matrix (2.9) with K = L and W (θ)/(nσ 2 ) = w, we define αL,w as the GLS estimator based on (2.8) and covariance matrix Σ L,w , i.e.
The maximal risk of the GLS αL,w over Θ 2L,w can be derived to offer a lower bound of the minimax risk.Finally, we study a GLS estimator based on an upper bound of the covariance structure (2.9) and its maximal risk over Θ L,w , which leads to a minimax upper bound different from (2.17).
Let {D k } be the sequence defined recursively by , and define where V −1 L,λ [1,1] is the top left entry of the 2 × 2 matrix V −1 L,λ .Theorem 2.4.Let r L,w be the minimax risk defined in (2.14), and g L (•) be a function defined in (2.18).For the subclass Θ 2L,w , the GLS estimator αL,w ∈ Q L is minimax with the risk The minimax risk on the model class Θ L,w satisfies (2.17) and The function g L (•) in (2.18) is defined through the sequence {D k }.Although the explicit expression of D k and hence g L (•) can be derived, it is complicated and barely provides any additional insight, so we choose not to present it.Instead, we characterize the behavior of g L (•) around 0 in the following proposition.
This proposition, together with (2.17), shows that the exceeded minimax risk of the OLS estimator is bounded by As an immediate consequence, we have the following corollary.
Corollary 2.3.The OLS estimator αL is asymptotically minimax under condition w = o(1), i.e., In Figure 1, we illustrate the minimax risk bounds discussed above.In particular, we plot the upper bounds given by OLS in (2.17) (labeled by OLS-L) and by GLS in (2.19) (labeled by GLS-L).(2.17) is tighter when w is small, and (2.19) gives a sharper bound when w is large.Two other lines in Figure 1, labeled by OLS-2L and GLS-2L, are for the risks of the OLS and GLS estimators over a smaller model class Θ 2L,w , as in (2.11) and (2.19).In particular, as stated in Theorem 2.4, the GLS-2L line, corresponding to g L (2w), gives a lower bound of the minimax risk over Θ L,w .All the curves are plotted over a big range 0 ≤ w ≤ 0.8.For example, a model class Θ L,w with w = 0.8 would include a model θ which changes mean at a level of 2 standard deviation every 5 data points, or at a level of 4 standard deviation every 20 data points.In general, a large ratio W (θ)/(nσ 2 ) indicates that either the magnitude of mean changes is large or the mean changes frequently.In the former scenario, we may detect the obvious change points first and reduce the total variation W (θ), then estimate the variance, which facilitate the detection of subtle change points.In the second scenario, it would be difficult to identify all the change points simultaneously even if we know the true variance.Therefore, it is reasonable to consider variance estimation for a class Θ L,w with small or moderate w.Finally, we conclude that the OLS estimator αK , defined in (2.10) and considered in Section 2.3, gives a simple and good solution to the variance estimation problem, especially for a model class Θ L,w where w is not too big.We call αK the equivariant variance estimator (EVE), whose numerical performance will be presented next.

Simulated data examples
We illustrate the performance of our method using simulated data.We consider three error distributions, standard Gaussian distribution ε i ∼ N (0, 1), a scaled t- , and a translated exponential distribution ε i ∼ Exp(1)− 1, all of which have mean zero and variance one, with κ 4 = 3, 6, 9, respectively.Note that the exponential distribution is non-symmetric with a nonzero third moment.We fix n = 1, 000 and consider three mean structures.Specifically, we consider a null model without any change point in scenario 1, a sparse mean model with few change points in scenario 2, and a model with frequent changes in scenario 3, as detailed below.
To show the sensitivity to the choice of K of our method, we compare the performance of the EVE for K = 5, 10, 15, and 20 in Table 1.For the null model (Scenario 1), larger K leads to a better performance, as affirmed in Theorem 2.1.Nevertheless, the improvement using a K larger than 10 is marginal.In contrast, in Scenario 3 when there are many change points, there is an upward bias when K is larger than 10.In Scenario 2, a larger K leads to slightly larger bias but smaller variance.In this case, our method is not sensitive to the choice of K. We observe that the standard errors of all estimators for the exponential and t distributions are larger than the Gaussian distribution because their fourth moments are larger.This is consistent with Theorem 2.1.We see that the choice of K is crucial when the mean variation is large as in scenario 3. We develop a simple method to tune K. Given a range of K, say This tuning process chooses K = 10 with high probability (96.8%, 96.0%, and 95.2%) in S3-G, S3-T, and S3-E, respectively.In the first two scenarios, the choice of K is not crucial.Overall, the tuning method works well.In practice, we suggest that one should plot the first few Y k 's, e.g., Y 1 ,...,Y 20 , and see whether there is an obvious change on the slope.If not, K = 10 seems a safe choice and can be used as a rule of thumb.Otherwise, the tuning method can be used.We compare the variance estimators introduced in Section 2.1 with the EVE.The simulation results are summarized in Table 2.The regression based estimators EVE and MS with K = 10 are labeled by MS(K=10) and EVE(K=10), respectively.The EVE with tuned K is labeled by EVE.The estimators defined in (2.3), (2.4), (2.5), and the oracle estimator (2.13) are labeled by MAD, DK, Rice, and Oracle, respectively.We also report the relative efficiency of each estimator to the oracle one (2.13) in Table 3.It is clear from the results that the regression based methods MS and EVE perform best among all except the oracle one in all scenarios.The relative efficiency of the EVE and MS to the oracle is constantly low.The tuning method works well.All of the MAD, DK and Rice estimators are seriously biased in some scenarios.In general, MAD and DK estimators tend to be biased upward when the mean structure is complex, e.g., in S2-G and S3-G, and to be biased downward when the noise distribution is t or exponential, e.g., in S1-T and S1-E.The Rice estimator is immune to the error distribution, but is biased upward when the mean structure is complex, e.g., in Scenario 3. As illustrated in our theoretical result, the EVE and MS estimator perform similarly.The EVE is slightly better when θ 1 = θ n , and the MS estimator is better in Scenario 3 when |θ 1 − θ n | is large.

Error from real data
In real applications, the noise distributions are unknown and often far from being Gaussian, which makes the variance estimation even more challenging.
To illustrate the performances of different variance estimators, we use a SNP genotying data set produced by Illumina 550K platform, available in web site we treat all data points as random noise.We standardize the data to have mean zero and variance one.We use the same mean structures as before and draw the errors randomly from the standardized sequence.The results are shown in Tables 4 and 5.We observe that the performance of the EVE and MS estimator is similar to the oracle estimator and better than other estimators.

Labor productivity
This example is motivated by [18].We consider the variance estimation of the U.S. labor productivity of major sectors: manufacturing/durable (DUR), manufacturing/nondurable (NDUR), business (BUS), nonfarm business (NFBUS), and nonfinancial corporations (NFC).All the series range from 1987 Q1 to 2019 Q4 (with length 132).We aim to estimate the variance of the quarterly growth rates in percentages.The data is obtained from U.S. Bureau of Labor Statistics (https://www.bls.gov/lpc/).The five series are plotted in out there are no obvious change points for the last three sectors.For DUR and NDUR, we identify and show the change points locations by vertical lines.The sample ACF plots (for DUR and NDUR, we plot the ACF for the segment-wise demeaned series) are also included to show that the serial correlation can be ignored for these data.We report the estimated standard deviations of the five series in Table 6.Besides the estimators introduced earlier, the sample standard deviation (SD) is also included for comparison.Furthermore, we report SD s as a benchmark.The SD s is the sample standard deviation of the segmented series, which is different from the SD for DUR and NDUR, and same as SD for the other three series.We find that SD might overestimate σ for DUR and NDUR as it ignores the potential change points.DK often underestimates σ possibly due to non-Gaussian noise distribution.The MAD estimator seems to be unstable, with larger biases.The Rice estimator is similar to the proposed EVE estimator (with data-driven choice of K), which provides most reliable estimates.Overall, the EVE is very close to the benchmark SD s , but without segmenting the series first.This is exactly what we propose to achieve: a reliable variance estimator before identifying the locations of the change points.

Discussion
The detection or segmentation procedures for change-point models often require the prior knowledge of the variance, and it is a common practice to estimate the variance as the first step of the analysis.We find that the regression based quadratic variance estimators, such as MS estimator [28] and the EVE proposed in this work, perform better than other popular approaches.We show the ℓ 2 risk explicitly for both the EVE and MS estimator.There are a few interesting research directions for future works.As a next step, it is natural to consider the change-point model where the observations are serially correlated.In this time series context, not only the marginal variance, but also the autocovariances and the long run variance are all of critical importance in change point analysis.It is desirable to construct easy-to-do yet accurate estimators of these quantities as well.The framework and idea introduced in this paper will be indispensable for this direction of future research.As a referee pointed out, an estimator to W (θ) is automatically obtained based on the estimator for the slope β in the regression model (2.8).A reliable estimate to W (θ) might be helpful to test the existence of mean changes of the sequence, i.e., W = 0 versus W ̸ = 0, especially when the changes are frequent and noises are far from normal.Moreover, good estimates to W (θ) and κ 4 can lead to a decent approximation to the GLS, which is competitive estimator.

Appendix A: Proof of Theorem 2.1
We start this section with a lemma which facilitates our proof of Propositions 2.1 and 2.2, and conclude with the proof of Theorem 2.1.
Proof of Lemma A.1.For k ≤ L(θ), there is at most one change point between i and i + k.Therefore, For k ≤ L(θ)/2, there is at most one change point between i and i + 2k.At least one of θ i − θ i+k and θ i+k − θ i+2k is zero, so is the product.
Under Condition 1, it is straightforward to obtain So we have Similarly, we have For the covariance part, we start with

Recall our convention that θ
The second to last equality follows the fact Cov ( As we will see in the next a few lines, the second summand above involving the third moment will be canceled out in calculating the covariance structure of T k 's because of equivariance.For S k 's, we will need an additional condition Eε 3 i ′ = 0 in order to get a neat formula.It follows last equation that, for k ≤ L(θ)/2, where the last equality is implied by two facts, and In particular, (A.4) holds because of the equivariant formulation of T k ; (A.5) follows Lemma A.1.To summarize, we have For k ≤ L(θ)/2, by (A.3) and (A.6), we have where the summands are not zero only when where the last summand is not zero only when It is straightforward to verify that the sum is the same when i + k = i ′ + h, and the sum is The computation for covariance among S k 's is similar except that it requires vanished third moment condition as they are not equivariant.
We need the following lemma to prove Theorem 2.1. .
Proof of Lemma A.2. Denote by a K × 2 matrix Z the design matrix of the regression model (2.8), i.e., The covariance matrix of OLS is ( It is straightforward to calculate .
The above equation follows the fact that 1 is the first column of the matrix Z, and ( To calculate S 3 , we rewrite H as where η K = 1, and for k < K, η k is a vector (0, ..., 0, 1, ..., 1) ⊤ with first K − k entries 0 and last k entries 1.
With the help of equations Finally, we get .
Taking sum of S 1 , S 2 and S 3 , we can get the conclusion of the lemma.Proof of Theorem 2.1.The first conclusion (2.11) of Theorem 2.1 follows Lemma A.2 immediately.Now we prove (2.12).Denote then the OLS αK can be equivalently represented as αK = X ⊤ BX, where B = B 1 + B ⊤ 1 .By Lemma C.1, the variance of αK can be expressed as Let U be the K dimensional upper triangular matrix with 1 on and above the diagonal, and 0 below the diagonal.Let l j = τ j+1 − τ j , and define the l j -dimensional vector where the last l j − K entries are zero.The elements of B ⊤ 1 θ at the locations τ j + 1, . . ., τ j+1 is (µ j − µ j−1 )s j /(2n).Define the operation ← − • as arranging the rows of a matrix upside-down.In particular, ← − s j is the upside-down version of the vector s j .The elements of B 1 θ at the same locations is (µ j − µ j+1 ) ← − s j /(2n).Note that the supports of s j and ← − s j do not overlap if l j ≥ 2K, and overlap completely if l j = K, so the value of the inner product s ⊤ j ← − s j varies according to the segment length l j .It can be shown that the absolute value of the inner product is maximized when l j = K, and the value s Taking the sum over all segments, Therefore, the variance of the OLS αK is bounded from above by The calculation of the quadratic term very similar with the proof of Lemma A.2, so we omit the details, and directly give the result as the upper bound in (2.12).
Finally, we argue that the upper bound in (2.12) can be achieved.Suppose in model (2.2),K = L(θ), J = n/K is an even number, all segments are of the same length, and the segments means µ j have the same absolute value, but with alternating signs.Then in (A.8), the two inequalities become identities with |µ j − µ j+1 | = (W (θ)/J, and so is the one in (2.12).

Proof of Lemma B.1. σ2
A is equivariant if and only if σ2 A (X) = σ2 A (C k X) for all C k ∈ C n and X ∈ R n , where C k is a circulant matrix defined in Section 2.4.Directly calculation shows σ2

Therefore, σ2
A is equivariant if and only if A = C ⊤ k AC k for all C k , which implies that A is a circulant matrix by classic result in linear algebra, e.g., Theorem 5.20 in [14].
Proof of Lemma B.2.It is straightforward to show Therefore, Eσ 2 A = σ 2 for all θ ∈ Θ L if and only if trA = 1 and θ ⊤ Aθ = 0 for all θ ∈ Θ L .Now we show that θ ⊤ Aθ = 0 for all θ ∈ Θ L if and only if i,j∈Λ a ij = 0, for all Λ ∈ I L .Let 1 Λ ∈ R n be a vector with entries equal to 1 in index set Λ, and equal to 0 otherwise.Note that 1 Λ ∈ Θ L when Λ ∈ I L , and 1 ⊤ Λ A1 Λ = i,j∈Λ a ij .This implies the "only if" part.For the other direction, we first show that i,j∈Λ a ij = 0 for all Λ ∈ I L implies two facts: a ij = 0 when L < |i − j| < n − L; i∈Λ;j∈Λ ′ a ij = 0 for connected Λ and Λ ′ .Here we call that Λ, Λ ′ ∈ I L are connected if Λ and Λ ′ are disjoint and Λ ∪ Λ ′ ∈ I L .
Proof of Theorem 2.2.By definition we have where It is easy to check the sum of all entries in a principal submatrix of We conclude that σ2 A is also unbiased by Lemma B.2, and hence, all estimators in Q L are equivariant and unbiased.Now we have any unbiased and equivariant quadratic estimator σ2 A is equivariant, then A is circulant by Lemma B.1.In the proof of Lemma B.2, we show that a ij = 0 for all L < |i − j| < n − L if σ2 A is unbiased for model class Θ L .Therefore, A is in the linear space spanned by symmetric circulant matrices {I, C k + C −k , k = 1, ..., L}.We may write A as an element in this linear space with b 0 I + In the proof of Theorem 2.1, we show that the minimax risk is achieved when n is a multiple of 2L = 4.
We need a lemma before proving Theorems 2.3 and 2.4.

Lemma C.1. Assume the same conditions of Theorem 2.3. Write the unbiased and equivariant estimator σ2
Then its risk can be expressed as Proof of Lemma C.1.By Theorem 2.2 and its proof, we consider an estimator of the form where As c is a fixed vector in this proof, we use A to denote A c for simple notation.Note that all entries in the diagonal of A are 1 n , and A1 = 0.
We calculate the variance of a general estimator in Q We calculate the second moment where the last equation follows the fact E[4ε ⊤ Aθε ⊤ Aε] = 0. Since ε ⊤ Aθε ⊤ Aε is a homogeneous cubic polynomial on ε i 's, all terms in this polynomial have expectation zero except the ones involving ε 3 i 's.Moreover, by the fact where ε •3 denotes entry-wise cube of the vector ε.Now we calculate the two summands in (C.2).
where the omitted part has zero expectation.Therefore, It is easy to find tr( We complete the proof of the lemma.
Proof of Theorem 2.3.We will follow the notation of Lemma C.1 and write the risk as a quadratic function of c i 's with coefficients depending on the mean θ.The only nontrivial part is ∥Aθ∥ 2 = θ ⊤ A 2 θ.Now we calculate A 2 .
Note that when 0 < k ≤ L, we have Combining (C.5) and (C.6), As c k = 1 and kc k = 0, the first four terms in last line are canceled, and we have Putting all terms together, we have The proof of the first part of Theorem 2.4 is complete in view of (C.12).
We now prove the second part of Theorem 2.4, i.e., the upper bound.Same as the proof of Theorem 2.
We first calculate tr(A 2 2 ) − tr(A 2 1 ) = 2 where the first term is due to the difference in the upper-left and bottom-right K × K blocks, and the second term comes from the upper-right and bottom-left blocks.The first term can be further calculated as where in the first and last identities we have used the fact This completes the proof of Proposition 2.3 when K ≤ L(θ)/2.
On the other hand, the difference between tr(A 2 2 ) and tr(A 2 1 ) remains the same as the previous case.Combining these facts completes the proof.
Proof of Proposition D.1.The proof of Proposition D.1 is very similar to that of Lemma B.2, adapting it to the model class Θ c L .We omit the details.The conditions (C1)-(C9) form a minimal set of conditions which will imply the condition in Proposition D.1.The proof of its sufficiency is self evident, and will be skipped as well.

Appendix F: Circular Equivariance
Equivariance, or invariance, is an important concept in statistics, particularly within the realms of statistical estimation, hypothesis testing, and decision theory [10,22,23,6].It refers to a property of statistical procedures or estimators that describes how they behave under certain transformations or symmetries of the data or parameters.Equivariant procedures are desirable when there are multiple ways to parameterize the data, or when certain statistical models exhibit symmetries.For instance, we can measure temperature in different units (Celsius or Fahrenheit), but this choice of units should not influence the statistical inference.When modeling a coin tossing process, it should not matter whether we choose π as the probability of head or the probability of tail.Many summary statistics naturally exhibit invariance (e.g., sample correlation), or equivariance (e.g., sample proportion).We refer to aforementioned textbooks for more examples.
In the literature, circular equivariance has received less attention due to scarcity of circular data.Even in the classical book on circular data [11], equivariance is not emphasized.Nonetheless, recent research has delved into equivariant estimation concerning directional data, as seen in [25].In our work, the natural space of the location parameter [n] is by default a subset of real numbers rather than the unit circle.Embedding the parameter space [n] into the unit circle by the map π n , as defined in section 2.4, offers two distinct advantages.First, because of the different topological structures of the unit circle S 1 and the real line R 1 , it requires two points instead of one to segment the circle into two parts.Consequently, circular-based segmentation methods are more powerful in discovering short segments [32].To our best knowledge, the work [32] is the pioneering attempt to explore a circular parameter space for change-point problems.However, it remains relatively untouched in the literature regarding the second advantage of this embedding, which facilitates an elegant equivariant theory.We demonstrate this benefit through the lens of variance estimation and anticipate further research to explore this direction in greater depth.

Fig 1 .
Fig 1. Lower (GLS-2L) and upper (OLS-L, GLS-L) bounds of the minimax risk r L,w with respect to w.The left and right panels correspond to L = 10 and L = 15 respectively.

Figure 2 Fig 2 .
Fig 2. Time series plots and the ACF plots after segmentation.
we can write it as σ2 A = X ⊤ AX where A = L k=1 c k A k with L k=1 c k = 1 and L k=1 kc k = 0.A k is circulant, so is A. Therefore, σ2 A is equivariant by Lemma B.1.Moreover, trA = tr L k=1

Lemma C. 2 .
3, we consider an arbitrary σ2c = X ⊤ A c X = L k=1 c k Y k ∈ Q L .But this time we write A c = B 1 + B ⊤ 1 , where B 1 = 1 2n L k=1 c k (I − C k ).As given in the proof of Theorem 2.1, it holds that∥B 1 θ∥ 2 = ∥B ⊤ 1 θ∥ 2 = W (θ) • c ⊤ U ⊤ L U L c,and hence∥Bθ∥ 2 = ∥B 1 θ + B ⊤ 1 θ∥ 2 ≤ 4∥B 1 θ∥ 2 = 4W (θ) • c ⊤ U ⊤ L U L c.Therefore, by Lemma C.1, on the class Θ L,w , the risk of σ2 c is bounded byr(σ 2 c ) ≤ κ 4 − 1 + c ⊤ I L + 4wU ⊤ L U L c.According to the proof of the first part of Theorem 2.4, we have minσ2 c ∈Q L max (θ,σ 2 )∈Θ L,w r(σ 2 c ) ≤ g L (4w).So the upper bound has been derived.For any integer k ≥ 1, let U k be the upper triangular matrix with 1 on and above the diagonal.Assume λ ≥ 0.(i) Let D k be the determinant of the matrixI k + λU ⊤ k U k , then D k satisfies the recursion D k = (2 + λ)D k−1 − D k−2 with initial values D 0 = 1 and D 1 = 1 + λ. (ii) The cofactor of the (1, k)-th element of I k + λU ⊤ k U k is always −λ. Proof of Lemma C.2. Performing two operations on I k + λU ⊤k U k : subtracting the (k − 1)-th row from the last row, and subtracting the (k − 1)-th column from the last one, we have

Table 1
Average values of estimators with standard errors in parenthesis over 500 replicates.

Table 2
Average values of estimators with standard errors in parenthesis over 500 replicates.

Table 3
Estimated relative efficiency of each method to the oracle estimator based on 500 replicates.The log R ratio (LRR) sequence of the data set has mean zero except a few short segments, called copy number variations (CNVs).We pick the LRR sequence of Chromosome 11 of the subject father with 27272 data points.As the CNVs are few and short in this data set,

Table 4
Average values of estimators with standard errors in parenthesis over 500 replicates.

Table 5
Estimated relative efficiency of each method to the oracle estimator based on 500 replicates.

Table 6
Variance estimation for the US labor productivity indices.
These two estimators are based on leg-k Rice estimators S k and a circular version T k , respectively.Practically, the EVE is slightly preferred when the noises are skewed as it does not require vanished third moment.Theoretically, it is easier to work with T k because of the symmetric set-up, and all unbiased equivariance quadratic variance estimators are linear combinations of T k , as shown in Theorem 2.2.It is more difficult to characterize all unbiased quadratic variance estimators (without equivariance), which are not necessarily linear combinations of S k .As a conclusion, we recommend both the EVE and MS estimator for variance estimation in change-point analysis.