Multiple Changepoint Detection with Partial Information on Changepoint Times

This paper proposes a new minimum description length procedure to detect multiple changepoints in time series data when some times are a priori thought more likely to be changepoints. This scenario arises with temperature time series homogenization pursuits, our focus here. Our Bayesian procedure constructs a natural prior distribution for the situation, and is shown to estimate the changepoint locations consistently, with an optimal convergence rate. Our methods substantially improve changepoint detection power when prior information is available. The methods are also tailored to bivariate data, allowing changes to occur in one or both component series.


Introduction
Changepoints, also called structural breaks or breakpoints, are times in a sequential record where the data abruptly shift in some manner (mean, variance, autocovariance, quantile, etc.). The primary goal of a retrospective multiple changepoint analysis, the case considered here, is to estimate the number of changepoints and their location times. Various approaches have been developed for independent data; good recent references include Fryzlewicz (2014), Pein, Sieling and Munk (2017), and the review paper Niu, Hao and Zhang (2016) (and the references therein). When the data are correlated, such as the monthly Multiple changepoint detection with partial information 2463 temperature records studied here, this feature can greatly impede changepoint detection; in fact, mean shifts can often be misattributed to positive correlation (Lund et al., 2007).
One simple way to detect multiple changepoints is to combine an at most one changepoint (AMOC) technique (say a CUSUM or likelihood ratio test) with a binary segmentation procedure, e.g., Shao and Zhang (2010); Aue and Horváth (2013); Fryzlewicz and Subba Rao (2014). Wild binary segmentation techniques usually improve upon ordinary binary segmentation methods (Fryzlewicz, 2014). Since estimating the optimal multiple changepoint configuration can be formulated as a model selection problem, penalized likelihood methods such as BIC (Yao, 1988) and its modifications Siegmund, 2007, 2012), and minimum description lengths (MDL) are also popular. In this paper, an MDL technique is developed that takes into account prior information on the changepoint numbers and locations. This scenario is shown to arise in the homogenization of temperature time series to account for gauge changes and station location moves.
The MDL principle (Risanen, 1989) from information theory has been successfully applied in statistical model selection problems (Hansen and Yu, 2001). MDL penalties are the sum of penalties (i.e., description lengths, or code lengths) of all unknown model parameters. In the multiple changepoint literature, the seminal work of Davis, Lee and Rodriguez-Yam (2006) develops an MDL penalty for piecewise autoregressive (AR) processes. Here, the penalty is constructed by following certain automatic rules that assign different penalties to different parameter types: bounded integer parameters, unbounded integer parameters, and real-valued parameters. Since MDL penalties are not just simple multiples of the number of model parameters, they are believed superior to AIC and BIC penalties (a belief supported by simulations), and are shown consistent for changepoint estimation under infill asymptotics (Davis, Lee and Rodriguez-Yam, 2006;Davis and Yau, 2013). Following the automatic penalty rules, MDL methods have been extended to various time series structures, including GARCH processes (Davis, Lee and Rodriguez-Yam, 2008), periodic ARs (Lu, Lund and Lee, 2010), autoregressive moving-averages (Davis and Yau, 2013), and threshold ARs (Yau, Tang and Lee, 2015).
The main goal of this paper is to incorporate partial information on changepoint numbers and times into the MDL penalty, an aspect not readily handled by existing MDL methods. Indeed, this will require us to revisit information theory. The motivating example involves the climate homogenization (Caussinus and Mestre, 2004;Menne and Williams Jr, 2005) of monthly temperature records. Here, the aim is to detect abrupt mean shifts, which are often induced by artificial causes such as station relocations or gauge changes. Two types of a priori changepoint knowledge arise. First, metadata station history logs, which document the times of physical changes in the station, are sometimes available. Although metadata climate records are notoriously incomplete, and not all documented metadata times induce actual mean shifts in the series, climatologists believe that metadata times are more likely than non-metadata times to be changepoints. Second, when multivariate records exist for the same sta-tion, changepoints may affect component records simultaneously. For example, with monthly maximum and minimum temperature averages (called Tmax and Tmin, respectively), moving a station to a drier location can both increase daytime highs and reduce nighttime lows. While changepoints in either Tmax or Tmin can occur by themselves, climatologists believe that it is more likely for changepoints to occur in both component series at the same time (these are called concurrent shifts).
While metadata is typically only used to verify climate changepoint conclusions in hindsight, Sections 5 and 6 will show that use of metadata can improve detection power and time of estimation accuracy. This benefit is not limited to climatological pursuits; in other areas such as biology, economics, and engineering, domain expert knowledge is often available; e.g., knowledge from previous experiments on possible copy number variation locations, or the impact of certain political policy or regime changes on financial series.
Of course, Bayesian methods account for a priori knowledge via the construction of prior distributions. From a Bayesian model selection perspective, the optimal model (i.e., multiple changepoint configuration) is the one with the highest posterior probability (Clyde and George, 2004). This maximum a posteriori (MAP) rule can be loosely viewed as a penalization method, where the posterior density is a penalized likelihood and the prior density is the penalty. Compared to frequentist approaches, one advantage of Bayesian posterior analysis is that it can also provide a measure of uncertainty for model parameters and changepoint locations. Bayesian approaches have been proposed for retrospective multiple changepoint detection -see Barry and Hartigan (1993); Chib (1998); Fearnhead (2006); Girón, Moreno and Casella (2007); Zhang and Siegmund (2007); Giordani and Kohn (2008); Fearnhead and Vasileiou (2009); Hannart and Naveau (2012). However, theoretical studies of large sample performance of Bayesian methods are in general lacking; while Du, Kao and Kou (2016) study asymptotic consistency of changepoint locations, they only consider independent data.
More importantly, existing Bayesian changepoint approaches are typically derived under non-informative prior distributions; they rarely explicate how to incorporate real subjective prior knowledge. BIC-based changepoint detection methods cannot readily handle subjective prior information: from a Bayesian model selection perspective, BIC is a large sample approximation of the marginal likelihood. Thus, comparing models directly based on their BICs imposes an implicit assumption that the prior probabilities of the models are the same, which is not appropriate when one wants to incorporate metadata information.
The only exception to the above is Li and Lund (2015), which accounts for metadata in a univariate precipitation time series. That work was written for a climate audience and was largely void of statistical and technical detail. This paper complements that work by dealing with the statistical and technical issues. It has a different focus and content, aiming to develop a general MDL framework that can handle prior information on changepoint times in a wide range of changepoint problems. For example, multivariate series, which involve the more challenging problem of borrowing information across component series, are pur-sued. In this sense, Li and Lund (2015) is a special case of the current paper. This paper also includes a thorough investigation of the asymptotic consistency of the proposed methods.
Changepoint detection for multivariate data has received significant attention in recent years, e.g., Cho and Fryzlewicz (2015); Kirch, Muhsal and Ombao (2015); Preuss, Puchstein and Dette (2015); Ma and Yau (2016). In Davis, Lee and Rodriguez-Yam (2006), the automatic MDL is applied to multivariate AR series, where changepoints affect all component series. However, for many applications, a changepoint may not affect all component series. The automatic MDL does not directly accommodate this case, probably because it is unclear whether a change affecting all components should receive the same penalty as one that affects a subset of components. On the other hand, Bayesian approaches such as Zhang and Siegmund (2012) and Bardwell and Fearnhead (2017) can handle this problem, but only for independent data over time and components. Since these works are developed under non-informative prior distributions, they are not ready applicable to handle multivariate temperature homogenization, where concurrent changes in Tmax and Tmin should be encouraged.
In this paper, a new class of flexible MDL methods is proposed that incorporates domain experts' a priori knowledge for multiple changepoint detection, in both univariate and multivariate time series. Multiple changepoint configurations are reformulated as vectors of zero/one indicators, thus permitting natural construction of subjective prior distributions, with straightforward hyperparameter elicitation. To account for correlation in time and across components, AR processes for univariate data, and vector autoregressive (VAR) processes for multivariate data are employed. Our MDL method is termed a Bayesian MDL (BMDL) because it can be viewed as an empirical Bayes model selection approach. While our main focus is to improve and generalize conventional MDL changepoint detection approaches, to the best of our knowledge, this paper is the first Bayesian multiple changepoint work to establish asymptotic consistency with correlated observations. Under infill asymptotics, the estimated changepoint locations are shown to converge in probability to their true values; moreover, estimators of the number of changepoints and model parameters such as regime means and AR coefficients are also consistent.
We choose to work within the MDL framework rather than extending BICbased approaches due to the following considerations. First, the BIC approximation to the marginal likelihood is usually precise only up to an O(1) error. Although it is asymptotically consistent for model selection, it often does not work well when the sample size is small or moderate (Grünwald, 2007). Second and perhaps more importantly, in the changepoint detection literature, MDL penalties have been demonstrated to be more flexible and have better empirical performance than BIC penalties (Davis, Lee and Rodriguez-Yam, 2006). Therefore, MDL methods to are exclusively pursued here.
The rest of this paper is organized as follows. Section 2 briefly reviews MDL principles. Section 3 develops a BMDL penalty to detect mean shifts in univariate series. This work incorporates metadata, while allowing for a confounding seasonal mean cycle and AR errors. Section 4 extends the BMDL to the mul-tivariate setting, where Tmax and Tmin series are modeled jointly. Section 5 presents simulation examples. Section 6 moves to an application to 114 years of monthly temperatures from Tuscaloosa, Alabama. Section 7 studies the frequentist large sample performance of the univariate BMDL. Comments close the paper in Section 8. Technical results and proofs are delegated to an appendix.

A Brief Review of MDL
In information theory, a code length is the number of binary storage units required to transmit a random number or code. To reduce storage costs, one wants to assign shorter (longer) code lengths to common (rare) outcomes. Competing probability models can be compared by their code lengths; the true data generating distribution (i.e., the true model) should have the shortest expected code length. The MDL principle (Risanen, 1989) states that given the observed data, the model with the shortest code length is optimal.
For a discrete random variable X with probability mass function f (·), Shannon (1948) states that the encoding with code length has the shortest expected code length. The existing MDL approach for multiple changepoint detection (Davis, Lee and Rodriguez-Yam, 2006) is developed under the automatic rules that the code length of a positive random integer X bounded above by N is log 2 (N ), and that of an unbounded positive random integer X is log 2 (X). The former rule implies a uniform distribution over the set {1, 2, . . . , N}, which leads to the code length L(X) = − log 2 (1/N ) = log 2 (N ), while the latter implies an improper power law distribution with the probability mass function f (X) ∝ 1/X. For a continuous random variable, say X ∈ R k with density function f (·), after discretizing each dimension into equal cells of size δ (often viewed as the machine precision), one can mimic the discrete case to obtain L(X) = − log 2 {f (X)δ k } = − log 2 f (X) − k log 2 (δ). Because k and δ do not vary with X, the term −k log 2 (δ) does not affect comparison between different outcomes of X and is hence often omitted. Thus, the MDL for a continuous variable can also be expressed as in (1). In the rest of this paper, the natural logarithm is substituted for the base two logarithm -this does not affect model comparisons since Now suppose that a dataset X = (X 1 , . . . , X N ) , believed to be generated from a certain parametric model M with density f (X | θ, M), is to be transmitted along with a possibly unknown parameter θ ∈ Θ. As reviewed in Hansen and Yu (2001), two types of MDL approaches, the two-part MDL and the mixture MDL, are commonly used.

Two-part MDLs
The two-part MDL, also called the two-stage MDL, considers the transmission of X and θ in two steps. If both the sender and receiver know θ, the MDL Here, notations such as L(· | ·) are analogous to the usual conditional distribution notations that emphasize dependence. Should θ also be unknown to the receiver, an additional cost of L(θ | M) is incurred in transmitting it. Hence, the two-part MDL is Suppose that L(X, θ | M) is minimized atθ, an estimator of θ based on the data X. If θ is a k-dimensional continuous parameter andθ is a √ N -consistent estimator, then one can set the machine precision to be δ , which does not depend on θ. Hence, the maximum likelihood estimator (MLE) minimizes L(X, θ | M), and the two-part MDL coincides with the BIC (Schwarz, 1978). In fact,θ need not be the MLE; any √ N -consistent estimator is justifiable. Again the constant term k log(c) can be dropped and the remaining code length L(θ | M) = k log(N )/2 is adopted by Davis, Lee and Rodriguez-Yam (2006) as the automatic MDL rule for a k-dimensional continuous parameter.
If there exists a discrete set of candidate models, to account for model uncertainty, the two-part MDL can be modified to include an additional code length for the model M, i.e., whereθ is model dependent, L(M) = − log{π(M)}, and π(M) is the prior distribution over the model space. The model with the smallest MDL in (2) is deemed optimal.
All existing automatic MDL methods for multiple changepoint detection are based on two-part MDLs. However, for a finite sample size N , the two-part MDL is problematic when the dimension of θ changes across models, as in the multiple changepoint case. Consider a setting of two competing models M 1 and M 2 , whose parameters θ j are k j -dimensional continuous parameters, for j = 1, 2, and Note that the code length difference for the parameters L(θ 1 | M 1 )−L(θ 2 | M 2 ) contains the term (k 1 −k 2 ){log(N )− 2 log(c)}/2. This term, and hence also L(X,θ 1 , M 1 ) − L(X,θ 2 , M 2 ), could be either positive or negative depending on N and the arbitrary constant c. One cannot judge either model superior without knowledge of c. Of course, this issue does not conflict with the asymptotic consistency of BIC or automatic MDLs: as N increases, log(N ) dominates the constant log(c). Mixture MDLs, reviewed next, do not suffer from such a problem for a finite N .

Mixture MDLs
By Hansen and Yu (2001), the mixture MDL is defined to be based on the marginal likelihood f (X | M): averages the likelihood f (X | θ, M) over θ under its prior density π(θ | M). If this prior distribution depends on an unknown hyper-parameter ψ, then a twopart MDL can be used to account for the additional cost needed to transmit ψ. In this case, the overall mixture MDL, for any √ N -consistent estimator of ψ, is The mixture MDL for the model M is thus L(X,ψ, M) = L(X,ψ | M) + L(M), which is related to empirical Bayes (EB) approaches (Carlin and Louis, 2000). If the prior probabilities of two models are the same, i.e., π(M 1 ) = π(M 2 ), and the hyper-parameter ψ is transmitted under the uniform encoder π(ψ | M j ) ∝ 1 for j = 1, 2, then the difference of the two mixture MDLs, L(X,ψ 1 , M 1 ) − L(X,ψ 2 , M 2 ), equals the logarithm of their Bayes factor BF M2:M1 (Kass and Raftery, 1995). Similarly, in EB settings, while the estimatorψ is often chosen to maximize the marginal likelihood f (X | ψ, M), other consistent estimators (moments for example) can be used.

Bayesian Minimum Description Lengths for a Univariate Time Series
Consider a univariate time series X 1:N = (X 1 , . . . , X N ) with a seasonal mean cycle with fundamental period T . For monthly data, T = 12. A model with autoregressive errors describing this situation is Here where x is the largest integer less than or equal to x. The seasonal means s = (s 1 , . . . , s T ) are unknown. The errors { t } N t=1 are a causal zero mean AR process. Here, we assume that the AR order p is known; if unsure, picking a slightly larger value for p is advised. The AR coefficients φ = (φ 1 , . . . , φ p ) and the white noise variance Var(Z t ) = σ 2 are assumed unknown. For likelihood computations, following Davis, Lee and Rodriguez-Yam (2006), white noises are assumed iid normal. This can be justified as a quasi-likelihood approach; furthermore, in climate applications, monthly averaged temperatures are approximately normally distributed (Wilks, 2011).
Following Li and Lund (2015), the multiple changepoint configuration (m; τ ) is reformulated as an (N − p)-dimensional vector of zero/one indicators: η = (η p+1 , . . . , η N ) . Here, η t = 1 indicates that time t is a changepoint in this model; η t = 0 means that time t is not a changepoint. The total number of changepoints in model η is thus m = N t=p+1 η t . Our idea is to apply the mixture MDL to the continuous parameter μ, whose dimension varies across models, and use the two-part MDL for the parameters s, σ 2 , φ, and the model η. In the rest of this section, subsection 3.1 introduces our priors on η and μ, subsection 3.2 derives the BMDL formula (17), and subsection 3.3 concludes with computational strategies. Asymptotic studies are included in section 7.

Prior specifications
Our prior distribution for the changepoint model η assumes that, in the absence of metadata, each time t has an equal probability ρ of being a changepoint, independently of all other times, i.e., This independent Bernoulli prior has been used in previous Bayesian multiple changepoint detection works (Chernoff and Zacks, 1964;Yao, 1984;Barry and Hartigan, 1993). From a hidden Markov perspective, this prior is equivalent to τ r | τ r−1 ∼ Geometric(ρ) for r = 1, . . . , m (Fearnhead and Vasileiou, 2009), and thus is a special case of the negative Binomial prior (Hannart and Naveau, 2012). The uniform prior π(η) ∝ 1 adopted in Du, Kao and Kou (2016) is a special case of the Bernoulli prior with ρ = 0.5. For applications where knowledge beyond metadata is unavailable, an iid prior on {η t } seems reasonable. In other applications, π(η) is allowed to have different success probabilities in different regimes (Chib, 1998); correlation across different changepoint times can also be achieved using Ising priors (Li and Zhang, 2010).
To account for uncertainty in the success probability ρ, a hyper-prior is placed on it. Barry and Hartigan (1993) let ρ have a uniform prior on the interval (0, ρ 0 ), where ρ 0 < 1. For additional flexibility, we use the Beta distribution where a, b > 0 are fixed hyper-parameters. The Beta-Binomial hierarchical priors in (4) and (5) are widely used in Bayesian model selection (Scott and Berger, 2010), and have been adopted to detect changepoints (Giordani and Kohn, 2008;Li and Lund, 2015). Due to conjugacy, the marginal prior density of η has the following closed form, with β(·, ·) denoting the Beta function: Note that here, the Beta-Binomial density in (6) depends on η through m, the total number of changepoints in the multiple changepoint model η. In common changepoint detection problems, changepoints are usually relatively sparse (m N ). Suppose our prior belief on ρ reflects this sparsity assumption, say, E(ρ) = a/(a + b) ≤ 1/2, i.e., a ≤ b. Then (6) decreases as m increases until m reaches a relatively large value (at least (N − p)/2). Thus, the Beta-Binomial prior can be viewed as a prior preference on smaller models, or equivalently, a penalty on the number of changepoints.
For hyper-parameter choices, an objective Bayesian option (Girón, Moreno and Casella, 2007) , which implies that marginally, the number of changepoints m has a uniform prior on the set {0, 1, . . . , N − p}, and all models containing the same number of changepoints have the same prior probabilities. The Beta-Binomial prior can be tuned to accommodate subjective knowledge from domain experts. For temperature homogenization, Mitchell (1953) estimates an average of six station relocations and gauge changes per century in United States temperature series; this long-term rate is 0.005 changepoints per month and can be produced with a = 1 and b = 199; with these parameters, E(ρ) = a/(a + b) = 0.005.
This prior is now modified to accommodate metadata. Suppose that during the times {p + 1, . . . , N}, there are N (2) documented times (times listed in the metadata) and N (1) = N − p − N (2) undocumented times. For notation, all quantities superscripted with (1) refer to undocumented times; quantities superscripted with (2) refer to documented times. Following Li and Lund (2015), we posit that the undocumented times have a Beta-Binomial(a, b (1) ) prior, and independently, the documented times have a Beta-Binomial(a, b (2) ) prior. To make the metadata times more likely to induce true mean shifts, we impose For monthly data, default values are a = 1, b (1) = 239, and b (2) = 47, making E(ρ (1) ) = 0.0042, i.e., an average of one changepoint about every 20 years for non-metadata times, and E(ρ (2) ) = 0.0208, i.e., on average, one changepoint in every 4 years for metadata times. In other words, a priori, a documented time is roughly five times more likely to be a changepoint than an undocumented time. For different problems, one may need to modify b (1) and b (2) to reflect specific domain knowledge. Our previous paper (Li and Lund, 2015) gives a detailed sensitivity analysis on the choice of Beta-Binomial hyper-parameters. It suggests that changepoint detection results are relatively stable under a range of E(ρ (2) )/E(ρ (1) ) values. For applications that lack any subjective information, the non-informative Beta-Binomial(1, 1) prior can serve as a default choice. In this paper, this prior is referred to as "oBMDL", with "o" standing for objective. Empirical comparison will be provided in the univariate simulation examples in Section 5.1.
Following (6) and writing Beta integrals via their Gamma function representations, a changepoint configuration η with m (2) documented changepoints and m (1) undocumented changepoints (m = m (1) + m (2) ) has a marginal prior density (up to a normalizing constant) For a changepoint model with m > 0 changepoints, priors for the m-dimensional regime means μ are posited to have independent normal prior distributions: Here, ν is a pre-specified non-negative parameter that is relatively large (making the variances of the regime means large multiples of the white noise variances). Similar to the sensitivity analysis in Du, Kao and Kou (2016), our experience suggests that model selection results are stable under a wide range of ν values. Our default takes ν = 5. In fact, π(μ) can be any zero mean continuous distribution. For example, if mean shifts are expected to be large, heavy-tailed distributions such as the Student-t may be preferable. When μ cannot be tractably integrated out, inferences can be based on Laplace approximations or posterior sampling with a reversible-jump MCMCs (Green, 1995). Due to conjugacy under Gaussian likelihoods, the normal prior leads to closed form marginal likelihoods. Hence, for computational ease in the rest of this paper, the normal regime mean priors in (7) are used.

The BMDL expression
To derive the BMDL expression in (17), the data likelihood is first obtained. This is then integrated over μ to obtain the mixture MDL; finally, two-part MDLs are obtained for the rest of the parameters. Given a changepoint model η, the sampling distribution (3) has the regression representation with A 1:N ∈ R N ×T and D 1:N ∈ R N ×m as seasonal and regime indicator matrices, respectively: where 1(A) denotes the indicator of the event A. In (8), the subscript 1 : N , or in general t 1 : t 2 , signifies that only rows t 1 through t 2 are used in the quantities. The normal white noises {Z t } in the AR process imply the distributional result and observe that Note that all terms superscripted with ∼ depend on the unknown AR parameter φ. To avoid AR edge effects, a likelihood conditional on the initial observations X 1:p is used. In the change of variable computations, the Jacobian |∂( X − As − Dμ)/∂X (p+1):N | = 1 and the likelihood has the multivariate normal form Innovation forms of the likelihood (Brockwell and Davis, 1991) can be used if one wants a moving-average or long-memory component in { t }.
We now obtain a BMDL for the changepoint model η. If m > 0, we first use the mixture MDL on μ. The marginal likelihood, after integrating μ out, has the closed form where the notation has If the parameters s, σ 2 , and φ are known, the mixture MDL is simply Under a given changepoint model η, the two-part MDL is used to quantify the cost of transmitting the parameters s, σ 2 , and φ. The optimal s and σ 2 that minimize the mixture MDL have closed forms: These estimators depend on φ; however, the φ that minimizes L(X (p+1):N | s,σ 2 , φ, η) is intractable. In general, likelihood estimators for autoregressive models do not have closed forms. Hence, simple Yule-Walker moment estimators, which are asymptotically most efficient and √ N -consistent under the true changepoint model, are used. There is actually little difference between moment and likelihood estimators for autoregressions (Brockwell and Davis, 1991).
In the linear model (8), the ordinary least squares residuals are where [A 1:N |D 1:N ] denotes the block matrix formed by A 1:N and D 1:N , and P [A 1:N |D 1:N ] is the orthogonal projection matrix onto its column space. The sample autocovariance of the residuals areγ This matrix is invertible whenever the data are non-constant (Brockwell and Davis, 1991). Next, the Yule-Walker estimatorφ is substituted for φ in X, A, D, B, andσ 2 . The resulting quantities are denoted by X, A, D, B, and σ 2 , respectively. In particular, X contains estimated one-step-ahead prediction residuals (innovations).
By (2), the BMDL for transmitting the data X (p+1):N , the model η, and its parameters is (up to a constant) The second equality holds because under a uniform encoder π(s, σ 2 , φ) ∝ 1, the two-part MDL L(ŝ,σ 2 ,φ | η) = (T +1+p) log(N −p)/2 is constant across models and hence can be omitted. Therefore, for a model with m > 0 changepoints, its BMDL is (up to a constant) For a model with no changepoints (m = 0), denoted by η ø , the above procedure needs modification. Since η ø does not involve μ, the mixture MDL step can be skipped. As D has no columns, B in (12) is reduced to I N −p , and hence (14) still holds. With the convention that the determinant of a 0 × 0 matrix is unity, log D D + I m /ν = 0. Therefore, (17) also holds for η ø . This resolves the issue of evaluating log(m) at m = 0 with some existing MDL methods.

BMDL optimization
The optimal changepoint modelη is selected as the one with the smallest BMDL score. However, exhaustively searching the changepoint configuration space is formidable since the total number of admissible models, 2 N −p , is extremely large. To overcome this, genetic algorithms are used as optimization tools in Davis, Lee and Rodriguez-Yam (2006) and Lu, Lund and Lee (2010). Genetic algorithms efficiently explore the model space, only evaluating the penalized likelihood at a relatively small number of promising models.
The following connection to empirical Bayes (EB) methods allow us to borrow MCMC model search algorithms that are commonly used in Bayesian model selection. The BMDL under model η represented in (16) is equivalent to the negative logarithm of an EB estimator of the posterior probability of η: As our BMDL formula (17) is tractable, Bayesian stochastic model search algorithms can be used; see García-Donato and Martínez-Beneito (2013) and the references therein. Here, we modify the Metropolis-Hastings algorithm in George and McCulloch (1997) by intertwining two types of proposals: a componentwise flipping at a random location and a simple random swapping between a changepoint and a non-changepoint. This algorithm is described in detail in Li and Lund (2015) and can be implemented by the R package BayesMDL (https://github.com/yingboli/BayesMDL).

Extensions to Multivariate Time Series
Mimicking the univariate setup, this section develops a BMDL for multivariate time series. While the details are illustrated for bivariate series, similar extensions apply to multivariate series of more than two components. The BMDL penalty constructed here allows changepoints to occur in one or both component series. Furthermore, it can accommodate domain experts' knowledge that encourage concurrent changes, i.e., changes affecting both series at the same time.
In temperature homogenization, to model Tmax and Tmin series jointly, both series are concatenated via X 1:N = (X 1:N,1 , X 1:N,2 ) ∈ R 2N , where X 1:N,i = (X 1,i , . . . , X N,i ) is the record for Tmax (i = 1) or Tmin (i = 2). Again, each time in {p+1, . . . , N} is allowed to be a changepoint in either the Tmax or Tmin series, or both. A multiple changepoint configuration is denoted by Given a bivariate changepoint model η, series i has m i = N t=p+1 η t,i changepoints. As in the univariate case, the seasonal means are denoted by s i = (s 1,i , . . . , s T,i ) ∈ R T ; regime means are denoted by μ i = (μ 2,i , . . . , μ mi+1,i ) ∈ R mi . The seasonal and regime indicator matrices A 1:N,i ∈ R N ×T and D 1:N,i ∈ R N ×mi are constructed analogously to their univariate counterparts.
As Tmax and Tmin temperature series tend to fluctuate about the seasonal mean in tandem (positive correlation), the errors { t = ( t,1 , t,2 ) } need to be correlated across components. For this, a vector autoregressive model (VAR) of order p is employed: where Φ 1 , . . . , Φ p are 2 × 2 VAR coefficient matrices. The VAR model allows for correlation in time and between components.
As (11) holds after replacing σ 2 I N −p with Σ⊗I N −p , the likelihood of X (p+1):N , conditional on the initial observations X 1:p , is (up to a multiplicative constant) Here, ⊗ denotes a Kronecker product and the terms X, A, D are modified by replacing φ j with Φ j ⊗ I N −p in (9) and (10), for j = 1, . . . , p.
To obtain the mixture MDL in a closed form, for a bivariate model with m = m 1 + m 2 > 0 changepoints, the regime means μ again are taken to have independent normal priors where σ 2 1 and σ 2 2 are the diagonal entries of the white noise covariance Σ.

The bivariate BMDL
For a model η with m > 0, the marginal likelihood, after integrating μ out, has a closed form: The maximum marginal likelihood estimators is unaltered from (13). However, after pluggingŝ back into the likelihood, the maximum likelihood estimators of Σ and Φ 1 , . . . , Φ p do not have closed forms. Again, Yule-Walker estimators are used.

Simulation Studies
This section studies changepoint detection performance under finite samples via simulation. Our simulation parameters are selected to roughly resemble the bivariate Tuscaloosa data, which will be studied in Section 6. Specifically, the bivariate error series { t } is chosen to follow a zero mean Gaussian VAR model with p = 3. The VAR parameters are taken as and Σ = 9 2 2 9 .
In each of 1000 independent runs, 50 year monthly Tmax and Tmin series (N = 600) are simulated with m = 3 changepoints in each series. For the Tmax series, mean shifts are placed at the times 150, 300, and 450. The regime means have form μ 1 = (0, Δ, 2Δ, 3Δ) where Δ > 0 will be varied. For the Tmin series, mean shifts are placed at times 150, 300, and 375. The regime means are μ 2 = (0, −Δ, Δ, 0) . Here, Tmax has monotonic "up, up, up" shifts of equal shift magnitudes; Tmin shifts in a "down, up, down" fashion and the second shift is twice as large as the other two shifts. The shifts at times 150 and 300 are concurrent in both series. Seasonal means are set to s = (0, 3, 10, 18,26,33,36,36,31,20,8,2) in both series. Seasonal mean parameters are not critical, but the Δ parameter controlling the mean shift size is. Our detection powers will be reported under different signal to noise ratios, measured by κ = Δ/σ. Our study examines κ ∈ {1, 1.5, 2}, with σ = 3 agreeing with the diagonal elements of Σ. For metadata, a record containing four documented changes at the times 75, 150, 250, and 550 is posited. Among the documented times, only time 150 is a true changepoint.
A simulated series with κ = 1.5 is shown in Figure 1. Figure 7 in the Appendix shows the same series after subtraction of sample monthly means.

Univariate simulations
First, the Tmax and Tmin series are analyzed separately, each fitted by univariate BMDL methods with default parameters, once with the fictitious metadata and once without metadata. We also compare various methods without metadata, including a BMDL under the objective Bayes parameters a = b = 1 (denoted by oBMDL), the automatic MDL (denoted by MDL), and the BIC. The MDL obtained when the automatic code length rules in Davis, Lee and Rodriguez-Yam (2006) are applied to our multiple mean shift problem: The first term in (18) approximates the negative logarithm of the maximum likelihood, where the Yule-Walker estimator of σ 2 is used, which equals (14) with ν = ∞ after φ is replaced byφ. This estimator is denoted byσ 2 ν=∞ here. The other terms in (18) are the two-part MDLs for the regime means μ 2 , . . . , μ m+1 , the number of changepoints m (the original penalty of log(m) is undefined for the null model with m = 0; the ad-hoc fix to this simply uses m + 1 in the logarithm), and the regime lengths N 1 , . . . , N m+1 , respectively. The two-part MDLs of the global parameters s, σ 2 , and φ are constants and hence omitted. An MDL for the AR order p is not needed as p is tacitly assumed known. BIC, up to a constant, is In each fit, an MCMC chain of 100,000 iterations is generated. The optimal multiple changepoint model is taken as the one that minimizes the objective function.
For Tmax series, Table 1 reports empirical detection percentages, including true positive rates at the exact times of changepoints and average false positive rates at non-changepoint times, along with estimated number of changepointŝ m and its standard error. When metadata is ignored, since the three shifts are of equal size Δ, their detection rates are similar. False detection rates are very low; even when κ = 1, on average, a non-changepoint is flagged 0.43% of the time or less. Among different methods without metadata, detection rates of true changepoints are similar, while BIC flags slightly more false positives than MDL-based methods (BMDL, oBMDL, and MDL). When κ = 1, the number of changepoints m = 3 is underestimated by the MDL-based methods and better estimated by BIC penalties; when κ = 1.5 and 2, m is better estimated by the MDL-based methods, and overestimated by BIC. Overall, BIC tends to favor models with more changepoints than the MDL-based methods. As suggested by Theorem 3 below, the BMDL performs similarly to the automatic MDL.
Interestingly, without metadata, the BMDL under the default parameters a = 1 and b = 239 and the objective choices a = b = 1 perform similarly. Figure 8 in the Appendix reveals that as functions of m, the code lengths L(η) = − log{π(η)} under BMDL and oBMDL have similar shapes, with a nearly constant difference over the region where m is small. In this case, if knowledge of changepoint frequency is not available, a BMDL can still be used with objective parameters.
Metadata use substantially increases detection power for the BMDL. In Figure 2, the true documented change at time 150 is detected 75.7% of the time when metadata is used, more than twice as high (36.3%) when metadata is eschewed. Moreover, times near the changepoint at time 150 are less likely to be erroneously flagged as changepoints. Our prior belief that metadata times are more likely to be changepoints is especially important when the mean shift is small: when κ = 1, using metadata increases the detection rate of the time 150 changepoint from 15.4% to 58.8%. For false positives, Figure 2 shows that using metadata does not increase false detection rates at the documented times 75, 250, and 550 (where no shifts occur). This suggests that the prior distribution does not "overwhelm" the data. Table 1 shows that average false positive rates even drop after using metadata.  For Tmin series, the non-monotonic shift aspect (down, up, down) that troubles some at most one change (AMOC) binary segmentation approaches (Li and Lund, 2012) is well handled by all multiple changepoint detection methods examined. Table 2 shows that when metadata is ignored, the larger shift at time 300 is more easily detected than the two smaller shifts at times 150 and 375. When metadata is used, the detection rate of the time 150 shift becomes comparable to the detection rate of time 300 shift, which is twice as large in size, but is not a metadata time. False positive rates are uniformly low, and m is well-estimated by MDL-based methods when κ is not too small. Again, without metadata, the MDL-based methods are similar, while BIC tends to favor models with larger m.

Bivariate simulations
Since the BMDL is flexible enough to handle non-concurrent shifts for bivariate series, we now apply it to Tmax and Tmin series jointly. Each bivariate series is fitted by an MCMC chain of 50,000 iterations, once without metadata, and once with metadata. Metadata impacts are similar to the univariate case, increasing detection of true mean shifts at metadata times and also slightly decreasing average false positive rates (see Tables 3 and 4). Figure 3 shows bivariate detection rates with metadata when κ = 1.5. For the non-concurrent shift times at 375 and 450, detection rates for the component series are very different; in most runs, concurrent shifts are not erroneously signaled.
While concurrent shifts are not always the case, they are believed to be more likely in our parameter elicitation settings. Compared to the univariate BMDL, the bivariate method enhances detection power of concurrent changepoints. When κ = 1.5, at time 150, where Tmax (Tmin) shifts Δ (−Δ), the  Tables 3 and 4 show that detection power gains under the bivariate approach are greater for small signals κ = 1, without metadata. An interesting phenomenon is observed: bivariate BMDL improves univariate methods more when the concurrent shifts move the series in opposite directions than in the same direction (detection rates at time 300 do not increase for Tmin). Because Tmax and Tmin are positively correlated series, concurrent shifts in the same direction may be misattributed to positively correlated errors; this cannot happen for shifts in opposite directions.
Overall, while bivariate detection does not induce more false positives, it tends to flag more false positives at locations where the mean in the other series shifts. Figure 3 shows that at time 375, a changepoint time in Tmin but not in Tmax, a false detection rate of 7.3% for Tmax is obtained. At time 450, a changepoint in Tmax but not Tmin, a false detection rate of 15.2% is obtained for Tmin. These false positive rates slightly degrade inferences at nearby changepoints; for example, at time 450 for Tmax and time 375 for Tmin, detection rates are 34.2% and 33.0%, respectively, slightly lower than the 37.9% and 38.2% reported in the univariate case. Finally, Tables 3 and 4 show that the bivariate approach tends to overestimate m, which differs from the univariate case. In this section, the Tmax and Tmin series will be analyzed from both univariate and bivariate perspectives via the penalization methods of Section 5. All parameters are set to default values; the AR order p = 2 is judged as appropriate: by Figure 9 in the Appendix, almost all sample autocorrelations of residuals fitted with p = 2 lie inside pointwise 95% confidence bands.
To ensure convergence in the MCMC search algorithm, for each fit, 50 Markov chains are generated from different starting points, each containing 1,000,000 (univariate) or 100,000 (bivariate) iterations. Among all changepoint models visited by the 50 Markov chains, the one with the smallest BMDL is reported as the optimal model.

Univariate fits
The top half of Table 5 displays estimated changepoints for the univariate fits. When metadata is ignored, all methods (BMDL, oBMDL, MDL, and BIC) estimate the same optimal changepoint configuration: Tmax has two estimated  1956Nov, 1987May Tmin 1921Nov, 1956Jun, 1987May no Tmax 1957Mar, 1990Jan Tmin 1918Feb, 1957Jul, 1990Jan Bivariate yes Tmax 1921Nov, 1956Jun, 1987May Tmin 1921Nov, 1956Jun, 1987May no Tmax 1918Feb, 1957Jul, 1988Jul Tmin 1918Feb, 1957Jul, 1988 changepoints and Tmin has three; of these, only January 1990 is a concurrent change. Another changepoint is approximately concurrent: March 1957 for Tmax and July 1957 for Tmin. The 1918 changepoint flagged for Tmin is close to the station relocation in November 1921; the station relocation in June 1956 and the equipment change in November 1956 are near the two estimated changepoints in 1957. The metadata time in May 1987 is about three years from the concurrent changepoints flagged in January 1990. Of course, when metadata is ignored, estimated changepoint times may not coincide (exactly) with metadata times.

Bivariate fits
Both Tmax and Tmin series are now analyzed in tandem with our methods. Three changepoints are detected in both series, with or without metadata, and all are concurrent (see the bottom half of Table 5). Figure 4 illustrates  In temperature homogenization problems, the goal is often to detect (and then adjust for) "artificial" changes. Naturally occurring climate shifts should be left in the record if possible. Because of this, analyses often consider target minus reference series, where a reference series is a record from a nearby station that shares similar weather with the target station. A changepoint detection analysis using bivariate BMDL is performed on target minus reference data, and is included in the Appendix Section B.2.
Suppose that the true relative changepoint configuration is λ 0 = (λ 0 1 , . . . , λ 0 m 0 ) , where true parameter values are superscripted with zero. Our goal is to identify λ 0 over many candidate models. In fact, for a (fixed) large integer M , all relative changepoint configurations in are considered, where d is a small positive constant, smaller than λ 0 r − λ 0 r−1 for all r = 1, . . . , m 0 + 1. We assume that m 0 ≤ M such that λ 0 ∈ Λ and M ≤ 1/d.
Under the same assumptions, the automatic MDL for piece-wise AR processes (Davis, Lee and Rodriguez-Yam, 2006) has been shown to consistently estimate relative changepoint locations and model parameters (Davis and Yau, 2013). The following two theorems show that the BMDL (17) also achieve the same large sample consistency.

Theorem 1 (Consistency of changepoint configuration). Given the observed time series of length N , denote the estimated relative changepoint model aŝ
Furthermore, the convergence rate for each r = 1, . . . , m 0 is Theorem 2 (Consistency of parameter estimation). Suppose that under the true model λ 0 , the true model parameters are μ 0 , s 0 , (σ 2 ) 0 , and φ 0 . Under the estimated relative changepoint modelλ N in (19), the BMDL estimator for φ, denoted byφ N , is given by the Yule-Walker estimator described in Section 3.2; the BMDL estimator for s and σ 2 , denoted byŝ N andσ 2 N , are given by (13) and (14) after replacing all terms containing φ byφ N , respectively; the BMDL estimator for μ is taken as its conditional posterior mean (22) Then as N → ∞, all estimators converge to their true values in probability, i.e., Proofs of Theorem 1 and 2 are given in the Appendix Section A.4 and A.5, respectively. The convergence rate O P (1/N ) in (21) is viewed as the optimal rate in the multiple changepoint detection literature (Niu, Hao and Zhang, 2016). From a Bayesian model selection perspective, a model selection criterion is consistent if the ratio of posterior probabilities between the true model λ 0 and any other model λ ∈ Λ tends to infinity (Clyde and George, 2004). This is equivalent to the BMDL difference BMDL (λ) − BMDL λ 0 −→ ∞, which is shown to hold in Proposition 3 and 4 in the Appendix.
To better understand our BMDL penalty, we compare it to the MDL (18). Under a given relative changepoint model λ, (18) increases linearly with N . The following theorem states that the difference between the BMDL in (17) and the automatic MDL in (18) is asymptotically bounded. A proof of Theorem 3 is obtained by comparing the large sample performance of the corresponding terms in (17) and (18) via order estimates derived in the Appendix. In the BMDL expression (17), all but the last term arise from the mixture MDL. The term (N − p) log σ 2 /2 measures the model's goodness-offit. By Lemma 3 in the Appendix,σ 2 =σ 2 ν=∞ + O P (1/N ); hence, the difference between the first terms in (17) and (18) obeys In (17), the second term is O P (1), while the third term, by Lemma 4 in the Appendix, satisfies which interestingly suggests that the mixture MDL in (17) contains a built in penalty on μ that performs similarly to the two-part MDL penalty on μ in (18). The last term in (17) is the penalty on the changepoint configuration λ.
With or without metadata, Lemma 5 in the Appendix suggests that this term is asymptotically m log(N ) + O P (1), which only differs from the last term in (18) by O P (1) plus a constant. An implication of Theorem 3 is that the model selection consistency results in Theorem 1 also hold for the automatic MDL (18), which gives alternate confirmation of the asymptotic results in Davis, Lee and Rodriguez-Yam (2006) and Davis and Yau (2013). In addition, without metadata, the BMDL (17) and the automatic MDL (18) perform similarly for large samples. Section 5 confirms this result via simulation examples, also demonstrating that when metadata is available and incorporated, the BMDL significantly increases changepoint detection power and precision under finite samples.

Discussion
This paper developed a flexible MDL-based multiple changepoint detection approach to accommodate a priori information on changepoint times via prior distributional specifications. Motivated by climate homogenization problems, our Bayesian MDL (BMDL) method incorporates subjective knowledge such as metadata in mean shift detection for univariate autoregressive processes with seasonal means, and then extended these ideas to bivariate VAR settings while encouraging concurrent changes in the component series. Both theoretical and simulation studies show that without metadata, our BMDL performs similarly to the state-of-art automatic MDL method; with metadata, the BMDL's detection power significantly improves under finite samples. Our BMDL has several practical advantages, including simple parameter elicitation, asymptotic consistency, and efficient MCMC computation.
The approach can be extended to accommodate more flexible time series structures, including periodic autoregressions (Hewaarachchi et al., 2017), moving-averages, and multivariate data with more than two series. The methods could also be tailored to categorical data. For count data, the likelihood could be Poisson-based. With a conjugate Gamma prior on means, the resulting marginal likelihoods will again have closed forms. There is no technical difficulty in allowing a background linear trend, or even piecewise linear trends. This said, linear trends can be mistaken for multiple mean shifts should trends be present and ignored in the analysis (Li and Lund, 2015). In addition, with straightforward modification, the BMDL can handle changes in variances or autocovariances.
Non-MCMC stochastic search methods could also be used. Genetic algorithms, popular in multiple changepoint MDL analyses, are also capable of minimizing the BMDL. Pre-screening methods such as Chan, Yau and Zhang (2014); Yau and Zhao (2016) can speed up model search algorithms. In simple settings when no global parameters exist (i.e., independent observations, no seasonal cycle, error variance known), dynamic programming based techniques such as the PELT (Killick, Fearnhead and Eckley, 2012) can further accelerate computational speed.

A. Theoretical Results and Proofs
In this Appendix, the asymptotic limits of the Yule-Walker estimatorφ and white noise varianceσ 2 under a given changepoint model λ are investigated in Sections A.1 and A.2, respectively. In Section A.3, the BMDL difference between the true model λ 0 and other models is studied, showing that λ 0 achieves the smallest BMDL in the limit. Last, the proofs of Theorem 1 and Theorem 2 are given in Sections A.4 and A.5, respectively.

A.1. Asymptotic behavior of the Yule-Walker estimator of the autoregression coefficientsφ
For a sample size N , the observations obey the true changepoint model λ 0 in (8): Here, is a zero-mean causal AR(p) series. When there is no ambiguity, we simplify the notations μ 0 , s 0 , (σ 2 ) 0 , φ 0 to μ, s, σ 2 , φ, respectively, and omit subscripts such as 1 : N on the data vector and other quantities.
For any relative changepoint model λ, suppose that η is the corresponding changepoint configuration under the sample size N . From (15), the ordinary least squares residual vector is (24) Here, [A|D] is the block matrix formed by A and D and P A is the orthogonal projection onto the columns of the matrix A. The regime indicator matrix D depends on λ and may not equal D 0 . Lemma 1. For each relative changepoint configuration λ ∈ Λ and t ∈ {1, . . . , N }, when N is large, each entry of ols can be expressed as Here, the functions r 0 (t) and r(t) are the regimes that time t is in under the models λ 0 and λ, respectively. In regime of the changepoint configuration λ,

the average of the true mean parameters, N is the number of time points in this regime, and R is the set of all time points in this regime. Likewise,¯ is the average of errors in regime ,¯ v is the average of errors during season v, and¯ is the average of all errors.
Proof. Because of (24), our main objective is to study the projection residual I N −P [A|D] under large N . Since the two column spaces spanned by (I N −P D )A and D are perpendicular, Theorem B.45 in Christensen (2002, pp. 411 The term P (I N −P D )A can be expanded as For any n ∈ N, let 0 n be the n-dimensional vector containing all zero entries, 1 n be the n-dimensional vector whose entries are all unity, and J n be the n × n matrix whose entries are all unity, i.e., J n = 1 n 1 n .
For v ∈ {1, . . . , T }, suppose there are k(v, ) time points in regime that are also in season v. Since N increases linearly with N , so does k(v, ). Moreover, when N is large, inside each regime, the seasonal counts k(v, ) are equal except for edge effects, i.e., k(v, )/N ≈ 1/T for all seasons v. To avoid trite work, we will ignore these edge effects in the ensuing calculations. Proceeding under this simplification, the vth column in A, denoted by A v , under the projection P D , becomes We can now obtain an expression for A (I N − P D )A. To do this, note that for u, w ∈ {1, 2, . . . , T }, and it follows that The inverse of this matrix can be verified as Plugging this inverse into (27) and denoting Q D = I N − P D produce For simplicity, we assume that regime starts with season one, ends with season T , and contains n full cycles. Using n = N/T = m+1 r=1 n r and (28) gives Plugging these into (29) produces Since P D is block-diagonal of form we have Li et al. Therefore, for t ∈ {1, 2, . . . , N}, the tth entries of the vectors in (24) are For any changepoint configuration λ ∈ Λ, as N → ∞, N −1 N t=h+1 δ t δ t−h converges to a constant that does not depend on the lag h ∈ {0, 1, . . . , p}. This is because for any lag h, δ t = δ t−h for all t ∈ {1, . . . , N}, except for at most (m + m 0 )h ≤ (m + m 0 )p times near the changepoints in λ and λ 0 . Hence, as N → ∞, N −1 N t=h+1 δ t δ t−h converges to its limit at rate O (1/N ). We denote this limit as which is non-negative and depends on λ, but not on N . It is not hard to see that δ t = 0 for all t ∈ {1, . . . , N} if and only if λ contains all relative changepoints in λ 0 (denoted by λ ⊃ λ 0 ). Therefore, δ 2 = 0 only for models λ such that λ ⊃ λ 0 , including λ 0 itself.

Lemma 2.
Under any relative changepoint configuration λ ∈ Λ (which may or may not be the true changepoint configuration), for h ∈ {0, 1, . . . , p}, as N → ∞, the lag h sample autocovariancê where γ(h) is the true lag h autocovariance for the AR(p) series .
Proof. Since the AR(p) errors are assumed causal, we may write +¯ , one can write W t as a linear combination of all Z t s up to and before time N : Since ψ j = 0 when j < 0, ψ (t) j = 0 if j < t − N . The asymptotic limit of the sample autocovariances can now be derived: Arguing as in Proposition 7.3.5 of Brockwell and Davis (1991, pp. 232) gives Therefore, for each t and h, ψ Since {Z t } is iid with variance σ 2 , the weak law of large numbers (WLLN) for linear processes (Brockwell and Davis, 1991, pp. 208, Proposition 6.3.10) gives Li et al. This identifies the limit of the first term in the bottom line of (34 Since the Yule-Walker estimatorφ is formulated based onγ(h)'s, the following asymptotic result follows from Lemma 2. Proposition 1. Under any relative changepoint configuration λ ∈ Λ, the Yule- where γ p = (γ(1), . . . , γ(p)) and Γ p is a p×p matrix with (i, j)th entry γ(|i−j|).

A.2. Asymptotic behavior of estimators of σ 2
In the BMDL and (automatic) MDL formulas, estimators for σ 2 arê respectively. The following lemma shows that under any model λ, these two estimators are asymptotically the same as the Yule-Walker estimator of σ 2 , i.e., Lemma 3. Under any changepoint configuration λ ∈ Λ, as N → ∞, Proof. Under the null model λ ø (m = 0), the column space of D is the null space and bothσ 2 andσ 2 ν=∞ are (N − p) −1 X I N − P A X. Sinceσ 2 YW = 1 N X I N − P A X, the conclusion holds. The rest of the proof is for any model λ that contains m ≥ 1 relative changepoints. We first establish (39). Sinceφ has the limit in (35), it is not hard to show that as N tends to infinity, D D/N and D X/N converge in probability to a m × m positive definite matrix and an m-dimensional vector, respectively. In the prior of μ, the parameter ν is a constant; hence, Hence, the left hand side of (39) has the limit where the second to last equality follows from (26). We now show that for any λ with m ≥ 1, (40) holds. For notational simplicity, for any j ∈ {0, 1, . . . , p}, matrices formed from the rows of A and D are denoted by Since both A and A j are (N − p) × T matrices and each column in A can be written as a linear combination of the columns in A j , the corresponding column spaces agree: C( A) = C(A j ). Therefore, P A = P Aj for all j. Now define The denominator in (41) cannot be zero since 1 − p k=1φ k = 0 for Yule-Walker estimates when N is large (Brockwell and Davis, 1991).
Since there are at most 2m(p + h) non-zero entries in Δ j , and none of these entries depend on N , Δ j Δ j = O P (1). In addition, for any N -dimensional vectors α whose entries do not depend on N , α Δ j = O P (1). Using (41), we have Therefore, for any α, β ∈ R N whose entries do not depend on N , Hence, from (26), shows that which is the right hand side of (40).
Under any model λ, Lemma 2 shows that the Yule-Walker estimatorσ 2 YW converges to at rate O P (1/ √ N ). We define the limit in (43) as f (δ 2 ), emphasizing dependence on δ 2 . By Lemma 3, the asymptotic behavior of the BMDL estimatorσ 2 can be summarized in the following proposition.
Proof. We show that f (δ 2 ) strictly increases in δ 2 . According to (2.22) in Harville (2008, pp. 428), for any matrices R ∈ R r×r , S ∈ R r×l , T ∈ R l×l , U ∈ R l×r with R, U non-singular, Hence, for δ 2 > 0, For notational simplicity, denote the following scalars by Then f (δ 2 ) can be expanded as Differentiation of f (δ 2 ) with respect to δ 2 gives The strict inequality follows from causality of the AR(p) errors, which implies that b = p k=1 φ k > 1. Therefore, f (δ 2 ) is strictly increasing in δ 2 and f (0) = σ 2 . (17) Recall that under the relative changepoint model λ, its BMDL in (17) is

A.3. Asymptotic behavior of the BMDL in
The next two lemmas quantify the asymptotic behavior of the third and forth terms in the above BMDL formula, respectively. Proof. By (41) and the corresponding results in the proof of Lemma 3, as N → ∞, The determinant of the m × m matrix (of finite dimension) is then and (47) follows immediately.
Since N r = O(N ) for all r ∈ {2, . . . , m + 1}, Lemma 4 implies that for any changepoint model λ, Lemma 5. Suppose that both the number of documented and undocumented times increases linearly with N , i.e., N (k) = O(N ), for k = 1, 2. Then under any two changepoint models λ 1 , λ 2 ∈ Λ, whose total number of changepoints are m 1 , m 2 , respectively, the pairwise difference of the last term in the BMDL formula (17) is Proof. The left hand side of (49) can be simplified to Stirling's formula quantifies the asymptotic limit of the following Gamma function ratio: Therefore, (50) equals (m 1 − m 2 ) log N + O P (1).
The asymptotic behavior of the BMDL is now established in the following two propositions. They consider the pairwise difference of BMDLs between the true model λ 0 and another changepoint model λ. Proposition 3 considers the case where the model λ does not contain all relative changepoints in λ 0 , i.e., λ ⊃ λ 0 , whereas Proposition 4 considers the case where λ ⊃ λ 0 , i.e., λ contains all relative changepoints in λ 0 , and also may have some redundant changepoints.
Proposition 3. For any relative changepoint configuration λ ∈ Λ, if λ ⊃ λ 0 , then as N → ∞, Proof. In this proof, when necessary, subscripts λ and λ 0 are used to distinguish the same terms under different models. By (48) and (49), the difference between BMDLs in the (non-true) model λ and the true model λ 0 is asymptotically Here, the last equality is justified via Proposition 2. For the model λ ⊃ λ 0 , its corresponding δ 2 λ > 0. By Proposition 2, f (δ 2 ) strictly increases in δ 2 , which shows that the leftmost logarithm term in (52) has a strictly positive limit. Therefore, when N is large, the first term in (52) is positive, of order O P (N ), and dominates the other terms in (52).

Proposition 4. For any relative changepoint configuration
Proof. In the case where λ ⊃ λ 0 , (51) still holds. Moreover, since λ also contains redundant changepoints, m > m 0 . Hence, for large N , the second term in (51) is positive and of order O P (log N ). To prove Proposition 4, we need to show that the first term in (51) is bounded in probability. A sufficient condition for this simply shows thatσ To establish (53), we first focus on the model λ. For notational simplicity, the subscript λ is omitted when there is no ambiguity. Under any model λ ⊃ λ 0 , its corresponding δ t in (25) is zero for all t ∈ {1, . . . , N}; hence, by Lemma 1, the lag-h sample autocovarianceγ(h) in (34) for all h ∈ {0, 1, . . . , p} can be written Recall that¯ r(·) ,¯ v(·) ,¯ are averages of zero-mean AR(p) errors. These averages are taken over error blocks whose size is proportional to N . By the central limit theorem for linear processes, these averages all converge to zero in probability with order O P (1/ √ N ). Since the fourth term in (54) is a sum of their two-way interactions and quadratic forms, it is also O P (1/N ). The second term in (54) can be expanded as where r,t denotes the error during time t in the rth regime, v,t denotes the error during time t in the vth month, and¯ r and¯ v are the error averages for the rth regime and vth month, respectively. Similarly, we can show that the third term in (54) is also O P (1/N ). Therefore, under any model λ ⊃ λ 0 , including λ 0 itself, (54) becomeŝ which shows thatγ(h) under the two models λ and λ 0 only changes by O P (1/N ). By (38),σ 2 YW under the two models λ and λ 0 also can only differ by O P (1/N ). By Lemma 3, the BMDL estimatorσ 2 =σ 2 YW + O P (1/N ), which establishes (53). Thus,σ 2 under the two models λ and λ 0 only differ by O P (1/N ).

A.4. A proof of Theorem 1
To prove Theorem 1, we first establish the asymptotic consistency ofλ N in the case where m 0 is known. Here, Λ m denotes a subset of Λ formed by models that have m relative changepoints.
Proposition 5. If m 0 is known, then as N → ∞, Proof. We will show that for each subsequence i.e., convergence in distribution. By the results in Section 25 of Billingsley (1995), this implies thatλ N w −→ λ 0 . However, since λ 0 is a constant configuration, one can upgrade the mode of convergence to infer thatλ N P −→ λ 0 (see again Section 25 of Billingsley (1995)).
Hence, let N k be an infinite sequence with N k → ∞ as k → ∞. By Helly's selection theorem (Theorem 25.9 in Billingsley (1995)) and the compactness of Λ m 0 , there exists a further infinite subsequence N k and a possibly random configuration λ * such thatλ N k w −→ λ * . Here, a random configuration λ * means a random variable a = (a 1 , . . . , a m 0 ) such that 0 ≤ a 1 < a 2 < . . . < a m 0 ≤ 1. To finish the argument, it is sufficient to show that λ * = λ 0 .
To show that λ * = λ 0 , we use proof by contradiction and suppose that λ * = λ 0 in that P (λ * = λ 0 ) > 0. For notational simplicity, we simply replace N k by N below. Let Fλ N (·) and F λ * (·) denote the cumulative distribution functions ofλ N and λ * , respectively, and define where the function δ 2 (·) is defined by (31). It is easy to verify that δ 2 (a) is a continuous function in a: For a fixed configuration a and the truth a 0 = (a 0 1 , . . . , a 0 m 0 ), we can rewrite their regime means as We then make a vector b of dimension at most 2m 0 by ordering all components in both a and a 0 . Thus, where b i is a component in a or a 0 , and w i has form ±(Δ 0 Therefore, (55) is continuous in a. We also tacitly assume that all regime mean parameters Δ k are bounded. By Part (ii) of Theorem 25.8 in Billingsley (1995), if X N w −→ X and a function g(·) is continuous and bounded, then Our work can be reduced to showing that BMDL(λ N ) − BMDL(λ 0 ) is bigger than a positive constant for all large N ; for if this holds, then the fact thatλ N minimizes the BMDL would be contradicted. Hence, it suffices to show that lim sup To do this, since m 0 is known,m = m 0 and (52) now give Obviously, the term N −1 (N − p) in (57) converges to unity as N → ∞. The leftmost term in brackets in the bottom equation in (57) converges to zero. This follows from (56), the continuity of f and the natural log function, and the fact that log(1) = 0. When λ * = λ 0 , since the number of changepoints in these two models are the same, λ * ⊃ λ 0 . Therefore, by (31), we have δ 2 λ 0 = 0 and δ 2 λ * > 0. The limit of the rightmost bracketed term in (57) must be positive. Positivity follows from f (δ 2 λ * ) > f(δ 2 λ 0 ) = σ 2 , which can be verified by an argument akin to that proving Proposition 2, the nondecreasing and continuous nature of f , that f (0) = σ 2 > 0, and that P (λ * = λ 0 ) > 0. The details are omitted; this said, one can get a flavor for the argument in the proof of the next result, which quantifies how much δ 2 λ varies when elements of it are changed. This finishes our work.
Next, under the assumption that m 0 is unknown, we first establish the following convergence rate lemma on estimated changepoint locationsλ j .
Lemma 6. Suppose that m 0 is unknown. Then for each λ 0 r , r ∈ {1, . . . , m 0 }, there exists aλ j inλ N such that Proof. By the spacing assumptions made on the changepoint configuration, there can be at most a finite number of changepoints. Using this and repeating the argument in the proof of Proposition 5, one can argue that the estimated changepoint modelλ N in (19) converges to a limit λ * that contains all changepoints in λ 0 ; that is, P (λ * ⊃ λ 0 ) = 1. This means that for each λ 0 r , r = 1, . . . , m 0 , there exists aλ j(r),N inλ N such thatλ j(r),N P −→ λ 0 r ; that is, |λ j(r),N − λ 0 r | = o P (1). For notation simplicity, we rewriteλ j(r),N asλ j when there is no ambiguity.
The above shows that for all r ∈ {1, . . . m 0 }, |λ j − λ 0 r | = O P (N αr−1 ) for some finite α r ; in fact, we know that α r ≤ 1. Now let To prove the Lemma, we need to show that ω r ≤ 0 for all r, or that ω ≤ 0 where This will be done by contradiction. Hence, suppose that ω > 0, then there exist an r such that This will now be used to draw a contradiction. For a sufficiently large N , a new modelλ N is created fromλ N by replacing the changepointλ j inλ N with λ 0 r : A contradiction occurs if BMDL(λ N ) < BMDL(λ N ) for all large N sinceλ N minimizes the BMDL. We first investigate the difference inγ(h) in (34) under the modelsλ N and λ N , for each h ∈ {0, 1, . . . , p}. Following the argument in Proposition 4, only depends on the observed data up to an O P (1/N ) error. Hence, its difference under the modelsλ N andλ N is O P (1/N ). For the other terms in (34), we need only focus on the summation over t satisfying λ j−1 N ≤ t ≤ λ j+1 N − 1, depicted in Figure 5. This is because (W t , δ t ) for all t elsewhere are identical in the modelsλ N andλ N . For notational simplicity, lengths of time intervals on the rescaled timeline are denoted by We first consider the case whereλ j−1 is to the left of λ 0 r−1 andλ j+1 is to the right of λ 0 r+1 . Without loss of generality, we assume thatλ j is to the left of λ 0 r . The length between these estimated changepoints and their limits are denoted by all of which converge to zero at rates no slower than O P (N ω−1 ). Under the modelλ N , δ t in (25) can be written as When N is large, δ t = δ t−h for all but a finite number of times t; hence, for the second term (a similar argument applies to the third term) in (34), By (64) and (65), under the two modelsλ N andλ N , the difference of δ t is piecewise constant: To study the sum of W t in (25) over the above intervals, apply the central limit theorem for linear processes to see that¯ r(t) ,¯ v(t) ,¯ all converge to zero at the rate O P (1/ √ N ) for any t. Hence, for a t ∈ [a, b] whose length b − a depends on N and is O P (N ξ ) with ξ ∈ (0, 1], the sums of t and W t over this interval satisfy where the last equality follows from ξ ≤ 1. For the three interval sums in (67), the corresponding convergence rates ξ of their lengths are 1, ω, and 1, respectively. Hence, in (66), when decomposed as three sums in these intervals, differences under the modelsλ N andλ N are thus where the last equality follows from ω ≤ 1. Therefore, the second and third term differences in (34) under the two modelsλ N andλ N is O P N ω 2 −1 . For the last term in (34), we similarly have Under the modelλ N , On the other hand, under the modelλ N , The difference of the last term in (34) under the two models, up to an O P (1/N ) error, is thus Therefore, the difference ofγ(h) (34) under the modelsλ N andλ N iŝ Here, the convergence rates of the three terms in the summation are given by the results shown in (62), (69), and (70), respectively. Since ω > 0, the third term in (71) dominates the overall convergence rate. Note that by (70), this term has the same limit as (μ r+1 − μ r ) 2 Δl r . Therefore, the limit of (71) remains the same across different value of h ∈ {0, 1, . . . , p}. By (59), (61), and (63), Δl r is positive, and converges to zero in probability on the order of O P (N ω−1 ), but not at any faster polynomial rate. Since μ r+1 = μ r , by (71), for large N ,γ(h)λ N −γ(h)λ N is also positive, converging to zero in probability on the order of O P (N ω−1 ), but not any faster.
Following similar reasoning, ifλ j is to the right of λ 0 r , the result in (71) still holds. This conclusion does not change ifλ j−1 is to the right of λ 0 r−1 (orλ j+1 is to the left of λ 0 r+1 ): we can simply take Δl r−1 = 0 (or Δl r+1 = 0) and all above derivations hold unaltered.
Next, we will show that for sufficiently large N , the modelλ N has a smaller BMDL than modelλ N . Proposition 2 shows that f (δ 2 ) in (43) is strictly increasing in δ 2 . A similar argument applies here after replacing δ 2 by the limit of (71), which is (μ r+1 − μ r ) 2 Δl r . This implies that the difference of the Yule-Walker estimatorsσ 2 YW in (38) under the modelsλ N andλ N obeyŝ Furthermore, this difference is positive and converges to zero in probability on the order of O P (N ω−1 ), but not at any faster polynomial rate. By Lemma 3, the BMDL estimatorσ 2 =σ 2 YW + O P (1/N ), thus, the difference of the BMDL estimatorσ 2 under the two models satisfieŝ the last equality stemming from ω > 0. This shows that (72) is dominated bŷ σ 2 λ N ,Y W −σ 2 λ N ,Y W , and thus is positive and converges to zero in probability on the order of O P (N ω−1 ) (but not at any faster polynomial rate). Since ω > 0, σ 2 λ N −σ 2 λ N /N ω 2 −1 diverges in probability, i.e., for a strictly positive constant C, when N is large enough,σ Recall that the modelλ N contains the same number of changepoints as the modelλ N ; therefore, where the last equality follows from lim N →∞ (1 + x N ) N → e x and ω ≤ 1. Hence, BMDL(λ N ) − BMDL(λ N ) diverges to infinity at rate O P (N ω 2 ) or faster, should ω > 0. Here, a contradiction arises sinceλ N minimizes the BMDL.
In Theorem 1, the convergence rate in (21) comes from Lemma 6. Now the proof of (20) is given.
A proof of (20) in Theorem 1. In the proof of Lemma 6, λ * ⊃ λ 0 . To verify (20), we need only show that λ * = λ 0 ; in other words, there are no changepoints in λ * that are not in λ 0 .
Proof by contradiction will again be used. Suppose that for a large N , the BMDL estimatorλ N contains more than m 0 changepoints. More specifically, suppose that during the (r + 1)th regime in the true model λ 0 , there are redundant changepoints estimated inλ N , i.e., for some integer d > 1, On the other hand, under the modelλ N , When N is large, δ t = δ t−h for all but a finite number of times t; hence, for the second term (and similarly, the third term) in (34), By (73) and (74), under the modelsλ N andλ N , the difference of δ t is piecewise constant, i.e., For the three time intervals in (76), their lengths are Δl r + a 1 = O P (N ξ1 ), l r+1 − a 1 − a d = O P (1), and a d + Δl r+1 = O P (N ξ d ), respectively, with ξ 1 , ξ d ∈ [ω, 1]. For (75), when decomposed as three sums in these intervals, by (68), its difference under the modelsλ N andλ N is By Lemma 6, ω ≤ 0; hence, for the second term (and similarly for the third term) in (34), its difference under the two models converges to zero at rate O P (1/N ).
The difference between BMDL(λ N ) and BMDL(λ N ) will now be studied. Recall thatλ N has d − 1 more changepoints thanλ N . By (48) and (49), the BMDL difference is and is positive. Here, the second equality follows from (79). This contradicts thatλ N minimizes the BMDL.
Following the proof of Lemma 3, the BMDL estimators for s and μ have the following limits: After rewriting (80)

B.1. Simulation Examples
Additional figures related to our simulation examples in Section 5 are included here.

B.2. Tuscaloosa Data Analysis: Target Minus Reference
A reference series is a record from a station near the target station that is subtracted from the target series. The idea is that two nearby stations should experience similar weather; hence, any trends or seasonal cycles should be lessened (if not altogether removed) in the target minus reference subtraction. Changepoints caused by artificial reasons, rather than by real climate changes, are easier to detect (visually) in target minus reference comparisons. Following Lu, Lund and Lee (2010), our reference series is obtained by averaging three nearby stations: Aberdeen, MS; Greensboro, AL; and Selma, AL. By averaging multiple reference series (this is called a composite reference), impacts of mean shifts in any of the individual stations in the composite reference are lessened. Figure 10 shows the optimal changepoint configuration for the target minus reference series and contains 12 concurrent changes: June 1914, January 1919,    1933, July 1937, August 1937, October 1938, December 1938, June 1946, July 1946, November 1956, May 1987, and October 1996. Among them, the 1956and 1987 changepoints are in the metadata; the two changepoints in 1938 are close to the 1939 station relocation. The changepoints in 1919, 1933, and 1990 are also flagged by Lu, Lund and Lee (2010). One of the shifts, November 1956, moves the Tmax series warmer and the Tmin series colder.
The October and December 1938 changepoints are likely due to typos in the data record. Specifically, the October and November 1938 Tmin values in the  target minus reference series appear to be abnormally high. While the data have been quality checked, some errors persist. This conjecture is made because the three reference stations lie in various directions from Tuscaloosa; climatologically, series to the north and west of Tuscaloosa should be cooler and those to the south and east should be warmer. In this case, Tuscaloosa was significantly warmer than all three references. Similar statements apply to the two "outlier" changepoints in 1937, and the two changepoints in 1946, where the Tmin records for Tuscaloosa are lower than those for all three reference stations. It is interesting that our method picked up outliers.
It is natural to flag more changepoints in the target minus reference series than the target series alone. An ideal reference series should have the same trend and seasonal cycles as the target series and be free of artificial mean shifts. This said, we do not assume that the target minus reference comparison completely removes the monthly mean cycle; indeed, Liu et al. (2016) shows that this is seldom the case. Reference series selection is a problem currently studied by climatologists. As our reference series averages three neighbor stations, mean shifts in any of the reference records may induce shifts in the target minus reference series. For example, the estimated changepoint in 1914 is close to the 1915 metadata time listed in the Aberdeen reference. This said, averaging three neighbors should help mitigate the effects of changepoints in any individual reference series.