Inference on the change point under a high dimensional sparse mean shift

We study a plug in least squares estimator for the change point parameter where change is in the mean of a high dimensional random vector under subgaussian or subexponential distributions. We obtain sufficient conditions under which this estimator possesses sufficient adaptivity against plug in estimates of mean parameters in order to yield an optimal rate of convergence Op(ξ−2) in the integer scale. This rate is preserved while allowing high dimensionality as well as a potentially diminishing jump size ξ, provided s log(p ∨ T ) = o(√(T lT )) or s log(p ∨ T ) = o( √ (T lT )) in the subgaussian and subexponential cases, respectively. Here s, p, T and lT represent a sparsity parameter, model dimension, sampling period and the separation of the change point from its parametric boundary, respectively. Moreover, since the rate of convergence is free of s, p and logarithmic terms of T , it allows the existence of limiting distributions under high dimensional asymptotics. These distributions are then derived as the argmax of a two sided negative drift Brownian motion or a two sided negative drift random walk under vanishing and non-vanishing jump size regimes, respectively, thereby allowing inference on the change point parameter. Feasible algorithms for implementation of the proposed methodology are provided. Theoretical results are supported with monte-carlo simulations. MSC2020 subject classifications: Primary 62F10, 62F12, 62F03.


Introduction
In many applications of current scientific interest assuming stationarity of the mean of a time series over an extended sampling period may be unrealistic and may lead to flawed inference. Dynamic time series characterized via mean changes across unknown time points form a simplistic yet useful tool to model non-stationarity of large streams of data. In this article we consider the model, x t = θ 0 1 + ε t t = 1, ..., τ 0 θ 0 2 + ε t t = τ 0 + 1, ..., T. (1.1) The observed variables here are x t = (x t1 , x t2 , ..., x tp ) T ∈ R p , t = 1, ..., T . The variables ε t = (ε t1 , ..., ε tp ) T ∈ R p are unobserved zero mean random variables, which are allowed to be subgaussian or subexponential. The unknown parameters are the mean vectors θ 0 1 , θ 0 2 ∈ R p , and the change point parameter τ 0 ∈ {0, 1, 2, ..., T }, with the latter being of main interest. The model dimension p is allowed to be fixed or diverging potentially much faster than the sampling period T . The boundary points τ 0 = 0, T characterize the 'no change' case, or a static model where no realizations from the corresponding distribution are observed. These boundary points are considered to present additional theoretical insights in the estimation of τ 0 later in the manuscript, however these shall not be pursued from an inference perspective. Our objective throughout the article is that of inference on τ 0 when it exists, i.e., construction of asymptotically valid confidence intervals for the change point parameter when it is not at the boundary of its parametric space. We mention here that several solutions for the boundary problem (detection) of testing the null hypothesis H 0 : τ 0 = T , in the high dimensional setting are available in the literature, see, e.g., [24], [38], [17], [13] and [32] amongst others.
To proceed further, define the jump vector and the jump size that are fundamentally related to the properties of any change point estimator. Let, η 0 = (θ 0 1 − θ 0 2 ), and ξ = η 0 2 . 1 (1. 2) The problem of change point estimation in the high dimensional setting has received significant attention in the recent past and several different estimators have been proposed. A large proportion of this literature provides near optimal localization error bounds of the form |τ − τ 0 | ≤ O(ξ −2 a T ), where a T → ∞, with probability (w.p.) → 1. For e.g. in the case of a single change point, the results of [21] yield a T = log T , with a least squares estimator together with a total variation penalty, [40] provide a T = log log(T ) with a projected cusum estimator, and those of [12] yields a T ≥ log 2 (T ) in the case of a single change point. While near optimal rates of the approximation are informative from an estimation perspective, however, from an inference perspective one requires a change point estimator to obey an optimal rate of convergence of O p (ξ −2 ) in order to allow the existence of limiting distributions and in turn allow inference on τ 0 . The literature on this inference perspective is very sparse. In a setting where p is increasing with T , [7] and [8] develop limiting distribution results while assuming ξ −1 √ (p/T ) → 0. These results yield non-degenerate limiting distributions provided p T . However, due to the assumption made on rate of the jump size, these do not extend to the high dimensional case where p may be diverging faster than T . In this case the assumption on the jump size made in these articles necessitates ξ → ∞ in the high dimensional setting and consequently allows only a degenerate limiting distribution to remain valid. The article [2] provides a limiting distribution result while assuming a further stronger assumption of ξ −1 √ p → 0. Two other recent articles in this direction are of [10] and [37], both these works are under dense alternatives similar to those of [7] and [8], in particular [37] require the dimension p to be necessarily diverging faster than T but slower than T 2 / log T , under the case ξ = O (1). The article of [10] provides results while also allowing temporal dependence.
More generally, in the high dimensional setting the question of an optimal rate of estimation and that of inference on the change point parameter in the non-degenerate case where the jump size is not assumed to be diverging remains unaddressed. The viability of the question itself comes from the recent work of [27] who show that assuming sparsity of the jump vector, much weaker signals in the jump size are detectable. Specifically, they show that the region of detectability of the change point satisfies a rate of ξ −1 √ s log(p ∨ T )/T ≤ c, in a minimax sense, upto other logarithmic terms in s and T , and under restrictions on the sparsity parameter s. We refer to their article for the precise rate which involves a tripe iterated log expression. A corresponding result in the univariate setting has been provided in [36].
A more precise description of the purpose of this article requires additional notation. For any T ≥ 2, p ≥ 3, 2 τ ∈ {0, ..., T }, and θ 1 , θ 2 ∈ R p , consider, Assume for the time being, the availability of some estimatesθ 1 andθ 2 of the mean parameters of the model (1.1) and consider the following plug in estimator utilizing these nuisance estimates, τ :=τ (θ 1 ,θ 2 ) = arg min 0≤τ ≤T Q(τ,θ 1 ,θ 2 ). (1.4) The overarching objective of this article is to study the inference properties of the estimatorτ in the assumed setting allowing high dimensionality and weak requirements on the jump size that allow non-degenerate limiting distributions, for e.g. to allowing a potentially diminishing jump size. Establishing existence of limiting distributions requires first suitable estimation properties to hold, which forms the first main contribution of this article.
In particular, we shall show thatτ yields an optimal rate of convergence O p (ξ −2 ), under a subgaussian or subexponential setting and any nearly arbitrary spatial dependence structure. New arguments are developed to obtain this result, including a novel application of the Kolmogorov's inequality on partial sums (see, Theorem B.1). Moreover, we obtain sufficient conditions on the nuisance estimatesθ 1 andθ 2 required to achieve the optimal rate O p (ξ −2 ), or a near optimal rate O p s log(p∨T ) . These sufficient conditions on nuisance estimation are stated as an inter-relationship between the 2 error of nuisance estimates and the jump size (Condition C.1 and Condition C.2). Formulating sufficient conditions on nuisance parameters with respect to the jump size provides some surprising insights. For e.g., they allow us to show that the estimation of a change point parameter in itself does not require many assumptions that are typically thought of as necessary conditions in the literature, including a rate condition on the separation of τ 0 from the parametric boundary, a rate condition on the jump size, and a rate of divergence of the model dimension. Instead, these assumptions arise from the nuisance estimation aspect of the overall process of change point estimation. This is best observed for the case where the nuisance parameters are known. Here these sufficient conditions shall be trivially satisfied withθ 1 = θ 0 1 , andθ 2 = θ 0 2 . In this case,τ yields an optimal rate O p (ξ −2 ), where ξ > 0 may be converging arbitrarily fast towards zero and the model dimension may 2 We assume p ≥ 3 throughout the article so that log p ≥ 1. This is not a necessary condition and is only assumed to ease notational complexity of the results and proofs. 3 The sum τ t=1 is defined to be zero when τ = 0, and similar for the other sum on the boundary τ = T . be diverging arbitrarily fast with respect to T , and even when the change point does not actually exist (τ 0 = 0 or T ) 4 . This case of known nuisance parameters is clearly infeasible in practice and is only meant to illustrate the above subtlety. The main requirement to obtain an O p (ξ −2 ) rate forτ in the usual case of unknown nuisance parameters shall effectively take the form with probability 1 − o (1), under the following weak condition on the rates of model parameters, for the subgaussian and subexponential cases, respectively, and for a suitably chosen small enough constant c u1 > 0. Here s is a sparsity parameter defined later in (2.1), σ is a variance proxy parameter (Condition A) and l T is a sequence separating τ 0 from the parametric boundary (Condition D). Despite irregular p-dimensional nuisance estimates in the construction ofτ , it shall yield an optimal O p (ξ −2 ) rate of convergence. This indicates that under the assumed mild conditions largely described above, the estimatorτ statistically behaves as if the nuisance parameters are known. This property of an estimator is typically referred to as adaptation as described in [9], but is observed here in a high dimensional sense. An indirect but informative comparison is with recent results on inference on regression coefficients in high dimensional regression models. For estimation of a component of the regression vector, it is known that the least squares estimator itself is not sufficiently adaptive against nuisance parameter estimates (estimates of remaining regression vector components) to allow for an optimal rate of convergence. Instead, certain corrections to the least squares loss or its first order moment equations, such as debiasing ( [34]) or orthogonalization ( [3], [11], [5] and [28]) induce sufficient adaptivity against nuisance estimates and thereby allow optimal estimation of the target regression parameter. The results of this article show that in the context of change point estimation, the plugin least squares estimator (1.4) itself possesses the required adaptivity against potentially high dimensional nuisance estimates, in order to allow for O p (ξ −2 ) estimation of the change point τ 0 provided the nuisance parameters are estimated with sufficient precision.
It may be observed that taking advantage of sparsity yields conditions (1.6) that are weaker than those assumed in [2], [7] and [8]. This is best seen by noting that conditions (1.6) allow a diminishing jump despite high dimensionality, under the restrictions s log(p ∨ T ) = o √ (T l T ) , and s log 3/2 (p ∨ T ) = o √ (T l T ) , under the subgaussian and subexponential cases, respectively. We also mention 76 A. Kaul et al. here that while the estimators studied in [2] and [8] are also based on a squared loss, however, they consider a grid search least squares where estimation of nuisance parameters θ 0 1 and θ 0 2 is carried out internally in the change point estimation mechanism. This is in contrast to the plug in least squares estimator (1.4) where the nuisance estimation has been separated from the change point estimation. This separation is crucial since it allows nuisance estimates to be computed separately and be made suitable for high dimensional approximations of the mean vectors via regularization, whereas a grid search least squares by construction disallows this capability.
Another observation here is that these conditions impose a stronger requirement on the jump size in the subexponential case in comparison to the subgaussian case. While we do not prove that these are necessary assumptions for an O p (ξ −2 ) rate ofτ , however the tail probability bounds (e.g. Bernstein's inequality) that force the conditions (1.6) are known to be sharp bounds. It is thus reasonable to speculate that this relationship between the jump size, dimensionality and the underlying distribution is inherent to achieving an optimal rate of convergence O p (ξ −2 ) of the change point estimator and not an artifact of our argument. Further circumstantial evidence towards this also comes from the following additional result. We show that a near optimal rate O p ξ −2 s log(p ∨ T ) ofτ can be obtained under a weaker condition than (1.5) on the nuisance estimates, which shall in turn requires the weaker restriction, for both the subgaussian and subexponential settings. Notably, the distinction in the required conditions (1.6) for the two classes of distributions is no longer present when only a near optimal rate is of interest. An intuitive explanation for this behavior is as follows. One among a few quantities that controls the rate ofτ is the tail behavior of a stochastic term of the form (τ 0 +k) t=(τ 0 +1) ε t ∞ uniformly over k ≤ k . Note that when k ≥ log(p ∨ T ) is diverging with T , then for a sufficiently large T , this tail behavior is of the same order under both the subgaussian and subexponential cases. When only a near optimal rate is of interest, it is sufficient to examine this case with a diverging k . However, this is no longer true in the case where k is finite. In this case, the heavier tail of the subexponential distribution is realized in the corresponding tail bound of the underlying stochastic term, and in turn on the assumption required to retain an optimal O p (ξ −2 ) rate of convergence.
The second main contribution of this article is about inference on the change point parameter. Note that in the case where ξ → ∞, the rate O p (ξ −2 ) directly yields a degenerate limiting distribution. It is thus sufficient to restrict this analysis to ξ = O (1). We show that the optimal rate ofτ , together with peripheral results allows for the existence of limiting distributions ofτ in both vanishing and non-vanishing jump size regimes, the forms of which are then derived. More precisely, under the vanishing jump regime ξ → 0, we obtain, and W (·) is a two-sided Brownian motion on R. It may be observed that the form of the limiting distribution obtained here is the same as that obtained in a one dimensional setting, ( [1]). The distribution of arg max ζ∈R 2W (ζ) − |ζ| is well studied in the literature and its cdf and thus its quantiles are readily available, ([41]).
The limiting distribution in the non-vanishing case of ξ → ξ ∞ > 0 necessitates a further parametric assumption (Condition A ) on the form of the underlying distribution, the reason for which is discussed in Section 3 later in the manuscript. The literature on this case even in the classical fixed p setting is quite sparse. Some relevant articles in this direction are of [23] and [18]. However, the results of these articles do not allow an extension to the case where the dimension p is function of T . When p is allowed to move with T , but p T , the only articles we are aware of who consider the non-vanishing case are of [7] and [8]. However it may be observed that our result to follow is quite different in comparison to theirs and is additionally valid in the high dimensional setting. To describe this distribution, let L represent the parametric form of the distribution of the random variable 2η 0T ε t − ξ 2 and define the following negative drift two sided random walk initializing at the origin, where z t , z * t are independent copies of a L(−ξ 2 ∞ , 4ξ 2 ∞ σ 2 ∞ ) distribution, which are also independent over all t. The notation in the arguments of L(· , · ) is representative of the mean and variance of this distribution, where the limits ξ ∞ and σ 2 ∞ are as described earlier. Then, we obtain the following result, where Z represents the collection of integers. Quantiles of this distribution can be approximated numerically thereby enabling the construction of asymptotically valid confidence intervals. We emphasize that asymptotics here are in a high dimensional sense, where the sampling period T → ∞ and the dimension p may be fixed or be allowed to diverge, potentially at an exponential rate of T . Clearly all of the above discussion on estimation and inference on τ 0 relies critically on the nuisance estimatesθ 1 andθ 2 that satisfy suitable conditions, that have not yet been explicitly defined. We postpone this discussion to Section 4 where the construction of these nuisance estimates is discussed, along with validity of the assumed sufficient conditions. Section 2 and Section 3 study the proposed plug in least squares estimatorτ and provide a rigorous description of the estimation and inference results discussed above. Section 5 provides monte-carlo simulations numerically supporting the theoretical results developed in this article. We conclude this section with a short note on the notation used throughout the article.
Notation: Throughout the paper, R represents the real line. For any vector δ ∈ R p , δ 1 , δ 2 , δ ∞ represent the usual 1-norm, Euclidean norm, and sup-norm respectively. For any set of indices U ⊆ {1, 2, ..., p}, let δ U = (δ j ) j∈U represent the subvector of δ containing the components corresponding to the indices in U . Let |U | and U c represent the cardinality and complement of U . We denote by a ∧ b = min{a, b}, and a ∨ b = max{a, b}, for any a, b ∈ R. We use a generic notation c u > 0 to represent universal constants that do not depend on T or any other model parameter. All limits are with respect to the sample size T → ∞. The notation ⇒ represents convergence in distribution.

Assumptions and estimation properties
In this section we state sufficient conditions and theoretical results regarding estimation properties of the plug in least squares estimatorτ of (1.4).
Condition A (On underlying distributions). We assume that the underlying distribution in model (1.1) obeys one of the following two conditions. Subgaussian and subexponential are well known classes of distributions with the latter being heavier tailed than the former. Distributions included in class (I) are: Gaussian distribution, any bounded distribution, asymmetric mixtures of Gaussian distributions etc. Examples of distributions included in class (II) are: Laplace distribution, mean centered Exponential distribution, mean centered Chi-square, centered mixtures of these distributions, amongst several others. The monograph [35] provides a detailed study on the behavior of these classes of distributions. This assumption is significantly weaker than assuming a Gaussian distribution such as that in [40], but it requires lighter tail behavior in comparison to [8] who assume a finite fourth moment of the underlying distribution. However, as discussed in Section 1, the inference results of [7] and [8] do not extend to the p T setting as considered here. Moreover, our results indicate that achieving an optimal rate O p (ξ −2 ) in the high dimensional setting leads to the rate required of jump signal indeed being influenced by the tail behavior of the underlying distribution (see, (1.6)). It is thus expected that an assumption of heavier tails will lead to further stringent requirements on this rate, although we do not pursue this further in this article.
Condition B assumes a positive definite dependence structure. Note here that Condition A is inherently related to Condition B since the former by construc-tion imposes that maxeigen(Σ) = O(σ 2 ), where σ 2 is the variance proxy parameter. This can be observed as follows: for all δ ∈ R p , δ 2 = 1 from Definition B.3, we have, δ T ε t ∼ subG(σ 2 ) (or subE(σ 2 )) and thus A relaxation of this assumption to allow φ 2 → ∞ and κ 2 → 0 is feasible. In context of estimation results, the bounds for the localization error ofτ and thereby its rate of convergence provided later in this section are obtained upto universal constants. Consequently the effect of this relaxation will be directly observable in these bounds. Specifically, the case where φ 2 is allowed to diverge will lead to a deceleration of the rate of convergence ofτ . Note from the discussion in the previous paragraph that if the maximum eigenvalue of Σ is allowed to diverge with T , then the variance proxy parameter σ 2 must necessarily be allowed to diverge at the same rate. Consequently, without loss of generality, one may assume that φ 2 and σ 2 are of the same order, i.e., φ 2 σ 2 . The reason we mention this is because the effect of a diverging φ 2 will be observed via σ 2 in the bounds to be presented later in this section.
From an inference perspective, the extreme eigenvalues of Σ shall impact limiting distributionτ through a variance parameter of the form ξ −2 η 0T Ση 0 . Thus a relaxation to the case of φ 2 → ∞ and κ 2 → 0 would be feasible upon a rescaling with this underlying varaince. However, we do not explicitly illustrate this in our results in the interest of simplicity of exposition.
Next consider the following sets of non-zero indices of θ 0 1 and θ 0 2 , and let S c 1 and S c 2 be the complement sets. Define the maximum cardinality |S 1 | ∨ |S 2 | = s ≥ 1. The parameter s measures sparsity in the model (1.1). To allow the viability of this assumption one may center the observed data with column-wise means, i.e., consider x t of model (1.1) where instead of the means θ 0 1 , θ 0 2 the jump η 0 is s-sparse, i.e., there is a mean change in at most s components. Upon centering x t with column-wise empirical means, x * t = x t −x, t = 1, ..., T , withx = T t=1 x t T , the sparsity of η 0 is transferred onto the new mean vectors θ * 1 = Ex * t , t ≤ τ 0 , and θ * 2 = Ex * t , t > τ 0 . Heuristically, this centering operation is same as that carried out in linear regression models to get rid of the intercept parameter, which is implicitly assumed in the high dimensional linear regression literature. The sparsity assumption is typically made on the jump vector η 0 , as done in [40] and [17]. In contrast we make this assumption directly on the mean vectors θ 0 1 and θ 0 2 and in Appendix C we show that this assumption holds without loss of generality with respect to a sparsity assumption on jump vector η 0 , in context of the problem under consideration.
The remainder of this section is divided into two subsections. These subsections present near optimal and optimal rates of convergence ofτ and the sufficient conditions on the nuisance estimatesθ 1 andθ 2 required for the same. As noted in Section 1, near optimal rates in themselves do not allow inference.
However, these results are still relevant since they shall provide new insight into the distinctions between the sufficient conditions required to achieve optimality over near optimality. Moreover, these shall also serve as a stepping stone in the construction of a feasible methodology to obtain an optimal estimator considered in Section 4.

Near optimal O p (ξ −2 s log(p ∨ T )) estimation of τ 0
We begin with the following condition on the nuisance estimatesθ 1 andθ 2 .
Condition C.1 (On nuisance estimatesθ 1 ,θ 2 for near optimality ofτ ). Let π T → 0 be a positive sequence and assume that the following two properties hold with probability at least 1 − π T .
where S 1 , and S 2 are as defined in (2.1). (II) Assume these nuisance estimates satisfy the following bound in the 2 error, Condition C.1 is an exceptionally weak condition on the quality of nuisance estimates. All it requires is the 2 error in the estimation of the mean parameters to be of order of the jump size and may potentially be weaker than assuming ordinary consistency, i.e., an o p (1) approximation. To see this, consider the case where jump size ξ is bounded below by a constant, then these nuisance estimates are allowed to be inconsistent. Perhaps surprisingly, these nuisance estimates shall still be sufficient for near optimal estimation O p s log(p ∨ T ) of the change point parameter. We can now state our first result which bounds the localization error ofτ , thereby also yielding a near optimal rate of convergence in both subgaussian and subexponential settings. Then for any T ≥ 2, and c u > 2, we have, (1). Alternatively, under Conditions A(II) (subexponential setting), B and C.1, assuming T ≥ log(p ∨ T ), and c u > 8, we have, Although Theorem 2.1 only provides a near optimal rate of convergence and not the optimal rate, it does so under a very mild condition on the relationship between the nuisance estimates and the jump size (Condition C.1).
Remark 2.1. It may be observed from Theorem 2.1 that when ξ = O( √ s) and T ≥ log(p ∨ T ), then under both subgaussian and subexponential cases we have the same rate of convergence ofτ , i.e. (τ − τ 0 ) = O p ξ −2 s log(p ∨ T ) , under the same assumption (Condition C.1) on the nuisance estimates. This illustrates that when only a near optimal rate of convergence is of interest the heavier tail of a subexponential distribution does not influenceτ in its rate of convergence, or the quality of nuisance estimates required to achieve the same. This shall no longer be true when instead an optimal rate is of interest. Remark 4.1 in Section 4 provides further insight in this direction.
Another observable consequence of Theorem 2.1 is that when {s log(p ∨ T )} 1/2 ξ → 0, then under the subgaussian case we have perfect identification of the change point parameter in probability, i.e., pr(τ = τ 0 ) → 1. However the same cannot be obtained from Theorem 2.1 in the subexponential case. This is because under more general conditions than Remark 2.1, the localization bound under a subexponential distribution is either less precise than its subgaussian counterpart, or alternatively, requires a more rigid condition to match the estimation precision obtained under subgaussianity. This is illustrated in the following result where a slightly weaker bound allows perfect identifiability for the subexponential case.
Theorem 2.2 is valid when T ≥ 2 whereas the subexponential case of Theorem 2.1 requires T ≥ log(p ∨ T ). The reason as to why this distinction arises shall have significant consequences on optimal estimation and inference in the context of distinctions between assumptions for subgaussian and subexponential distributions. This is pursued in the following subsection. Simply stated, when the underlying stochastic term comprises of only a finite number of random variables, the heavier tail of the subexponential distribution is realized in the tail bound of this underlying stochastic term, else the behavior is similar to that in the subgaussian case. For intuition purposes, note that when an optimal rate of convergence |τ − τ 0 | ≤ cξ −2 , is of interest and ξ ≥ c u , then there are only a finite number of indices betweenτ and τ 0 .

Optimal O p (ξ −2 ) estimation of τ 0
This subsection illustrates thatτ can achieve an optimal rate of convergence, O p (ξ −2 ) while allowing potentially diminishing jump sizes. The only price one needs to pay to get this advantage is to ensure that the nuisance estimatesθ 1 , andθ 2 are of a higher quality as compared that in the previous subsection. To describe this behavior we begin with a stronger version of Condition C.1 on the nuisance estimates.
Condition C.2 (Assumption on nuisance estimates for optimality ofτ ). Let π T → 0 be a positive sequence and assume that either one of the two pairs of properties (I, II) or (I, III) holds with probability at least 1 − π T .
for an appropriately chosen small enough constant c u1 > 0. (III) For the subexponential case: Assume that there exists a sequence r T > 0, such that these nuisance estimates satisfy, for an appropriately chosen small enough constant c u1 > 0.
The only distinction between Condition C.2 and Condition C.1 of the previous subsection is that we have assumed a tighter bound on the nuisance estimates. This tightening has consequences on both the rate of convergence ofτ , and the assumptions on s, p required for the feasibility of this assumption. These aspects shall be discussed in detail after the following first main result providing an optimal rate of convergence ofτ .

Theorem 2.3. Suppose the model (1.1) and assume
Additionally assume either one of the following two sets of conditions.

(a) Suppose Conditions A(I) (subgaussian), B and C.2 (I, II) hold. (b) Suppose Conditions A(II) (subexponential), B and C.2 (I, III) hold.
Then, for any 0 < a < 1, choosing c a ≥ √ (1/a), we have, Theorem 2.3 provides the optimal rate of convergence ofτ . A first look on the sufficient conditions required for this result may lead one to suspect that Theorem 2.3 provides an optimal bound without any rate conditions on the model parameters s, p, ξ, l T . This is indeed true only in a very special case but false in general, as discussed in the following.
Consider the case where mean parameters θ 0 1 and θ 0 2 are known. Here, settinĝ θ 1 = θ 0 1 andθ 2 = θ 0 2 allows Condition C.2 to be trivially satisfied irrespective of the rate of divergence of s, p. Consequently, even if τ 0 is at a boundary (τ 0 = 0 or T ) and the dimensions s, p are diverging arbitrarily fast,τ will still estimate τ 0 at an optimal rate. The only assumption required for this case is ξ > 0, i.e. θ 0 1 = θ 0 2 . This case is clearly infeasible in practice and is only discussed to provide the following perhaps surprising insight. The estimation of a change point in itself does not require many assumptions that are usually thought of as necessary in the literature, including separation from boundary, minimum jump size and restrictions on dimensionality. These assumptions instead arise solely from the nuisance estimation aspect of the overall process.
In the more realistic setup where θ 0 1 and θ 0 2 are unknown, the key in Theorem 2.3 is Condition C.2. Effectively, the use of Condition C.2 has passed the burden of assumptions on model parameters to the nuisance estimatesθ 1 and θ 2 . To discuss this further we require the following boundary condition on τ 0 .
Condition D (On separation of τ 0 from its parametric boundary). Assume the existence of a change point τ 0 for the model ( Clearly, all this condition requires is at least one realization from both of the two distributions characterizing model (1.1) and is usually implicit in the literature. Under Condition D, one can obtain regularized mean estimatesθ 1 , andθ 2 , that satisfy at best the bound (see, Section 4), with probability 1 − o (1). Now comparing (2.3) with Condition C.2 (II) and C.2 (III) for the subgaussian and subexponential cases, respectively, yields the following requirements that must be satisfied for Condition C.2 to be feasible and in turn Theorem 2.3 to remain valid, for a suitably chosen small enough constant c u1 > 0. Relations (2.4) and (2.5) describes interplay between model parameters ξ, l T , s, p, T which are all sequences in T and the underlying class of distribution, that are then sufficient for the estimatorτ to achieve the optimal rate of convergence O p (ξ −2 ).
As a direct consequence of Theorem 2.3 one may observe that when ξ → ∞, then the estimatorτ perfectly identifies the change point parameter τ 0 , in probability, i.e., the limiting distribution ofτ in this case is degenerate. While this perfect identifiability under a diverging jump size is also provided by the grid search least squares estimator as studied in [2] and [8], however the assumption made here on the diverging jump size is weaker given high dimensionality. This can be observed in the rate at which ξ is required to diverge, for e.g. for the same result to hold true in [8] one requires ξ → ∞ at a fast enough rate so that additionally The assumption required in [2] to achieve the same perfect identifiability in the high dimensional setting is more stringent than that of [8].
From an estimation perspective, the optimal rate O p (ξ −2 ) ofτ may not seem a significant improvement in comparison to near optimal rates available in the literature for estimators in the high dimensional setting, for e.g. the projected cusum estimator of [40] with a presented rate of O p (ξ −2 log log T ), thus the improvement offered byτ being only of order log log T . However, this slight improvement is critical from an inference perspective, it is only the availability of an O p (ξ −2 ) rate that allows the existence of a limiting distribution.
The above discussion also highlights that in order forτ to have a nondegenerate limiting distribution in the high dimensional setting, conditions (2.4) and (2.5) must allow ξ ≤ c u , despite high dimensionality and while preserving the optimal rate of convergence O p (ξ −2 ) presented in Theorem 2.3. This feasibility is summarized in the following corollary.
We conclude this section with another perspective on the discussion in this subsection. Recall that the construction ofτ utilizes p-dimensional nuisance estimates whose rate of convergence involve the dimensional parameters s, p (see, (2.3)). However the rate of convergence of the change point estimator itself is O p (ξ −2 ), which is free of dimensionality parameters s, p, the sampling period T and is valid despite high dimensionality and a potentially diminishing jump size. This alludes towards the estimatorτ behaving as if the nuisance parameter vectors utilized in its construction are known. This property of an estimator is typically referred to as adaptation as described in [9], but is observed here in a high dimensional sense and in the context of change point estimation. In the fixed p setting, this property of a change point estimator behaving as if the nuisance parameters are known has also been studied in [22]. There are also more recent precedent's to similar behavior but in the context of inference on regression coefficients in high dimensional linear regression models. For estimation of a component of the regression vector, where certain corrections to the least squares loss or its first order moment equations, such as debiasing ( [34]) or orthogonalization ( [3], [11], [5] and [28]) induce sufficient adaptivity against nuisance estimates and thereby allow optimal estimation of the target regression parameter. The results of this subsection show that in the context of change point estimation, the plugin least squares estimator (1.4) itself possesses adaptivity against potentially high dimensional nuisance estimates, in order to allow for O p (ξ −2 ) estimation of the change point τ 0 , provided the nuisance parameters are estimated with sufficient precision. This adaptation shall become further visible in the following section where limiting distributions ofτ are established.

Limiting distributions ofτ in vanishing and non-vanishing jump size regimes
This section investigates the asymptotic distributional properties ofτ . Critically, here asymptotics are in a high dimensional sense where s, p are allowed to be fixed or diverge with T , with p diverging potentially exponentially with T . As noted before, the case of ξ → ∞ yields a degenerate limiting distribution of τ . Thus, in what follows we restrict our analysis to ξ ≤ c u , where the limiting distribution ofτ is non-trivial. This case is further subdivided into two distinct regimes described in the following condition.
Condition E (On the jump size for stability of limiting distributions). Assume that the jump size is bounded above, i.e., 0 < ξ ≤ c u . Let Σ and η 0 be as defined in Condition B and (1.2), respectively, and additionally assume that either one of the following two conditions hold.
The existence of the deterministic limit assumed in Condition E(i) and E(ii) is a mild assumption since Condition B already guarantees that the sequences under consideration are bounded above and below, i.e., we have, 0 < κ 2 ξ 2 ≤ η 0T Ση 0 ≤ φ 2 ξ 2 < ∞. This limit measures the variance of the underlying limiting process which then characterizes the distribution ofτ , thus the need for an assumption of its existence.
The vanishing and non-vanishing jump size regimes described in Condition E play a fundamental role in the distributional behavior of a change point estimator. The reason for this inherent characteristic can be directly observed by noting that the stochastic term that controls the change point estimatorτ has a distribution of the form , and constant ζ > 0. The regime ξ → 0, enables the ζξ −2 → ∞ and thus upon suitable normalization allows the functional central limit theorem to kick in, and yield a Brownian motion as the resulting process over ζ. This neat property has been exploited in the classical literature under fixed dimension (p) to obtain distributional results under this vanishing jump size regime, see, e.g. [1]. Unfortunately, when ξ → 0, the stochastic term described earlier is no longer an infinite sum, and it is clear that the Brownian motion approximation is no longer feasible. Infact it is also observable that any distributional result under this non-vanishing case will necessitate a further parametric assumption on the underlying distribution, since in this case the stochastic term under consideration is a finite sum.
The first result below considers the vanishing case ξ → 0. It obtains the limiting distribution ofτ as the distribution of the argmax of a symmetric two sided Brownian motion with a negative drift, under suitable conditions on the quality of the nuisance estimates used in the construction ofτ .
Following are observations regarding the sufficient conditions required for this result and comparisons with those in Theorem 2.3 which provides the optimal rate of convergence. As before, the burden of rate assumptions on s, p and ξ, have been passed onto Condition C.2 and additionally here the requirement (3.2), which in turn requires an inter-relationship between s, p, T, ξ, l T to be satisfied similar to as discussed before in (2.4) and (2.5). In this case however, condition (3.2) forces a marginally stronger requirement, specifically, comparing the desired rate of r T in condition (3.2) to the best attainable rate (2.3) of mean estimation under high dimensionality yields, for the subgaussian and subexponential cases, respectively. Comparing (3.3) to the requirements (2.4) and (2.5) we observe that the additional assumption made here is only to tighten the rate restriction to o(1) from O(1). This illustrates the price paid in order to obtain the limiting distribution in comparison to only optimal rate of estimation. This tighter restriction is also in coherence with classical results in the fixed p setting, where the condition reduces to only a relationship between ξ, T and l T . Additionally, since here we are restricted by the regime ξ → 0 under consideration, consequently these sufficient conditions must be further restricted as, for the subgaussian and subexponential cases, respectively. Another slightly stronger assumption made here in comparison to Theorem 2.3 is on sequence l T . While the result of Theorem 2.3 is valid without the actual existence of the change point, i.e. τ 0 ∧ (T − τ 0 ) ≥ 0, the limiting distribution of Theorem 3.1 assumes that the change point exists and is separated from the boundaries of its parametric space, i.e., This additional assumption is required in order to allow both ends of the two sided random walk to stabilize to the given Brownian motion process.
It can be observed that a change of variable to , which in turn yields the relation (1.8) provided in Section 1. This distribution is well studied in the literature and its cdf was first provided by [41], which enables computation of quantiles and in turn an asymptotically valid confidence interval for τ 0 .
We now proceed to the non-vanishing regime of Condition E(ii). The literature on inference under this case is quite sparse. Under the fixed p setting, the articles [23] and [18] provide generalized results on the distribution of the maximum likelihood estimators. These results provide key connections of the desired distribution to a two sided random walk. However, the results require mean parameters to be known and the sequence ξ to be constant over the sampling period T . To the best of our knowledge, the only results in the literature that discuss this non-vanishing regime in a diverging p setting are those of [7], [8] and [37]. Where the first two require p diverging slower than T , and the last requiring p diverging slower than T 2 / log T . The second main result of this section provides this limiting distribution valid under both fixed p and high dimensional asymptotics.
Recall from the earlier discussion on Condition E that under this non vanishing regime, the stochastic term controlling the distributional properties of the change point estimator is a finite sum (in t), and with finite variance for each random variable in this sum. This disallows the use of central limit theorems and thereby makes Gaussian approximations of the limiting process infeasible (without exact normality assumptions on the data generating process). It is due to this reason that the analysis of this regime requires further parametric assumptions on the underlying distribution which are stated in the following.

Condition A (Additional distributional assumptions). Suppose Condition A and B hold and additionally assume for any constants
.., T , for some distribution L, which is continuous and supported in R.

88
A. Kaul et al. The arguments in the notation L(μ, σ 2 ) are used to represent the mean and variance of the distribution L, i.e., EL(μ, σ 2 ) = μ, and var L(μ, σ 2 ) = σ 2 . Note that these relations follow trivially since Eε t = 0 and Eε t ε T T = Σ from Condition A and B, these parameters are notated only to complete the description of the assumed distribution and not to place them as additional assumptions.
The additional assumptions made in Condition A over those in Condition A are that of assuming an explicit form L of the distribution and assuming that it is continuous. Moreover, this condition also assumes that the form of the distribution of the linear combination of p components of ε t remains the same irrespective of p. An example of this is when ε t are assumed to be multivariate normal, here this linear combination will still be normal irrespective of size of the multivariate normal vector being projected. This condition can alternatively be restated as assuming the r.v c 1 +c 2 ε T t η 0 follows the distribution L in the limit (in T via p), without any alterations to the results to follow. The distribution being supported in R is also implicitly assumed in Condition A.
To proceed further we require the following stochastic process that shall characterize the limiting distribution of the change point estimator in the current non-vanishing regime. Let N + = {1, 2, ....} and N − = {−1, −2, ....}, and define the following negative drift two-sided random walk initializing at the origin, Here z t , z * t are independent copies of a L(−ξ 2 ∞ , 4ξ 2 ∞ σ 2 ∞ ) distribution, which are also independent over all t. The parameters ξ ∞ and σ 2 ∞ are defined in Condition E(ii). In the case of unit variances and spatial uncorrelated-ness of the data generating process, where Σ = Eε t ε T t = I p×p , we have σ 2 ∞ = 1 and consequently . Under these notations we can now state the second main result of this section.
The map arg max ζ∈Z C ∞ (ζ) is a.s. unique and possesses a distribution supported on Z. This has been shown in the proof of Theorem 3.2, although it is also quite intuitive upon observing that the two sided random walk C ∞ (ζ) is negative drift with continuously distributed increments, which in turn implies where it is continuously distributed on (0, ∞) and has an additional probability mass at the singleton zero.
The sole distinction between the assumptions of Theorem 3.1 and Theorem 3.2 is the change of regime from a vanishing jump size (Condition E(i)) to the non-vanishing regime (Condition E(ii)). Consequently, the observations after Theorem 3.1 on the inter-relationship between the quality of nuisance estimates, the dimensional parameters s, p and the jump size ξ, retain their validity under this non-vanishing regime. In particular, these inter-related requirements can be replaced with the rate restrictions (3.3) and in turn (3.4), while maintaining validity of Theorem 3.2. Since the analytical form of the distribution arg max ζ∈Z C ∞ (ζ) is unavailable, one may obtain quantiles of this distribution via a monte-carlo simulation, i.e., simulating the two sided random walk process and in turn obtaining realizations from the distribution under consideration.

Construction of a feasible O p (ξ −2 ) estimator of τ 0
The results of Section 2 and Section 3 allowτ to provide an O p (ξ −2 ) approximation of τ 0 , and provide limiting distributions to perform inference on the unknown change point. However, these results rely on the apriori availability of nuisance estimatesθ 1 , andθ 2 , satisfying Condition C.2. In this section we develop an algorithmic estimator to obtain these nuisance estimates that are theoretically guaranteed to satisfy Condition C.2, which in turn shall yield a feasible O p (ξ −2 ) estimate of the change point parameter.
In order to develop a feasible estimator for τ 0 , recall the following two aspects from Section 2. (a) The missing links required to implement the estimator of Section 2 are the nuisance (mean) estimates. (b) These mean estimates require either Condition C.1 (milder) to obtain a near optimal estimate or Condition C.2 (stronger) to obtain an optimal estimate of τ 0 . We shall fulfill requirement  (a) using soft thresholded means (4.2), and furthermore utilize the distinctions between Condition C.1 and Condition C.2 to build an algorithmic estimator that improves a nearly arbitrarily chosenτ , first to a near optimal estimateτ in a first iteration, and then to an optimal estimateτ in a second iteration. We remind the reader here that the specific choice of soft-thresholding as a regularization mechanism on the empirical means is superficial, the eventual objective is only to obtain mean estimates that are well behaved in the high dimensional setting in the 2 norm (see, (2.3)). Alternatively, one may consider using any suitable choice of the regularization mechanism that may also be problem specific, e.g. group 1 regularization which assumes a partially known sparsity structure.
The stepwise approach of the estimator to be considered is as follows. Condition C.1 is weak enough to be satisfied by estimatesθ 1 =θ(τ ) andθ 2 =θ 2 (τ ) of (4.2), computed with any nearly arbitrarily chosenτ ∈ {1, ..., (T − 1)} that is marginally away from its boundaries. (see, Condition F below). Thus, Theorem 2.1 and Theorem 2.2 now guarantee the updateτ =τ θ 1 ,θ 2 of (1.4) computed using these mean estimatesθ 1 ,θ 2 , shall be a near optimal estimate of τ 0 . With the availability of this near optimal estimateτ , it can be shown that the updatesθ 1 =θ 1 (τ ), andθ 2 =θ 2 (τ ), satisfy Condition C.2. This allows us to perform another updateτ =τ (θ 1 ,θ 2 ), and Theorem 2.3 now guarantees optimality ofτ . Thus, in performing these updates (two each of the change point and the mean) we have taken aτ from a nearly arbitrary neighborhood of τ 0 , and deposited it in an optimal neighborhood of τ 0 , with an intermediateτ that lies in a near optimal neighborhood. This process is stated as Algorithm 1 below and is presented visually in Figure 1. To complete the description of Algorithm 1, we provide the mild sufficient condition required from the initializing choiceτ .
Condition F (Initializing assumption). Let ψ = η 0 ∞ and assume that the initializerτ of Algorithm 1 satisfies the following relations.
Here l T is as defined in Condition D, c u > 0 is any constant and c u1 > 0 is an appropriately chosen small enough constant. The second requirement is discussed in the following, first from a theoretical and then followed by a practical perspective. For simplicity consider the case when l T ≥ c u < 1, i.e., the true change point τ 0 /T in the fractional scale is in some bounded subset of (0, 1), and that √ (2s)ψ ξ = O(1), i.e., the entries of the change vector η 0 are roughly evenly spread across its non-zero components and not with uneven diverging spikes, this is also satisfied if one assumes ψ ≤ c u . Then, requirement (ii) of Condition F is satisfied for allτ in an o(T ) neighborhood of τ 0 , i.e., anyτ satisfying |τ − τ 0 | = o(T ).
We shall show that despite choosing any starting value in this o(T ) neighborhood, Step 1 of Algorithm 1 shall then move it into a near optimal neighborhood. Following which, the next iteration of Step 2 will then move it to an optimal neighborhood of τ 0 , i.e., o(T )-nbd. −→ Step 1 near optimal-nbd., O p (ξ −2 s log p) −→ Step 2 optimal-nbd., O p (ξ −2 ). Note here the sequential improvement in the rate of convergence from initializing to Step 2. Moreover, the improvement to optimality in exactly two iterations. Another important consequence of these results is that it shows the redundancy of any further iterations, in the sense that since an optimal rate has been obtained at Step 2, performing further iterations will not yield any statistical improvement in the estimation of τ 0 . Upon viewing the above discussion from a rate perspective provides the theoretical argument in support of the mildness of Condition F. From a practical perspective, a valid initializerτ in an o(T ) neighborhood of τ 0 can be obtained by means of a preliminary coarse grid search, for e.g. one may choose a slowly diverging sequence (say log T ) and choose log T equally separated values in {1, ..., T } forming a coarse grid of possible initializers. Upon choosing the best fitting valueτ for Algorithm 1 from this coarse initializer grid (minimizing squared loss) and assuming that the best fitting value is closest to τ 0 , amongst the chosen grid points (while this is fairly intuitively, it remains to be verified analytically). Then by the pigeonhole principle this choice ofτ must be in an T/ log T = o(T ) neighborhood of τ 0 . Thereby thisτ shall form a theoretically valid initializer. A similar preliminary coarse grid search has also been heuristically utilized in [31] in a different model setting.
However, based on extensive numerical experiments we observe that this preliminary coarse grid search is numerically redundant. It is observed that any arbitrarily chosenτ separated from the boundaries of the parametric space yields statistically indistinguishable updates of Algorithm 1 when T is large. The reader may numerically confirm these observations using the software associated with this article. An illustration of this behavior is provided in Figure 2 with a single data set realization. In Section 5 we present results with the initializer fixed atτ = T/2 , irrespective of the location of the true change point τ 0 . Note that in the absence of any information on τ 0 , this choice ofτ = T/2 forms the worst or farthest initializer in a mean distance sense. All other values ofτ shall only serve to make estimation easier. Despite this worst possible choice, numerical results remain indistinguishable compared to those obtained whenτ is chosen with a preliminary coarse grid search. A version of this condition has also been provided in [26] in the context of near optimal estimation of a change point in linear models together with evidence in its support.
In the following we provide a precise description of the statistical performance of Algorithm 1, starting with a result that obtains the near optimal rate of convergence ofτ of Step 1 of Algorithm 1.  Then,τ =τ (θ 1 ,θ 2 ) of Step 1 of Algorithm 1 satisfies the following.
The result of Theorem 4.1 shows thatτ of Step 1 of Algorithm 1 will satisfy near optimal bounds despite the algorithm initializing with a nearly arbitrary τ . The conclusion of this theorem is effectively same as that of Theorem 2.1 and Theorem 2.2, with the distinction being that here theτ is a implementable estimate in comparison to Theorem 2.1, where the availability of nuisance estimates satisfying Condition C.1 was assumed. Following are two important remarks regarding Algorithm 1 and Theorem 4.1.
with probability 1 − o (1). The bound presented in (4.6) is chosen since it is required for the results to follow. The reason we bring this up is because in the case where ξ = O( √ s), it may be observed that (4.7) reduces to |τ − τ 0 | ≤ c u ξ −2 s log(p ∨ T ), which is the same bound as that in the subgaussian case, i.e., in this case where only a near optimal rate of convergence is of interest, the heavier tail of the subexponential distribution does not impact the change point estimation neither through rate assumptions (4.5) nor through the rate of the change point estimator itself. Remark 4.2 (Boundary case of τ 0 = 0 or T ). It may be observed that Theorem 4.1 assumes existence of a change point (Condition D), which is not required in Theorem 2.1. As discussed in Section 2, this distinction is not due to change point estimation itself but instead because one requires at least one realization from both underlying distributions before and after τ 0 to obtain any estimate of both nuisance parameters θ 0 1 and θ 0 2 . If these mean parameters are known apriori one may also estimate the boundary points with the squared loss itself. Nevertheless, a 0-norm regularization approach can be utilized to relax this assumption and include one boundary point, τ 0 = T 8 . This can be achieved by replacing Step 1 of Algorithm 1 with a regularized version, Here γ is a tuning parameter. It can be observed thatτ * can be equivalently represented as,τ whereτ is as in (4.4). This representation is more common in the change point literature, see, e.g. [19] and [40], where it is utilized to extend a single change point methodology to a multiple change point setting via variants of binary segmentation. Selection consistency pr(τ * = T ) → 1 when τ 0 = T yielded by this regularization can be additionally verified via conventional arguments. This is quite intuitive since when τ 0 = T , the mean parameter on both sides of any arbitrary cutoffτ is θ 0 1 . Thus if γ is chosen as an upper bound on residual noise, the boundary squared loss will be at most γ larger than that obtained with any value in {1, ..., (T − 1)}, in probability. A rigorous proof is omitted since it is largely a reproduction of existing arguments from the literature.
The construction of Algorithm 1 is modular in the sense that for it to yield an estimateτ that is optimal in its rate of convergence, it does not require the estimator of Step 1 to be specifically the one that has currently been chosen. Instead, all that is required from Step 1 is that it provides some estimateτ that satisfies the bound (4.6) of Theorem 4.1 with probability 1−o(1). Consequently, one may instead modify Algorithm 1 to use any other near optimal estimator in Step 1. This is described below as Algorithm 2.
Step 1: Implement any estimatorτ from the literature that satisfies the near optimal bounds (4.6) with probability 1 − o(1).
Step 2: Compute mean estimatesθ 1 =θ 1 (τ ), andθ 2 =θ 2 (τ ) and perform the update, An example of an estimator that can be used in Step 1 of Algorithm 2 is of [40], which obeys a tighter bound than that of Theorem 4.1 under similar rate conditions on model parameters, and consequently also satisfies (4.6). However, this estimator would be limited to the Gaussian setting. To the best of our knowledge, there is no estimator that is currently available in the literature that would serve as a replacement for Step 1 of Algorithm 1 while allowing high dimensionality and under the assumed conditions of Theorem 4.1. The following result provides the optimal rate of convergence for the estimateτ obtained from either Algorithm 1 or Algorithm 2 and shows that the limiting distributions of Section 3 remain valid for these feasible estimators.

(a) Conditions A(I) (subgaussian), B, D, and the relation (2.4) hold, and c u T l T ≥ s log(p ∨ T ). (b) Conditions A(II) (subexponential), B, D and the relation (2.5) hold, and
c u T l T ≥ s log 2 (p ∨ T ).

1). Additionally suppose Condition A , E and (3.3) hold, thenτ of Algorithm 1 or Algorithm 2 obeys the limiting distributions of Theorem 3.1 and Theorem 3.2, in the vanishing and non-vanishing regimes, respectively.
The result of Theorem 4.2 concludes the task that was put forth in the problem setup of Section 1, i.e., to obtain feasible estimators that achieve an optimal rate of convergence and possess well defined limiting distributions which in turn allows inference on the change point parameter τ 0 , despite high dimensionality of the underlying mean structure. Finally, we mention here the computational simplicity of Algorithm 1. It may be noted that computationally all that is required is two computations of sample means and two one dimensional discrete minimizations. The only notable computational cost arises from data based tuning process of choosing the regularizers λ 1 , λ 2 for soft-thresholding. Effectively, this makes Algorithm 1 highly scalable and implementable on large scale data.

Numerical results
This section evaluates the numerical performance of the estimation and inference results developed in the preceding sections. The two main objectives of this section are to evaluate the estimation performance of the proposed Algorithm 1 (AL1) and benchmark its performance with the estimator (WS) of [40]. While illustrating this objective we shall also compute the first step estimator (Step 1) of Algorithm 1, which although is not optimal but still yields a near optimal rate of convergence. The second objective is to evaluate the empirical inference performance of Algorithm 1 when utilized in conjunction with the result of Theorem 4.2. An auxiliary simulation examining uniformity of the proposed methodology over the mean parametric space is provided in Appendix D of the supplementary materials. In all simulations to follow, no underlying parameter is assumed to be known.
For the inference related designs of Simulation A(ii) and B(ii), we construct confidence intervals using both the limiting distributions of Theorem 3.1 and Theorem 3.2. Note that by design ξ is fixed throughout, hence the former limiting distribution is mis-specified for the considered cases. The significance level is set to α = 0.05. Confidence intervals are constructed as (τ − ME), (τ + ME) , whereτ is the output of Algorithm 1 and the margin of error (ME) is computed as ME = q v α σ 2 ∞ /ξ 2 or ME = q nv α based on the results of Theorem 3.1 and Theorem 3.2, respectively. Here q v α represents the 1−α/2 th quantile of the argmax of two sided negative drift Brownian motion of Theorem 3.1. This critical value is evaluated as c α = 11.03 by using its distribution function provided in [41]. The 1 − α/2 th quantile q nv α of the argmax of the two sided negative drift random walk is computed as its monte carlo approximation by simulating 3000 realizations of this distribution. Recall that Theorem 3.2 necessitates a parametric assumption on the distribution of the projection of ε t (Condition A ). As per the assumed data generating process of Simulation A, the distribution L here is assumed to be Gaussian for this design. For Simulation B(ii) we assume L to also be Laplace distributed, this is clearly a mis-specification since Laplace distribution is not invariant under linear combinations. However this was empirically observed to be the closest parametric form amongst other common subexponential distributions. For implementation of the confidence interval, we utilize plugin estimates of σ 2 ∞ and ξ 2 , pertinent computational details of which are provided in Appendix D of the supplementary materials.
Choice of tuning parameters: The regularizers λ 1 , λ 2 used to obtain soft thresholded mean estimates in Step 1 and Step 2 are tuned via a BIC type criteria. Specifically we set λ 1 = λ 2 = λ, and evaluateθ 1 (λ), andθ 2 (λ) over an equally spaced grid of twenty five values in the interval (0, 0.5). Upon lettinĝ S = {j;θ 1j = 0} ∪ {j;θ 2j = 0} we evaluate the criteria, Step 1 of Algorithm 1 we set λ as the minimizer of BIC(λ,τ ), and for Step 2 of Algorithm 1 we choose λ as the minimizer of BIC(λ,τ ). In context of the benchmarking estimator of [40], due to the absence of an author recommended tuning mechanism, we follow a similar approach as above to also tune their estimator. Their estimator is implemented using the author provided r-package InspectChangepoint [39] on a grid of twenty five values in order to obtain a sequence of estimated change points. Each estimated change point is then used to construct corresponding soft-thresholded mean estimates, which are tuned via the BIC criteria as above. Finally, the squared loss criteria is applied to choose the tuned estimate from amongst the pairs of estimated change points and corresponding estimated mean parameters. To report our results we present the following metrics. For the estimation results of Simulation A(i) and B(i) we report bias (|E(τ − τ 0 )|), root mean squared error (RMSE, E 1/2 (τ − τ 0 ) 2 ), and time (average over replications of running time in seconds) 9 , computed based on 100 monte carlo replications. The reported computation time for Algorithm 1 includes all tuning undertaken for its computation, i.e., as it would be implemented in practice. For the benchmark estimator of [40], the reported computation time is that of repeating their estimation process over the chosen tuning grid of twenty five values and does not include the time taken to thereafter complete the tuning process as described above. For the inference results of Simulation A(ii) and B(ii), we report coverage (relative frequency of the number of times τ 0 lies in the confidence interval) and the average margin of error (average over replications of the margin of error of each confidence interval) computed based on 500 monte carlo replications.
Partial results of estimation simulations A(i), B(i), and inference simulations of A(ii) and B(ii) are provided in Table 1, Table 2, and Table 3 and Table 4, respectively. Results of all remaining cases of these simulations are provided in Table 5- Table 16 in Appendix D of the supplementary materials. These results provide strong numerical support to our theoretical results regarding estimation and limiting distribution behavior of the proposed Algorithm 1.
We begin with a discussion on the estimation results of Simulation A(i) and B(i) from Table 1, and Table 2 in the Gaussian and Laplace settings, respectively. The proposed Algorithm 1 is observed to perform uniformly better over all considered model dimension sizes in comparison to the benchmark method WS when the sampling period is large T ∈ {350, 425}. In the case where T ∈ {200, 275}, neither method is observed to be uniformly superior. In all results, the Step 1 estimator is observed to be worst performer, this is not par-  Table 3 Simulation A(ii): inference using AL1 (τ ) with τ 0 = 0.2· T , at significance level α = 0.05.  Table 3 and Table 4. The proposed Algorithm 1 and the inference methodology provides good control on the nominal significance level with an expected deterioration observed with larger values of p and values of τ 0 closer to the boundary of the parametric space (also see, results of Table 11-16), but importantly the coverage is observed to catchup to the nominal level as T increases. Some observations from these results are following. The confidence intervals constructed using the non-vanishing regime result appear to provide nearly uniformly more precise coverage in comparison to those constructed using the vanishing regime result. While the reader may recall that the latter setting of a vanishing jump size regime is mis-specified under the considered designs, however, this should not be the cause for the above observation since under this mis-specification one would expect conservative coverage as opposed to the observed deficient coverage. Following are two speculative reasons that could be the root of this observation. There may be finite sample biases in the estimated jump sizeξ and estimated asymptotic varianceσ 2 ∞ which are inherent to regularized estimators. This reason however is not likely since this would also have impacted the non-vanishing regime confidence intervals equally, but is not observed to be the case. The most probable reason is due to the non-vanishing result itself and the manner in which its quantiles are computed. Specifically, since this result is based on a parametric distributional assumption and moreover its quantiles are evaluated as a monte-carlo approximation from realizations of the limiting distribution generated via the estimated jump size and asymptotic variance, it is consequently more adaptive to the specific data set realization under consideration in a finite sample sense. A final unusual observation is that despite the non-vanishing regime providing a higher coverage, the average margin of error is smaller than that of the vanishing regime. The margin of error being lower and coverage being higher is clearly not possible uniformly over all replications since both utilize the same estimateτ of Algorithm 1. Instead, upon a careful examination of individual intervals it was observed that the reason here is again that the quantiles of the non-vanishing regime are more adaptive to the specific data set realization under consideration.

Acknowledgment
We thank three anonymous reviewers whose comments led to significant improvements to this manuscript.

A.1. Proofs of Section 2
To present the arguments of this section we require some additional notation. In all to follow defineη =θ 1 −θ 2 . Also, for any non-negative sequences 0 ≤ v T ≤ u T ≤ 1 we define the following collection. Let A. Kaul et al. Finally, for any vectors θ 1 , θ 2 ∈ R p and any τ ∈ {0, ..., T }, define, where Q(· , · , · ) is the least squares loss in (1.3) defined for any T ≥ 2. The proof of Theorem 2.1 shall rely on the following preliminary lemma that provides a uniform lower bound on the expression U(τ,θ 1 ,θ 2 ), over the collection G(u T , v T ).

Alternatively, suppose Condition A(II) (subexponential setting), B and C.1 hold.
Additionally assume that T ≥ 2 ∨ log(p ∨ T ), and that the sequence v T satisfies T v T ≥ log(p ∨ T ). Then, for any c u > 8, the same bound (A.3) holds with probability at least Proof of Theorem A.1. We begin this proof with a few observations that shall be required to obtain the desired bound (A.3). Using Condition C.1 we have the following relations, with probability at least 1 − π T . Here the third inequality follows since the assumption (θ 1 ) S c 1 1 ≤ 3 (θ 1 − θ 0 1 ) S1 1 of Condition C.1 in turn implies that ξ, and similarly, which holds with probability at least 1 − π T . Here the second inequality for the consider the following expression, which again holds with probability at least 1 − π T . Here the first inequality is simply an algebraic manipulation. The second follows from the Cauchy-Schwartz inequality and the third follows from Condition C.1, (A.4) and (A.5). The final inequality again follows since the constant c u1 > 0 in Condition C.1 is chosen to be small enough.
We now proceed to the main proof of the bound (A.3). Consider any τ ∈ G(u T , v T ), and without loss of generality assume τ ≥ τ 0 . The mirroring case of τ ≤ τ 0 can be proved using symmetrical arguments. Consider, with probability at least 1 − π T . Where the final inequality follows by using (A.5). The uniform bound (A.3) now follows by substituting the uniform bound in Lemma A.6 for term τ t=τ 0 +1 ε t ∞ in (A.7), for the subgaussian and subexponential cases, under their respective assumptions.
Proof of Theorem 2.1. The proof of this result relies on a recursive argument on Lemma A.1, where the desired rate of convergence is obtained by a series of recursions, with this rate being sharpened at each step. We begin by considering any v T > 0 and applying Lemma A.1 on the set G(1, v T ) to obtain, with probability at least 1 − 2 exp{−c u1 log(p ∨ T )} − π T , where c u1 = (c u − 2) in the subgaussian case and c u1 = ( √ (c u /2) − 2) in the subexponential setting. Now upon choosing any, Where, Note that the rate of convergence ofτ has been sharpened at the second recursion in comparison to the first. Continuing these recursions by resetting u T to the bound of the previous recursion, and applying Lemma A.1, we obtain for the m th recursion, with probability at least 1 − 2 exp{−c u1 log(p ∨ T )} − π T . Repeating these recursions an infinite number of times and noting that a ∞ = ∞ j=0 (1/2 j ) = 2, and b ∞ = ∞ j=1 (1/2 j ) = 1 we obtain, with probability at least 1 − 2 exp{−c u1 log(p ∨ T )} − π T . Finally, note that despite the recursions in the above argument, the probability of the bound after every recursion is maintained to be at least 1 − 2 exp{−c u1 log(p ∨ T )} − π T . This follows since the probability statement of Lemma A.1 arises from stochastic upper bounds of Lemma A.6 applied recursively with a tighter bound at each recursion. This yields a sequence of events such that the event at each recursion is a proper subset of the event at the previous recursion. This completes the proof of Part (i) of this theorem. The distinction between the bounds of Part (ii) (subexponential case) and Part (i) (subgaussian case) arises due to the following reason. Recall that for the bound of Lemma A.1 to be valid in the subexponential case, we require T v T ≥ log(p ∨ T ). This is in turn due to the same requirement in the corresponding subexponential part of Lemma A.6. Thus the recursions in the above argument can only be performed so long as the rate is slower than log(p ∨ T ) for the subexponential case, thereby yielding the statement of Part (ii) of this theorem.
Proof of Theorem 2.2. We begin with a version of Lemma A.1 that is valid in this subexponential setting with any non-negative sequences 0 ≤ v T ≤ u T ≤ 1 (as opposed to log(p ∨ T ) ≤ T v T ≤ u T assumed in Lemma A.1). Proceeding with identical as in Lemma A.1 and using the deviation bound of Lemma A.7 (instead of Lemma A.6) in (A.7), we obtain, Without loss of generality assume τ ≥ τ 0 . The mirroring case of τ ≤ τ 0 can be proved using symmetrical arguments. Now following the recursive tightening proof of Theorem 2.1 we have for the m th recursion that, inf τ ∈ G(u T , v * T ) > 0, i.e., Where, Continuing these recursions an infinite number of times we obtain, Then, for any 0 < a < 1, choosing c a ≥ √ (1/a), we have the following uniform lower bound.
Proof of Lemma A.2. The structure of this proof is largely similar to that of Lemma A.1, except that it requires a more careful analysis of a residual stochastic term in order to allow for the comparatively sharper bound, however with a slightly weaker probability statement. First, using Condition C.2, we have, with probability at least 1 − π T . Also, using Condition C.2 we also have that, that holds with the same probability. This inequality follows by identical arguments as those in (A.6), which also hold here since Condition C.1 assumed in Lemma A.1 is weaker than Condition C.2 assumed here. Now consider any τ ∈ G(u T , v T ), and without loss of generality assume τ ≥ τ 0 . Following the arguments used to obtain (A.7) we have, with probability at least 1 − π T . The inequality follows from (A.11), and the final equality is obtained by simply an algebraic manipulation, however it is the key step that shall yield the desired bound of this lemma. We now consider each of the residual stochastic terms in the expression (A.12). First, applying Lemma A.8 for any 0 < a < 1, with c a ≥ √ (1/a), we have, with probability at least 1 − a. The second stochastic term in (A.12) can be bounded as follows, with probability at least 1 − 2 exp{− log(p ∨ T )}. The second inequality follows using the deviation bounds in Lemma A.6 and Lemma A.7 for the subgaussian and subexponential cases, respectively, and the use of corresponding 1 error bound of (A.10). The third inequality and the probability statement follows by first choosing c u = 3 see, Lemma A.6 and Lemma A.7 , and then recalling that by assumption, c u1 of Condition C.2 is chosen to be small enough choosing c u1 ≤ 1/(32· 3) . Substituting (A.13) and (A.14) in (A.12), we obtain, with probability at least 1 − a − 2 exp{− log(p ∨ T )} − π T . The mirroring case of τ ≤ τ 0 can be proved using symmetrical arguments. This completes the proof of this lemma.
Proof of Theorem 2.3. The proof of this result follows a recursive argument similar to that of Theorem 2.1, the distinction being that these recursions are made on Lemma A.2 instead of Lemma A.1. We begin by considering any v T > 0, now applying Lemma A.2 on the set G(1, v T ), for any 0 < a < 1, and choosing c a ≥ √ (1/a), we have, Continuing these recursions by resetting u T to the bound of the previous recursion, and applying Lemma A.2, we obtain for the m th recursion, with probability at least 1 − a − 2 exp{− log(p ∨ T )} − π T . Repeating these recursions an infinite number of times and noting that b ∞ = ∞ j=0 (1/2 j ) = 2, and c ∞ = ∞ j=1 (1/2 j ) = 1 we obtain, with probability at least 1 − a − 2 exp{− log(p ∨ T )} − π T . The statement of Theorem 2.3 follows directly from the bound (A.17) upon recalling φ 2 ≤ c u σ 2 from the discussion after Condition B. As seen earlier, despite the recursions in the above argument, the probability of the bound after every recursion is maintained to be at least 1 − a − 2 exp{− log(p ∨ T )} − π T , due to the same reasoning as in Theorem 2.1. This completes the proof of Part (i) of this theorem.

A.2. Proofs of Section 3
As the reader may have observed, a change of notation has been carried out for the results of this section. These results are presented in more conventional argmax notation instead of the argmin notation of the problem setup in Section 1. This is purely a notational change and all results can equivalently be stated in the argmin language. Accordingly we define the following versions. Let U(τ, θ 1 , θ 2 ) be as in (A.2) and consider, The multiplication of U with the sampling period T is only meant for notational convenience later on. Then, we can re-express the change point estimator τ (θ 1 , θ 2 ) defined in (1.4) as, The proofs of Theorem 3.1 and Theorem 3.2 below are applications of the Argmax Theorem (reproduced as Theorem B.2 in Appendix B). The arguments here are largely an exercise in verification of requirements of this theorem.
Proof of Theorem 3.1. Let the underlying indexing metric space be R, and consider the two cases of known and unknown mean parameters. Case I θ 0 1 and θ 0 2 known : Following is list of requirement of the Argmax theorem that require verification for this case (see, page 288 of [33]).
where the weak convergence follows from the functional central limit theorem.
Case II θ 0 1 and θ 0 2 unknown : In this case the applicability of the argmax theorem requires verification of the following conditions.
The third equality follows from Condition A and the convergence in distribution follows from Condition E(ii). Repeating the same argument with ζ ∈ ). An application the Argmax theorem now yields (τ * − τ 0 ) ⇒ arg max ζ∈Z C ∞ (ζ), which completes the proof of this case.
Case II θ 0 1 and θ 0 2 unknown : In this case, the applicability of the argmax theorem requires verification of the following.  [16]). Additionally ω ∞ ≥ 0, from the construction of C ∞ (ζ). Thus, we have 0 ≤ ω ∞ < ∞, a.s. which directly implies that when ω ∞ is well defined (unique) then it must be tight. To show that ω ∞ is unique, note that since by assumption (Condition A ) the increments are continuously distributed and supported on R, therefore max C ∞ (ζ) is continuously distributed on (0, ∞), with some additional probability mass at the singleton zero. Hence, the probability of max C ∞ (ζ) attaining any two identical values is zero. Consequently ω ∞ is unique a.s. Lemma A.4. Let C(τ, θ 1 , θ 2 ) be as defined in (A.17) and suppose Condition A, B and C.2 hold. Additionally assume sequence r T of Condition C.2 satisfies (3.2). Then, for any c u > 0, we have, Proof of Lemma A.4. By proceeding similar to (A.10), we have under Condition C.2 and (3.2) that, with probability at least 1 − π T . Consider any τ ≥ τ 0 and define the following, Then we have the following algebraic expansion, Next we provide uniform bounds on the terms R 1 and R 2 of (A.21). Consider, with probability at least 1 − o (1). Here the second inequality follows from Lemma A.6 and Lemma A.7. The final equality follows from an application of (A.20). Next consider term R 2 of (A.21).
Corresponding bound for the mirroring case of {τ ≤ τ 0 } can be obtained via symmetrical arguments. This completes the proof of this lemma.

A.3. Proofs of Section 4
The proof of Theorem 4.2 first requires some preliminary work, in particular we first need to examine the behavior of the estimatesθ 1 (τ ), andθ 2 (τ ), uniformly over a collection of values of τ . This is provided in the following theorem.
Theorem A.1. Suppose model (1.1), let 0 ≤ u T ≤ 1 be any non-negative sequence, ψ = η 0 ∞ , and for any constants c u , c u1 > 0, let, Additionally let either one of the following two conditions hold.
The following bound is satisfied, Here π T = 2 exp{−(c u1 − 2) log(p ∨ T )} under conditions (a) and Proof of Theorem A.1. Consider any τ ∈ G(u T , 0), satisfying τ ≥ c u T l T , and without loss of generality assume τ ≥ τ 0 . The mirroring case of τ ≤ τ 0 can be proved using symmetrical arguments. An algebraic rearrangement of the elementary inequality x (0:τ ] −θ 1 (τ ) .., τ . Now using the bound of Lemma A.9 we have that, with probability at least 1 − 2 exp{−(c u1 − 2) log(p ∨ T )}, or 1 − exp − (c u2 − 2) log(p ∨ T ) , under the subgaussian or subexponential setting, respectively. Consequently, upon choosing, and substituting these bounds in (A.25) we obtain, , under the subgaussian or subexponential setting, respectively. Now choosing λ 1 = 2λ, leads to θ 1 (τ ) S c 1 1 ≤ 3 θ 1 (τ ) − θ 0 1 S1 1 , which proves part (i) of this theorem. From inequality (A.26) we also have that, This directly implies that θ 1 (τ ) − θ 0 To finish the proof of this part recall that the only stochastic bound used here is the uniform bound over G(u T , 0) of Lemma A.9, consequently the final bound also holds uniformly over the same collection. Similar results forθ 2 (τ ) follow analogously. This result can alternatively be proved using the properties of the soft-thresholding operator k λ (· ), by building uniform versions of arguments such as those in [30], or [25].
Lemma A.5. Suppose model (1.1) and assume the following, for an appropriately chosen small enough constant c u1 > 0. Additionally, let either one of the following two conditions hold.
The following bound is satisfied, and applying Theorem A.1 we obtain the following two results that hold with probability 1 − o(1), for both subgaussian and subexponential cases. First,  of error) which is then evidenced by the proper control on coverage in this sub-interval c ∈ (0.5, 2). The methodology appears to break down in the subinterval (0.25, 0.5). This is plausibly due to the jump size getting smaller and going beyond the detection limit, i.e., at these smaller values of c, the jump size is ξ = c θ 0 1 − θ 0 2 2 and may no longer be able to preserve the relation (2.4), thus leading to a breakdown of the theoretical results supporting the methodology and thereby leading to the observations of Figure 3 and Figure 4. In practice, one may perform a detection test on the existence of a change point prior to implementation of the proposed inference methodology. A positive detection from this test will point toward the jump size being above the detection limit. However, there is still a gap here. While the detection limit on the jump size is when it scales as approximately s log p/T 1/2 , however, our inference results begin their validity at a scaling of s log p/T 1/2 .