Optimal multiple change-point detection for high-dimensional data

This manuscript makes two contributions to the field of change-point detection. In a generalchange-point setting, we provide a generic algorithm for aggregating local homogeneity testsinto an estimator of change-points in a time series. Interestingly, we establish that the errorrates of the collection of tests directly translate into detection properties of the change-pointestimator. This generic scheme is then applied to various problems including covariance change-point detection, nonparametric change-point detection and sparse multivariate mean change-point detection. For the latter, we derive minimax optimal rates that are adaptive to theunknown sparsity and to the distance between change-points when the noise is Gaussian. Forsub-Gaussian noise, we introduce a variant that is optimal in almost all sparsity regimes.


Introduction
Change-point detection has a long history since the seminal work of Wald [39] that lead to flourishing lines (see [31,36] for recent surveys). Earlier contributions focused on the problems of detecting and localizing change-points in a univariate time series. Spurred by applications in genomics [32] and finance, there has been a recent trend in the literature towards the analysis of more complex time series for instance in a high-dimensional linear space [21] or even belonging to a non-Euclidean space [8].
In this work, we study high-dimensional time series whose mean may change possibly on a few number of coordinates. See the introduction of [46] for an account of possible applications and practical motivations. In particular, we build a procedure which is able to detect and localize change-points under minimal assumptions on the height of these change-points. Along the way towards this optimal procedure, we define and analyze a scheme for general change-point problems that aggregates a collection of local tests into an estimator change-points. This generic scheme is of independent interest and easily allows to derive optimal change-point procedure in other complex settings such as covariance change-points problems or nonparametric change-point problems. In this introduction, we first describe this generic scheme before turning to our results in high-dimensional sparse change-point detection and finally discussing other applications.

General change-point setting
In the most general form of a change-point problem, we consider a random sequence Y = (y 1 , y 2 , . . . , y n ) in some measured space Y n and, for t = 1, . . . , n, we write P t for the marginal distribution of y t . We are also given a functional Γ mapping the probability distribution P t to some space V. Then, the purpose of change-point detection is to detect changes in the sequence (Γ(P 1 ), Γ(P 2 ), . . . , Γ(P n )) in V n and to estimate the positions of these changes. This setting is really general and does not require that the random variables (y t ) are independent.
Let us shortly explain how this general framework encompasses most offline change-point detection problems. In the Gaussian mean univariate change-point setting, we have Y = R, the distribution P t corresponds to the normal distribution with mean θ t ∈ R and variance σ 2 and Γ(P t ) = θ t . In the (heteroscedastic) mean univariate change-point problem, the distribution P t is not necessarily Gaussian and, in particular, the variance of y t is allowed to vary with t. Still, one is only interested in detecting variations of Γ(P t ) = xdP t = E[y t ]. By contrast, in the variance univariate change-point problems, one focuses on changes in the variance of y t . This can be done by taking Γ(P t ) = x 2 dP t − [ xdP t ] 2 = Var(y t ). If one is interested in possibly nonparametric changes in the distributions, then the functional Γ is simply taken to be the identity map. In semiparametric quantile change-point detection [22], the univariate distributions P t can be arbitrary whereas Γ(P t ) is a quantile of P t .

Desirable Guarantees of an estimator.
Before describing the generic scheme for estimating τ , let us first formalize the desired properties of a good change-point procedure. Informally, the primary objectives are to detect most if not all change-points while estimating no (or at least very few) spurious change-points.
Regarding the latter objective, it is usually required that the number of change-points K is not overestimated byτ . Here, we require a slightly stronger local property introduced in [38]. An estimatorτ of sizeK is said to detect no spurious change-points (NoSp) if The second condition simply ensures that no change-point is estimated near the boundaries of the time series. The first condition entails that, for each change-point τ k there is at most one estimated change-pointτ k in the interval [τ k − (τ k − τ k−1 )/2, τ k + (τ k+1 − τ k )/2]. In other words, (NoSp) requires that, on each sub-interval, the number of change-points is not overestimated. Let us now formalize the objective of detecting the change-points. In this work, we consider as in [38] realistic settings where some change-points are so close or their heights are so small that they are impossible to detect. As a consequence, we can only hope to detect the subset of significant change-points. In what follows, we define a subset K * ⊂ [K] of change-point indices that correspond to significant change-points. Obviously, the significance of a particular change-point is relative to the problem under consideration -data distribution, nature of change-points -and the definition is problem dependent. As an example, we define in the next subsection the suitable notion of energy and significance of a change-point in the mean multivariate change-point setting. In Section 6, we formalize this notion for covariance and univariate nonparametric change-point problems. In light of this discussion, the second guarantee we aim for is to detect all significant change-points. A change-point τ k is said to be detected if there is at least one estimated change-pointτ l in the interval [τ k − (τ k − τ k−1 )/2, τ k + (τ k+1 − τ k )/2]. Equivalently, this means that at least one of the estimated change-points is closer to τ k than to any other true change-point.
Aside from (NoSp) and (detect) properties, one may additionally aim at localizing the changepoints as well as possible -see the discussions in [41]. Given a specific change-point τ k detected by an estimator τ , its localization error d H,1 ( τ , τ k ) is defined by which is the smallest distance between τ k and one of the estimated change-points. While this work mainly focused on the detection problem, we shall also provide localization bounds along the way.

A generic roadmap for change-point detection.
In this manuscript, our first contribution is a generic procedure for aggregating a collection of tests into an estimator τ of τ . For two positive integers (l, r), we consider the time interval [l − r, l + r). Suppose we are given a collection G of such (l, r). For each (l, r) ∈ G, we are also given a homogeneity test T l,r of the null hypothesis H 0 : {(Γ(P t )) is constant over the segment [l−r, l+r)}. This hypothesis is equivalent to the absence of any change-point on the interval (l − r, l + r). Given such a collections of homogeneity tests (T l,r ), (l, r) ∈ G, we build in this manuscript an estimator τ that satisfies the following properties. If the multiple testing procedure does not reject any true null hypothesis (no false positives), then τ does not estimate any spurious change-point, that is, it satisfies (No Sp). Furthermore, any change-point τ k that is detected by some test Tτ k ,r k , whereτ k is close enough to τ k and r k is small enough is detected by the estimator τ . In other words, we establish a completely generic result that translates properties of the multiple testing procedure into detection properties. Thus, the construction of a change-point procedure boils down to building a suitable multiple testing procedure (T l,r ), (l, r) ∈ G whose family-wise error rate (FWER) is controlled, while being able to detect all the significant change-points. In turn, this allows us to reduce the problem of change-point detection under minimal distance between the change-points to the well-established field of minimax testing.

Related Work and possible applications.
In the last years, there has been a growing interest into the extension of univariate mean changepoint procedures such as wild binary segmentation (WBS) [14] to other problems such as covariance change-point [40], network change-point [41], or nonparametric change-point [33]. For each of these problems (and for others), it turns out that the general ideas of WBS can be instantiated. However, for each setting, the proofs need to be fully adapted in a case by case manner. Besides, the resulting procedures are only optimal up to logarithmic terms.
In contrast, it is quite straightforward to adapt our generic procedure to any new setting once suitable homogeneity multiple tests have been crafted. As the most prominent example, we consider the sparse high-dimensional mean change-point detection and establish the optimality of our procedure -see the next subsection for details. In Section 6, we also handle the covariance changepoint detection and the univariate nonparametric change-point detection problems. In each case, we pinpoint the first tight minimal conditions for detection.
Besides, we could apply our strategy to other problems such changes in auto-regressive models [43], changes in the inverse covariance matrix of y i [17,24] or changes in a high-dimensional regression model [34]. All such change-point problems can be addressed through the construction and careful analysis of two-sample tests for auto-regressive models, inverse covariance matrices, and linear regression models respectively. Similarly, we can build Kernel change-point procedures [1,16] from kernel two-sample tests [18].

Sparse Multivariate Change-point Setting
As explained above, our primary application of our generic scheme is the multivariate mean changepoint detection problem with sparse variations where one observes a time series Y = (y 1 , . . . , y n ) ∈ R p×n with unknown means Θ = (θ 1 , . . . , θ n ) ∈ R p×n so that we have the decomposition where the noise matrix ε = (ε 1 , . . . , ε n ) is made of independent and mean zero random vectors of size p. In this manuscript, we make two distributional assumptions on the noise. Either we suppose that all random vectors ε i follow independent normal distribution with variance σ 2 I p (see Section 3) or that the components of ε i follow independent sub-Gaussian distributions with variance σ 2 (see Section 4). In either case, we assume that σ 2 is known.
Here, we are interested in the variations of the mean vector θ t so that, relying on the formalism of the previous subsection, we have Γ(P t ) = θ t . Considering the vector of change-points τ = (τ 1 , . . . , τ K ), we can define K + 1 vectors µ 0 , . . . , µ K in R p satisfying µ k = µ k+1 for all k = 0, . . . , K − 1 such that Equivalently, µ k is the constant mean of y over the interval [τ k , τ k+1 − 1]. The difference µ k − µ k−1 in R p measures the variation of Θ at the change-point τ k and can possibly have many null coordinates. In this possibly sparse multi-dimensional setting, the significance of a changepoint is measured through three quantities ∆ k , r k , and s k . First, the height ∆ k of the change-point τ k is defined as the Euclidean norm of the signal difference. The length r k of the change-point τ k is the minimal distance from τ k to another change-point, τ k−1 or τ k+1 . More precisely, As a simple example, Figure 1 depicts a one dimensional piece-wise constant sequence Θ with 3 change-points illustrating the setting presented above. In the univariate change-point literature (e.g. [7,14,15]) the height and the length of a change-point characterize the significance of a change-point. In the multivariate setting, where the change-points can be sparse, meaning the number of non null coordinates of the vector µ k − µ k−1 is possibly small, one also considers the sparsity s k of change-point τ k , defined by where, for any v ∈ R p , v 0 = 1≤i≤p 1{v i = 0}. Figure 1: An example of a piece-wise constant sequence Θ with 3 change-points and p = 1.

Two-sample tests and CUSUM statistics
Our objective is to detect and recover positions (τ k ) k≤K under minimal conditions on the changepoint height ∆ k , change-point length r k and sparsity s k . In view of the generic change-point procedure discussed in the previous subsection, this mainly boils down to building suitable tests of the assumptions {Θ is constant over [l − r, l + r)} versus {Θ is not constant on this segment}. Following the literature on binary and wild binary segmentation, we consider the CUSUM statistic This statistic computes the normalized difference of empirical mean of y i on [l − r, l) and [l, l + r).
If the noise is Gaussian and if Θ is constant on [l − r, l + r), then C l,r (Y ) simply follows a standard p-dimensional normal distribution. To simplify, consider a specific instance of our testing problem where we want to test whether {Θ is constant over [l − r, l + r)} versus {Θ contains exactly one change-point at l on the segment [l − r, l + r)}. This corresponds to a two-sample mean testing problem, for which the CUSUM statistic C l,r (Y ) is a sufficient statistic if the noise is Gaussian. Then, given C l,r (Y ), one wants to test whether its expectation is 0 (no change-point on [l − r, l + r)) versus its expectation is non-zero but is s-sparse for some unknown s. This classical detection problem is well understood [11] and it is well known that a combination of a χ 2 -type test with a higher-criticism-type test is optimal. Here, the challenge stems from the fact that we do not want to perform a single such test, but a large collection of tests over a collection of (l, r) ∈ G.

Our contribution
As usual in the mean change-point literature, we consider the energy r k ∆ 2 k of the change-point τ k . Up to a possible factor in [1/2, 1], r k ∆ 2 k is the square distance between Θ and its projection on the space of vectors Θ with change-point at (τ 1 , . . . , τ k−1 , τ k+1 , . . . , τ K ) -see e.g. [38] for a discussion in the univariate setting. In other words, the energy r k ∆ 2 k characterizes the significance of the change-point τ k . In Section 3, we introduce a multi-scale change-point detection procedure detecting any change-point τ k whose energy is higher, up to a numerical constant, than σ 2 s k log(1+ √ p s k log(n/r k )) + σ 2 log(n/r k ). This result is valid for arbitrary length r k and sparsity s k , and does not require the knowledge of these two quantities. In summary, our procedure does not estimate any spurious change-point (NoSp) and detects all the change-points whose energy are higher than the latter threshold. In Section 5, we establish that, as soon as the unknown number K of the change-points is larger than 1, the condition σ 2 s k log(1 + √ p s k log(n/r k )) + σ 2 log(n/r k ) on the energy is tight with respect to n, p, r k and s k , in the sense that no procedure achieving (NoSp) is able to detect with high probability a change-point whose energy is smaller (up to some constant) than the latter threshold. In Section 4, we consider the more general setting where the noise is L-sub-gaussian with known variance, and we establish a similar result to the Gaussian case up to a logarithmic loss in some regimes. Finally, we illustrate in Section 8 the behavior of our procedure on numerical experiments.

Related work
For dense change-points (s k = p) but with unknown covariance for the noise, Wang et al. [45] (see also [44]) study the behavior of a procedure based on U -statistics of the CUSUM. Jirak [21] and Yu and Chen [48] introduce binary segmentation procedures based on the l ∞ norm of the CUSUMs. Although those work explicitly characterize the asymptotic distribution of the test statistics and, for some of them, allow temporal dependencies in the data, the corresponding energy requirements for change-point detection are either not studied or turn out to be suboptimal.
Closest to our work, Chan and Chen [5] study a bottom-up approach to detect change-points of a Gaussian multivariate time series in an asymptotic setting. More specifically, the authors consider an asymptotic regime where the size of the time series is exponential in the dimension: n = e p ζ with ζ ∈ (0, 1). The authors also assume that the number K of change-points remains finite when n, p → ∞ and that the minimal sparsity s of these change-points is polynomial is p. In this specific regime, their procedures provably recover change-points under a near-minimal (up to logarithmic factors with respect to n) condition on the energy. In contrast, our results provide non-asymptotic and tight results for all scaling with respect to n and p, allow for arbitrarily large number K of change-points and allow for the presence of non-significant change-points. In the same specific asymptotic setting, [20] introduce a so called score test statistic used in a change-point detection procedure which is shown to achieve the same performance as [5] in the gaussian model but also handle Poisson observations.
Recently, Liu et al. [28] have characterized the optimal detection rate of a possibly sparse change-point in the specific case where there is at most one change-point, but the optimal rates are significantly slower in the multiple change-point setting. See also [12] and [9] for earlier results. Wang and Samworth [46] have proposed the INSPECT method based on sparse projection to handle sparse change-points, but INSPECT provably detects the change-points under strong assumption on the energy; see Section 3 for a precise comparison.
In the univariate setting (p = 1), minimal energy requirements for change-point detection are well understood [13,15,38,42] and are nearly achieved by a wide range of procedures including penalized least-square and multi-scale tests methods.

A Generic algorithm for multiscale change-point detection on a grid
In this section, we study the problem of change-point detection in the general setting defined in Section 1.1. We introduce a bottom-up algorithm that aggregates a collection of homogeneity tests, performed at many positions, and for many scales, of our data. Then, we establish that, under some conditions on these tests, the procedure detects significant change-points.

Grid and multiscale statistics
Since our purpose is to translate a collection of local tests T = (T l,r ) (l,r)∈G indexed by a grid G into a change-point detection procedure, we first need to formalize what we mean by a grid. Henceforth, we call a grid G of [n] a collection of locations and scales where a scale r is a positive integer smaller or equal to n/2 and a location l is an integer between r + 1 and n − r. This couple (l, r) refers to the segment [l − r, l + r) centered at l and with radius r. Formally, G is therefore a subset of J n = (l, r) : r = 1, . . . , n 2 and l = r + 1, . . . , n − r + 1 . Given a grid G, we call R its collection of scales, that is R = {r : ∃l s.t. (l, r) ∈ G}. Finally, for a scale r ∈ R, D r stands for the corresponding collection of locations, that is D r = {l : (l, r) ∈ G}. Although we do not make any assumption on the grid G for the time being, we will mainly consider two specific grids in this section: the complete grid G F = J n and the dyadic grid G D defined by R = {1, 2, 4, . . . , 2 log 2 (n) −1 }, D 1 = [2, n], and See Figure 2 for a visual representation of the dyadic grid. At some points, we shall also mention a-adic grids G a . For any a ∈ (0, 1), G a is defined by R = {1, a −1 , a −2 , . . . , a 1− log(n)/ log(a) } and D r as in (5). Interestingly, the cardinality of the dyadic grid or more generally of the a-adic grid is order O(n), whereas the complete grid G D is quadratic. Grids are reminiscent of the c-normal systems of intervals introduced by Nemirovsky [30] (see also [27] for a definition) although our definition allows for non-necessarily normal intervals.
Given a fixed grid G, a multiscale test is simply a collection of test T = (T l,r ) (l,r)∈G indexed by the elements of G, which amounts to testing at all scales r ∈ R and all locations l ∈ D r whether the functional Γ(P t ) is constant over the segment [l − r, l + r). Equivalently, T l,r tests whether there exists a change-point in [l − r + 1, l + r − 1].

From a multiscale test to a change-point detection procedure
Our purpose is to introduce a generic procedure to translate a multiscale procedure into a vector of change-points. Intuitively, if, for some (l, r) ∈ G, we have T l,r = 1, then the functional Γ(P t ) is certainly not constant over [l − r, l + r) which entails that there is possibly at least one changepoint in [l − r + 1, l + r − 1]. As a consequence, the multiscale test gives a collection I(T ) = {[l − r + 1, l + r − 1] s.t. T l,r = 1} of intervals that tentatively contain at least one change-point.
If all these intervals were disjoint, then one simply would take τ as the sequence of centers of these intervals. Unfortunately, when two intervals [l 1 − r 1 + 1, l 1 + r 1 − 1] and [l 2 − r 2 + 1, l 2 + r 2 − 1] in I(T ) have a non-empty intersection, one cannot necessarily decipher whether there is only one change-point in the intersection of both intervals or if each interval contains a specific change-point. Hence, our general objective is to transform the collection I(T ) into a collection of non-intersecting intervals by either discarding or merging some of them. The dyadic grid is represented as follows : for each r = 2 i and l ∈ D r , we draw the interval [l − r + 1, l + r − 1] at position (l, log 2 (r)).
We propose the following bottom-up iterative procedure for building a collection of nonintersecting intervals. Start with T 0 = S 0 = ∅. For any scale r ∈ R, we compute the collections S r of intervals of scale r and the collection T r of locations based on the following The sets T 1 and S 1 are made of all positions l such that T l,1 = 1. More generally, T r contains all locations l such that T l,r = 1 and the corresponding interval [l − r + 1, l + r − 1] does not intersect with any of the detected intervals at a smaller scale r < r. The set S r contains all intervals associated to T r .
One can easily check that S = r S r is a union of closed non-intersecting intervals. Denote C = {C 1 , . . . , CK} the partition of S into connected components such that, for all 1 ≤ i < j ≤K, max C i < min C j . Finally, we estimate the vector of change-points τ by taking the center of each segment C k . In other words, we takeτ k := 1 2 (min C k +max C k ) for any 1 ≤ k ≤ K. This bottom-up aggregation procedure is summarized in Algorithm 1 and illustrated in Figure 3 below.
Remark: If, for some r ∈ R and some l 1 < l 2 ∈ D r , we have T l 1 ,r = 1, T l 2 ,r = 1, and l 1 + r − 1 ≥ l 2 − r + 1, then S r contains the segment [l 1 − r + 1, l 2 + r − 1]. In other words, our aggregation procedure merges two intervals if and only if they correspond to the same scales. In Section A, we also introduce a variant of the algorithm where, instead of merging these two intersecting with identical scale, we discard one of them. Computational Cost. A naive implementation of Algorithm 1 -and also of Algorithm 2 defined in Appendix -requires to compute all tests T l,r on the grid, whereas the aggregation procedure only needs to compute a number of tests T l,r proportional to the size of the grid. More precisely, if the Data: y t , t = 1 . . . n and local test statistics (T l,r ) (l,r)∈G Result: (τ k ) k≤K T r , S r = ∅ for all r ∈ R and S = ∅; for increasing r ∈ R do for l ∈ D r s.t.
..,K be the connected components of S sorted in increasing order; return τ k = 1 2 (min C k + max C k ) k=1,...,K Algorithm 1: Bottom-up aggregation procedure of multiscale tests computational cost of T l,r is Λ l,r for each (l, r) in the grid G, then the aggregation procedure requires O( (l,r)∈G Λ l,r ) computations. If for all (l, r), the cost Λ l,r is proportional to r, that is Λ l,r = O(rΛ), then the overall computational cost is O(Λ (l,r)∈G r) which is O(Λn 3 ) for the complete grid and O(Λn log(n)) for the dyadic grid. One can speed up the full procedure by computing the statistics T l,r and aggregating on the fly by checking whether [l − r + 1, l + r − 1] intersects S before evaluating T l,r = 1. Indeed, the connected components C k can be computed at each increasing scale r. Hence, at scale r, one only needs to compute the tests T l,r at locations l such that [l − r + 1, l + r − 1] does not intersect the connected components detected at scales r < r.

General analysis
In this subsection, we provide an abstract theorem translating error controls of the multiple test procedure T in terms of properties of τ . As explained in the introduction, the time series (y t ) may contain change-points that are too small to be detected. Having this in mind, we define a subset K * ⊂ [K] of indices corresponding to so-called significant change-points. As our purpose is to provide deterministic condition so that the change-points in K * , we need to introduce, for each k ∈ K * , an element of the grid (τ k ,r k ) ∈ G at which the statistic T is expected to detect τ k . One could think ofτ k as some position close to τ k and tor k as some radius which is large enough to convey information on the change-point. Recall that the length r k of the change-point τ k is defined by r k = min(τ k+1 − τ k , τ k − τ k−1 ). We assume that the scalesr k and the locationτ k of detection satisfy the two following conditions: 4(r k − 1) < r k and |τ k − τ k | ≤r k − 1.
The first condition ensures that the scaler k < r k /4 + 1 is small enough compared to the length r k . The second condition is always satisfied ifτ k is the best approximation of τ k in Dr k and if the grid G satisfies the following approximation property (App): For all r ∈ R and all l ∈ [r + 1, n − r + 1], there exists l ∈ D r such that |l − l| ≤ r − 1.
This property entails that any point l can be approximated at distance r − 1 by some location in D r . This also implies that each point l ∈ [r + 1, n − r] belongs to at least one segment (l − r, l + r) where l 1 lies in D r . In practice, the a-adic grids G a and the complete grid satisfy (App). Next, we introduce an event on the tests (T l,r ) under which the change-point estimator τ of Algorithm 1 performs well. In the following, we write H 0 , the collection of all possible (l, r) ∈ J n such that there is no change in [l −r +1, l +r −1], i.e. Γ(P t ) is constant on [l −r, l +r). Equivalently, we have (l, r) ∈ H 0 iff (l − r, l + r) ∩ {τ k , k = 1, . . . , K} = ∅ .
For a collection K * and some elements of the grid (τ k ,r k ) satisfying (6), the Event A (T, K * , (τ k ,r k ) k∈K * ) is defined as the conjunction of the two following properties: (i) (No false positive) T l,r = 0 for all (l, r) ∈ H 0 ∩ G (ii) (Detection of significant change-points) for every k ∈ K * , we have Tτ k ,r k = 1.
The first property states that T performs no type I errors on the event A (T, K * , (τ k ,r k ) k∈K * ), whereas the second property enforces that all the significant change-points are detected by the specific tests Tτ k ,r k . Theorem 1. The following holds for any grid G, any local test statistic T , any non-negative integer K, any distribution with K change-points, any K * ⊂ [K] and scales and locations (τ k ,r k ) k∈K * in G satisfying Assumption (6). Under the event A(T, K * , (τ k ,r k ) k∈K * ), the estimated change-point vectorτ returned by Algorithm 1 satisfies the two following properties • Significant change-points are detected: for all k ∈ K * , there exists k ≤K such that |τ k − τ k | ≤r k − 1 < r k 4 .
The first property states that so-called significant change-points (τ k ) k∈K * are detected by the generic algorithm at the right scale. The no-spurious property (1) guarantees that, around any true change-point τ k , the procedure estimates at most one single change-point τ l . Importantly, the theorem does not make any assumption on the non-significant change-points. In fact, change-points τ k with k ∈ [K] \ K * may or may not be detected. In general, we can only conclude from Theorem 1 that |K * | ≤ K ≤ K on the event A (T, K * , (τ k ,r k ) k∈K * ) .
Theorem 1 is abstract but its main virtue is to translate multiple testing properties into changepoint detection properties. For a specific problem such as multivariate mean change-point detection considered in the next section, the construction of a near optimal procedure boils down to introducing a collection of local test statistics, such that (a) change-points τ k belong to K * under minimal conditions, (b) the scaler k is the smallest possible, and (c) the event A(T, K * , (τ k ,r k ) k∈K * ) holds with high probability.
In the case where all the change-points are significant, the result of Theorem 1 can be reformulated as follows: Corollary 1. The following holds for any grid G, any local test statistic T , any non-negative integer K, any distribution with K change-points, any (τ k ,r k ) k=1,...,K in G satisfying Assumption (6). Under the event A(T, [K], (τ k ,r k ) k=1,...,K , the estimated change-point vectorτ returned by Algorithm 1 satisfies K = K and, Let us respectively define the Hausdorff distance and the Wasserstein distance of two vectors (u 1 , . . . , u K ) and Then, Corollary 1 straightforwardly implies that, if K * = [K], then these two losses are bounded as follows As an alternative of Algorithm 1, one could use other bottom-up aggregating procedures. For instance, Algorithm 2 defined in Appendix A also satisfies Theorem 1. Although these two algorithms are closely related, Algorithm 1 is slightly more conservative than Algorithm 2 since it merges all detection intervals at a given resolution while Algorithm 2 only keeps one interval at a given resolution when multiple intervals intersect -the one with smallest index t. While the minimax properties of both methods are comparable -at least up to a multiple constant -the choice of aggregation method will have an influence in practice on the outcome: Algorithm 1 will be slightly more stable, detect less change-points, and provide wider confidence interval around them, while Algorithm 2 will be slightly more sensitive to smaller changes, i.e. detect smaller change-points, will be more precise, and somewhat less stable.
Theorem 1 ensures that, if T τ k ,r k = 1 with (τ k , r k ) satisfying Assumption (6), then the changepoint τ k is detected. Inspecting the proof of Theorem 1, one easily checks that Assumption (6) is minimal for Algorithm 1 (and also for Algorithm 2). Still, one may wonder whether any generic algorithm has to require that 4(r k − 1) < r k to detect the change-points or if there exists a generic algorithm where the constant 4 in the above condition can be improved.
Comparison with narrowest over threshold methods. As mentioned in the introduction, other aggregation procedures have been proposed in the literature. In particular, the narrowest over threshold scheme proposed by [2] and later used in [24] is also closely related to the local segmentation algorithm of Chan and Chen [5]. A simple extension of these procedures for generic change-point problems and for a general collection of tests (T l,r ) would amount to modifying Algorithm 1 by selecting locations l in D r such that T l,r = 1 and [l − r + 1, l + r − 1] does not intersect previously detected change-points, whereas we require in Algorithms 1 and 2, that [l−r +1, l+r −1] does not intersect previously detected confidence intervals. In some way, the narrowest-over threshold scheme is therefore less conservative. Unfortunately, there is no generic result in the form of Theorem 1 for such procedures and, from informal arguments, we doubt that the corresponding procedure provably achieves (NoSp) under a control of the FWER of the tests. Inspecting the proof of Theorem 1 in [2] and Theorem 3 in [24] for univariate mean change-point problems, one observes that the chosen threshold is much larger than what is needed to control the FWER so that the theoretical threshold is certainly over-conservative -see step 5 of the proof of Theorem 1 in [2]. In contrast, Theorem 1 in [5] for univariate change-point problems is based on the minimal threshold, but the proof relies on the important assumption that the number K of change-point remains bounded while n goes to infinity. Besides, it is not clear how one could extend the arguments to more general settings.

Multivariate Gaussian change-point detection
We now turn to the multivariate change-point model introduced in Section 1.2. Throughout this section, we assume that the random vectors ε t are independently and identically distributed with ε t ∼ N (0, σ 2 I p ). Since we shall apply the general aggregation procedures introduced in the previous section, our main job here is to introduce a near-optimal testing procedure.
Fix some quantity δ ∈ (0, 1). At the end of the section, 1 − δ will correspond to the probability of the event A (T, K * , (τ k ,r k ) k∈K * ) introduced in the previous section. Alternatively, one may interpret δ as an upper bound of the desired probability that the change-point detection procedure detects a spurious change-points. Recall that, for a change-point τ k , s k stands for the sparsity of the difference µ k+1 − µ k . The energy of a given change-point τ k is c 0 -high if for some universal constant c 0 to be defined later. We show in this section that when c 0 is large enough, all high-energy change-points can be detected. Conversely, it is established in Section 5 that Condition (8) is (up to a multiplicative constant) optimal for detecting change-points and cannot be weakened. Let us now discuss the different regimes contained in Equation (8). In what follows, define in order to alleviate notations. If γ r ≥ p/2, then ψ (g) n,r,s γ r where u v means that for two positive numerical constants c 1 and c 2 , one has c 1 v ≤ u ≤ c 2 v. This corresponds to the minimal energy condition for detection in the univariate case, i.e. when p = 1; see [38]. The condition γ r ≥ p/2 occurs when p is rather small and the scale r is much smaller than n. If γ r ≤ p/2, then We define K * ⊂ [K] as the subset of indices such that τ k satisfies (8). For any k ∈ K * , we define r * k as the minimum radius r such that an inequality similar to (8) is satisfied for r∆ 2 k , namely In the following, we introduce multi-scale tests for respectively dense and sparse change-points. For simplicity, we restrict our attention to the dyadic grid G D = (R, D) introduced in the previous section (see Equation (5)), the complete grid being used in the next section.
To apply Theorem 1, we will consider an event A (T, K * , (τ k ,r k ) k∈K * ) in the proof of Corollary 2 where the scaler k ∈ R is of the same order as r * k ∈ R + .

Dense change-points
We focus here on dense change-points for which s k is possibly as large as p. Given κ > 0, τ k is a κ-dense high-energy change-point if The requirement (10) is analogous to (8) when s k ≥ [p log(n/(r k δ))] 1/2 . For any κ-dense highenergy change-point, we definer k ∈ R as the minimum radius r ∈ R such that an inequality of the same type as (10) is satisfied for r∆ 2 k , Intuitively,r corresponds to the smallest scale such that τ k is guaranteed to be detected. By definition, we have 4(r be the best approximation of τ k in the grid with scalē r (d) k . By definition of the dyadic grid, we have |τ l,r is zero. Recall that the rescaled CUSUM statistic C l,r depends on the noise level σ, and the statistic Ψ Proposition 1. There exists a universal constant κ d > 0 and an event ξ (d) of probability larger The above proposition ensures that, on the event ξ (d) , the collection of tests T  If we plugged this collection of tests into the general multiple change-point procedure, then Theorem 1 would entail that all κ d -dense high-energy change-points are discovered and localized and that τ does not detect any spurious change-point. In the next subsection, we introduce alternative tests that are tailored to sparse change-points and thereby allow to detect change-points that are not κ d -dense high-energy but still satisfy the energy condition (8).

Energy condition
For a given 1 ≤ k ≤ K, the change-point τ k is a κ-sparse high-energy change-point if s k ≤ [p log(n/(r k δ))] 1/2 and If τ k is a κ-sparse high-energy change-point, we definer (s) k as the minimum scale such that an inequality similar to (11) is satisfied : As in the dense case, we have 4(r k as the best approximation of τ k in the grid Dr(s) k at scale τ k . By definition of the dyadic grid, we have |τ k /4. We introduce below two statistics for handling this problem.

Berk-Jones Test
The Berk-Jones test [29] is a variation of the Higher-Criticism test originally introduced in [11] for signal detection. It has been previously studied in [6] for sparse segment detection. We decided to use the Berk-Jones test in this paper because of its intrinsic formulation in terms of the quantiles of a Bernoulli distribution, but the Higher-Criticism test would reach the same rates of detection within a constant factor. We use the notation N * to denote the set of positive itegers. Given (l, r) in the grid G D , we first introduce N x,l,r as the number of coordinates of C l,r that are larger than x in absolute value.
If (l, r) ∈ H 0 , then the rescaled CUSUM statistic follows a standard normal distribution and N x,l,r therefore follows a Binomial distribution with parameters p and 2Φ(x). The Berk-Jones test amounts to rejecting the null, when at least one of the statistics N x,l,r , for x ∈ N * , is significantly large. Next, we formalize what we mean by 'large'. For any u > 0, any q 0 ∈ [0, 1], and positive integer p 0 , denote Q(u, p 0 , q 0 ) = P[B(p 0 , q 0 ) > u] the tail distribution function of a Binomial distribution with parameters p 0 and q 0 . Given δ ∈ [0, 1], we then write Q −1 (δ, p 0 , q 0 ) for the corresponding quantile function, Given a scale r ∈ R and a positive integer x, we define the weights This allows us to define the Berk-Jones statistic over [l − r, l + r) as the test rejecting the null when at least one N x,l,r is large.
Equivalently, T is an aggregated test based on the statistics N x,l,r with weights δ (BJ) x,r . From the above remark and a union bound, we deduce that the probability that the collection of tests {T (BJ) l,r , (l, r) ∈ G D } rejects a least one false positive is at most δ: where we recall that (l, r) ∈ H 0 if and only if Θ is constant on [l − r, l + r). Although one may think from the definition (14) that T involves an infinite number of N x,l,r , this is not the case.
Indeed, N x,l,r is a non-increasing function of x whereas for all x such that 2pΦ(x) ≤ δ (BJ) x,r , we have x,r we derive Since, for any

Partial norm statistics
The Berk-Jones test is able to detect change-points τ k for which there exists s such that the s largest squared coordinates of µ k − µ k−1 are larger than C(log(ep/s 2 ) + log(n/r k )/s) with a large enough constant C. However, it may happen that τ k satisfies the energy condition (8) and that the s largest coordinates of µ k − µ k−1 are negligible compared to log(n/r k )/s, mainly because s → 1/s is not summable. To solve this issue, we introduce a second sparse statistic based on the partial sums. Let Z = 1, 2, 2 2 , . . . , 2 log 2 (p) denote the dyadic set. Only the sparsities s ∈ Z will be analysed by the partial norm statistic. For any (l, r) in the grid G D , we respectively write C l,r,(1) , C l,r,(2) , . . . the reordered entries of C l,r by decreasing absolute value, that is |C l,r,(1) | ≥ · · · ≥ |C l,r,(p) |. Then, for s ∈ Z, we define the partial CUSUM norm by Then, we define the test T (p) l,r rejecting the null when at least one of the partial norms is large Finally, we define the sparse test by aggregating both the Berk-Jones test and the partial norm test. For any (l, r) ∈ G D , let T Here we introduced two different statistics for the same sparse regime s k ≤ [p log(n/(r k δ))] 1/2the Berk-Jones statistic and the partial sums statistic -mainly to solve a problem of integrability. We made this choice for the sake of simplicity, but we could have used a single test, as presented in [28] where Z follows a standard normal distribution N (0, 1). This statistic leads to the same type of result as the Berk-Jones statistic when enough coordinates µ k − µ k−1 are large in absolute value, and it is comparable to the partial sums statistic when its threshold x becomes low enough.

Consequences
To conclude this section, it suffices to observe that, for c 0 in (8), any c 0 -high-energy change-point τ k in the sense of (8) is either a c 0 2 -dense or a c 0 2 -sparse high-energy change-point. Hence, upon defining the test T l,r = T l,r for (l, r) ∈ G D , we consider the change-point procedure τ defined in Algorithm 1. Gathering Theorem 1 with Proposition 1 and Proposition 2, we obtain the following.

Corollary 2.
There exists a universal constant c 0 > 0 such that, with probability higher than 1−6δ, the estimator τ satisfies (NoSp) and detects all c 0 -high-energy change-points (as defined in (8)) τ k in the sense where r * k is defined in (9).
If the change-points are of high-energy, that is K * = [K], then Corollary 2 can be reformulated as follows: Corollary 3. Assume that for all k = 1, . . . , K, τ k is a c 0 -high-energy change-point (see (8)) where c 0 is the same as in Corollary 2. Then, with probability higher than 1 − 6δ, the estimator τ satisfieŝ for all k = 1, . . . , K .
In particular, one can respectively bound the Hausdorff and the Wasserstein losses, with probability higher than 1 − 6δ by In Section 5, we establish that the Condition (8) is (up to a multiplicative constant) unimprovable and corresponds to the detection threshold for multivariate change-points.
Corollary 3 can be compared to the result of [46] on multivariate change-point detection in the multiple change-point setting. Using a method based on the CUSUM statistic and assuming that there are only high-energy change-points, the authors also obtain an upper bound on the energy necessary to detect the change-points. However, this result does not adapt to r k , ∆ k , s k , and the detection rate is suboptimal in many regimes. Writing r = min k=1,...,K r k , ∆ = min k=1,...,K ∆ k and s = max k=1,...,K s k , Theorem 5 of [46] requires two conditions of the type r∆ 2 ≥ c( n r ) 4 log(np) and r∆ 2 ≥ cs n r log(np). This detection rate is therefore suboptimal by a polynomial factor in n/r when r is of smaller order than n, and by a logarithmic factor log(np) instead of log(1 + √ p/s log(n/r)) + 1 s log(n/r) when r is of order n. Closer to our results, [5] have introduced another bottom-up procedure in the very specific asymptotic setting n = e p ζ for ζ ∈ (0, 1) with a fixed K number of change-points. Assuming that, for each change-point, at least s coordinates of µ k+1 − µ k+1 are larger than ζ in absolute value, [5] establish that their procedure provably detects the change-points as long as In their specific asymptotic regime and when all non-zero coordinates are of the same order, and all the change-points have a similar length r k , their result is similar to ours up to the logarithmic terms. Indeed, for equispaced change-points, our logarithmic term log(n/r k ) = log(K) is much smaller than log(n). Besides, their result does not handle the presence of low-energy change-points and does not hold beyond the asymptotic regime n = e p ζ . In contrast, our condition (8) for high-energy change-points entails that the detection conditions are qualitatively different for other scalings in n and p. On the technical side, our condition (8) is of l 2 type whereas that in [5] is of minimal non-zero type. Recovering the tight l 2 conditions turns out to be much more challenging as we need to handle situations where some coordinates have different orders of magnitude. This is the main reason why we need to resort to a combination of the Berk-Jones and the partial-norm statistics.
Comparison to one change-point problem. When one knows that K ≤ 1 (at most one change-point), then [28] proved that it is possible to detect τ 1 if and only if r 1 ∆ 2 1 ≥ cσ 2 s 1 log(1 + 1 s 1 √ p log log 8n) + log log 8n . As in the univariate setting, the problem with only one change-point is simpler than for general K ≥ 2. As for our procedure, Liu et al. [28] rely on statistics based on the CUSUM -a chi square statistics in the dense case and a thresholded sum of squared coordinates in the sparse case -to detect and localize τ 1 . It turns out that the detection procedure of [28] adapts to distance r 1 = max(τ 1 − 1, n + 1 − τ 1 ) the boundary, and one could refine their result by stating that τ 1 is detectable if and only if r 1 ∆ 2 1 ≥ cσ 2 [s 1 log(1 + 1 s 1 p log log(2n/r 1 )) + log log(2n/r 1 )] which is more smaller when r 1 is of the order of n. This refined result is in the same spirit as our bounds for mutiple change-point, but the rate is faster because one obtains log log(n/r 1 )instead of log(n/r k ) in our case. The reason for this faster rate is due to the relative simplicity of the problem with only one change-point. Indeed, in single change-point detection, there is no need to look for change-points at all positions and scale at the same time, since scale and positions are related. This implies that it is possible to attain faster rates than in multiple change-point detection. The comparison between single and multiple change-point detection is thoroughly done in [38] for univariate models.
Computational Cost. The cost of the tests T (d) l,r in the dense regime is O(rp). The computation of the partial norm statistic requires to sort the coordinates C l,r,i of the CUSUM statistic, which takes O(p(r + log(p))) operations. Since only the thresholds x ≤ c log(np/(rδ)) 1/2 are needed to compute the Berk-Jones statistic, it holds that, for δ ≥ (np) −c with a numerical constant c > 0, the computational cost of the Berk-Jones statistic is O(p(r + log(np))). Thus, for each (l, r), the overall computational cost of the test T l,r = T l,r is Λ = O(p(r + log(np))), and the computational cost of the whole change-point detection procedure on the dyadic grid is O(np log(np)).

Multi-scale change-point detection with sub-Gaussian noise
We now turn to the more general case of sub-Gaussian distributions [37]. Given a random variable Z, define its ψ 2 -norm by Z ψ 2 = inf{x > 0, E[exp(Z 2 /x 2 )] ≤ 2} . Given L > 0, a mean zero real random variable is said to be L-sub-Gaussian if Z ψ 2 ≤ L. This implies in particular that, for all x ≥ 0, one has P (|Z| ≥ x) ≤ 2 exp(−x 2 /L 2 ). Throughout this section, we assume that, for t = 1, . . . , n, the random vectors ε t are independent, have independent L-sub-Gaussian components ε t,i , for i = 1, . . . , p with variance σ 2 . As in the previous section, we apply the general aggregation procedures introduced in Section 2. As a consequence, our main task boils down to introducing a near-optimal multiple testing procedure indexed by a grid for detecting the existence of a change-point. Here, we shall rely on the complete grid G F = J n = (l, r) : r = 1, . . . , n 2 and l = r + 1, . . . , n − r whose size is quadratic with respect to n. All the results presented in this section are still valid (but with different numerical constants) if we keep the dyadic grid G D as in the previous section. Here, we use the complete grid as a proof of concept that one can rely on the full collection of possible segments without deteriorating the rates. Still, controlling the behavior of the procedure on the complete grid is technically more involved and requires chaining arguments. A detailed comparison between the complete and dyadic grids is made in Section 7.
In order to emphasize the common points with the previous section, we use the same notation K * for the collection of high-energy change-points 1 ,r k for the scales associated to the k-th changepoints 2 , Ψ for the statistics, T for the test and x for the thresholds although these quantities are slightly changed to cope with the sub-Gaussian tail distribution. We follow the same scheme as for the Gaussian case and first introduce multi-scale tests for dense change-points before turning to sparse change-points. As in the previous section, we consider some δ ∈ (0, 1) corresponding to the type I error probability.

Dense change-points with sub-Gaussian noise
Recall that, for a change-point τ k , s k stands for the sparsity of the difference µ k+1 − µ k . We focus here on dense change-points for which s k is possibly as large as p. Given κ > 0, τ k is a κ-dense high-energy change-point if This condition is very similar to its counterpart (10) for Gaussian noise. Still, we introduce it here for the sake of completeness. For k ∈ [K] such that τ k is a κ-dense high-energy change-point, we definer k as the minimum length such that an inequality similar to (17) is satisfied : As in the Gaussian case in Section 3,r k corresponds to the smallest scale such that τ k is guaranteed to be detected. For any κ-dense high-energy change-point, it holds that 4(r thresh > 0 be a tuning parameter to be discussed later. To calibrate the corresponding multiple test procedures (T Proposition 3. There exists a numerical constantc thresh > 0 such that the following holds for any In comparison to Proposition 1 in the previous section, there are two differences. First, we need to cope with sub-Gaussian distribution by applying the Hanson-Wright inequality. Most importantly, the grid G F is much larger than G D so that we cannot simply consider each test T l,r separately and simply apply a union bound as in the previous section. To handle the dependencies between the statistics Ψ (d) l,r , we have to apply a chaining argument. In fact, the thresholds x (d) r are similar to their counterpart in the previous section, whereas the number |G F | of tests is now proportional to n 2 . In principle, the benefit of using the full grid G F is that (τ k ,r k ) does not necessarily belong to the dyadic grid G D and we needed to consider its best approximation (τ k ] is therefore not centered on τ k and the corresponding statistic Ψ In summary, both the collections of dense tests Ψ (d) l,r on G D and G F are able to detect change-points whose energy is, up to some multiplicative constants, higher than L 2 [[p log( n r k δ )] 1/2 + log( n r k δ )].

Sparse change-points with sub-Gaussian noise
Unlike in the Gaussian case, we do not know the exact distribution of the noise. As a consequence, the Berk-Jones test and more generally higher-criticism type tests cannot be applied to this setting. This is why we only rely on the partial norm statistic. Recall that Z = 1, 2, 2 2 , . . . , 2 log 2 (p) stands for a dyadic set of sparsities. For (l, r) ∈ G F and s ∈ Z, we also recall that the partial CUSUM norm is defined as Ψ (p) l,r,s = s i=1 C l,r,(i) 2 . Then, for any (l, r) ∈ G F , the test T (p) l,r rejects the null when at least one of the partial norms is large thresh is a tuning parameter in Proposition 4 below. The partial norm test alone is not able to detect sparse high-energy change-points in the sense of (11) and we need to introduce a stronger condition on the energy. Given κ > 0, a change-point τ k is a κ-sparse high-energy change-point in the sub-Gaussian setting if s k ≤ [p log( n r k δ )] 1/2 and Both Conditions (11) and (18) For any κ-sparse high-energy change-point, it holds that 4(r (s) Proposition 4. There exists a numerical constantc (p) thresh > 0 such that the following holds for any κ s > 32c

Consequences
Let c 0 > 0 be some constant that we will discuss later. A change-point τ k is then said to be a c 0 -high-energy change-points -in the sub-Gaussian setting-if We here re-introduce K * ⊂ [K] as the subset of indices such that τ k satisfies (20). We gather both tests by considering, for any (l, r) ∈ G F , the test T l,r = T we straightforwardly derive from Proposition 3 and Proposition 4 the following result. thresh > 0 such that the following holds. With probability higher than 1 − δ, it holds that (i) T l,r = 0 for all (l, r) ∈ G F ∩ H 0 and (ii) T τ k ,r k = 1 for any c 0 -high-energy change-point τ k in the sense of (20).
Then, it suffices to combine this multiple testing procedure with Algorithm 1 to get the changepoint procedure τ . Since, for a high-energy change-point in the sense of (20), we have 4(r k −1) < r k , we are in position to apply Theorem 1.
In the case where all change-points are c 0 -high-energy change-points in the sense of (20), all of them are detected, and a result similar to Corollary 3 holds here, replacing r * k /2 byr k − 1. Also, both the Hausdorff distance and the Wasserstein distance, can be bounded as in Equation (16) if we replace r * k /2 byr k − 1. As already stated, we could have obtained a similar result (but with different constants) using the dyadic grid G D instead of G F . To conclude this section, let us compare the conditions (20) and (8)  where we recall that γ r = log n rδ . If γ r ≥ p/2, then ψ (sg) n,r,s γ r . In low dimension, the energy threshold for multivariate change-point detection is the same as in the univariate setting, see [38]. If γ r ≤ p/2, then As a consequence, ψ n,r,s and ψ (g) n,r,s are of the same order of magnitude for all s when γ r ≥ p/2. When log(n/rδ) < p, they are also of the same order of magnitude except when s is close but smaller than √ pγ r , for which the ratio ψ (sg) n,r,s between these two quantities can be as large as log(p) − log(γ r ). This gap corresponds to the regime where the test based on the Berk-Jones statistic defined in Equation (14), used in the Gaussian case, outperforms the test based on the partial CUSUM norm statistic defined in Equation (15).
In the definitions of the tests, the tuning constantsc l,r is O(p(r + log(p))). Thus, a naive computation of all the tests T l,r for (l, r) in the complete grid G F requires O(p log(p) (l,r)∈G F r) = O(pn(n 2 + log(p))) operations. Nevertheless, using the fact that it is possible to compute all the tests at scale r with cost O(np log(p)). Since there are n possible scales r on the complete grid, the whole procedure cost is O(n 2 p log(p)). Using a grid G = {(l, r) ∈ G F : r ∈ R} that contains dyadic scales and all possible locations l for each scale, the whole change-point detection would then require only O(np log(n) log(p)) computations, since there are only log(n) possible scales r for such grids.

Minimax lower bound
In this section, we write for any Θ ∈ R p×n , the distribution of the time series Y = (y 1 , . . . , y n ) in the model (2) with Gaussian noise ε t ∼ N (0, σ 2 I p ). In Section 3, we have established that any change-point satisfying the condition (8), that is is detected by our change-point procedure. We now show that this energy condition is unimprovable from a minimax point of view. More precisely, let us define, for any u > 0, the classP(u) of mean parameters Θ with arbitrary K ≥ 0 number of change points and such that any change-point τ k for 1 ≤ k ≤ K satisfies For u small enough, it turns out no change-point estimator is able to detect all change-points without estimating any spurious change-point with high probability on the full classP(u). Still, using this large class provides somewhat pessimistic bounds. For instance, the most challenging distributions inP(u) for the purpose of change-point detection satisfy s k = p and r k = 1 (very close change-points). As a consequence, relying on the full collectionP(u) turns too pessimistic.
To establish that our bounds are adaptive with respect to the sparsity s k and the length r k , we define, for any positive integers 1 ≤ r ≤ n/2 and any 1 ≤ s ≤ p the collection P(u, r, s) = {Θ ∈P(u) : min k r k ≥ r and max k s k ≤ s} .
By convention, constant means Θ with no change-points (K = 0) also belong toP(u, r, s). In the classP(u, r, s), all change-points have a sparsity at most s and a length at least r. Hence,P(u, r, s) becomes larger when s increases or when r increases.
Thus, in the Gaussian setting, if all the change-points have a high-energy in the sense of (8) but with a smaller multiplicative constant factor, no change-point estimator can consistently estimate the true number of change-points. The next corollary restates this negative results in the same lines as Corollary 3.
Corollary 6. Fix any u ∈ (0, 1/8). For any σ > 0, n ≥ 2, p > 1, any length 1 ≤ r ≤ n/4, any sparsity 1 ≤ s ≤ p, and any estimatorτ , there exists some Θ ∈P(u, r, s) such that with P Θ -probability larger than 1/4, at least one of the two following properties is satisfied This corollary is to be compared to Corollary 3 -indeed, the energy condition in Equation (22) differs from Equation (8) only by a numerical multiplicative constant. As a consequence, the energy condition (22) is minimal for detection by a change-point estimator that achieves (NoSp).

Application to other change-point problems
In this section, we apply the general methodology of Section 2 to two other problems, namely detection of covariance and nonparametric change-points. This allows us to obtain the first tight minimax detection conditions for these problems.

Covariance change-point detection
Following Wang et al. [40], we consider the covariance change-point model where the covariance matrices Σ t of the centered random vectors y t ∈ R p are piece-wise constant. Then, the goal is to estimate the times 0 < τ 1 < . . . < τ K < τ K+1 = n + 1 such that Σ t is varying. See [40] for motivations. As in that work, we assume that the random vectors y t are independent and are sub-Gaussian with a uniformly bounded Orlicz norm, that is max t=1,...,n y t ψ 2 ≤ B for some known fixed B. The Orlicz norm of a random vector y is the supremum of the Orlicz norm of any uni-dimensional projection of y -see e.g. [37]. If the y t 's follow a normal distribution, this amounts to assuming that max t=1,...,n Σ t op ≤ 2B 2 where . op is for the operator norm. The purpose of Wang et al. was to detect small changes in operator norm, that is detecting instants τ k such that Σ τ k = Σ τ k −1 with Σ τ k − Σ τ k−1 op possibly small. Apart from the operator norm, other norms have also been considered e.g. in [10]. Here, we focus on the operator norm as in [40].
Recalling the generic procedure introduced in Section 2, we consider the dyadic grid G D and some δ ∈ (0, 1). For any (l, r) ∈ G, we respectively write Σ l,−r and Σ l,r for the empirical covariance matrices Then, we consider the test T l,r rejecting for large values of Σ l,r − Σ l,−r op .
where the numerical tuning constant c 0 is set in the proof of the following proposition. Relying on concentration bounds [23] for the empirical covariance matrix of sub-Gaussian random vectors, we easily prove that the FWER of the multiple testing procedure (T l,r ) with (l, r) ∈ G D is small. Then, we can analyze the type II error probability and plug it into the generic result (Theorem 1) to control the behavior of the change-point estimator τ . This leads us to the following result. In the sequel, a change-point τ k is said to have a high-energy if where the numerical constant c 1 is introduced in the proof of the following proposition. We recall that, by definition of the model, we have Σ τ k − Σ τ k−1 op ≤ 4B 2 .
Proposition 5. There exist positive numerical constants c 0 , c 1 , and c 2 such that the following holds for any B > 0 and any sequence of independent centered random vectors (y t ) satisfying max t y t ψ 2 ≤ B. With probability higher than 1−δ, the change-point estimator τ satisfies (NoSp) and detects all high-energy change-points in the sense of (24). Besides, any such high-energy change-point τ k satisfies under the same event of probability than 1 − δ.
Let us compare our condition (24) for detection with Theorem 2 in Wang et al. [40]. The authors assume that all the change-points satisfy In addition to the fact that we allow some change-points to have an arbitrarily low energy, our requirement for detection scales like √ p + log(n/r k ) instead of p log(n).
The next proposition establishes that the latter condition is minimal. By homogeneity, we can only consider the case where B = 3/2. We focus our attention on Gaussian distributions so that the distribution of the sequence (y 1 , . . . , y n ) is uniquely defined by the sequence (Σ 1 , . . . , Σ n ) of covariance matrices. Given an integer 1 ≤ r ≤ n/4 and ζ ∈ (0, 1/ √ 2), we defineP(r, ζ) the collection of sequences η = (Σ 1 , . . . , Σ n ) of covariance matrices that satisfy either Σ t = I p or Σ t op = 1 + ζ. Besides, the corresponding change-points (τ 1 , . . . , τ K ) of η must satisfy min k r k ≥ r and min k Σ τ k − Σ τ k−1 op ≥ ζ. For η ∈P(r, ζ), we write P η for the corresponding distribution of (y 1 , . . . , y n ). Proposition 6. There exists a positive numerical constant c such that, for any n, p and any length 1 ≤ r ≤ n/4 the following holds. Provided that rζ 2 ≤ c(p + log(n/r)) ∧ r 2 , we have inf τ sup η∈P(r,ζ) As a consequence, our procedure τ achieves the minimal separation condition (24) for changepoint detection. In their work, [40] obtain faster localization errors than (25) to the price of stronger separation conditions. Our focus in this work is to provide optimal detection conditions and we did not try to optimize (24).

Univariate nonparametric change-point detection
We now turn to the univariate nonparametric change-point model considered in [33]. Let m ≥ 1 be any positive integer. At each time t = 1, . . . , n, the random vector y t is an m-sample of a univariate distribution with cumulative distribution function F t . Then, we aim at detecting a vector τ = (τ 1 , . . . , τ K ) of change-points such that F τ k = F τ k−1 . As in [33], we quantify the distance between two distributions by the Kolmogorov distance As in the previous subsection, we build a procedure τ with our generic algorithm on the dyadic grid. Regarding the collection of tests (T l,r ), we consider two-sample Kolmogorov-Smirnov tests. More precisely, we denote F t the empirical distribution function associated with the sample y t and we define the test In the following, a change-point τ k is said to have a high-energy if where the numerical constant c 1 is introduced in the proof of the next proposition. As in Subsection 6.1, it is straightforward to prove, based on Dvoretzky-Kiefer-Wolfowitz inequality, that the FWER of the multiple testing procedures (T l,r ) with (l, r) ∈ G D is small. Then, we analyze the type II error probability of this test and plug it into the generic result (Theorem 1) to control the behavior of the change-point estimator τ .

Proposition 7.
There exist positive numerical constants c 1 and c 2 such that the following holds. With probability higher than 1 − δ, the change-point estimator τ satisfies (NoSp) and detects all high-energy change-points τ k in the sense of (26). Besides, any such high-energy change-points τ k satisfies under the same event of probability than 1 − δ.
In [33], the authors introduce a procedure detecting all the change-points provided that Comparing this last condition with (26), we observe that our logarithmic term is tighter and that we allow arbitrarily low-energy change-points. The next proposition establishes that the condition (26) is unimprovable. Given an integer 1 ≤ r ≤ n/4 and ζ ∈ (0, 1/4), we focus our attention on the collectionP(r, ζ) of sequences (F 1 , . . . , F n ) of distributions such that the corresponding change-points (τ 1 , . . . , τ K ) satisfy min k r k ≥ r and min k F τ k − F τ k−1 ∞ ≥ ζ. For η ∈P(r, ζ), we write P η for the corresponding distribution of the sequence (y 1 , . . . , y n ).

Proposition 8.
There exists a positive numerical constant c such that, for any n, p and any length 1 ≤ r ≤ n/4 the following holds. Provided that rζ 2 ≤ c log(n/r)/m, we have

Noise distribution for multivariate change-point detection
Comparison between Gaussian and sub-Gaussian rates. In this work, we have studied two types of noise distribution: Gaussian (Section 3) and general sub-Gaussian distributions (Section 4) without further knowledge on the distribution functions. Since the Gaussian setting is a specific instance of the sub-Gaussian setting, it is clear that the minimax lower bounds from Section 5 apply in both settings. As described in the previous subsection, the performances in the sub-Gaussian case almost match those in the Gaussian setting except for s k slightly lower but close to p log(en/r k ). Indeed, in that regime, Berk-Jones or Higher-Criticism type statistics heavily rely on the probability distribution function of the noise, which is not available in the general sub-Gaussian case. Still, we could slightly improve the sub-Gaussian rates if we further assume that the noise components are identically distributed with common CDF F .
• If F is known (know noise distribution), then one may adapt Berk-Jones test by replacinḡ Φ(x) in Equation (14) by F (−x) + (1 − F (x)). This would allow us to recover the exact same detection condition as in the Gaussian setting.
• If F is unknown and if there are not too many change-points, one could hope to estimate the quantiles of the CUSUM statistic at each scale r and plug them into a Berk-Jones statistics. This goes however beyond the scope of this paper.
Unknown variance or more general variance matrix. We assumed in the sparse multivariate sections that the variance σ 2 is known. Whereas the partial norm test only requires the knowledge of an upper bound on σ, the dense statistic Ψ (d) l,r requires the exact knowledge of the variance. As soon as there are not too many change-points, it is possible to roughly estimate σ and therefore accommodate the partial norm test with an unknown variance. In contrast, the dense statistics needs to be replaced by a U -statistics. Consider any even positive integer r and define where C l,r (Y ) and C l,r (Y ) are independent. If there is one change-point at position l and no other change-points in (l−r, l+r), then these statistics are identically distributed and we consider Ψ (d) l,r = C l,r (Y ), C l,r (Y ) whose expectation is null when there are no change-points in the segment. As a consequence, Ψ (d) l,r does not require the knowledge of σ; only an upper bound of σ is required to calibrate the corresponding test. Such a U -statistics has already been introduced in [45] and analyzed in an asymptotic setting. Unfortunately, since we can only consider even r, this precludes us to detecting change-points that are very close together with r k = 1.
In the general case where there is spatial covariance in the noise, that is var( t ) = Σ for an unknown but general Σ, we can still use the same U -statistic described in the previous paragraph for the dense case. For the sparse case, one could use the supremum norm of the CUSUM statistics as in Jirak [21] and Yu and Chen [48]. To calibrate those tests, we need to estimate both the Frobenius and the operator norm of Σ, which seems to be doable as soon as there are not too many change-points. If the spatial covariance matrix var( t ) is unknown and even allowed to change with time, we suspect that the problem becomes intrinsically more involved.

Optimal Localization rates
In this work, we mainly considered the problem of detecting change-points in the mean of a random vector. We provided tight conditions on the energy so that a change-point is detectable. When such a change-point τ k is detected, Corollary 2 states that its position is estimated up to an error of r * k , which is also of the order of σ 2 Ψ (g) n,r k ,s k ∆ −2 k -see the definition (9). It is not clear whether this error is optimal or not.
In the univariate setting (p = 1), [38] has established that, above the detection threshold, a specific change-point position τ k can be localized at the rate σ 2 ∆ −2 k . In the multivariate setting, the situation is more tricky and there are certainly several localization regimes beyond the detection threshold. It is an interesting direction of research to pinpoint the exact localization rate between σ 2 ∆ −2 k and σ 2 Ψ (g) n,r k ,s k ∆ −2 k . We leave this for future work.

On the choice of the grid in the generic algorithm
Our general procedure is defined for almost any arbitrary grid. Optimal procedures with the dyadic grid are introduced in Sections 3 and 6, whereas we use a near-optimal procedure on the complete grid in Section 4. From a computational perspective, the procedure's worst-case complexity is proportional to the size |G| of the grid G. In that respect, the dyadic grid and more generally the a-adic grids benefit from a linear size whereas the size of the complete grid is quadratic.
From a mathematical perspective, it is much easier to control the behaviour of the procedure for an a-adic grid by a simple Bonferroni correction on all the statistics as it turns out that this correction is sufficient for our purpose -see the proofs of Section 3. In constrast, controlling larger collections of tests turns out to be much more challenging as one needs to carefully take into account the dependences between the test statistics, which becomes all the more challenging for complex models. As an example, we introduced in Section 3 Berk-Jones statistics to achieve the tight minimax condition for change-point detection. Unfortunately, we did not manage to apply a suitable chaining argument to these statistics and were therefore unable to control the behavior of the corresponding change-point detection procedure on the complete grid.
From a purely statistical perspective, it is difficult to appreciate the respective benefits of denser or sparser grids. On the one hand, for denser grids, the approximation τ k of τ k at scale r will be closer to τ k so that the corresponding test T τ k ,r may be more powerful. On the other hand, for a denser grid, the tests possibly suffer from a higher price for multiplicity. This price can be mild if one takes into account the dependences between the tests. Still, except perhaps in the univariate Gaussian change-point model for which delicate controls of the CUSUM process exist, it is challenging to provide theoretical guidance towards the best choice of the grid.
7.4 Optimality of the generic algorithm in a broader context. Algorithm 1 aggregates homogeneity tests and provides theoretical guarantees on the event A (T, K * , (τ k ,r k ) k∈K * ) -i.e. the event where the outcomes of the tests are consistent -as stated in Theorem 1. In the possibly sparse high-dimensional mean change-point model, we introduced a suitable multiple testing procedure which, when combined with Algorithm 1, leads to a minimax optimal change-point detection procedure.
We described in Section 2 how to adapt this approach to other change-point problems and this was already illustrated in Section 6 with covariance and nonparametric problems. One may then wonder whether this roadmap still leads to minimax optimal procedures for general problems. Consider the general setting from Section 1 where we are interested in detecting change-points in (Γ (P t )) t∈ [n] . Upon endowing the space V with some distance d, we define, for any k, which corresponds to the change-point height. Then, one may wonder how large∆ k has to be -as a function of r k -so that a change-point detection procedure achieving the no-spurious property (NoSp) with high probability is able to detect τ k . In this discussion, we restrict our attention to independent observations, that is the random variables y t are assumed to be independent and we consider the dyadic grid G D . Fix δ ∈ (0, 1). At each scale r ∈ {1, 2, . . . , 2 log 2 (n) −1 } and for each l ∈ D r , with D r defined in This amounts to testing whether there is a single change-point near l of height at least ρ in the segment (l − r, l + r). Given δ ∈ (0, 1) and a test T we define the δ-separation distance of T by This corresponds to the minimal change-point height that is detected by the test T . Then, the minimax separation distance ρ * l,r (δ) is simply inf T ρ l,r (T, δ), i.e. the infimum over all tests T of the separation distance. By translation invariance of the testing problem, note that ρ * l,r (δ) does not depend on l and is henceforth denoted ρ * r (δ). For any (l, r), take any test T l,r (nearly) 3 achieving the minimax separation distance ρ * r (δ|D r | −1 β r ) with β r = 6 log −2 2 (n/r))π −2 . Then, it follows from a simple union bound on the dyadic grid that, with probability higher than 1 − δ, the collection of tests T l,r , where (l, r) belongs to the dyadic grid, does not detect any false positive and detects any change-point τ k such that ∆ k is higher than ρ * r k (δ|Dr k | −1 βr k ), wherer k is the largest scale in R such that 4(r k − 1) ≤ r k . As a consequence of Theorem 1, the corresponding detection procedure achieves, with probability higher than 1 − δ, the property (NoSp) and detects any change-point satisfying the energy condition ∆ k ≥ ρ * r k (rδβ r /2n).
Conversely, we believe that this energy condition is almost tight. Indeed, fix any even range r ≥ 2. To simplify the discussion suppose that n/(2r) is an integer. We consider a specific instance of the problem where the statistician knows that there are n/(2r) − 1 evenly-spaced change-points respectively at 2r + 1, 4r + 1, . . . , n − 2r + 1 that allow to reduce the change-point detection problem to n/(2r) change-point detection problem in intervals (l − r, l + r] for l = r + 1, 3r + 1, 5r + 1, . . .. Furthermore, it is known that, in each such segment, there exists at most one change-point that is situated in [l−0.5r, l+0.5r], and if the change-point is present then its height is at least ρ = ρ * r (δ)−ζ for ζ arbitrarily small. Since all n/(2r) − 1 evenly-spaced change-points 2r + 1, 4r + 1, . . . , n − 2r + 1 are known to the statistician, detecting all remaining change-points is equivalent to building an n/(2r) multiple test of the hypotheses H 0,l,r versus H ρ,l,r for l = r + 1, 3r + 1, 5r + 1, . . .. If a change-point procedure achieves (NoSp) and detects all change-points with radius at least r/2 and height at least ρ with probability at least 1 − δ, then one is able, with probability uniformly higher than 1 − δ, to simultaneously perform without error n/(2r) independent tests H 0,l,r versus H ρ,l,r . Since any single test must endure an error with probability at least δ in the worst case, no collection of independents tests is able to endure less than 1 − (1 − δ) n/(2r) . When n/r is large and δ < 2r/n, the latter is of the order of δ2r/n. Based on this, we conjecture that no change-point procedure is able to achieve, with probability higher than 1 − δ the property (NoSp), and also to detect all change-points with radius at least r/2 and height at least ρ * r (2rδ/n) − ζ for ζ > 0 arbitrarily small.
Comparing the performances of our procedure with the negative arguments that we just outlined, we see that aggregating optimal tests on a dyadic grid allows to detect change-points with (almost) uniform height higher ρ * r k (r k δβ r k /(2n)) whereas, as explained above, we conjecture that a change-point τ k can be detected only if∆ k ≥ ρ * r k (2r k δ/n). Sincer k ≥ (r k /8) ∨ 1-as we considered the dyadic grid when constructingr k -the difference between these two bounds is mostly due to the term β r which is of the order of log 2 (n/r). Whereas it is possible to detect change-points at a given scale with a test of type I error probability 2rδ/n, our multi-scale procedure relies on a collection of single tests with type I error probability of the order of rδ/n/ log 2 (n/r). This mild mismatch -that we introduce to deal with the multiplicity of scales -of order log 2 (n/r) is harmless for the Gaussian mean-detection problem. Indeed, one may deduce from our analysis in Section 3 that ρ * r k (2r k δ/n) is of the same order as ρ * r k (δ|Dr k | −1 βr k ).
In conclusion, one can build through Algorithm 1 an almost optimal change-point procedure in any model provided that we are given optimal homogeneity tests of the form H 0,l,r versus H ρ,l,r . This provides a universal reduction of the problem of change-point detection to the problem of homogeneity testing.

Numerical Experiments
In this section, we illustrate the behavior of our procedure to detect change-points in a sparse high-dimensional setting (2).
To assess the quality of change-point estimatorτ , we first measure whether the estimated number of change-points K = |τ | is equal to the true number K of changepoints. We also define the SAND loss as the proportion of Spurious estimated change-points And true change-points that are Not Detected: Change-point Detection Methods. In the experiments, we implemented the bottom-up aggregation procedure Algorithm 1 with partial norm tests T (p) and dense test T (d) corresponding to Section 4 on a semi-complete grid G F = {(l, r) : l ∈ {r + 1, . . . , n − r + 1, r ∈ R} -we take scales r in the dyadic set for computational purposes. On a location l and a scale r, each test statistic can be seen as a partial norm test relying on the statistic Ψ r,s for our thresholds Thresh(r, s) since they rely on constants that are not necessarily tight, but we rather calibrate them by a Monte-Carlo method using 10.000 independant samples. For each sample consisting in a time series made of n gaussian normal centered vector in R p , and for each r ∈ R, s ∈ Z r ∪ {p}, we compute the maximum over all l of the statistics Ψ (p) l,r,s . Considering the list of all the 10.000 maximums and taking δ = 5%, Thresh(r, s) is then defined as the (1 − δ/(2|R||Z r |))-quantile if s ∈ Z r and as the (1 − δ/(2|R|))-quantile if s = p, so that, by a union bound, the total probability of finding a false positive is less than δ. Note that this calibration step only depends on n, p, and σ and only needs to be performed once and for all.
We compare our procedure with the inspect method of [46] which is available as an R package. The tuning parameters of inspect are computed with the automatic method defined in the same R package.
In all the following experiments, we fix the dimension p = 100 and the sample size n = 200. We generate a piecewise constant signal (η t ) n t=1 in R p with possible change-points (τ 1 , . . . , τ K ) using one of the three following settings. We then add a scaling factor α > 0 and apply our procedure to the data y t = αη t + ε t , which amounts to setting θ t = αη t in model (2). We fix the variance of all the coordinates of ε t to be equal to one. Increasing α on a grid with step 0.1 allows us to experimentally identify a transition between the regime where we do not detect precisely the change-points -in which case the two losses tend to be close to one -and the regime where we do detect the change-points -in which cases the losses are smaller. We consider three simulation settings: 1. Segment. We generate a signal η which is zero everywhere, except on [80, 100] where we set it equal to a random vector ∆ with ∆ = 1 and ∆ 0 = s, for s = 1, 20, 100. In each one of these cases, we choose the location of the s non null coordinates of ∆ uniformly at random and their value uniformly at random in the set {−1/ √ s, 1/ √ s}. Each time, η has 2 true change-points, and we generate the noise ( t ) as independent centered and normalized gaussian vectors.
2. Multiple Change points. We generate 10 uniform random locations τ 1 < τ 2 < . . . < τ 10 on [1,200]. For each location τ i , we generate a uniform random integer s i ∈ [1, 100] and a vector ∆ i as in the segment setting with ∆ i = 1 and ∆ i 0 = s i . We generate a uniform random real number N i ∈ [1, 5] and define the time series η i by (η i ) t = N i ∆ i 1 t≥τ i . Finally, the signal η = 10 i=1 η i has exactly 10 change-points with random locations. As previously, the noise components (ε t ) follow independent centered and standard gaussian vectors.
3. Time-dependencies. We use the same signal as in the segment setting with s = 20 but we move away from our assumptions by considering time dependencies. More precisely, the (ε t )'s are now defined according to an AR process such taht ε t+1 = ρε t + 1 − ρ 2 ε t+1 for t ≥ 0 where (ε t ) are independent centered and normalized gaussian vectors, ρ = 0.05 for the simulation and by convention ε 0 ∼ N (0, I p ).
Risk estimation with Monte-Carlo. In each setting, we generate 500 independent samples and compute the twpo losses SAND((τ k ), (τ k )) and 1{K = K}. We estimate the risks E[SAND((τ k ), (τ k ))] and P(K =K) by averaging the loss over the 500 trials. We also compute 95% confidence intervals.
Results. In the segment setting -see Figure 4, 5, 6, the risks tend to decrease as α increases since the higher α, the higher the energy of the generated change-points are. As s increases, we can see that both methods need a higher scaling factor to achieve the same risk, which translates the fact that the higher s, the more energy is needed to detect a change-point with vector ∆ of sparsity s. In the segment settings, our bottom-up procedure tends to achieve significantly smaller loss than the inspect method on average. It is not the case in the multiple change-points setting -see Figure 7 where the inspect method tends to perform slightly better. In the setting with time-dependencies -see Figure 8 -the risks are worse than the corresponding setting without time-dependencies -see Figure 5 -mainly because adding time-dependencies tends to create more spurious change-points (i.e. false positives).
Computation time Our code is implemented with python 3.9 and it mainly uses the convolution function conv1d from pytorch 1.12.1 to compute the Cusum statistics. Simulations are run on CPU (Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz) with 32Go of memory. Running our method on pure noise -i.e. θ t = 0 for all t -takes 101 ± 2 ms while the inspect method takes only 18 ± 2 ms to run on average, but optimizing our code is out of the scope of this paper. All the experiments are described in the repository https://github.com/epilliat/multicpdetec.

A An alternative Algorithm
In Algorithm 2 below, we also introduce a variant of the procedure, where instead of merging relevant interesting intervals at the same scale, we only keep one of them. More precisely, we choose the convention of discarding the interval [l − r + 1, l + r − 1] if there exists l < l such that T l ,r = 1 and [l − r + 1, l + r − 1] ∩ [l − r + 1, l + r − 1] = ∅. Alternatively, we could have chosen to discard one of the intervals at random. Let Θ ∈ R n×p , T be a local test statistic, K * be a set of indices of significant change-points and (τ k ,r k ) k∈K * be elements of the grid G that satisfy (6). We assume that A(Θ, T, K * , (τ k ,r k ) k∈K * ) holds, that is: 1. (No False Positive) T l,r = 0 for all (l, r) ∈ H 0 ∩ G, where H 0 is defined by (7) 2. (Significant change-point detection) for every k ∈ K * , we have Tτ k ,r k = 1.
For every r ∈ R define In other words, for all r ∈ R, T * r is the subset of T r for which each interval of detection [l − r + 1, l + r − 1] contains a significant change-point. The next proposition recursively analyzes the detection sets corresponding to significant change-points (S * r ) r≥1 . The first inclusion means that significant change-points which can be detected with a local statistic with radius smaller than r are detected before step r, while the second inclusion means that each connected component of r∈R S * r is included in a close neighborhoods of some significant change-point τ k , k ∈ K * .
Proposition 9. For all r ∈ R ∪ {0}, we have the double inclusion The next proposition shows that for each step r ∈ R, the subset of detection corresponding to non significant change-point is disjoint from r ∈R S * r . Proposition 10. For all r ∈ R, we have Recall that (C k ) k=1,...,K are defined as the connected component of r∈R S r . To ease the notation, re-index (C k ) so that τ k is the closest true change-point toτ k = min C k +max C k 2 . Since there is no false positive, τ k ∈ C k . By Proposition 10, the two closed subset r∈R l∈Tr\T * r [l −r +1, l +r −1] and r∈R S * r are disjoint. For all k ∈ K * , it holds by Proposition 9 that τ k ∈ r∈R S * r , so that C k is a connected component of r∈R S * r containing the significant change-point τ k . In particular,K ≥ |K * |. We have • By Proposition 9, C k ⊂ [τ k − 2(r k − 1), τ k + 2(r k − 1)] for every k ∈ K * . Thus • For all k ∈ [K] \ K * , either τ k does not belong to r∈R S r and it is simply not detected, or it is the closest true change-point toτ k = min C k +max C k 2 so that In particular, • Finally, if there exists two estimated change-pointsτ k 1 ,τ k 2 in τ k − τ k +τ k−1 2 , τ k + τ k +τ k+1 2 , then either C k 1 or C k 2 does not contain τ k . Then Θ is constant on C k 1 or on C k 2 and we obtain a contradiction since there is no false positive.
This concludes the proof of Theorem 1.
Proof of Proposition 9. To prove the proposition, we do an induction on r ∈ R ∪ {0}. The case r = 0 is trivial since by definition, S 0 = ∅. Let r ∈ R and assume that the double inclusion Proposition 9 holds for all r < r, r ∈ R ∪ {0}.
First inclusion: Let k ∈ K * be such thatr k = r and assume that the corresponding significant change-point τ k has not been detected before step r, that is τ k ∈ r <r S * r . Since k ∈ K * , this implies in particular that τ k ∈ r <r S r . Let us show that τ k ∈ S r . To this end we prove that and Tτ k ,r = 1, which will be enough since |τ k − τ k | ≤r k − 1 = r − 1.
• Proof of (29): Assume for the sake of contradiction that there exists an integer z which belongs to [τ k − r + 1,τ k + r − 1] ∩ r <r r ∈R S r . There exists r < r such that z ∈ S r and l(z) ∈ T r such that z ∈ [l(z)−r +1, l(z)+r −1]. Since τ k ∈ r <r S r , we have τ k ∈ [l(z)−r +1, l(z)+r −1]. Moreover, Where the last inequality comes from the hypothesis 3(r k − 1) + |τ k − τ k | ≤ r k Consequently, so that θ is constant on [l(z) − r , l(z) + r ) ∩ N. Thus, (l(z), r ) ∈ H 0 and l(z) ∈ T r since there is no false positive. This gives a contradiction and concludes the proof of (29).
• Proof of (30): This is simply a consequence of the fact that significant change-point are detected on the grid (See Item 2 in the definition of A).
We have just shown that τ k ∈ S r and hence τ k ∈ S * r so that the first inclusion holds at step r.
Second inclusion : Let x be an element of S * r . There exists l(x) ∈ T * r such that x ∈ [l(x) − r + 1, l(x) + r − 1]. By definition of T * r , there exists a significant change-point τ k ( i.e. such that k ∈ K * ) belonging to [l(x) − r + 1, l(x) + r − 1].
We necessarily haver k ≥ r. Indeed, ifr k < r, then by the induction hypothesis, τ k ∈ S * r for some r < r, which contradicts the fact that S * r is disjoint from [l(x) − r + 1, l(x) + r − 1] ⊂ S * r . Consequently, We have just shown that Therefore, the proposition is verified at step r and the induction is proved.
Proof of Proposition 10. Let k ∈ K * and C k be the detected connected component containing the significant change-point τ k We know from Proposition 9 that C k is a connected component of r ∈R S * r and we want to prove now that C k does not overlap with l∈Tr\T * r [l − r + 1, l + r − 1] for some r ∈ R. Let r 0 be such that C k is the connected component of S r 0 , Such an r 0 exists and is unique since the sets S * r are disjoint. We have from Proposition 9 that τ k ∈ r ∈R,r ≤r k S * r so that r 0 ≤r k .
Let r ∈ R and l ∈ T r \ T * r and assume without loss of generality that l + r − 1 < τ k . Since there is no false positive, (l, r) ∈ H 0 and there exists at least one true change-point in the interval of detection [l − r + 1, l + r − 1]. Denote τ a , . . . , τ b with a ≤ b the true change-points belonging to [l − r + 1, l + r − 1]. By definition of T r \ T * r , τ a , . . . , τ b are not significant change-points, i.e. a, a + 1, . . . , b ∈ K * . We consider the two cases r >r k and r ≤r k • r >r k : In that case, since the sets (S r ) are disjoint and C k ⊂ S * r 0 , we have C k ∩ [l − r + 1, l + r − 1] = ∅.
• r ≤r k : In that case, we have where we used the fact that 4(r k − 1) < r k ≤ τ k − τ b . Since by Proposition 9 we have C k ⊂ [l − r + 1, l + r − 1], we also have in that case C k ∩ [l − r + 1, l + r − 1] = ∅.
This concludes the proof of the proposition.

B.2 Proofs for Gaussian multivariate change-point detection
From now on, we use the following notation for all (l, r) ∈ J n . • The population term of the CUSUM statistic C l,r is written U l,r = r 2 θ l,+r −θ l,−r .
• With these notation, we write v l,+r,i , v l,−r,i , U l,r,i for the i th coordinate of the vector v l,+r , v l,−r , U l,r .
The first term, written as r ε l,+r − ε l,−r , θ l,+r − θ l,−r , is a crossed term between the noise and the mean vector θ. Lemma 1 states that near the changepoints and on the grid defined by the sets R, D r , it is jointly controlled with high probability.
The second term, written as is a term of pure noise. Lemma 2 states that it is controlled jointly with high probability on the grid defined by the sets R, D r .

B.2.2 Proof of Proposition 2
Step 1 : Analysis of the Berk-Jones statistics We first define a threshold x (BJ) r,s for the Berk-Jones statistics for all r, s ≥ 1 where we recall that δ x,r are the weights defined by (13): Remark that (x (BJ) r,s ) is nonincreasing with s and define for all r ≥ 1 The second point of the following proposition ensures that if there exists s ∈ Z such that U l,r,(s) ≥ t s for some s ≥s r , for (l, r) = (τ = 1 with high probability. We recall that |U l,r,(1) | ≥ · · · ≥ |U l,r,(p) | are the sorted absolute values of the coordinate of U l,r and that H 0 is defined by (7). Proposition 11. There exists an event ξ (BJ) of probability larger than 1−2δ such that the following holds: • T Step 2 : Analysis of the partial norm statistics Since it may happen that τ k is a sparse highenergy change-point but there is no s ≥sr(s) then T Step 3 : Combination of the two Statistics Let us return to the proof of Proposition 2. To conclude the proof, it suffices to show that if τ k is a κ s -sparse high-energy change-point -see (11)for some large enough constant κ s , then the result of one of the two preceding propositions holds. This is precisely what the following lemma shows.
Lemma 3. There exists a constant κ s such that if τ k is a κ s -sparse high-energy change-point, then one of the following propositions is true: • There exists s ∈ Z such that s >sr(s) k and Uτ(s) • There exists s ∈ Z such that s ≤sr(s) k and s s =1 Uτ(s) Proof of Proposition 11. The first part of the proposition is a simple consequence of the definition together with an union bound.
We focus on the second part of the proposition. To ease the reading, we introduce some notation for x ≥ 0. In fact, γ x,r is the threshold of the statistics N x,l,r . As for η x,r,s , it stands for the contribution to N x,l,r of the (p − s) coordinates i such that θ ·,i is constant over [l − r, l + r). Finally, ψ x,r,s (u) stands for the contribution to N x,l,r of the s coordinates i whose population CUSUM statistics U l,r,i is equal to u.
Lemma 4. Consider any r ∈ R and l ∈ D r . If for some positive integers s and x we have ψ x,r,s (|U l,r,(s) |) > γ x,r − η x,r,s , Denote H[θ] the collection of (l, r) with r ∈ R and l ∈ D r that satisfy Condition (35) for some s and some x. We easily deduce from the above Lemma together with an union bound that, with probability higher than 1 − δ, T  .
Combining Lemma 5 and Lemma 4, we conclude the proof of the proposition.
Proof of Proposition 12. The following lemma ensures that the partial norm test returns 0 with high probability jointly at all positions where there is no change-point. We writeC s p for the set of all combinations of s indices taken from [p].
Lemma 6 (concentration of the pure noise for the second sparse statistic). If 1 ≥ δ > 0, then the event r,s holds with probability higher than 1 − δ.
We now state the following lemma, which ensures that the partial norm test returns 1 with high probability jointly at relevant positions which are close to a change-point.
Lemma 7 (concentration on the change-points for the second sparse statistic). We writeK * for the set of k ∈ [K] such that Proof of Lemma 6. Let r ∈ R, l ∈ D r , s ≤s r and S ∈C s p . Let δ > 0, δ r,s = r n 2 s 2ep s δ. Since r 2σ 2 (ε l,+r,i −ε l,−r,i ) follows a N (0, 1) distribution for all l, r, i, we have by Bernstein's inequality that with probability larger than 1 − δ r,s , i∈S (ε l,+r,i −ε l,−r,i ) 2 ≤ s + 2 s log 1 δ r,s + log 1 δ r,s Proof of Lemma 7. Let k ∈K * , and s ∈ Z such that To ease the reading, we write (τ, r) = (τ where in the second inequality, we used the fact that (a + b) 2 ≥ 1 2 a 2 − b 2 for all a, b ∈ R.
Proof of Lemma 3. First remark that there exists a large enough constant C such that for all r, s ≥ 1, We have for some universal constant C 1 . To handle the second term remark that since x → log p x 2 is decreasing, we have and thus In the first inequality we used the fact that τ (s) In the second inequality, we used the fact that for a large enough constant κ s (see (11)), x 2 is increasing for x ≤ p, so that s k can be replaced bys, •s ≤ C log log ep This concludes the proof of the lemma.
For all k such that τ k is a c 0 -high-energy change-point, define (r k ,τ k ) is well defined. Indeed, If s k ≤ p log n r k δ then Now if s k ≥ p log n r k δ then using log (1 + x) ≥ x 2 for x ∈ [0, 1] we have According to Theorem 1, it is sufficient to prove that the event A (Θ, T, K * , (τ k ,r k ) k∈K * ) defined in Section 2.3 holds on ξ: 2. (High-energy change-point detection): for every k such that τ k has c 0 -high-energy, it holds by definition ofr It remains to show thatr where r * k is define by (9). Using log (1 + x) ≥ x 2 for x ∈ [0, 1] and log (1 + x) ≥ log (x) for x ≥ 1 we have whenr k ≥ 2. Thus 2(r k − 1) ≤ r * k for c 0 ≥ 2(κ d ∨ κ s ). This concludes the proof of Corollary 2.

B.3 Proofs for sub-Gaussian multivariate change-point detection
We recall that in this section, we work on the complete grid G F = J n defined in Section 2.
The first term written as r ε l,+r − ε l,−r , θ l,+r − θ l,−r is a crossed term between the noise and the mean vector θ. Lemma 8 states that for l equal to a true change-point τ k and r of order r * k , it is controlled on event ξ holds with probability higher than 1 − δ.
The second term written as is a term of pure noise. Lemma 9 states that it is controlled on event ξ (d) 2 with high probability.
Set now Note that Proof of Lemma 8. Let k be in [K] and such that Equation (17) is satisfied. Remark that θ is constant on [τ k −r (d) k , τ k ) and is equal to µ k−1 , and is also constant on [τ k , τ k +r (d) k ) and is equal to µ k . First, from the definition of the ψ 2 -norm of a vector, there exists a universal constant C > 0 such that for all k = 1 . . . K, Thus by definition of sub-Gaussianity, for all t > 0, for some constant c > 0. Finally we apply the concentration inequality to t = r k ∆ 2 k 4 -remembering that τ k is a κ-dense high-energy change-point in the sense of Equation (17) -and sum over k to obtain a union bound over ξ c 2 : where the last inequality comes from the fact that K k=1r (d) k ≤ n and the fact that κ is chosen large enough so that c κ ≥ 1.
Proof of Lemma 9. Remark first that by homogeneity, we can assume without loss of generality that L = 1. To provide a proof, we will use the Hanson-Wright inequality in high dimension, which is a way to control quadratic forms of the noise.
Lemma 10 (Hanson-Wright inequality in high dimension). Let A = (a ij ) be a m × m matrix and ε 1 , . . . , ε m be sub-Gaussian vectors of dimension p with norm smaller than 1. Then where c is an absolute constant, A 2 F = i,j a 2 i,j is the squared Frobenius norm of A and A op is the operator norm of A.
The proof of this lemma relies on the classical Hanson Wright inequality that is proved for example in [35]. To prove the proposition, we will use a chaining argument. To this end, we let (N u ) u≥0 be the following covering sets of J n : where we define κ 1 = log 2 (n) , and more generally κ r = log 2 (n/r) for r = 1, . . . n. Remark that the higher u is, the finer the covering set N u is, and N κ 1 = J n . For all u ≥ 0, we define the projection map π u from J n to N u by π u (l, r) = arg min (l,r)∈Nu |l − l| + |r − r| .
In the sequel, we will use the slight abuse of notation for (l, r) in J n : (l u , r u ) = π u (l, r) .
A useful lemma to control the distance between (l, r) and its projection (l u , r u ) can be stated as follow.
Lemma 11. For all (l, r) ∈ J n and 0 ≤ u ≤ κ 1 such that N u = ∅, Let (l, r) ∈ J n . From know on, we write ε l,+r = rε l,+r = l+r−1 t=l ε t and ε l,−r = rε l,+r . The chaining relation can be written as r 2 ε l,+r − ε l,−r 2 − σ 2 p = 1 2r ε lκ r ,+rκ r − ε lκ r ,−rκ r 2 − 2r κr σ 2 p Remark that the chaining summation starts at scale u = κ r so that n 2 u r. The first term of the chaining is an approximation on the grid at level u of the term r 2 ε l,+r − ε l,−r 2 − σ 2 p. The second term can be viewed as an error term, and we will show that it is of the same order as the first term.
Since both terms are quadratic forms of the noise, we will need an upper bound on the norm of their corresponding matrix to apply the Hanson Wright inequality -see Lemma 10.
Lemma 12 (Control of the Frobenius norm). Let (l, r) be a fixed element of J n . Let A and B be the corresponding matrix of the two following quadratic form : and ε T B ε = ε l,+r − ε l,−r 2 − ε l ,+r − ε l ,−r 2 . Then The following lemma aims at upper bounding the first term of the chaining relation with high probability.
Proof of Lemma 11. Since the mesh of the grid N u is equal to 2 κ 1 −u ≤ n 2 u , there exists (l,r) ∈ N u such that |l −l| ≤ n 2 u and |r −r| ≤ n 2 u .
Proof of Lemma 12. Let us write where m 1 = min(l − r, l − r ), m 2 = max(l + r, l + r ). Remark that for all i, j in [l − r, l + r), a ij ≤ 2. This gives the first inequality. For the second inequality, assume without loss of generality that l ≤ l . As for the first inequality, b ij ≤ 2 for all i, j ∈ [m 1 , m 2 ). Remark that b ij can be non zero only if (i, j) is in one of the following cases : 1. i or j is in [min(l + r, l + r ), max(l + r, l + r )) 2. i or j is in [min(l − r, l − r ), max(l − r, l − r )) 3. i or j is in [l, l ).
First, fix u ≥ 0 and (l, r) ∈ N u such that r ≤ 3 n 2 u . Applying the first inequality of Lemma 12 and the Hanson-Wright inequality -see Lemma 10, we obtain for all t ≥ 0 P ε l,+r − ε l,−r 2 − 2rσ 2 p ≥ t ≤ 2 exp −c min t 2 pr 2 , t r , where c is an absolute constant. Choosing t = C N r p log (2 u δ −1 ) + log 2 u δ −1 , we obtain P ε l,+r − ε l,−r 2 − 2rσ 2 p ≥ C N r p log (2 u δ −1 ) + log 2 u δ −1 where c, C are absolute constants. Since the cardinal of N u is upper bounded by 2 2u+2 , A union bound on each N u for each u ≥ 0 gives : which is convergent. For C N large enough, we obtain P (ξ c N ) ≤ 1 − δ.
Proof of Lemma 14.
Then by Lemma 12, letting B be the matrix such that ε T B ε = ε l ,+r − ε l ,−r 2 − ε l,+r − ε l,−r 2 , we obtain Thus, by the Hanson Wright inequality -see Lemma 10, From now on, we choose There are at most 2 4v+6 elements in N v × N v+1 . Therefore, a union bound on v ≥ 0 and N v × N v+1 gives where the last inequality holds if C ∆ is large enough, for c, C universal constants.
This concludes the proof since 4(r k − 1) ≤ r k for k ∈ K * .

B.4 Proof of Theorem 2
Let us fix (r, s) ∈ [1, n/4] × [1, p]. Let ∆ be such that r∆ 2 = 1 2 σ 2 s log 1 + u √ p s log n r + u log n r , for some u ≤ 1 8 . In what follows, we consider any change-point detection method that outputs an estimatorτ of the change-points, associated to a numberK of detected change-points, i.e. the length ofτ . We also write P Θ for the distribution of the data when the mean parameter or the time series is fixed to a n × p matrix Θ, i.e. of Θ + ε where the noise entries (ε t ) j are i.i.d. and follow N (0, σ 2 ) as in Section 3. Also abusing slightly notations, we write P 0 for the distribution of the data when the parameter is constant and equal to 0. Consider also any prior π over the set of n × p matrices Θ such that the number of true changepoints over the support of the prior is larger than 1 -i.e. the prior puts mass only on problems where more than one change-point occurs. LetP π be the corresponding distribution of the data, namely the distribution of the matrix of data when the mean parameter of the time series is the random matrixΘ ∼ π. Otherwise said,P π is the distribution ofΘ + ε whereΘ ∼ π.
We remind that in our setting K is the number of true change-points in a given problem -which would be either 0 under P 0 , or more than 1 underP π . If the support of π 1 is included in P(r, s), then sup Θ∈P(r,s) P Θ (K = K) ≥ 1 2 P π (K = 0) + P 0 (K = 0) where d T V is the total variation distance. From the Cauchy-Schwarz inequality, we have d T V (P π , P 0 ) ≤ 1 2 χ 2 (P π , P 0 ), where χ 2 is the divergence between probability distributions: By a simple computation that can be found for example in [47] χ 2 (P π , P 0 ) = EΘ ,Θ e 1 σ 2 whereΘ andΘ are i.i.d. and distributed according to π, Θ ,Θ = Tr(Θ Θ T ) is the standard scalar product, and EΘ ,Θ is the expectation according toΘ andΘ .
Let us consider the three following cases for the couple (r, s): Thus, in all cases -combining Equations (40) and (41) with Equations (45), (46) and (47) -we obtain in all three cases sup Θ∈P(r,s) Taking a union bound over all high-energy change-points, we deduce from Theorem 1 that, with probability higher than 1 − δ, τ achieves (NoSp) and detects all high-energy change-points. Besides, the localization error (25) is a consequence of the definition (48) together with Theorem 1.
Proof of Proposition 6. As in the proof of Theorem 2, we only consider a specific setting where one aims at testing K = 0 with Σ 1 = I p versus K = 2 with τ 1 ∈ (n/4; 3n/4), τ 2 = τ 1 + r, Σ 1 = Σ τ 2 = I p and Σ τ 1 = I p + ζuu T for some unknown unit vector u in R p . Obviously, we have r 1 = r 2 = r and Σ τ 1 − Σ τ 0 op = Σ τ 2 − Σ τ 1 op = ζ so that it suffices to prove that the sum of the type I and type II error probabilities of any test of these hypotheses is bounded away from zero. We consider two subcases: Case 1: ζ ≤ c p/r ∧ 1 √ 2 . Then, we focus on the specific alternative hypothesis where τ 1 = n/2 and τ 2 = τ 1 + r, so that the problem reduces exactly to testing whether the covariance matrix Σ of a r-sample satisfies Σ = I p or whether Σ = I p + ζuu T . This hypothesis testing problem for covariance matrices is well understood. In particular, one can deduce from Theorem 5.1 in [3] that, as soon as ζ ≤ c [ p/r ∧ 1], for some c sufficiently small, one has inf τ sup Θ∈P(r,ζ) Case 2: ζ ≤ c log(n/r)/r ∧ 1/ √ 2. Here, we consider another specific class of alternative hypotheses where we fix u = (1, 0, . . . , 0) but τ 1 can take different values, i.e. τ 1 ∈ { n/4 , n/4 + r, . . . , n/4 + r n/2r }. It turns out that this is equivalent to a univariate variance testing problem where one observes q = n/(2r) samples of size r with distributions N (0, σ 2 1 ), . . . , N (0, σ 2 q ). Under the null, we have σ 1 = σ 2 = . . . = σ q = 1. Under the alternative, for some j ∈ [q], we have σ j = √ 1 + ζ and σ l = 1 for l = j. For j = 1, . . . , q, write P j for the distributions of the j-th sample of size r when σ 2 j = 1 + ζ and σ l = 1 for l = j. Besides, we write L j for the corresponding likelihood ratio with the null distribution P 0 . Then, the mixture distribution is defined as P = 1 q q j=1 P j whereas L stands for the mean likelihood ratio. Following the classical path of Le Cam's method we obtain that, for any test T , where . T V is the total variation norm. Using Cauchy-Schwarz inequality, we bound this total variation distance between the covariates since ζ ∈ (0, 1/2). As a consequence, we derive that P 0 −P T V ≤ 1/4 as long as rζ 2 ≤ c log(q) ∧ 1. The result follows.
Proof of Proposition 7. The proof is based on an application of Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [4] together with an union bound. For a q sample of a univariate distribution with empirical distribution function F and true distribution function F , DKW inequality ensures that Applying two-times DKW inequality to each statistic T l,r such that no-change-point occurs on (l − r, l + r), we deduce that, setting c 1 sufficiently larger, the FWER of (T l,r ) is at most δ/2 by summing the probabilities over all scales r ∈ R and by a union bound on all l ∈ D r . Turning to the high-energy change points, we consider τ k satisfying (26). Let r k be the smallest radius r ∈ R such that and consider the closest location l ∈ D r of τ k so that |l − τ k | ≤ r/2 and 2r ≤ r k . To ease the notation, we still write r for r k . As in the proof of Proposition 5, we decompose the statistic and apply DKW inequality to each of three sums. Taking the union bound over all possible T l,r we deduce that, with probability higher than 1 − δ/2 so that in view of Condition (49) implies that T l,r = 1. Applying Theorem 1 allows us to conclude.
Proof of Proposition 8. As in the proof of Proposition 6, we focus on a simpler testing problem. Write U for the cumulative distribution function of the uniform distribution on [0, 1], i.e. U (x) = x for any x ∈ [0, 1]. Given ζ ∈ (0, 1/4), we define the cumulative distribution function U ζ by U ζ (x) = (1 + 2ζ)x for x ∈ [0, 1/2] and U ζ (x) = (1/2 + ζ) + (1 − 2ζ)(x − 1/2) for x ∈ [1/2, 1]. Note that U ζ − U ∞ = ζ. We focus on a testing problem where, under the null, F t = U for all t = 1, . . . , n, whereas under the alternative there exists τ 1 ∈ { n/4 , n/4 + r, . . . , n/4 + (r − 1) n/(2r) } such that F t = U ζ for t = τ 1 , . . . , τ 1 +r −1 and F t = U otherwise. Defining q = n/(2r) , we observe that this amounts to testing whether q samples of size rm are distributed according the null distribution or whether exactly one of them is distributed according to U ζ . Arguing again in the proof of Proposition 6, we only need to bound the total variation distance between the distribution P 0 under the null and the mixture distribution q −1 q j=1 P j of the q possible alternatives -here P 0 = ⊗ q k=1 U ⊗(rm) is the distribution of the samples when F t = U and P j = ⊗ j−1 k=1 U ⊗(rm) ⊗ U ⊗(rm) ζ ⊗ ⊗ q k=j+1 U ⊗(rm) , is for j ≥ 1 the distribution of the samples when F t = U except for t ∈ [jr, (j + 1)r), in which case F t = U ζ .
Let z be a uniform random variable over [0, 1] and w be an independent Bernoulli random variable with parameter 1/2. Then, one easily checks that z/2 + w/2 is uniformly distributed on [0, 1]. If w is a Bernoulli random variable with parameter 1/2 − 2ζ, then one easily checks that the cumulative distribution function of z/2 + w/2 is F ζ . As a consequence, by a standard data-processing inequality [47], one derives that where under P 0 one observes q independent Binomial random variables with parameter (mr, 1/2), whereas under P j , the j-th observation follows a Binomial distribution with parameter (mr, 1/2 − 2ζ). Using Cauchy-Schwarz inequality, we upper bound the square of the total variation distance by the χ 2 distance and then compute it explicitly. This leads us to which is smaller than 1/4 provided that 16rmζ 2 ≤ log(q/4 + 1). If we choose c small enough in the statement of the proposition, this last condition holds and the result follows.