Multiple Testing of Local Extrema for Detection of Change Points

A new approach to detect change points based on differential smoothing and multiple testing is presented for long data sequences modeled as piecewise constant functions plus stationary ergodic Gaussian noise. As an application of the STEM algorithm for peak detection developed in \citet{schwartzman2011multiple} and \citet{cheng2017multiple}, the method detects change points as significant local maxima and minima after smoothing and differentiating the observed sequence. The algorithm, combined with the Benjamini-Hochberg procedure for thresholding p-values, provides asymptotic strong control of the False Discovery Rate (FDR) and power consistency, as the length of the sequence and the size of the jumps get large. Simulations show that FDR levels are maintained in non-asymptotic conditions and guide the choice of smoothing bandwidth. The methods are illustrated in magnetometer sensor data and genomic array-CGH data. An R package named"dSTEM"is available in R cran.

change point detection method that can operate over long sequences where the number and location of change points are unknown, and in such a way that the overall detection error rate is controlled.
Many different approaches have been proposed to find and estimate change points, such as kernelbased methods [1], Bayesian methods [2,7], segmentation techniques [21,27,19], nonparametric tests [16] and L 1 -penalty methods [6,10,26], including the PELT method [12,14]. Though there is abundant literature on change points segmentation and detection, only a few papers address the FDR issue which treats the detection of change points as multiple hypothesis testing problems. Tibshirani and Wang [26] applied the fused lasso method to the hot-spot detection problem, and provided empirical evidence for the FDR control. Efron and Zhang [5] introduced an iterative local FDR based algorithm to explore copy number changes. Recently, Frick et al. [8] introduced a simultaneous multiscale change point estimator (SMUCE) for the change point problem in exponential family regression, and proved the control of the probability of overestimating the true number of change points. Li et al. [17] improved the SMUCE method and proposed a multiscale segmentation method FDRSeg, which gives a non-asymptotic upper bound for its FDR in a Gaussian setting and is robust to the choice of parameter α. However, our proposed approach is unique in the following two ways.
First, the noise is assumed to be a stationary Gaussian process, allowing the error terms to be correlated. This is an important departure from the standard assumption of white noise in most of the change-point literature. In fact, applied statisticians desiring to use change-point methods have sometimes abandoned this option in favor of other techniques simply because the white noise assumption does not hold [11]. This paper shows that change-point methods can be devised for correlated noise, expanding the domain of their applicability.
Second, we use the theory of Gaussian processes to compute p-values for all candidate change points, so that significant change points can be selected at a desired significance level. For concreteness, we adopt the Benjamini-Hochberg multiple testing procedure, enabling control of the false discovery rate (FDR) of detected change points when the data sequence is long and the number and location of change points are unknown. To our knowledge, this is the first article proposing a multiple testing method for controlling the FDR of detected change points. Moreover, the asymptotic properties of FDR and power are provided.
In this paper, we consider a signal-plus-noise model where the true signal is a piecewise constant function and the change points are defined as the points of discontinuity. Inspired by the method for detecting peaks in Schwartzman et al. [23] and Cheng and Schwartzman [3], we modify the STEM algorithm therein to detect change points. The central idea is the observation that the true signal has zero derivative everywhere except at the change points, where the derivative is infinite.
Thus, in the presence of noise and under temporal or spatial sampling, change points can be seen as positive or negative peaks in the derivative of the smoothed signal. Note that because of the time sampling, derivatives cannot be observed directly and can only be estimated. The focus on the derivative of the smoothed signal effectively transforms the change point detection problem into a peak detection problem. As in the STEM algorithm, the resulting peak detection problem is then The cyan and pink bars indicate the location tolerance intervals for increasing and decreasing change points respectively. At this tolerance, there are nine true discoveries and one false discovery.
solved by identifying local maxima and minima of the derivative as candidate peaks and applying a multiple testing procedure to the list of candidates.
The "differential Smoothing and TEsting of Maxima/Minima" (dSTEM) algorithm for change point detection consists of the following steps: 1. Differential kernel smoothing: to transform change points to local maxima or minima, and to increase the SNR. This principle of this step is illustrated in Figure 2.
2. Candidate peaks: find local maxima and minima of the differentiated smoothed process.
3. P-values: computed at each local maximum and minimum under the the null hypothesis of no signal in a local neighborhood.

4.
Multiple testing: apply a multiple testing procedure to the set of local maxima and minima; declare as detected change points those local maxima and minima whose p-values are significant.
The algorithm is illustrated by a toy example in Figure 1.
The dSTEM algorithm above differs from the ones in Schwartzman et al. [23] and Cheng and Schwartzman [3] in that peaks are sought in the derivative of the smoothed signal rather than the smoothed signal itself, and that both positive and negative peaks are considered. In addition, an important consideration for the proper definition of error in change point detection is that, as opposed to the peak detection problems considered in Schwartzman et al. [23] and Cheng and Schwartzman [3] where signal peaks had compact support, a true single change point over a continuous domain at t = v has Lebesgue measure zero. Thus in the presence of noise, it can hardly be detected exactly at t = v. Therefore we introduce a location tolerance b that defines the precision within which a change point should be detected. Specifically, given b, a detected change point is regarded as a true discovery if it falls in the interval (v − b, v + b). Conversely, if a significant change point is found more than a distance b from any true change point, it is considered a false discovery. The quantity b is not used in the dSTEM algorithm itself but is needed for proper error definition.
Under this convention, it is shown here that the dSTEM algorithm exhibits asymptotic FDR control and power consistency as the length of the sequence and the size of the jumps at the change points increase. These asymptotic conditions are similar to those considered in Schwartzman et al. [23] and Cheng and Schwartzman [3] and, in fact, the proofs are extended from those in Cheng and Schwartzman [3].
Simulations for varying levels of smoothing bandwidth γ, smoothing degree of noise ν and jump size a are used to study the behavior of the algorithm under non-asymptotic conditions. The simulation results help guide the choice of smoothing bandwidth γ with respect to ν and the desired location tolerance. In general, power increases with bandwidth to a limit dictated by the distance between the change points, so admitting a higher tolerance generally allows a higher bandwidth and higher power.
The methods are illustrated in a genomic sequence of array-CGH data in a breast-cancer tissue sample [18,9]. The goal of the analysis is to find genomic segments with copy-number alterations.
These are found by detecting change points in the copy number genomic sequence. Another application is magnetometer sensor readings, aiming at finding the start and end points of hand gesture motion, which is the critical step and foundation of establishing a secret key based on hand gestures.
2 The multiple testing scheme

The model
We consider a continuous time model, although the algorithm is designed for data discretely sampled in time. Consider the signal-plus-noise model where the signal µ(t) is a piecewise constant function of the form We are interested in finding the change points v j . For the asymptotic analysis, we assume so that the change points do not become arbitrarily small in size nor arbitrarily close to each other.
Let w γ (t) = w(t/γ)/γ, where γ > 0 is the bandwidth parameter and w(t) ≥ 0 is a unimodal symmetric kernel with compact connected support [−c, c] and unit action. Convolving the process (2.1) with the kernel w γ (t) results in the smoothed random process where the smoothed signal and smoothed noise are defined respectively as and where the smoothed change point takes the form The smoothed noise z γ (t) defined by (2.4) is assumed to be a zero-mean four-times differentiable stationary ergodic Gaussian process.

Change point detection as peak detection of the derivative
Consider now the derivative of the smoothed observed process (2.3) where the derivatives of the smoothed signal and smoothed noise are respectively A key observation from (2.5) is that as illustrated in Figure 2. Thus (2.6) represents a signal-plus-noise model where the smoothed signal is a sequence of unimodal peaks with the same shape as that of w γ and located at locations v j . The problem of finding change points in y γ (t) is thus reduced to finding (positive or negative) peaks in y γ (t).
For simplicity, we assume that the compact supports S j,γ of the smoothed peak shape h j,γ (t) = w γ (t − v j ) do not overlap, although this is not crucial in practice.  1. Differential kernel smoothing: Obtain the process (2.6) by convolution of y(t) with the kernel derivative w γ (t).

2.
Candidate peaks: Find the set of local maxima and minima of y γ (t) in U (L), denoted bỹ

P-values:
For each t ∈T + γ , compute the p-value p γ (t) for testing the (conditional) hypotheses and for each t ∈T − γ , compute the p-value p γ (t) for testing the (conditional) hypotheses where b > 0 is an appropriate location tolerance.

4.
Multiple testing: Letm γ = #{t ∈T γ } be the number of tested hypotheses. Apply a multiple testing procedure on the set ofm γ p-values {p γ (t), t ∈T γ }, and declare significant all local extrema whose p-values are smaller than the significance threshold.

P-values
Given the observed heights y γ (t) at the local maxima or minima where F γ (u) denotes the right tail probability of z γ (t) at the local maximum t ∈T + γ , evaluated under the null model µ (s) = 0, ∀s ∈ (t − b, t + b), that is, (2.10) The second line in (2.9) is obtained by noting that, by (2.10), The distribution (2.10) has a closed-form expression, which can be obtained as in Schwartzman et al. [23] or Cramér and Leadbetter [4]. More specifically, the distribution (2.10) is given by γ (t)), ∆ = σ γ 2 λ 6,γ − λ 2 4,γ , and φ(x), Φ(x) are the standard normal density and cumulative distribution function, respectively. The quantities σ γ 2 , λ 4,γ and λ 6,γ depend on the kernel w γ (t) and the autocorrelation function of the original noise process z(t). Explicit expressions may be obtained, for instance, for the Gaussian autocorrelation model in Example 3.4 below, which we use later in the simulations.

Error definitions
Assuming the model of §2.1, define the signal region are respectively the set of local maxima of y γ (t) above u and the set of local minima of y γ (t) below −u. The number of totally and falsely detected change points at threshold u are defined respectively as Both are defined as zero ifT γ (u) is empty. The FDR at threshold u is defined as the expected proportion of falsely detected jumps Note that when γ and u are fixed, V γ (u; b) and hence FDR γ (u; b) are decreasing in b.
Following the notation in Cheng and Schwartzman [3], define the smoothed signal region S 1,γ to be the support of µ γ (t) and smoothed null region S 0,γ = U (L) \ S 1,γ . We call the difference between the expanded signal support due to smoothing and the true signal support the transition region

Power
Denote by I + and I − the collections of indices j corresponding to increasing and decreasing change points v j , respectively. We define the power of Algorithm 2.1 as the expected fraction of true discovered change points where Power j,γ (u; b) is the probability of detecting jump v j within a distance b, The indicator function in (2.14) ensures that only one significant local extremum is counted within a distance b of a change point, so power is not inflated. Note that when γ and u are fixed, Power γ (u; b) and Power j,γ (u; b) are increasing in b.

Asymptotic FDR control and power consistency
Suppose the BH procedure is applied in step 4 of Algorithm 2.1 as follows. For a fixed α ∈ (0, 1), let k be the largest index for which the ith smallest p-value is less than iα/m γ . Then the null hypothesis where kα/m γ is defined as 1 ifm γ = 0. Sinceũ BH is random, we define FDR in such BH procedure where R γ (·) and V γ (·; b) are defined in (2.12) and the expectation is taken over all possible realizations of the random thresholdũ BH . We will make use of the following conditions: (C1) The assumptions of §2.1 hold.
(C2) L → ∞ and a = inf j |a j | → ∞, such that (log L)/a 2 → 0, In condition (C2), we assume that the length of the search space L increases. In order for the de- (i) Suppose Algorithm 2.1 is applied with a fixed threshold u > 0. Then (ii) Suppose Algorithm 2.1 is applied with the random thresholdũ BH (3.1). Then Proof. Since w γ (t) has compact support [−cγ, cγ], by (2.7), the support S 1,γ of µ γ (t) in (2.8) is Notice that, on the null region S 0,γ , the expected number of local extrema, including both local maxima and minima, equals 2|S 0,γ |E[m 0,γ (U (1))]. On the other hand, following the proof of Theorem 3 in Cheng and Schwartzman [3], the expected number of local extrema on the signal region S 1,γ is asymptotically equivalent to J. This is because, for each j ∈ I + and b > 0, as a → ∞, asymptotically, there is no local maximum of The reasoning for the case of minima is similar.
Lemma 3.2 Let conditions (C1) and (C2) hold. As |a j | → ∞, the power for peak j and fixed u (2.15) can be approximated by The result then follows from Lemma 4 in Cheng and Schwartzman [3] with z γ (t) replaced by z γ (t).
By similar arguments in Cheng and Schwartzman [3] (see equation (20) therein), one can show that the random thresholdũ BH converges asymptotically to the deterministic threshold where E[m 0,γ (U (1))] is given by (3.2). Sinceũ BH is random, similarly to the definition of FDR BH,γ (b), we define power in the BH procedure as (ii) Suppose Algorithm 2.1 is applied with the random thresholdũ BH (3.1). Then

Proof. The desired results follow from similar arguments for showing Theorem 5 in Cheng and
Schwartzman [3].

Example 3.4 [Gaussian autocorrelation model] Let the noise z(t) in (2.1) be constructed as
where φ is the standard Gaussian density, dB(s) is Gaussian white noise and ν > 0 (z(t) is regarded by convention as Gaussian white noise when ν = 0). Convolving with a Gaussian kernel w γ (t) = (1/γ)φ(t/γ) with γ > 0 as in (2.4) produces a zero-mean infinitely differentiable stationary ergodic Gaussian field z γ (t) such that As a function of γ, the SNR has a local minimum at γ * = √ 2ν and is strictly increasing for large γ. In particular, when ν = 0, it is strictly increasing in γ. Thus we generally expect the detection power to increase with γ for γ > √ 2ν. This will be confirmed in the simulations below. Note that for ν > 0, the SNR is unbounded as γ → 0, however in practice γ cannot be too small: if the support of w γ becomes smaller than the sampling interval, then the derivative µ γ cannot be estimated. The results of FDR and power are shown in Figure 3. We see that for fixed γ and ν, as the strength of the signal a increases, FDR will decrease while the power will increase; moreover, FDR is eventually controlled below the nominal level and the power tends to 1, which is consistent with  In the simulations in Figure 3 above, the distance of neighboring change points d = 100 is large enough so that the kernel smoothing effectively affects only one change point at a time. However, if d is small, then the kernel smoothing with large γ may produce interference between neighboring change points, causing the power decrease. To illustrate this, we take the case where the signal strength is a = 1.5 and perform simulations for FDR and power with d = 40, 30 and 20. As shown in Figure 4, too large a γ makes FDR increase and power decrease, due to the overlap between the kernel smoothing at neighboring change points. This phenomenon becomes more evident when d is 20). Theoretically, the neighboring interference would happen when d is less than the support of the smoothing kernel, which is about 8γ in our Gaussian case here. In particular, we see that the turning point of the power, attaining almost its maximum, appears at around γ = 8 for cases d = 40 and d = 30, while at around γ = 5 for the case d = 20. This suggests that γ = d/4 is a good choice of bandwidth when d is not large. Figure 3 shows that the bandwidth γ will greatly affect the performance of dSTEM. We see that larger γ tends to attain a smaller FDR and larger power. However, as shown in Figure 4, if γ is too large, it will produce interference between neighboring change points and contamination error, thereby decreasing the power. Thus, the choice of γ is critical for the performance of dSTEM.

Choice of the bandwidth γ
The optimal bandwidth γ is the value that maximizes the power while controlling the FDR under the significance level. It is difficult to obtain the optimal γ theoretically. However, in our model, the signal is a piecewise constant function, and it is possible to obtain the optimal γ in practice by making more assumptions on the noise, such as in Example 3.4 using the Gaussian autocorrelation model.

Comparison with algorithm FDRSeg
As mentioned in Section 1, FDRSeg is the newest method which can control FDR for change point detection. In this subsection, we compare the performance of our method dSTEM with the algorithm FDRSeg. First, it is worth mentioning that our method is mainly designed for autocorrelated random noise, while FDRSeg requires independent and identically distributed random error, which is just a special case of our method (ν = 0). Note that FDRSeg contains only one parameter α F , which controls the theoretical upper bound of FDR at 2α F /(1 − α F ). However, they suggest that in practice their method should give FDR ≤ α F . Thus, we let α F be 0.05 and 0.1. Table 1 shows the realized FDR and detection power under independent noise situation. We see that for small signal a = 1, dSTEM can almost control FDR and its power is 84.8% when γ = 12; while FDRSeg can attain a little larger power, but it is hard to control FDR when α = 0.1. For larger signal a, both two methods can control FDR and attain a similar large power. Table 2 shows the results under the situation of autocorrelated noise (ν = 1). In this case, the performance of dSTEM is nearly the same as that in independent scenario, while FDRSeg tends to estimate a large number of change points, leading to a large FDR, which means it can hardly control FDR.  In the field of mobile security, two-factor authentication/verification is of great importance, which is an extra layer of security of your mobile device, such as smartphones, wearable, and smart home devices, designed to ensure that you are the only person who can unlock your device, even if someone knows your password. In recent years, gesture based key establishment is a popular topic in communication security and computer science [25,24,13].
Modern mobile devices embedded with various motion sensors including accelerometer, gyroscope and magnetometer are used to measure and record the gesture performing process. Obtaining accurate readings of magnetometer is the foundation of magnetometer baesd research. However, during data collection, there are always noises which might be caused by hardware imperfection, manipulation error or sensitivity of the sensor. Particularly, finding the start point and associated end point for each gesture is a big challenge. The goal of this analysis is to find such change points. In this paper, the data was collected from an experiment where several simple gestures, for example, shaking the smartphone in different directions and at different speeds, were designed. In particular, the smartphone defines a coordinate system of the embedded magnetometer, which is shown in Figure 5. The magnetometer can record the speed of smartphone movement as the readings along X, Y and Z axes. In our case, we will only show the results of the readings on X-axis, since readings along other two axes could be processed similarly.
In this data analysis, the sample size is n = 6510 and we use the same procedure as in example 5.2 except γ = 18, because the interval between neighboring change points is narrow so that the bandwidth cannot be too large to avoid interference. Figure 6 shows results of the detected change points. Due to the measurement error and magnetic-field interference, the real underlying data will be interfered by slight fluctuations, leading to lots of (349) local maxima and minima, as shown in

Array-CGH data
Array-based comparative genomic hybridization (array-CGH) is a high-throughput high-resolution technique used to evaluate changes in the number of copies of alleles at thousands of genomic loci simultaneously. The output is often called Copy Number Variation (CNV) data. Changes in copy number are represented by segments whose mean is displaced with respect to the background. To detect these changes, it is costumary to search for change points along the genome.
In this paper, we apply our method to chromosome 1 of tumor sample #18 from the dataset of Hsu et al. [9] and Loo et al. [18]. This sample is one of 37 formalin-fixed breast cancer tumors in that dataset and it was chosen for its visual appeal in the illustration of our method. The data in chromosome 1 of tumor sample #18 consists of 968 average copy number reads mapped onto 968 unequally spaced locations along the chromosome. For simplicity, the data was analyzed ignoring the gaps in the genomic locations. Figure 8 (top left) shows the data with spacings between reads artificially set to 1. Note that ignoring the spacings does not affect the presence or absence of change points.
To analyze the data, the dSTEM algorithm was applied with a truncated Gaussian smoothing kernel w γ (t) = (1/γ)φ(t/γ)1(t ∈ [−4γ, 4γ]) with γ = 10. The bandwidth was chosen not too large in order to avoid interference between neighboring change points. Figure 8  P-values corresponding to local maxima and minima were computed according to (2.9) using the distribution (2.11). The required parameters σ γ 2 = Var(z γ (t)), λ 4,γ = Var(z γ (t)), λ 6,γ =   In this paper, we combined both local maxima and minima of the derivative as candidate peaks, and then applied a multiple testing procedure to find a uniform threshold (in absolute value) for detecting all change points. This approach is sensible when the distributions (number and height) of true increasing and decreasing change points are about the same. Alternatively, different thresholds for detecting increasing and decreasing change points could be found by applying separate multiple testing procedures to the sets of candidate local maxima and local minima. While we applied the BH algorithm to control FDR, in principle other multiple testing procedures may be used to control other error rates.

The smoothing bandwidth
A natural and important question is how to choose the smoothing bandwidth γ. We can see that either a small γ (if the noise is highly autocorrelated) or a relatively large γ (if the noise is less autocorrelated) is preferred in order to increase power, but only to the extent that the smoothed signal supports h j,γ (t) have little overlap and that detected change points are not displaced by more than the desired tolerance b (recall that the value of b is not used in the dSTEM algorithm itself, but it may be determined by the needs of the specific scientific application). Considering the Gaussian kernel to have an effective support of ±cγ, a good value of γ may be about min(b, d/c), where d is the separation between change points. Since the location of the change points is unknown, a more precise optimization of γ may require an iterative procedure. Moreover, if some change points are close together and others are far apart, an adaptive bandwidth may be preferable. We leave these as problems for future research.

Supplementary Material
An R package named "dSTEM", for performing the dSTEM algorithm 2.1 for change point detection, is available in R cran.