## The Annals of Statistics

### Graph-based change-point detection

#### Abstract

We consider the testing and estimation of change-points—locations where the distribution abruptly changes—in a data sequence. A new approach, based on scan statistics utilizing graphs representing the similarity between observations, is proposed. The graph-based approach is nonparametric, and can be applied to any data set as long as an informative similarity measure on the sample space can be defined. Accurate analytic approximations to the significance of graph-based scan statistics for both the single change-point and the changed interval alternatives are provided. Simulations reveal that the new approach has better power than existing approaches when the dimension of the data is moderate to high. The new approach is illustrated on two applications: The determination of authorship of a classic novel, and the detection of change in a network over time.

#### Article information

Source
Ann. Statist. Volume 43, Number 1 (2015), 139-176.

Dates
First available in Project Euclid: 18 November 2014

https://projecteuclid.org/euclid.aos/1416322039

Digital Object Identifier
doi:10.1214/14-AOS1269

Mathematical Reviews number (MathSciNet)
MR3285603

Zentralblatt MATH identifier
1308.62090

Subjects
Primary: 62G32: Statistics of extreme values; tail inference

#### Citation

Chen, Hao; Zhang, Nancy. Graph-based change-point detection. Ann. Statist. 43 (2015), no. 1, 139--176. doi:10.1214/14-AOS1269. https://projecteuclid.org/euclid.aos/1416322039.

#### References

• Carlstein, E. G., Müller, H. G. and Siegmund, D. (1994). Change-Point Problems. Institute of Mathematical Statistics Lecture Notes—Monograph Series 23. IMS, Hayward, CA.
• Chen, L. H. Y. and Shao, Q. M. (2005). Stein’s method for normal approximation. An Introduction to Stein’s Method 4 1–59.
• Chen, H. and Zhang, N. R. (2013). Graph-based tests for two-sample comparisons of categorical data. Statist. Sinica 23 1479–1503.
• Chen, H. and Zhang, N. (2014). Supplement to “Graph-based change-point detection.” DOI:10.1214/14-AOS1269SUPP.
• Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a changepoint problem. Biometrika 65 243–251.
• Cox, D. R. and Spjøtvoll, E. (1982). On partitioning means into groups. Scand. J. Stat. 9 147–152.
• Desobry, F., Davy, M. and Doncarli, C. (2005). An online kernel change detection algorithm. IEEE Trans. Signal Process. 53 2961–2974.
• Eagle, N., Pentland, A. S. and Lazer, D. (2009). Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. USA 106 15274–15278.
• Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald–Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697–717.
• Girón, J., Ginebra, J. and Riba, A. (2005). Bayesian analysis of a multinomial sequence and homogeneity of literary style. Amer. Statist. 59 19–30.
• Harchaoui, Z., Moulines, E. and Bach, F. R. (2009). Kernel change-point analysis. In Advances in Neural Information Processing Systems.
• James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71–83.
• James, B., James, K. L. and Siegmund, D. (1992). Asymptotic approximations for likelihood ratio tests and confidence regions for a change-point in the mean of a multivariate normal distribution. Statist. Sinica 2 69–90.
• Kossinets, G. and Watts, D. J. (2006). Empirical analysis of an evolving social network. Science 311 88–90.
• Lung-Yut-Fong, A., Lévy-Leduc, C. and Cappé, O. (2011). Homogeneity and change-point detection tests for multivariate data using rank statistics. Preprint. Available at arXiv:1107.1971.
• Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
• Radovanović, M., Nanopoulos, A. and Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11 2487–2531.
• Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 515–530.
• Siegmund, D. (1988). Approximate tail probabilities for the maxima of some random fields. Ann. Probab. 16 487–501.
• Siegmund, D. O. (1992). Tail approximations for maxima of random fields. In Probability Theory (Singapore, 1989) 147–158. de Gruyter, Berlin.
• Siegmund, D. and Yakir, B. (2007). The Statistics of Gene Mapping. Springer, New York.
• Siegmund, D., Yakir, B. and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat. 5 645–668.
• Srivastava, M. S. and Worsley, K. J. (1986). Likelihood ratio tests for a change in the multivariate normal mean. J. Amer. Statist. Assoc. 81 199–204.
• Tang, H. K. and Siegmund, D. (2001). Mapping quantitative trait loci in oligogenic models. Biostatistics 2 147–162.
• Tsirigos, A. and Rigoutsos, I. (2005). A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 33 922–933.
• Tu, I.-P. and Siegmund, D. (1999). The maximum of a function of a Markov chain and application to linkage analysis. Adv. in Appl. Probab. 31 510–531.
• Vostrikova, L. J. (1981). Detection of the disorder in multidimensional random-processes. Doklady Akademii Nauk SSSR 259 270–274.
• Woodroofe, M. (1976). Frequentist properties of Bayesian sequential tests. Biometrika 63 101–110.
• Woodroofe, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ann. Statist. 6 72–84.
• Worsley, K. J. (1986). Confidence regions and test for a change-point in a sequence of exponential family random variables. Biometrika 73 91–104.
• Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97 631–645.

#### Supplemental materials

• Supplementary material: Supplement to “Graph-based change-point detection”. Supplement A: Proofs for lemmas, propositions and theorems. We provide the proofs to the lemmas, propositions and theorems. Supplement B: Skewness correction. We provide the details to the skewness correction we used. Supplement C: Checking analytic approximations to $p$-values. We provide more results in checking analytic approximations to $p$-values. Supplement D: Block permutation results. We provide more results on block permutation in analyzing the two real data examples.