## Electronic Journal of Statistics

### Model selection for the segmentation of multiparameter exponential family distributions

#### Abstract

We consider the segmentation problem of univariate distributions from the exponential family with multiple parameters. In segmentation, the choice of the number of segments remains a difficult issue due to the discrete nature of the change-points. In this general exponential family distribution framework, we propose a penalized $\log$-likelihood estimator where the penalty is inspired by papers of L. Birgé and P. Massart. The resulting estimator is proved to satisfy some oracle inequalities. We then further study the particular case of categorical variables by comparing the values of the key constants when derived from the specification of our general approach and when obtained by working directly with the characteristics of this distribution. Finally, simulation studies are conducted to assess the performance of our criterion and to compare our approach to other existing methods, and an application on real data modeled using the categorical distribution is provided.

#### Article information

Source
Electron. J. Statist., Volume 11, Number 1 (2017), 800-842.

Dates
First available in Project Euclid: 28 March 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1490666425

Digital Object Identifier
doi:10.1214/17-EJS1246

Mathematical Reviews number (MathSciNet)
MR3629016

Zentralblatt MATH identifier
1362.62068

Subjects
Primary: 62G05: Estimation 62G07: Density estimation
Secondary: 62P10: Applications to biology and medical sciences

#### Citation

Cleynen, Alice; Lebarbier, Emilie. Model selection for the segmentation of multiparameter exponential family distributions. Electron. J. Statist. 11 (2017), no. 1, 800--842. doi:10.1214/17-EJS1246. https://projecteuclid.org/euclid.ejs/1490666425

#### References

• [1] Akaike, H. (1973). Information theory and extension of the maximum likelihood principle., Second international symposium on information theory, 267–281.
• [2] Arlot, S., Celisse, A., and Harchaoui, Z. (2012). A kernel multiple change-point algorithm via model selection., arXiv preprint arXiv:1202.3878.
• [3] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression., The Journal of Machine Learning Research 10, 245–279.
• [4] Barron, A., Birgé, L., and Massart, P. (1999). Risk bounds for model selection via penalization., Probability Theory Related Fields 113, 3, 301–413.
• [5] Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming., Commun. ACM 4, 6, 284. http://portal.acm. org/citation.cfm?id=366611.
• [6] Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In, Festschrift for Lucien Le Cam. Springer, New York, 55–87.
• [7] Birgé, L. and Massart, P. (2001). Gaussian model selection., Journal of the European Mathematical Society 3, 3, 203–268.
• [8] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection., Probability Theory Related Fields 138, 1–2, 33–73.
• [9] Boys, R. J. and Henderson, D. A. (2004). A bayseian approach to DNA sequence segmentation., Biometrics 60, 2, 573–588.
• [10] Braun, J. V., Braun, R., and Müller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with application to dna sequence segmentation., Biometrika 87, 2, 301–314.
• [11] Braun, J. V. and Müller, H.-G. (1998). Statistical methods for DNA sequence segmentation., Biometrika 13, 2, 301–314.
• [12] Breiman, Friedman, Olshen, and Stone. (1984). Classification and regression trees., Wadsworth and Brooks.
• [13] Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory., Lecture Notes-monograph series, i–279.
• [14] Castellan, G. (2000). Modified Akaike’s criterion for histogram density estimation., C. R. Acad. Sci., Paris, Sér. I, Math. 330 8, 729–732.
• [15] Cleynen, A., Dudoit, S., and Robin, S. (2014). Comparing segmentation methods for genome annotation based on rna-seq data., Journal of Agricultural, Biological, and Environmental Statistics 19, 1, 101– 118.
• [16] Cleynen, A., Koskas, M., Lebarbier, E., Rigaill, G., and Robin, S. (2014). Segmentor3isback: an R package for the fast and exact segmentation of seq-data., Algorithms for Molecular Biology 9, 6.
• [17] Cleynen, A. and Lebarbier, E. (2014). Segmentation of the Poisson and negative binomial rate models: a penalized estimator., ESAIM: Probability and Statistics.
• [18] Cleynen, A., Luong, T. M., Rigaill, G., and Nuel, G. (2014). Fast estimation of the integrated completed likelihood criterion for change-point detection problems with applications to next-generation sequencing data., Signal Processing 98, 233–242.
• [19] Cleynen, A. and Robin, S. (2016). Comparing change-point location in independent series., Statistics and Computing 26, 1–2, 263–276.
• [20] Durot, C., Lebarbier, E., and Tocquet, A. (2009). Estimating the joint distribution of independent categorical variables via model selection., Bernoulli 15, 2, 475–507.
• [21] Frick, K., Munk, A., and Sieling, H. (2014). Multiscale change point inference., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 3, 495–580.
• [22] Gassiat, E., Cleynen, A., and Robin, S. (2016). Inference in finite state space non parametric hidden Markov models and applications., Statistics and Computing 26, 1–2, 61–71.
• [23] Harchaoui, Z. and Lévy-Leduc, C. (2010). Multiple change-point estimation with a total variation penalty., Journal of the American Statistical Association 105, 492.
• [24] Hughes, N. P., Tarassenko, L., and Roberts, S. J. (2003). Markov models for automated ECG interval analysis., Advances in Neural Information Processing Systems 16.
• [25] Johnson, N., Kemp, A., and Kotz, S. (2005). Univariate discrete distributions., John Wiley & Sons, Inc..
• [26] Kakade, S. M., Shamir, O., Sridharan, K., and Tewari, A. (2009). Learning exponential families in high-dimensions: Strong convexity and sparsity., arXiv preprint arXiv:0911.0054.
• [27] Killick, R., Fearnhead, P., and Eckley, I. (2012). Optimal detection of changepoints with a linear computational cost., Journal of the American Statistical Association 107, 500, 1590–1598.
• [28] Lai, W. R., Johnson, M. D., Kucherlapati, R., and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data., Bioinformatics 21, 19, 3763–3770.
• [29] Lebarbier, E. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection., Signal Processing 85, 4 (Apr.), 717–736.
• [30] Lee, J. D., Sun, Y., and Taylor, J. E. (2013). On model selection consistency of m-estimators with geometrically decomposable penalties., Advances in Neural Processing Information Systems.
• [31] Maidstone, R., Hocking, T., Rigaill, G., and Fearnhead, P. (2016). On optimal multiple changepoint algorithms for large data., Statistics and Computing, 1–15. http://dx.doi.org/10.1007/s11222-016-9636-3.
• [32] Massart, P. (2007)., Concentration inequalities and model selection. Springer Verlag.
• [33] Matteson, D. S. and James, N. A. (2014). A nonparametric approach for multiple change point analysis of multivariate data., Journal of the American Statistical Association 109, 505, 334–345.
• [34] Muri, F. (1998). Modelling bacterial genomes using hidden Markov models., Compstat98. Proceedings in Computational Statistics, Eds R. Payne and P. Green, 89–100.
• [35] Rigaill, G. (2010). Pruned dynamic programming for optimal multiple change-point detection., Arxiv:1004.0887. http://arxiv.org/abs/1004.0887.
• [36] Rigaill, G., Lebarbier, E., and Robin, S. (2012). Exact posterior distributions and model selection criteria for multiple change-point detection problems., Statistics and Computing 22, 4, 917–929.
• [37] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference., Foundations and Trends® in Machine Learning 1, 1–2, 1–305.
• [38] Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ criterion., Statistics & Probability Letters 6, 3 (February), 181–189.
• [39] Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data., Biometrics 63, 1, 22–32.