Assume one observes independent categorical variables or, equivalently, one observes the corresponding multinomial variables. Estimating the distribution of the observed sequence amounts to estimating the expectation of the multinomial sequence. A new estimator for this mean is proposed that is nonparametric, non-asymptotic and implementable even for large sequences. It is a penalized least-squares estimator based on wavelets, with a penalization term inspired by papers of Birgé and Massart. The estimator is proved to satisfy an oracle inequality and to be adaptive in the minimax sense over a class of Besov bodies. The method is embedded in a general framework which allows us to recover also an existing method for segmentation. Beyond theoretical results, a simulation study is reported and an application on real data is provided.
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription.
Read more about accessing full-text
References
[1] Aerts, M. and Veraverbeke, N. (1995). Bootstrapping a nonparametric polytomous regression model. Math. Methods Statist. 4 189–200.
[2] Akakpo, N. (2008). Estimating a discrete distribution via histogram selection. Technical report, Univ. Paris Sud, Orsay.
[3] Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4 329–375.
[4] Birgé, L. and Massart, P. (2000). An adaptive compression algorithm in Besov spaces. Constr. Approx. 16 1–36.
[5] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203–268.
[6] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33–73.
[7] Boucheron, S., Lugosi, G. and Massart, P. (2003). Concentration inequalities using the entropy method. Ann. Probab. 31 1583–1614.
[8] Braun, J.-V., Braun, R.-K. and Müller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with application to dna sequence segmentation. Biometrika 87 301–314.
[9] Donoho, D.-A. and Johnstone, I.-M. (1998). Minimax estimation via wavelet shrinkage. Ann. Statist. 26 879–921.
[10] Fu, Y.-X. and Curnow, R.-N. (1990). Maximum likelihood estimation of multiple change points. Biometrika 77 563–573.
[11] Gey, S. and Lebarbier, E. (2008). Using CART to detect multiple change-points in the mean for large samples. Technical report, Preprint SSB n12.
[12] Hoebeke, M., Nicolas, P. and Bessières, P. (2003). MuGeN: simultaneous exploration of multiple genomes and computer analysis results. Bioinformatics 19 859–864.
[13] Lebarbier, E. (2002). Quelques approches pour la détection de ruptures à horizon fini. Ph.D. thesis, Univ. Paris Sud, Orsay.
[14] Lebarbier, E. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Processing 85 717–736.
[15] Lebarbier, E. and Nédélec, E. (2007). Change-points detection for discrete sequences via model selection. Technical report, Preprint SSB n9.
[16] Massart, P. (2007). Concentration Inequalities and Model Selection. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23. Lecture Notes in Math. 1896. Berlin: Springer.