The Annals of Statistics

Adaptive estimation for Hawkes processes; application to genome analysis

Patricia Reynaud-Bouret and Sophie Schbath

Full-text: Open access

Abstract

The aim of this paper is to provide a new method for the detection of either favored or avoided distances between genomic events along DNA sequences. These events are modeled by a Hawkes process. The biological problem is actually complex enough to need a nonasymptotic penalized model selection approach. We provide a theoretical penalty that satisfies an oracle inequality even for quite complex families of models. The consecutive theoretical estimator is shown to be adaptive minimax for Hölderian functions with regularity in (1/2, 1]: those aspects have not yet been studied for the Hawkes’ process. Moreover, we introduce an efficient strategy, named Islands, which is not classically used in model selection, but that happens to be particularly relevant to the biological question we want to answer. Since a multiplicative constant in the theoretical penalty is not computable in practice, we provide extensive simulations to find a data-driven calibration of this constant. The results obtained on real genomic data are coherent with biological knowledge and eventually refine them.

Article information

Source
Ann. Statist., Volume 38, Number 5 (2010), 2781-2822.

Dates
First available in Project Euclid: 20 July 2010

Permanent link to this document
https://projecteuclid.org/euclid.aos/1279638540

Digital Object Identifier
doi:10.1214/10-AOS806

Mathematical Reviews number (MathSciNet)
MR2722456

Zentralblatt MATH identifier
1200.62135

Subjects
Primary: 62G05: Estimation 62G20: Asymptotic properties
Secondary: 46N60: Applications in biology and other sciences 65C60: Computational problems in statistics

Keywords
Hawkes process model selection oracle inequalities data-driven penalty minimax risk adaptive estimation unknown support genome analysis

Citation

Reynaud-Bouret, Patricia; Schbath, Sophie. Adaptive estimation for Hawkes processes; application to genome analysis. Ann. Statist. 38 (2010), no. 5, 2781--2822. doi:10.1214/10-AOS806. https://projecteuclid.org/euclid.aos/1279638540


Export citation

References

  • [1] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245–279.
  • [2] Baraud, Y., Comte, F. and Viennet, G. (2001). Model selection for (auto)-regression with dependent data. ESAIM Probab. Stat. 5 33–49.
  • [3] Baraud, Y., Comte, F. and Viennet, G. (2001). Adaptive estimation in autoregression or beta-mixing regression via model selection. Ann. Statist. 39 839–875.
  • [4] Birgé, L. (2005). A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory 51 1611–1615.
  • [5] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
  • [6] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33–73.
  • [7] Brémaud, P. and Massoulié, L. (1996). Stability of nonlinear Hawkes processes. Ann. Probab. 24 1563–1588.
  • [8] Brémaud, P. and Massoulié, L. (2001). Hawkes branching point processes without ancestors. J. Appl. Probab. 38 122–135.
  • [9] Daley, D. J. and Vere-Jones, D. (2005). An Introduction to the Theory of Point Processes. Springer Series in Statistics I. Springer, New York.
  • [10] Gallager, R. (1968). Information Theory and Reliable Communication. Wiley, New York.
  • [11] Gusto, G. (2004). Estimation de l’intensité d’un processus de Hawkes généralisé double. Application à la recherche de motifs corépartis le long d’une séquence d’ADN. Ph.D. thesis, Univ. Paris. Available at http://www.math.u-psud.fr/~stats/NEW/theses.php.
  • [12] Gusto, G. and Schbath, S. (2005). FADO: A statistical method to detect favored or avoided distances between motif occurrences using the Hawkes’ model. Stat. Appl. Genet. Mol. Biol. 4 Article 24, 28 pp. (electronic).
  • [13] Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a self-exciting process. J. Appl. Probab. 11 493–503.
  • [14] Lacour, C. (2007). Adaptive estimation of the transition density of a Markov chain. Ann. Inst. H. Poincaré Probab. Statist. 43 571–597.
  • [15] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
  • [16] Ogata, Y. and Akaike, H. (1982). On linear intensity models for mixed doubly stochastic Poisson and self-exciting point processes. J. Roy. Statist. Soc. Ser. B 44 102–107.
  • [17] Ozaki, T. (1979). Maximum likelihood estimation of Hawkes’ self-exciting point processes. Ann. Inst. Statist. Math. 31 145–155.
  • [18] Reinert, G., Schbath, S. and Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Biol. 7 1–46.
  • [19] Reynaud-Bouret, P. (2003). Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Related Fields 126 103–153.
  • [20] Reynaud-Bouret, P. (2006). Compensator and exponential inequalities for some suprema of counting processes. Statist. Probab. Lett. 76 1514–1521.
  • [21] Reynaud-Bouret, P. (2006). Penalized projection estimators of the Aalen multiplicative intensity. Bernoulli 12 633–661.
  • [22] Reynaud-Bouret, P. and Roy, E. (2007). Some nonasymptotic tail estimate for Hawkes processes. Bull. Belg. Math. Soc. Simon Stevin 13 883–896.
  • [23] Reynaud-Bouret, P. and Schbath, S. (2010). Adaptive estimation for Hawkes’processes; application to genome analysis. Available at arXiv:0903.2919v3.
  • [24] Vere-Jones, D. and Ozaki, T. (1982). Some examples of statistical estimation applied to earthquake data. Ann. Inst. Statist. Math. 34 189–207.