The Annals of Applied Statistics

Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology

David I. Warton and Leah C. Shepherd

Full-text: Open access

Abstract

Presence-only data, point locations where a species has been recorded as being present, are often used in modeling the distribution of a species as a function of a set of explanatory variables—whether to map species occurrence, to understand its association with the environment, or to predict its response to environmental change. Currently, ecologists most commonly analyze presence-only data by adding randomly chosen “pseudo-absences” to the data such that it can be analyzed using logistic regression, an approach which has weaknesses in model specification, in interpretation, and in implementation. To address these issues, we propose Poisson point process modeling of the intensity of presences. We also derive a link between the proposed approach and logistic regression—specifically, we show that as the number of pseudo-absences increases (in a regular or uniform random arrangement), logistic regression slope parameters and their standard errors converge to those of the corresponding Poisson point process model. We discuss the practical implications of these results. In particular, point process modeling offers a framework for choice of the number and location of pseudo-absences, both of which are currently chosen by ad hoc and sometimes ineffective methods in ecology, a point which we illustrate by example.

Article information

Source
Ann. Appl. Stat. Volume 4, Number 3 (2010), 1383-1402.

Dates
First available in Project Euclid: 18 October 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1287409378

Digital Object Identifier
doi:10.1214/10-AOAS331

Mathematical Reviews number (MathSciNet)
MR2758333

Zentralblatt MATH identifier
1202.62171

Keywords
Habitat modeling quadrature points occurrence data pseudo-absences species distribution modeling

Citation

Warton, David I.; Shepherd, Leah C. Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology. Ann. Appl. Stat. 4 (2010), no. 3, 1383--1402. doi:10.1214/10-AOAS331. https://projecteuclid.org/euclid.aoas/1287409378


Export citation

References

  • Austin, M. P. (1985). Continuum concept, ordination methods and niche theory. Annual Review of Ecology, Evolution, and Systematics 16 39–61.
  • Baddeley, A. and Turner, R. (2005). Spatstat: An R package for analyzing spatial point patterns. Journal of Statistical Software 12 1–42.
  • Baddeley, A. J. and van Lieshout, M. (1995). Area-interaction point processes. Ann. Inst. Statist. Math. 47 601–619.
  • Baddeley, A. J., Moller, J. and Waagepetersen, R. (2000). Non- and semiparametric estimation of interaction in inhomogeneous point patterns. Statist. Neerlandica 54 329–350.
  • Berman, M. and Turner, T. (1992). Approximating point process likelihoods with GLIM. J. Roy. Statist. Soc. Ser. C 41 31–38.
  • Burnham, K. P. and Anderson, D. R. (1998). Model Selection and Inference—A Practical Information-Theoretic Approach. Springer, New York.
  • Chefaoui, R. M. and Lobo, J. M. (2008). Assessing the effects of pseudo-absences on predictive distribution model performance. Ecological Modelling 210 478–486.
  • Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York.
  • Diggle, P. J. (2003). Statistical Analysis of Spatial Point Patterns, 2nd ed. Arnold, London.
  • Elith, J. and Leathwick, J. (2007). Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines. Diversity and Distributions 13 265–275.
  • Elith, J. and Leathwick, J. (2009). Species distribution models: Ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40 677–697.
  • Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology 77 802–813.
  • Guisan, A., Graham, C. H., Elith, J. and Huettmann, F. (2007). Sensitivity of predictive species distribution models to change in grain size. Diversity and Distributions 13 332–340.
  • Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman & Hall, Boca Raton, FL.
  • Hernandez, P. A., Franke, I., Herzog, S. K., Pacheco, V., Paniagua, L., Quintana, H. L., Soto, A., Swenson, J. J., Tovar, C., Valqui, T. H., Vargas, J. and Young, B. E. (2008). Predicting species distributions in poorly-studied landscapes. Biodiversity and Conservation 17 1353–1366.
  • Lepage, G. (1978). A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27 192–203.
  • Owen, A. B. (2007). Infinitely imbalanced logistic regression. J. Mach. Learn. Res. 8 761–773.
  • Pearce, J. L. and Boyce, M. S. (2006). Modelling distribution and abundance with presence-only data. Journal of Applied Ecology 43 405–412.
  • Phillips, S. J., Anderson, R. P. and Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 231–259.
  • Phillips, S. J., Dudík, M., Elith, J., Graham, C. H., Lehmann, A., Leathwick, J. and Ferrier, S. (2009). Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence data. Ecological Applications 19 181–197.
  • R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Ward, G. (2007). Statistics in ecological modelling; presence-only data and boosted MARS. PhD thesis, Dept. Statistics, Stanford Univ. Available at http://www-stat.stanford.edu/~hastie/THESES/Gill_Ward.pdf.
  • Ward, G., Hastie, T., Barry, S., Elith, J. and Leathwick, J. R. (2009). Presence-only data and the EM algorithm. Biometrics 65 554–563.
  • Zarnetske, P. L., Edwards, T. C., Jr. and Moisen, G. G. (2007). Habitat classification modeling with incomplete data: Pushing the habitat envelope. Ecological Applications 17 1714–1726.