The Annals of Applied Statistics

Finite-sample equivalence in statistical models for presence-only data

William Fithian and Trevor Hastie

Full-text: Open access

Abstract

Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence.

All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified—as it practically always is—logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose “infinitely weighted logistic regression,” which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.

Article information

Source
Ann. Appl. Stat. Volume 7, Number 4 (2013), 1917-1939.

Dates
First available in Project Euclid: 23 December 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1387823304

Digital Object Identifier
doi:10.1214/13-AOAS667

Mathematical Reviews number (MathSciNet)
MR3161707

Zentralblatt MATH identifier
1283.62225

Keywords
Presence-only data logistic regression maximum entropy Poisson process models species modeling case-control sampling

Citation

Fithian, William; Hastie, Trevor. Finite-sample equivalence in statistical models for presence-only data. Ann. Appl. Stat. 7 (2013), no. 4, 1917--1939. doi:10.1214/13-AOAS667. https://projecteuclid.org/euclid.aoas/1387823304.


Export citation

References

  • Aarts, G., Fieberg, J. and Matthiopoulos, J. (2012). Comparative interpretation of count, presence–absence and point methods for species distribution models. Methods in Ecology and Evolution 3 177–187.
  • Baddeley, A. and Turner, R. (2000). Practical maximum pseudolikelihood for spatial point patterns (with discussion). Aust. N. Z. J. Stat. 42 283–322.
  • Baddeley, A., Berman, M., Fisher, N. I., Hardegen, A., Milne, R. K., Schuhmacher, D., Shah, R. and Turner, R. (2010). Spatial logistic regression and change-of-support in Poisson point processes. Electron. J. Stat. 4 1151–1201.
  • Berman, M. and Turner, T. R. (1992). Approximating point process likelihoods with GLIM. J. Appl. Stat. 41 31–38.
  • Chakraborty, A., Gelfand, A. E., Wilson, A. M., Latimer, A. M. and Silander, J. A. (2011). Point pattern modelling for degraded presence-only data over large regions. J. R. Stat. Soc. Ser. C. Appl. Stat. 60 757–776.
  • Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York. Revised reprint of the 1991 edition.
  • Dorazio, R. M. (2012). Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics 68 1303–1312.
  • Elith, J., Graham, C. H., Anderson, R. P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R. J., Huettmann, F., Leathwick, J. R., Lehmann, A. et al. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29 129–151.
  • Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E. and Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and Distributions 17 43–57.
  • Gaetan, C. and Guyon, X. (2009). Spatial Statistics and Modeling. Springer, New York.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Johnson, C. J., Nielsen, S. E., Merrill, E. H., McDonald, T. L. and Boyce, M. S. (2006). Resource selection functions based on use-availability data: Theoretical motivation and evaluation methods. Journal of Wildlife Management 70 347–357.
  • Lee, A. J., Scott, A. J. and Wild, C. J. (2006). Fitting binary regression models with case-augmented samples. Biometrika 93 385–397.
  • Lele, S. R. and Keim, J. L. (2006). Weighted distributions and estimation of resource selection probability functions. Ecology 87 3021–3028.
  • MacKenzie, D. I. (2006). Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence. Academic Press, New York.
  • Manly, B. F. J., McDonald, L. L., Thomas, D. L., McDonald, T. L. and Erickson, W. P. (2002). Resource Selection by Animals: Statistical Analysis and Design for Field Studies. Kluwer Academic, Dordrecht.
  • Margules, C. R., Austin, M. P., Mollison, D. and Smith, F. (1994). Biological models for monitoring species decline: The construction and use of data bases (with discussion). Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 344 69–75.
  • Phillips, S. J., Anderson, R. P. and Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 231–259.
  • Phillips, S. J., Dudík, M. and Schapire, R. E. (2004). A maximum entropy approach to species distribution modeling. In Proceedings of the Twenty-First International Conference on Machine Learning 83. ACM, New York.
  • Phillips, S. J. and Dudík, M. (2008). Modeling of species distributions with Maxent: New extensions and a comprehensive evaluation. Ecography 31 161–175.
  • Phillips, S. J. and Elith, J. (2013). On estimating probability of presence from use-availability or presence-background data. Ecology 94 1409–1419.
  • Renner, I. W. and Warton, D. I. (2013). Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69 274–281.
  • Royle, J. A., Nichols, J. D. and Kéry, M. (2005). Modelling occurrence and abundance of species when detection is imperfect. Oikos 110 353–359.
  • Ward, G., Hastie, T., Barry, S., Elith, J. and Leathwick, J. R. (2009). Presence-only data and the EM algorithm. Biometrics 65 554–563.
  • Warton, D. I. and Shepherd, L. C. (2010). Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology. Ann. Appl. Stat. 4 1383–1402.
  • Xie, Y. and Manski, C. F. (1989). The logit model and response-based samples. Sociol. Methods Res. 17 283–302.