The Annals of Statistics

Perturbation and scaled Cook’s distance

Hongtu Zhu, Joseph G. Ibrahim, and Hyunsoon Cho

Full-text: Open access

Abstract

Cook’s distance [Technometrics 19 (1977) 15–18] is one of the most important diagnostic tools for detecting influential individual or subsets of observations in linear regression for cross-sectional data. However, for many complex data structures (e.g., longitudinal data), no rigorous approach has been developed to address a fundamental issue: deleting subsets with different numbers of observations introduces different degrees of perturbation to the current model fitted to the data, and the magnitude of Cook’s distance is associated with the degree of the perturbation. The aim of this paper is to address this issue in general parametric models with complex data structures. We propose a new quantity for measuring the degree of the perturbation introduced by deleting a subset. We use stochastic ordering to quantify the stochastic relationship between the degree of the perturbation and the magnitude of Cook’s distance. We develop several scaled Cook’s distances to resolve the comparison of Cook’s distance for different subset deletions. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of these scaled Cook’s distances in a formal influence analysis.

Article information

Source
Ann. Statist. Volume 40, Number 2 (2012), 785-811.

Dates
First available in Project Euclid: 17 May 2012

Permanent link to this document
https://projecteuclid.org/euclid.aos/1337268212

Digital Object Identifier
doi:10.1214/12-AOS978

Mathematical Reviews number (MathSciNet)
MR2933666

Zentralblatt MATH identifier
1273.62180

Subjects
Primary: 62J20: Diagnostics

Keywords
Cook’s distance perturbation relative influential conditionally scaled Cook’s distance scaled Cook’s distance size issue

Citation

Zhu, Hongtu; Ibrahim, Joseph G.; Cho, Hyunsoon. Perturbation and scaled Cook’s distance. Ann. Statist. 40 (2012), no. 2, 785--811. doi:10.1214/12-AOS978. https://projecteuclid.org/euclid.aos/1337268212


Export citation

References

  • [1] Andersen, E. B. (1992). Diagnostics in categorical data analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 54 781–791.
  • [2] Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary. Econometrica 67 1341–1383.
  • [3] Banerjee, M. (1998). Cook’s distance in linear longitudinal models. Comm. Statist. Theory Methods 27 2973–2983.
  • [4] Banerjee, M. and Frees, E. W. (1997). Influence diagnostics for linear longitudinal models. J. Amer. Statist. Assoc. 92 999–1005.
  • [5] Beckman, R. J. and Cook, R. D. (1983). Outlier…s. Technometrics 25 119–163.
  • [6] Chatterjee, S. and Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. Wiley, New York.
  • [7] Christensen, R., Pearson, L. M. and Johnson, W. (1992). Case-deletion diagnostics for mixed models. Technometrics 34 38–45.
  • [8] Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics 19 15–18.
  • [9] Cook, R. D. (1986). Assessment of local influence. J. Roy. Statist. Soc. Ser. B 48 133–169.
  • [10] Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman & Hall, London.
  • [11] Critchley, F., Atkinson, R. A., Lu, G. and Biazi, E. (2001). Influence analysis based on the case sensitivity function. J. R. Stat. Soc. Ser. B Stat. Methodol. 63 307–323.
  • [12] Davison, A. C. and Tsai, C. L. (1992). Regression model diagnostics. International Statistical Review 60 337–353.
  • [13] Eaton, M. L. and Tyler, D. E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Ann. Statist. 19 260–271.
  • [14] Fung, W.-K., Zhu, Z.-Y., Wei, B.-C. and He, X. (2002). Influence diagnostics and outlier tests for semiparametric mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 565–579.
  • [15] Haslett, J. (1999). A simple derivation of deletion diagnostic results for the general linear model with correlated errors. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 603–609.
  • [16] Huber, P. J. (1981). Robust Statistics. Wiley, New York.
  • [17] Lin, D. Y., Wei, L. J. and Ying, Z. (2002). Model-checking techniques based on cumulative residuals. Biometrics 58 1–12.
  • [18] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall/CRC, Boca Raton.
  • [19] Preisser, J. S. and Qaqish, B. F. (1996). Deletion diagnostics for generalised estimating equations. Biometrika 83 551–562.
  • [20] Shaked, M. and Shanthikumar, G. J. (2006). Stochastic Orders. Springer, New York.
  • [21] Stier, D. M., Leventhal, J. M., Berg, A. T., Johnson, L. and Mezger, J. (1993). Are children born to young mothers at increased risk of maltreatment. Pediatrics 91 642–648.
  • [22] Wasserman, D. R. and Leventhal, J. M. (1993). Maltreatment of children born to cocaine-dependent mothers. Am. J. Dis. Child. 147 1324–1328.
  • [23] Wei, B.-C. (1998). Exponential Family Nonlinear Models. Lecture Notes in Statist. 130. Springer, Singapore.
  • [24] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25.
  • [25] White, H. (1994). Estimation, Inference and Specification Analysis. Econometric Society Monographs 22. Cambridge Univ. Press, Cambridge.
  • [26] Zhang, H. (1999). Analysis of infant growth curves using multivariate adaptive splines. Biometrics 55 452–459.
  • [27] Zhu, H. and Ibrahim, J. G. (2011). Supplement to “Perturbation and scaled Cook’s distance.” DOI:10.1214/12-AOS978SUPP.
  • [28] Zhu, H., Ibrahim, J. G., Lee, S. and Zhang, H. (2007). Perturbation selection and influence measures in local influence analysis. Ann. Statist. 35 2565–2588.
  • [29] Zhu, H., Lee, S.-Y., Wei, B.-C. and Zhou, J. (2001). Case-deletion measures for models with incomplete data. Biometrika 88 727–737.
  • [30] Zhu, H. and Zhang, H. (2006). Asymptotics for estimation and testing procedures under loss of identifiability. J. Multivariate Anal. 97 19–45.