## The Annals of Statistics

### Bridging centrality and extremity: Refining empirical data depth using extreme value statistics

#### Abstract

Statistical depth measures the centrality of a point with respect to a given distribution or data cloud. It provides a natural center-outward ordering of multivariate data points and yields a systematic nonparametric multivariate analysis scheme. In particular, the half-space depth is shown to have many desirable properties and broad applicability. However, the empirical half-space depth is zero outside the convex hull of the data. This property has rendered the empirical half-space depth useless outside the data cloud, and limited its utility in applications where the extreme outlying probability mass is the focal point, such as in classification problems and control charts with very small false alarm rates. To address this issue, we apply extreme value statistics to refine the empirical half-space depth in “the tail.” This provides an important linkage between data depth, which is useful for inference on centrality, and extreme value statistics, which is useful for inference on extremity. The refined empirical half-space depth can thus extend all its utilities beyond the data cloud, and hence broaden greatly its applicability. The refined estimator is shown to have substantially improved upon the empirical estimator in theory and simulations. The benefit of this improvement is also demonstrated through the applications in classification and statistical process control.

#### Article information

Source
Ann. Statist., Volume 43, Number 6 (2015), 2738-2765.

Dates
Revised: June 2015
First available in Project Euclid: 7 October 2015

Permanent link to this document
https://projecteuclid.org/euclid.aos/1444222091

Digital Object Identifier
doi:10.1214/15-AOS1359

Mathematical Reviews number (MathSciNet)
MR3405610

Zentralblatt MATH identifier
1327.62205

#### Citation

Einmahl, John H. J.; Li, Jun; Liu, Regina Y. Bridging centrality and extremity: Refining empirical data depth using extreme value statistics. Ann. Statist. 43 (2015), no. 6, 2738--2765. doi:10.1214/15-AOS1359. https://projecteuclid.org/euclid.aos/1444222091

#### References

• Alexander, K. S. (1987). Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Related Fields 75 379–423.
• Cai, J.-J., Einmahl, J. H. J. and de Haan, L. (2011). Estimation of extreme risk regions under multivariate regular variation. Ann. Statist. 39 1803–1826.
• Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. J. Amer. Statist. Assoc. 91 862–872.
• Cuesta-Albertos, J. A. and Nieto-Reyes, A. (2008). The random Tukey depth. Comput. Statist. Data Anal. 52 4979–4988.
• de Haan, L. and Ferreira, A. (2006). Extreme Value Theory: An Introduction. Springer, New York.
• Dekkers, A. L. M., Einmahl, J. H. J. and de Haan, L. (1989). A moment estimator for the index of an extreme-value distribution. Ann. Statist. 17 1833–1855.
• Donoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist. 20 1803–1827.
• Efron, B. (1965). The convex hull of a random set of points. Biometrika 52 331–343.
• Einmahl, J. H. J. and Krajina, A. (2015). Empirical likelihood based testing for multivariate regular variation. Preprint.
• Hallin, M., Paindaveine, D. and Šiman, M. (2010). Multivariate quantiles and multiple-output regression quantiles: From $L_{1}$ optimization to halfspace depth. Ann. Statist. 38 635–669.
• Hill, B. M. (1975). A simple general approach to inference about the tail of a distribution. Ann. Statist. 3 1163–1174.
• Jessen, A. H. and Mikosch, T. (2006). Regularly varying functions. Publ. Inst. Math. (Beograd) (N.S.) 80 171–192.
• Li, J., Cuesta-Albertos, J. A. and Liu, R. Y. (2012). $DD$-classifier: Nonparametric classification procedure based on $DD$-plot. J. Amer. Statist. Assoc. 107 737–753.
• Li, J. and Liu, R. Y. (2004). New nonparametric tests of multivariate locations and scales using data depth. Statist. Sci. 19 686–696.
• Li, J. and Liu, R. Y. (2008). Multivariate spacings based on data depth. I. Construction of nonparametric multivariate tolerance regions. Ann. Statist. 36 1299–1323.
• Liu, R. Y. (1990). On a notion of data depth based on random simplices. Ann. Statist. 18 405–414.
• Liu, R. Y. (1995). Control charts for multivariate processes. J. Amer. Statist. Assoc. 90 1380–1387.
• Liu, R. Y., Parelius, J. M. and Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Statist. 27 783–858.
• Liu, R. Y. and Singh, K. (1993). A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc. 88 252–260.
• Liu, R. Y. and Singh, K. (1997). Notions of limiting $P$ values based on data depth and bootstrap. J. Amer. Statist. Assoc. 92 266–277.
• Mahalanobis, P. (1936). On the generalized distance in statistics. Proc. Nat. Inst. Sci. India 12 49–55.
• Meerschaert, M. M. and Scheffler, H.-P. (2001). Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice. Wiley, New York.
• Rousseeuw, P. J. and Hubert, M. (1999). Regression depth. J. Amer. Statist. Assoc. 94 388–433.
• Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York.
• Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians (Vancouver, B.C., 1974), Vol. 2 523–531. Canad. Math. Congress, Montreal, Que.
• Yeh, A. B. and Singh, K. (1997). Balanced confidence regions based on Tukey’s depth and the bootstrap. J. R. Stat. Soc. Ser. B. Stat. Methodol. 59 639–652.
• Zuo, Y. (2003). Projection-based depth functions and associated medians. Ann. Statist. 31 1460–1490.
• Zuo, Y. and Serfling, R. (2000). General notions of statistical depth function. Ann. Statist. 28 461–482.