The Annals of Statistics

Multivariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu and Singh)

Regina Y. Liu,Jesse M. Parelius, and Kesar Singh

Full-text: Open access

Abstract

A data depth can be used to measure the “depth” or “outlyingness” of a given multivariate sample with respect to its underlying distribution. This leads to a natural center-outward ordering of the sample points. Based on this ordering, quantitative and graphical methods are introduced for analyzing multivariate distributional characteristics such as location, scale, bias, skewness and kurtosis, as well as for comparing inference methods. All graphs are one-dimensional curves in the plane and can be easily visualized and interpreted. A “sunburst plot” is presented as a bivariate generalization of the box-plot. DD-(depth versus depth) plots are proposed and examined as graphical inference tools. Some new diagnostic tools for checking multivariate normality are introduced. One of them monitors the exact rate of growth of the maximum deviation from the mean, while the others examine the ratio of the overall dispersion to the dispersion of a certain central region. The affine invariance property of a data depth also leads to appropriate invariance properties for the proposed statistics and methods.

Article information

Source
Ann. Statist. Volume 27, Number 3 (1999), 783-858.

Dates
First available: 5 April 2002

Permanent link to this document
http://projecteuclid.org/euclid.aos/1018031260

Mathematical Reviews number (MathSciNet)
MR1724033

Digital Object Identifier
doi:10.1214/aos/1018031260

Subjects
Primary: 62H05: Characterization and structure theory 62J20: Diagnostics 62-09: Graphical methods

Keywords
Multivariate descriptive statistics multivariate ordering multivariate normality data depth depth ordering depth-$L$-statistics location scale bias skewness kurtosis sunburst plots $DD$-plots

Citation

Liu, Regina Y.; Parelius, Jesse M.; Singh, Kesar. Multivariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu and Singh). The Annals of Statistics 27 (1999), no. 3, 783--858. doi:10.1214/aos/1018031260. http://projecteuclid.org/euclid.aos/1018031260.


Export citation

References

  • ANDERSON, T. 1984. An Introduction to Multivariate Statistical Analysis. Wiley, New York. Z.
  • ANDREWS, D. 1972. Plots of high-dimensional data. Biometrics 28 125 136. Z.
  • ARCONES, M., CHEN, Z. and GINE, E. 1994. Estimators related to U-processes with applications to multivariate medians: asymptotic normality. Ann. Statist. 22 1460 1477.
  • AVEROUS, J. and MESTE, M. 1997. Skewness for multivariate distributions: two approaches. ´ Ann. Statist. 25 1984 1997. Z.
  • BARNETT, V. 1976. The ordering of multivariate data. J. Roy. Statist. Soc. Ser. A 139 319 354. Z.
  • BERAN, R. 1979. Testing for ellipsoidal symmetry of a multivariate density. Ann. Statist. 7 150 162. Z.
  • BERAN, R. and MILLAR, P. 1997. Multivariate symmetry models. In Festschrift for Lucien Le Z. Cam 13 42. L. Le Cam, E. Torgersen and G. Yang, eds. Springer, New York. Z.
  • BICKEL, P. and LEHMANN, E. 1975a. Descriptive statistics for nonparametric models I. Introduction. Ann. Statist. 3 1038 1044. Z.
  • BICKEL, P. and LEHMANN, E. 1975b. Descriptive statistics for nonparametric models II. Location. Ann. Statist. 3 1045 1069. Z.
  • BICKEL, P. and LEHMANN, E. 1976. Descriptive statistics for nonparametric models III. Dispersion. Ann. Statist. 4 1139 1158. Z.
  • BICKEL, P. and LEHMANN, E. 1979. Descriptive statistics for nonparametric models IV. Spread. Z. In Contributions to Statistics, Hajek Memorial Volume J. Jureckova, ed. 33 40. ´ ´ Reidel, London. Z.
  • BROWN, B. and HETTMANSPERGER, T. 1989. The affine invariant bivariate version of the sign test. J. Roy. Statist. Soc. B 51 117 125. Z.
  • CHAUDHURI, P. 1996. On a geometric notion of multivariate data. J. Amer. Statist. Assoc. 90 862 872. Z.
  • CHENG, A., LIU, R. and LUXHOJ, J. 1999. Monitoring multivariate aviation safety data: control charts and threshold systems. IIE Transactions. To appear Z.
  • CHERNOFF, H. 1973. The use of faces to represent points in k-dimensional graphically. J. Amer. Statist. Assoc. 68 361 368. Z.
  • DONOHO, D. and GASKO, M. 1992. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist. 20 1803 1827. Z.
  • DUMBGEN, L. 1992. Limit theorems for simplicial depth. Statist. Probab. Lett. 14 119 128. ¨ Z.
  • EASTON, G. and MCCULLOCH, R. 1990. A multivariate generalization of quantile quantile plots. J. Amer. Statist. Assoc. 85 376 386. Z. Z.
  • EDDY, W. 1982. Convex hull peeling. In COMPSTAT H. Caussinus et al., eds. 42 47. Physica, Vienna. Z.
  • EINMAHL, J. and MASON, D. 1992. Generalized quantile process. Ann. Statist. 20 1062 1078. Z.
  • FRAIMAN, R., LIU, R. and MELOCHE, J. 1997. Multivariate density estimation by probing depth. In L -Statistical Procedures and Related Topics 415 430. IMS, Hayward, CA. 1 Z.
  • FRAIMAN, R. and MELOCHE, J. 1996. Multivariate L-estimation. Preprint. Z.
  • FRIEDMAN, J. and RAFSKY, L. 1979. Multivariate generalizations of the Wald Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697 717. Z. Z
  • FRIEDMAN, J. and RAFSKY, L. 1981. Graphics for the multivariate two-sample problem with. comments. J. Amer. Statist. Assoc. 76 277 295. Z.
  • GASTWIRTH, J. 1971. A general definition of the Lorenz curve. Econometrica 39 1037 1039. Z.
  • GNANADESIKAN, R. 1997. Methods for Statistical Data Analysis of Multivariate Observations, 2nd ed. Wiley, New York. Z.
  • HE, X. and WANG, G. 1997. Convergence of depth contours for multivariate datasets. Ann. Statist. 25 495 504. Z.
  • HETTMANSPERGER, T. 1984. Statistical Inference Based on Ranks. Wiley, New York. Z.
  • HETTMANSPERGER, T., NYBLOM, J. and OJA, H. 1992. On multivariate notions of sign and rank. Z. In L-1 Statistical and Related Methods Y. Dodge, ed. 267 278. North-Holland, Amsterdam. Z.
  • HETTMANSPERGER, T. and OJA, H. 1994. Affine invariant multivariate multisample sign tests. J. Roy. Statist. Soc. Ser. B 56 235 249. Z.
  • HODGES, J. 1955. A bivariate sign test. Ann. Math. Statist. 26 523 527. Z.
  • HUBER, P. 1972. Robust statistics: a review. Ann. Math. Statist. 43 1041 1067. Z.
  • HUSLER, J., LIU, R. and SINGH, K. 1999. A formula for the tail probability of a multivariate ¨ normal distribution and its applications. Preprint.
  • KENDALL, K., STUART, A. and ORD, J. K. 1987. Kendall's Advanced Theory of Statistics 1. Oxford Univ. Press. Z.
  • KLEINER, B. and HARTIGAN, J. 1981. Representing points in many dimensions by trees and Z. castles with comments. J. Amer. Statist. Assoc. 76 260 276. Z.
  • KOLTCHINSKII, V. 1997. M-estimator, convexity and quantiles. Ann. Statist. 25 435 477. Z.
  • LEHMANN, E. 1991. Theory of Point Estimation. Wadsworth and Brooks Cole, Belmont, CA. Z.
  • LIU, R. 1990. On a notion of data depth based on random simplices. Ann. Statist. 18 405 414. Z.
  • LIU, R. 1992. Data depth and multivariate rank tests. In L-1 Statistics and Related Methods Z. Y. Dodge, ed. 279 294. North-Holland, Amsterdam. Z.
  • LIU, R. 1995. Control charts for multivariate processes. J. Amer. Statist. Assoc. 90 1380 1388. Z.
  • LIU, R. and SINGH, K. 1993. A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc. 88 257 260. Z.
  • LIU, R. and SINGH, K. 1997. Notions of limiting P-values based on data depth and bootstrap. J. Amer. Statist. Assoc. 91 266 277. Z.
  • LORENZ, M. 1905. Methods of measuring the concentration of wealth. J. Amer. Statist. Assoc. 9 209 219.Z.
  • MAHALANOBIS, P. C. 1936. On the generalized distance in statistics. Proc. Nat. Acad. Sci. India 12 49 55. Z.
  • MARDEN, J. 1998. Bivariate qq-plot. Statist. Sinica 8 813 826. Z.
  • MARDIA, K., KENT, J. and BIBBY, J. 1979. Multivariate Analysis. Academic Press, New York. Z.
  • MUIRHEAD, R. 1982. Aspects of Multivariate Statistical Theory. Wiley, New York. Z.
  • NOLAN, D. 1992. Asymptotics for multivariate trimming. Stochastic Process. Appl. 42 157 169. Z.
  • OJA, H. 1983. Descriptive statistics for multivariate distributions. Statist. Probab. Lett. 1 327 332. Z.
  • PARELIUS, J. 1997. Multivariate analysis based on data depth. Ph.D. dissertation. Dept. Statistics, Rutgers Univ., New Jersey. Z. Z.
  • ROUSSEEUW, P. and HUBERT, M. 1999. Regression depth. with discussion. J. Amer. Statist. Assoc. 4, 388 433. Z.
  • ROUSSEEUW, P. J. and LEROY, A. M. 1987. Robust Regression and Outlier Detection. Wiley, New York. Z.
  • ROUSSEEUW, P. and RUTS, I. 1996. AS 307: bivariate location depth. Appl. Statist. 45 516 526. Z.
  • ROUSSEEUW, P. and RUTS, I. 1997. The bagplot: a bivariate box-and-whiskers plot. Preprint. Z.
  • ROUSSEEUW, P. and STRUYF, A. 1998. Computing location depth and regression depth in higher dimensions. Statist. Comput. 8, 193 203. Z.
  • RUTS, I. and ROUSSEEUW, P. 1996. Computing depth contours of bivariate point clouds. Computational Statistics and Data Analysis 23 153 168. Z.
  • SINGH, K. 1991. Majority depth. Unpublished manuscript. Z.
  • SINGH, K. 1998. Breakdown theory for bootstrap quantiles. Ann. Statist. 26 1719 1732. Z.
  • TUKEY, J. 1975. Mathematics and picturing data. In Proceedings of the 1975 International Congress of Mathematics 2 523 531. Z.
  • WEGMAN, E. 1990. Hyperdimensional data analysis using parallel coordinates. J. Amer. Statist. Assoc. 85 664 675. Z.
  • YEH, A. and SINGH, K. 1997. Balanced confidence sets based on the Tukey depth. J. Roy. Statist. Soc. Ser. B 3 639 652.
  • HILL CENTER NEW YORK, NEW YORK 10036 RUTGERS UNIVERSITY
  • PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kern@stat.rutgers.edu
  • BECKKER, R. A., CLEVELAND, W. S. and WILKS, A. R. 1987. Dynamic graphics for data analysis Z. with discussion. Statist. Sci. 2 353 395. Z.
  • MOSTELLER, F. and TUKEY, J. W. 1977. Data Analysis and Regression. Addison-Wesley, Reading, MA. Z. Z.
  • SCHERVISH, M. J. 1987. Multivariate analysis with discussion. Statist. Sci. 2 396 433. Z. Z.
  • TUKEY, J. W. 1962. The future of data analysis. Ann. Math. Statist. 33 1 67. Corr: V33 p812 Z.
  • TUKEY, J. W. 1977. Exploratory Data Analysis. Addison-Wesley, Reading, MA.
  • PITTSBURGH, PENNSYLVANIA 15213-3890 E-MAIL: bill@stat.cmu.edu
  • HOUSTON, TEXAS 77005-1892 E-MAIL: scottdw@stat.rice.edu
  • UCU T, where p is the generalized variance, the orthogonal matrix U contains the eigenvectors and C is the diagonal matrix of standardized eigenvalues Z Z.. Z. det C 1. As in Bensmail and Celeux 1996, we use the terms scale, shape and orientation for items, C and U. If z comes from a spherical distribution with the location vector 0 and covariance matrix I, then y UC1 2 1 2z is elliptically symmetric with the location vector, scale, shape C and orientation U. Our plan is to first define a multivariate centered rank vector. This vector, in many ways, represents an extension of the idea of a univariate rank. In addition, it has certain nice affine equivariance properties. We only provide a Z. Z. sketch here; see Hettmansperger, Mottonen and Oja 1998 or Oja 1999 for ¨ ¨ details. We then consider the rank covariance matrix, RCM. Visuri, Koivunen Z. and Oja 1999 show that if the standardized eigenvalues and the eigenvectors of the covariance matrix are c c and u,..., u, respectively, 1 p 1 p then c 1 c 1 and u,..., u are the standardized eigenvalues and 1 p 1 p the eigenvectors for the theoretical RCM. The sample RCM is more robust than the sample covariance matrix and, hence, provides a robust estimate of the underlying shape and orientation for the elliptical distribution. This, along with a robust estimate of Wilk's generalized variance, can be used to robustly estimate. However, here we use only the standardized eigenvalues and the eigenvectors to define a robust version of depth. We next sketch the construction of the rank vector and corresponding sample RCM. We begin with p-dimensional data x,..., x. The volume of 1 n the p-variate simplex determined by x and p observation vectors with indices i i is 1 p
  • , shape C or orientation U. The log scale facilitates comparison of scale near the centers. Compare Z. these plots to Figure 7 a, b in the paper. The other nice application discussed by the authors is for the comparison of scatter of the multivariate estimates Z. of location; see Figure 8 a, b, c in the paper. The comparison based on ellipses would be quite natural here since, typically, the estimators will have multivariate normal limiting distributions. Another way to compare scales for two distributions is to look at a PP-plot of the elliptical areas for the two samples. Essentially, it is a plot of the empirical cdf's of the elliptical areas determined by the data in each sample. Z. Z. Figure 3 shows a PP-scale plot of A versus D. Z. Note that beyond 0.5 the empirical cdf's of the elliptical areas, F u A Z. Z. Z. F u, indicating that D has more scatter or larger scale than A. The area D under the curve could provide a measure and, hence, in the elliptical case, an asymptotically distribution-free test for scale differences. The test statistic then is the Mann Whitney Wilcoxon U-statistic calculated from the depths. In the univariate case, this corresponds to a rank test based on magnitudes of the centered observations. In the comparison in Figure 4, the observed Z. p-value one-sided test is 0.22.
  • BENSMAIL, H. and CELEUX, G. 1996. Regularized Gaussian discriminant analysis through eigenvalue decomposition. J. Amer. Statist. Assoc. 91 1743 1749. Z.
  • HETTMANSPERGER, T. P., MOTTONEN, J. and OJA, H. 1998. Affine invariant multivariate rank ¨ ¨ tests for several samples. Statist. Sinica 8 785 800. Z.
  • OJA, H. 1999. Affine invariant multivariate sign and rank tests and corresponding estimates: a Z. review. Scand. J. Statist. invited paper. Z.
  • VISURI, S., KOIVUNEN, V. and OJA, H. 1999. Sign and rank covariance matrices. Conditionally accepted to the J. Statist. Plann. Inference.
  • UNIVERSITY PARK, PENNSYLVANIA 16802-2111 E-MAIL: tph@stat.psu.edu
  • BECKER, R. A., CLEVELAND, W. S. and WILKS, A. R. 1987. Dynamic graphics for data analysis Z. with discussion. Statist. Sci. 2 353 395. Z.
  • CHENG, A., LIU, R. and LUXHOJ, J. 1999. Monitoring multivariate processes: control charts, culpability indices, consistency curves and threshold systems. Preprint. Z.
  • CHENG, A. and OUYANG, M. 1998. On algorithms for computing simplicial depth. Preprint. Z.
  • GIL, J., STEIGER, W. AND WIGDERSON, A. 1992. Geometric medians. Discrete Math. 108 37 51. Z.
  • JOHNSON, T., KWOK, I. and NG, R. 1998. Fast computation of 2-dimensional depth contours. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Z. Z.
  • ROUSSEEUW, P. and HUBERT, M. 1999. Regression depth with discussion. J. Amer. Statist. Assoc. 94 388 433.Z.
  • ROUSSEEUW, P. and RUTS, I. 1996. A5 307: bivariate location depth. Appl. Statist. 45 516 526. Z.
  • ROUSSEEUW, P. and STRUYF, A. 1998. Computing location depth and regression depth in higher dimensions. Statist. Comput. 8 193 203. Z. Z.
  • SCHERVISH, M. J. 1987. Multivariate analysis with discussion. Statist. Sci. 2 396 433. Z.
  • SCOTT, D. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Z.
  • TENG, J. 1999. New methodology in regression and multivariate quality control via data depth. Ph.D. thesis. Dept. Statistics, Rutgers Univ. To appear.
  • PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kesar@stat.rutgers.edu