Statistical Science

The Impact of Levene’s Test of Equality of Variances on Statistical Theory and Practice

Joseph L. Gastwirth, Yulia R. Gel, and Weiwen Miao

Full-text: Open access

Abstract

In many applications, the underlying scientific question concerns whether the variances of k samples are equal. There are a substantial number of tests for this problem. Many of them rely on the assumption of normality and are not robust to its violation. In 1960 Professor Howard Levene proposed a new approach to this problem by applying the F-test to the absolute deviations of the observations from their group means. Levene’s approach is powerful and robust to nonnormality and became a very popular tool for checking the homogeneity of variances.

This paper reviews the original method proposed by Levene and subsequent robust modifications. A modification of Levene-type tests to increase their power to detect monotonic trends in variances is discussed. This procedure is useful when one is concerned with an alternative of increasing or decreasing variability, for example, increasing volatility of stocks prices or “open or closed gramophones” in regression residual analysis. A major section of the paper is devoted to discussion of various scientific problems where Levene-type tests have been used, for example, economic anthropology, accuracy of medical measurements, volatility of the price of oil, studies of the consistency of jury awards in legal cases and the effect of hurricanes on ecological systems.

Article information

Source
Statist. Sci., Volume 24, Number 3 (2009), 343-360.

Dates
First available in Project Euclid: 31 March 2010

Permanent link to this document
https://projecteuclid.org/euclid.ss/1270041260

Digital Object Identifier
doi:10.1214/09-STS301

Mathematical Reviews number (MathSciNet)
MR2757435

Zentralblatt MATH identifier
1329.62332

Keywords
ANOVA equality of variances Levene’s test trend tests effect of dependence applied statistics

Citation

Gastwirth, Joseph L.; Gel, Yulia R.; Miao, Weiwen. The Impact of Levene’s Test of Equality of Variances on Statistical Theory and Practice. Statist. Sci. 24 (2009), no. 3, 343--360. doi:10.1214/09-STS301. https://projecteuclid.org/euclid.ss/1270041260


Export citation

References

  • Abelson, R. P. and Tukey, J. W. (1963). Efficient utilization of non-numerical information in quantitative analysis: General theory and the case of simple order. Ann. Math. Statist. 34 1347–1369.
  • Agresti, A. (2002). Categorical Data Analysis. Wiley, New York.
  • Algina, J., Olejnik, S. and Ocanto, R. (1989). Type I error rates and power estimates for selected two-sample tests of scale. Journal of Educational Statistics 14 373–383.
  • Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rodgers, W. H. and Tukey, J.W. (1972). Robust Estimates of Location: Survey and Advances. Princeton Univ. Press, Princeton, NJ.
  • Arnold, S. F. (1980). Asymptotic validity of F tests for the ordinary linear model and the multiple correlation model. J. Amer. Statist. Assoc. 75 890–894.
  • Auger, J. and Jouannet, P. (1997). Evidence for regional differences of semen quality among fertile french men. Human Reproduction 12 740–745.
  • Balakrishnan N. and Ma, C. W. (1990). A comparative study of various tests for the equality of two population variances. J. Stat. Comput. Simul. 35 41–89.
  • Bancroft, T. A. (1964). Analysis and inference for incompletely specified models involving the use of preliminary test(s) of significance. Biometrics 20 427–442.
  • Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proc. Roy. Soc. Ser. A 160 268–282.
  • Bathke, A. (2002). ANOVA for a large number of treatments. Math. Methods Statist. 11 118–132.
  • Bathke, A. (2004). The ANOVA F-test can still be used in some balanced designs with unequal variances and non-normal data. J. Statist. Plann. Inference 2 413–422.
  • Berger, A. K., Gottdiener, J. S., Yohe, M. A. and Guerro, J. L. (1999). Epidemiologic approach to quality assessment in echocardiographic diagnosis. Journal of the American College of Cardiology 34 1831–1836.
  • Bickel, P. J (1975). One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 428–434.
  • Boos, D. D. and Brownie, C. (1989). Bootstrap methods for testing homogeneity of variances. Technometrics 31 69–82.
  • Boos, D. D. and Brownie, C. (1995). ANOVA and ranks test when the number of treatments is large. Statist. Probab. Lett. 23 183–191.
  • Boos, D. D. and Brownie C. (2004). Comparing variances and other measures of dispersion. Statist. Sci. 19 571–578.
  • Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika 40 318–335.
  • Box, G. E. P. and Andersen, S. L. (1955). Permutation theory in the derivation of robust criteria and the study of departures from assumption. J. Roy. Statist. Soc. Ser. B 17 1–26.
  • Brown, M. B. and Forsythe, A. B. (1974). Robust tests for equality of variances. J. Amer. Statist. Assoc. 69 364–367.
  • Carlton, G. C. and Bazzaz, F. A. (1998). Resource congruence and forest regeneration following an experimental hurricane blowdown. Ecology 79 1305–1319.
  • Carroll, R. J. and Ruppert, D. (1982). Robust estimation in heteroscedastic linear models. Ann. Statist. 10 429–441.
  • Carroll, R. J. and Schneider, H. (1985). A note on Levene’s tests for equality of variances. Statist. Probab. Lett. 3 191–194.
  • Cattaneo, Z., Postma, A and Vecchi, T. (2006). Gender differences in memory for object and word. Quarterly Journal of Experimental Psychology 59 904–919.
  • Chacko, V. J. (1963). Testing homogeneity against ordered alternatives. Ann. Math. Statist. 34 945–956.
  • Chang, E. C., Jain, P. C. and Locke, P. R.(1995). Standard and Poors 500 index futrues volatility and price changes around the New York stock exchange close. Journal of Business 68 61–84.
  • Chang, E. C., Pinegar, J. M. and Schacter B. (1997). Interday variations in volume, variance and participation of large speculators. Journal of Banking and Finance 21 797–810.
  • Conover, W. J., Johnson, M. E. and Johnson, M. M. (1981). A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23 351–361.
  • Crow, E. L. and Siddiqui, M. M. (1967). Robust estimation of location. J. Amer. Statist. Assoc. 62 353–389.
  • Coulson, D. and Joyce, L. (2006). Indexing variability: A case study with climate change impacts on ecosystems. Ecological Indicators 6 749–769.
  • Christie, D. R. and Koch, T. W. (1997). The impact of market-specific public information on return variance in an illiquid market. Journal of Futures Markets 17 887–908.
  • Cumming, J. and Hall, C. (2002). Athlete’s use of imagery in the off-season. Sport Pshychologist 16 160–172.
  • Davis, J. T. (1996). Experience and auditors’ selection of relevant information for preliminary control risk assessments. Auditing 15 16–37.
  • Dempster, A. P. (1988). Employment discrimination and statistical science. Statist. Sci. 3 149–161.
  • Dhillon, U. S., Lasser, D. J. and Watanbe, T. (1997). Volatility, information and double versus walrasian auction pricing in US and Japanese futures markets. Journal of Banking and Finance 21 1045–1061.
  • Dorfman, D. D. and Berbaum, K. S. (2000). A contaminated binormal model for ROC data-part III: Initial evaluation with detection ROC data. Academic Radiology 7 438–447.
  • English, D. R., Armstrong, B. K. and Kricker, A. (1998). Reproducibility of reported measurements of sun exposure in a case-control study. Cancer, Epidemiology, Biomarkers and Prevention 7 857–863.
  • Esserman, L., Cowley, H., Eberle, C., Kirkpatrick, A., Chang S., Berbaum, K. and Gale, A. (2002). Improving the accuracy of mammography: Volume and outcome relationships. Journal of the National Cancer Institute 94 369–375.
  • Evett, I. W. and Weir, B. S. (1998). Interpreting DNA Evidence. Sinauer, Sunderland, MA.
  • Fisher, N. I. (1986). Robust-tests for comparing the dispersions of several Fisher or Watson distributions on the sphere. Geophysical Journal of the Royal Astronomical Society 85 563–572.
  • Fligner, M. A. and Killeen, T. J. (1976). Distribution-free two-sample tests for scale. J. Amer. Statist. Assoc. 71 210–213.
  • Flynn, F. J. and Brockner, J. (2003). It is different to give than to receive: Predictors of givers’ and receivers’ reactions to favor exchange. Journal of Applied Psychology 88 1034–1045.
  • Francois, N., Guydot-Declerck, C., Hug, B., Callemien, D., Govaerts, B. and Collin, S. (2006). Beer astringency assessed by time-intensity and quantitative descriptive analysis: Influence of pH and accelerated aging. Food Quality and Preference 17 445–452.
  • Freidlin, B. and Gastwirth, J. L. (2004). A note on the use of tests of mutation rates on ordered groups. Genetic Testing 8 437–440.
  • Freidlin, B., Miao M. and Gastwirth, J. L. (2003). On the use of the Shapiro–Wilk test in two-stage adaptive inference for paired data from moderate to very heavy tailed distributions. Biometrical Journal 45 887–900.
  • Fujino, Y. (1979). Tests for the homogeneity of variances for ordered aternatives. Biometrika 66 133–139.
  • Gastwirth, J. L. and Rubin, H. (1969). On robust linear estimators. Ann. Math. Statist. 40 24–39.
  • Gastwirth, J. L. (1972). Robust estimation of the Lorenz curve and Gini index. Rev. Econom. Statist. 54 306–316.
  • Gastwirth, J. L. (1987), The statistical precision of medical screening procedures: Application to polygraph and AIDS antibodies test data. Statist. Sci. 2 213–238.
  • Gastwirth, J. L. (2001). Screening and selection. In International Encyclopedia of Social Sciences (N. J. Smelser and P. B. Bates, eds.). Elsevier, Oxford, U.K. 13755–13767.
  • Gillespie, J. H. (1998). Population Genetics: A Concise Guide. Johns Hopkins Univ. Press, Baltimore, MD.
  • Giraud, T. and Capy, P. (1996). Somatic activity of the mariner trasposable element in natural populations of Drosophila simulans. Proceedings: Biological Sciences 263 1481–1486.
  • Grambsch, P. M. (1994). Simple robust tests for scale differences in paired data. Biometrika 81 359–372.
  • Goodman, J., Green, E. and Loftus, E. F. (1989). Runaway verdicts or reasonable determination: Mock juror strategies in awarding damages. Jurimetrics Journal 29 285–309.
  • Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology 68 155–165.
  • Graubard, B. I. and Korn, E. L. (1987). Choice of column scores for testing independence in ordered 2×k contingency tables. Biometrics 43 471–476.
  • Greene, E., Coon, D. and Boornstein, B. (2001). The effects of limiting punitive damage awards. Law and Human Behavior 25 217–234.
  • Hall, P. and Padmanabhan, A. R. (1997). Adaptive inference for the two-sample scale problem. Technometrics 39 412–422.
  • Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science 28 40–50.
  • Hedrick, P. W. (2000). Genetics of Populations, 2nd ed. Jones and Bartlett, Sudbury, MA.
  • Hedrick, P. W. (2006). Genetic polymorphism in heterogeneous environments: The age of genomics. Ann. Rev. Ecol. Systems 37 67–93.
  • Henriksen, H. (2003). The role of some regional factors in the assessment of well yields from hard-rock aquifers of Fennoscandia. Hydrogeology Journal 11 628–645.
  • Hays, M. A., Irsula, B., McMullen, S. L. and Feldblum, P. J. (2001). A comparison of three daily coital diary designs and a phone-in regimen. Contraception 63 159–166.
  • Hicks, T. V. and Leitenberg, H. (2001). Sexual fantasies about one’s partner versus someone else: Gender differences in incidence and frequency. The Journal of Sex Research 38 43–50.
  • Hines, W. G. S. and Hines, R. J. O. (2000). Increased power with modified forms of the Levene (med) test for heterogeneity of variance. Biometrics 56 451–454.
  • Hogg, R. V. (1974). Adaptive robust procedures: A partial review and some suggestions for future applications and theory. J. Amer. Statist. Assoc. 69 909–923.
  • Hogg, R. V., Fisher, V. M. and Randles, R. H. (1975). A two-sample adaptive distribution-free test. J. Amer. Statist. Assoc. 70 656–661.
  • Huber, P. J. (1972). Robust statistics: A review. Ann. Math. Statist. 43 1041–1067.
  • Huber, P. J. (1973). Robust regression: Asymptotic, conjectures and Monte Carlo. Ann. Statist. 1 799–821.
  • Huber, M, Chen, Y. G., Dinwoodie, I., Dobra, A. and Nicholas, M. (2006). Monte Carlo algorithms for Hardy–Weinberg proportions. Biometrics 62 49–53.
  • Johnson, S. W., Rice, S. D. and Moles, D. A. (1998). Effects of submarine mine tailings disposal on juvenile yellowfin sole (Pleuronectes asper): A laboratory study. Marine Pollution Bulletin 36 278–287.
  • Johnson, N. L. and Leone, F. C. (1964). Statistics and Experimental Design in Engineering and Physical Sciences, 2nd ed. Wiley, New York.
  • Kahn, M. S., Coulibaly, P. and Dibike, Y. (2006). Uncertainty analysis of statistical downscaling methods using canadian global climate predictors. Hydrological Processes 20 3085–3104.
  • Keyes, T. K. and Levy, M. S. (1997). Analysis of Levene’s test under design imbalance. Journal of Educational and Behavioral Statistics 22 845–858.
  • Korn, E. L. and Graubard, B. I. (1999). The Analysis of Health Surveys. Wiley, New York.
  • Koissi, M. C., Shapiro, A. R. and Hognas, G. (2006). Evaluating and extending the Lee–Carter model for mortality forecasting: Bootstrap confidence interval. Insurance Math. Econom. 38 1–20.
  • Krutchkoff, R. G. (1988). One-way fixed effects analysis of variance when the variances may be unequal. J. Stat. Comput. Simul. 30 259–183.
  • Kutner, M. H., Nachtsheim, C. J. and Neter, J. (2004). Applied Regression Analysis. McGraw-Hill/Irwin, Boston.
  • Kvamme, K. L., Stark, M. T. and Longacre, M. A. (1996). Alternative procedures for assessing standardization in ceramic assemblages. American Antiquity 61 116–126.
  • Levene, H. (1949). On a matching problem arising in genetics. Ann. Math. Statist. 20 91–94.
  • Levene, H. (1953). Genetic equilibrium when more than one ecological niche is available. American Naturalist 87 331–333.
  • Levene, H. (1960). Robust testes for equality of variances. In Contributions to Probability and Statistics (I. Olkin, ed.) 278–292. Stanford Univ. Press, Palo Alto, CA.
  • Lim, T. S. and Loh, W. Y. (1996). A comparison of tests of equality of variances. Comput. Statist. Data Anal. 22 287–301.
  • Little, R. J. A. and Rubin, D. A. (2002). Statistical Analysis with Missing Data. Wiley, New York.
  • Manly, B. F. J. and Francis, R. I. C. C. (2002). Testing for mean and variance differences with samples from distributions that may be non-normal with unequal variances. J. Stat. Comput. Simul. 72 633–646.
  • Marti, M. W. and Wissler, R. L. (2000). Be careful what you ask for: The effect of anchors on personal injury damages awards. Journal of Experimental Psychology-Applied 6 91–103.
  • Martin, C. G. and Games, P. A. (1977). Tests for homogeneity of variance: Non-normality and unequal samples. Journal of Educational Statistics 2 187–206.
  • Maurer, H. P., Melchinger, A. E. and Frisch, M. (2007). An incomplete enumeration algorithm for an exact test of Hardy–Weinberg proportions with multiple alleles. Theoretical and Applied Genetics 115 393–398.
  • Mayhew, D. A., Comer, C. P. and Stargel, W. W. (2003). Food consumption and body weight changes with neotame, a new sweetener with intense taste: Differentiating effects of palatability from toxicity in dietary safety studies. Regulatory Toxicology and Pharmacology 38 124–143.
  • Miao, W. and Gastwirth, J. L (2009). A new two stage adaptive nonparametric test for paired difference. Statistics and Its Interface 2 213–221.
  • Miller, R. G., Jr. (1968). Jacknifing variances. Ann. Math. Statist. 39 567–582.
  • Miller, R. G., Jr. (1986). Beyond ANOVA: Basics of Applied Statistics. Wiley, New York.
  • Milliken, G. A. and Johnson, D. E. (1984). Analysis of Messy Data, Vol.1. Van Nostrand Reinhold, New York.
  • Mitchell-Olds, T. and Rutledge, J. J. (1986). Quantitative genetics in natural populations: A review of the theory. The American Naturalist 127 379–402.
  • Molenberghs, G. and Kenward, M. G. (2007). Missing Data in Clinical Studies. Wiley, Chichester, UK.
  • Moser, B. K., Stevens, G. R. and Watts, C. L. (1989). The two-sample T-test versus Satterthwaite’s approximation F-test. Communication in Statistics—Theory and Methods 18 3963–3975.
  • Moser, B. K., Stevens, G. R. and Watts, C. L. (1992). Homogeneity of variances in the two-sample means test. Amer. Statist. 46 19–21.
  • Neave, F. B., Mandrak, N. E., Docker, M. F. and Noakes, D. L. (2006). Effects of preservation on pigmentation and length measurements in larval lampreys. Journal of Fish Biology 68 991–1001.
  • Neuhauser, M. and Hothorn, L. A. (2000). Parametric location-scale and scale trend tests based on Levene’s transformation. Comput. Statist. Data Anal. 33 189–200.
  • Nygard, F. and Sandstrom, A. (1989). Income inequality measures based on sample surveys. J. Econometrics 42 81–95.
  • O’Brien, R. G. (1979). A general ANOVA method for robust tests of additive models for variances. J. Amer. Statist. Assoc. 74 877–880.
  • O’Gorman, T. (1997). A comparison of an adaptive two-sample test to the t-test and the rank sum test. Commun. Statist. Simulation and Comput. 26 1393–1411.
  • O’Neil, K. M., Penrod, S. D. and Bornstein, B. H. (2003). Web-based research: Methodological variables’ effects on dropout and sample characteristics. Behavior Research Methods Instruments and Computers 35 217–226.
  • O’Neil, M. E. and Mathews, K. L. (2000). A weighted least squares approach to Levene’s test of homogeneity of variance. Aust. N. Z. J. Stat. 42 81–100.
  • O’Neil, M. E and Mathews, K. L. (2002). Levene tests of homogeneity of variance for general block and treatment designs. Biometrics 58 216–2224.
  • Pan, G. (2002). Confidence intervals for comparing two scale parameters based on levene statistics. J. Nonparametr. Stat. 14 459–476.
  • Piegorsch, W. W. and Bailer, A. J. (2005). Analyzing Environmental Data. Wiley, Chichester, UK.
  • Pepe, M. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Wiley, Chichester, UK.
  • Plourdes, A. and Watkins, G. C. (1998). Crude oil prices between 1985 and 1994: How volatile in relation to other commodities? Resource and Energy Economics 20 245–262.
  • Pollak, E. (2006). The influence of Levene’s paper on polymorphism in subdivided populations. In Proceedings of the Joint Statistical Meetings, August, 2006. Amer. Statist. Assoc., Alexandria, VA.
  • Robbennolt, J. K. and Studebaker, C. A. (1999). Anchoring in the courtroom: The effect of caps on punitive damages. Law and Human Behavior 23 353–373.
  • Rosenbaum, P. R. (2002). Observational Studies. Springer, New York.
  • Rosser, D. A. Murdoch, I. E. and Cousens, S. N. (2004). The effect of optical defocus on the test-retest variability of visual acuity measurements. Investigative Ophththalmology and Visual Science 45 1076–1079.
  • Roth, A. J. (1983). Robust trend tests derived and simulated: Analogs of the Welch and Brown–Forsythe tests. J. Amer. Statist. Assoc. 78 1972–1980.
  • Saks, M. J., Hollinger, L. A., Wissler, R. L., Evans, D. L. and Hart, A. (1997). Reducing variability in civil jury awards. Law and Human Behavior 21 243–256.
  • Sant, R. and Cowan, A. R. (1994). Do dividends signal earnings—the case of omitted dividends. Journal of Banking and Finance 18 1113–1133.
  • Schaale, G. B. and Despain, D. J. (1996). Robustness of variance tests for randomized complete block data. Commun. Statist. Simulation 25 961–977.
  • Scheffe, H. (1959). The Analysis of Variance. Wiley, New York.
  • Schom, C. B. and Kit, J. M. (1980). Genetic and environmental-control of avian embryos response to a teratogen. Poultry Science 59 473–478.
  • Schucany, W. R. and Ng, H. K. T. (2006). Preliminary goodness of fit tests for normality do not validate the one-sample student t. Comm. Statist. 5 2275–2286.
  • Shorack, G. R. (1969). Testing and estimating ratios of scale parameters. J. Amer. Statist. Assoc. 64 999–1013.
  • Sprott, D. A. and Farewell, V. T. (1993). The difference between two normal means. Amer. Statist. 47 126–128.
  • Star, B., Stoffels, R. J. and Spencer, H. G. (2007). Evolution of fitness and allele frequencies in a population with spatially heterogeneous selection pressures. Genetics 177 1743–1751.
  • Tabain, M. (2001). Variability in frictive production and spectra: Implications for the hyper- and hypo- and quantal theories of speech production. Language and Speech 44 57–94.
  • van Belle, G. (2002). Statistical Rules of Thumb. Wiley, New York.
  • Vangel, M. G. (2005). A numerical approach to the Behrens–Fisher problem. J. Statist. Plann. Inference 130 341–350.
  • Vincent, S. E. (1961). A test of homogeneity for ordered variances. J. Roy. Statist. Soc. Ser. B 23 195–206.
  • Waldo, D. R. and Goering, H. K. (1979). Insolubility of proteins in ruminant feeds by 4 methods. Journal of Animal Science 49 1560–1568.
  • Weerhandi, S. (1995). ANOVA under unequal error variances. Biometrics 51 589–599.
  • Weinberg, W. (1908). Uber den Nachweis der Vererbung beim Menschen. Jaresh. Verein f. Vaterl. Naturk. In Wuttemberg 64 364–382.
  • Weir, B. (1996). Genetic Data Analysis II. Sinauer, Sunderland, MA.
  • Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika 29 350–362.
  • Welch, B. L. (1951). On the comparison of several mean values: An alternative approach. Biometrika 38 330–336.
  • Wilcox, R. R. (1989). Comparing the variances of dependent groups. Psychometrika 54 305–315.
  • Yitnosumarto, S. and O’Neill, M. E. (1986). On Levene’s tests of variance homogeneity. Aust. J. Statist. 28 230–241.
  • Zheng, G., Freidlin, B., Li, Z. and Gastwirth, J. L. (2003). Choice of scores in trend tests for case-control studies of candidate-gene associations. Biometrical Journal 45 335–348.
  • Zimmerman, D. W. (2004). A note on preliminary tests of variances. British J. Math. Statist. Psych. 57 173–181.