In randomized experiments, treatment and control groups should be roughly the same—balanced—in their distributions of pretreatment variables. But how nearly so? Can descriptive comparisons meaningfully be paired with significance tests? If so, should there be several such tests, one for each pretreatment variable, or should there be a single, omnibus test? Could such a test be engineered to give easily computed p-values that are reliable in samples of moderate size, or would simulation be needed for reliable calibration? What new concerns are introduced by random assignment of clusters? Which tests of balance would be optimal?
To address these questions, Fisher’s randomization inference is applied to the question of balance. Its application suggests the reversal of published conclusions about two studies, one clinical and the other a field experiment in political participation.
Agresti, A. (2002). Categorical Data Analysis. Wiley, New York.
Agresti, A. and Gottard, A. (2005). Comment: Randomized confidence intervals and the mid-p approach. Statist. Sci. 20 367–371.
Altman, D. G. (1985). Comparability of randomised groups. The Statistician 34 125–136.
Arceneaux, K. T., Gerber, A. S. and Green, D. P. (2004). Monte Carlo simulation of the biases in misspecified randomization checks. Technical report, Yale Univ., Institution for Social and Policy Studies.
Barndorff-Nielsen, O. E. and Cox, D. R. (1994). Inference and Asymptotics. Chapman and Hall, London.
Begg, C. (1990). Significance tests of covariate imbalance in clinical trials. Controlled Clinical Trials 11 223–225.
Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K., Simel, D. et al. (1996). Improving the quality of reporting of randomized controlled trials. The CONSORT statement. J. Amer. Medical Assoc. 276 637–639.
Berger, V. W. (2005). Quantifying the magnitude of baseline covariate imbalances resulting from selection bias in randomized clinical trials. Biometrical J. 47 119–139.
Berger, V. W. and Exner, D. V. (1999). Detecting selection bias in randomized clinical trials. Controlled Clinical Trials 20 319–327.
Blyth, C. R. (1972). On Simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc. 67 364–366.
Bowers, J. and Hansen, B. B. (2006). RItools, an add-on package for R.
Brazzale, A. R., Davison, A. C. and Reid, N. (2006). Applied Asymptotics. Cambridge Univ. Press.
Breslow, N. E. and Day, N. E. (1987). Statistical Methods in Cancer Research. II. The Design and Analysis of Cohort Studies. International Agency for Research on Cancer Lyern, France.
Campbell, M. K., Elbourne, D. R. and Altman, D. G. (2004). consort statement: extension to cluster randomised trials. British Medical J. 328 702–708.
Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā Ser. A: Indian J. Statist. 35 417–446.
Davison, A. (2003). Statistical Models. Cambridge Univ. Press.
Divine, G., Brown, J. and Frazier, L. (1992). The unit of analysis error in studies about physicians’ patient care behaviour. J. General Internal Medicine 71 623–629.
Donner, A. and Klar, N. (1994). Methods for comparing event rates in intervention studies when the unit of allocation is a cluster. Amer. J. Epidemiology 140 279–289.
Donner, A. and Klar, N. (2000). Design and Analysis of Cluster Randomization Trials in Health Research. Edward Arnold Publishers, London.
Erdős, P. and Rényi, A. (1959). On the central limit theorem for samples from a finite population. Magyar Tud. Akad. Mat. Kutató Int. Közl. 4 49–61.
Feller, W. (1971). An Introduction to Probability Theory and Its Applications. II, 2nd ed. Wiley, New York.
Fleiss, J. (1973). Statistical Methods for Rates and Proportions Rates and Proportions. Wiley, New York.
Mathematical Reviews (MathSciNet): MR622544
Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B. and Pee, D. (1996). On design considerations and randomization-based inference for community intervention trials. Statistics in Medicine 15 1069–1092.
Gerber, A. S. and Green, D. P. (2000). The effects of canvassing, telephone calls, and direct mail on voter turnout: A field experiment. American Political Science Review 94 653–663.
Gerber, A. S. and Green, D. P. (2005). Correction to Gerber and Green (2000), replication of disputed findings, and reply to Imai (2005). American Political Science Review 99 301–313.
Hájek, J. (1960). Limiting distributions in simple random sampling from a finite population. Magyar Tud. Akad. Mat. Kutató Int. Közl. 5 361–374.
Hájek, J. and Šidák, Z. (1967). Theory of Rank Tests. Academic Press, New York.
Hansen, B. B. (2008). The essential role of balance tests in propensity-matched observational studies: Comments on “A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003” by P. Austin. Statistics in Medicine 27 2050–2054.
Harrell, F. E. (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, New York.
Highton, B. and Wolfinger, R. (2001). The first seven years of the political life cycle. American Journal of Political Science 45 202–209.
Hill, J. L., Thomas, N. and Rubin, D. B. (2000). The design of the New York schools choice scholarship program evaluation. In Research Design: Donald Campbell’s Legacy (L. Bickman, ed) 155–180. Sage Publications, Thousand Oaks, CA.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
Mathematical Reviews (MathSciNet): MR144363
Höglund, T. (1978). Sampling from a finite population. A remainder term estimate. Scand. J. Statist. 5 69–71.
Mathematical Reviews (MathSciNet): MR471130
Hotelling, H. (1931). The generalization of Student’s ratio. Ann. Math. Statist. 2 360–378.
Hothorn, T., Hornik, K., van de Wiel, M. A. and Zeileis, A. (2006). A Lego system for conditional inference. American Statist. 60 257–263.
Imai, K. (2005). Do get-out-the-vote calls reduce turnout? The importance of statistical methods for field experiments. American Political Science Review 99 283–300.
Imai, K., King, G. and Stuart, E. (2008). Misunderstandings among experimentalists and observationalists: Balance test fallacies in causal inference. J. Roy. Statist. Soc. Ser A 171 481–502.
Isaakidis, P. and Ioannidis, J. P. A. (2003). Evaluation of cluster randomized controlled trials in sub-Saharan Africa. Amer. J. Epidemiology 158 921–926.
Kalton, G. (1968). Standardization: A technique to control for extraneous variables. Appl. Statist. 17 118–136.
Mathematical Reviews (MathSciNet): MR234599
Kerry, S. M. and Bland, J. M. (1998). Analysis of a trial randomised in clusters. British Medical J. 316 54.
Le Cam, L. (1960). Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ. California Publ. Statist. 3 37–98.
Mathematical Reviews (MathSciNet): MR126903
Lee, W.-S. (2006). Propensity score matching and variations on the balancing test. Technical report, Melbourne Institute of Applied Economic and Social Research.
Lewsey, J. (2004). Comparing completely and stratified randomized designs in cluster randomized trials when the stratifying factor is cluster size: A simulation study. Statistics in Medicine 23 897–905.
MacLennan, G., Ramsay, C., Mollison, J., Campbell, M., Grimshaw, J. and Thomas, R. (2003). Room for improvement in the reporting of cluster randomised trials in behaviour change research. Controlled Clinical Trials 24 69S–70S.
Murray, D. M. (1998). Design and Analysis of Group-randomized Trials. Oxford Univ. Press.
Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. 5 463–480.
Peduzzi, P., Concato, J., Kemper, E., Holford, T. and Feinstein, A. (1996). A simulation study of the number of events per variable in logistic regression analysis. J. Clinical Epidemiology 49 1373–1379.
Raab, G. M. and Butcher, I. (2001). Balance in cluster randomized trials. Statistics in Medicine 20 351–365.
Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. J. Amer. Statist. Assoc. 79 516–524.
Senn, S. J. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine 13 1715–1726.
Whitehead, J. (1993). Sample size calculations for ordered categorical data. Statistics in Medicine 12 2257–2271.
Yudkin, P. L. and Moher, M. (2001). Putting theory into practice: A cluster randomized trial with a small number of clusters. Statistics in Medicine 20 341–349.