Statistical Science

Matching Methods for Causal Inference: A Review and a Look Forward

Elizabeth A. Stuart

Full-text: Open access

Abstract

When estimating causal effects using observational data, it is desirable to replicate a randomized experiment as closely as possible by obtaining treated and control groups with similar covariate distributions. This goal can often be achieved by choosing well-matched samples of the original treated and control groups, thereby reducing bias due to the covariates. Since the 1970s, work on matching methods has examined how to best choose treated and control subjects for comparison. Matching methods are gaining popularity in fields such as economics, epidemiology, medicine and political science. However, until now the literature and related advice has been scattered across disciplines. Researchers who are interested in using matching methods—or developing methods related to matching—do not have a single place to turn to learn about past and current research. This paper provides a structure for thinking about matching methods and guidance on their use, coalescing the existing research (both old and new) and providing a summary of where the literature on matching methods is now and where it should be headed.

Article information

Source
Statist. Sci. Volume 25, Number 1 (2010), 1-21.

Dates
First available: 3 August 2010

Permanent link to this document
http://projecteuclid.org/euclid.ss/1280841730

Digital Object Identifier
doi:10.1214/09-STS313

Mathematical Reviews number (MathSciNet)
MR2741812

Citation

Stuart, Elizabeth A. Matching Methods for Causal Inference: A Review and a Look Forward. Statistical Science 25 (2010), no. 1, 1--21. doi:10.1214/09-STS313. http://projecteuclid.org/euclid.ss/1280841730.


Export citation

References

  • Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74 235–267.
  • Abadie, A. and Imbens, G. W. (2009a). Bias corrected matching estimators for average treatment effects. Journal of Educational and Behavioral Statistics. To appear. Available at http://www.hks.harvard.edu/fs/aabadie/bcm.pdf.
  • Abadie, A. and Imbens, G. W. (2009b). Matching on the estimated propensity score. Working Paper 15301, National Bureau of Economic Research, Cambridge, MA.
  • Agodini, R. and Dynarski, M. (2004). Are experiments the only option? A look at dropout prevention programs. Review of Economics and Statistics 86 180–194.
  • Althauser, R. and Rubin, D. (1970). The computerized construction of a matched sample. American Journal of Sociology 76 325–346.
  • Augurzky, B. and Schmidt, C. (2001). The propensity score: A means to an end. Discussion Paper 271, Institute for the Study of Labor (IZA).
  • Austin, P. C. (2007). The performance of different propensity score methods for estimating marginal odds ratios. Stat. Med. 26 3078–3094.
  • Austin, P. (2009). Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Comm. Statist. Simulation Comput. 38 1228–1234.
  • Austin, P. C. and Mamdani, M. M. (2006). A comparison of propensity score methods: A case-study illustrating the effectiveness of post-ami statin use. Stat. Med. 25 2084–2106.
  • Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 962–972.
  • Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J. and Sturmer, T. (2006). Variable selection for propensity score models. American Journal of Epidemiology 163 1149–1156.
  • Carpenter, R. (1977). Matching when covariables are normally distributed. Biometrika 64 299–307.
  • Chapin, F. (1947). Experimental Designs in Sociological Research. Harper, New York.
  • Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 24 295–313.
  • Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā Ser. A 35 417–446.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Earlbaum, Hillsdale, NJ.
  • Cornfield, J. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute 22 173–200.
  • Crump, R., Hotz, V. J., Imbens, G. W. and Mitnik, O. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 187–199.
  • Czajka, J. C., Hirabayashi, S., Little, R. and Rubin, D. B. (1992). Projecting from advance data using propensity modeling. J. Bus. Econom. Statist. 10 117–131.
  • D’Agostino, Jr., R. B. and Rubin, D. B. (2000). Estimating and using propensity scores with partially missing data. J. Amer. Statist. Assoc. 95 749–759.
  • Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Re-evaluating the evaluation of training programs. J. Amer. Statist. Assoc. 94 1053–1062.
  • Dehejia, R. H. and Wahba, S. (2002). Propensity score matching methods for non-experimental causal studies. Review of Economics and Statistics 84 151–161.
  • Diamond, A. and Sekhon, J. S. (2006). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Working paper. Univ. California, Berkeley. Available at http://sekhon.berkeley.edu/papers/GenMatch.pdf.
  • Drake, C. (1993). Effects of misspecification of the propensity score on estimators of treatment effects. Biometrics 49 1231–1236.
  • Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics 58 21–29.
  • Glazerman, S., Levy, D. M. and Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. Annals of the American Academy of Political and Social Science 589 63–93.
  • Greenland, S. (2003). Quantifying biases in causal models: Classical confounding vs collider-stratification bias. Epidemiology 14 300–306.
  • Greenland, S. and Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 142 1255–1264.
  • Greenland, S., Robins, J. M. and Pearl, J. (1999). Confounding and collapsibility in causal inference. Statist. Sci. 14 29–46.
  • Greenwood, E. (1945). Experimental Sociology: A Study in Method. King’s Crown Press, New York.
  • Greevy, R., Lu, B., Silber, J. H. and Rosenbaum, P. (2004). Optimal multivariate matching before randomization. Biostatistics 5 263–275.
  • Gu, X. and Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. J. Comput. Graph. Statist. 2 405–420.
  • Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. J. Amer. Statist. Assoc. 99 609–618.
  • Hansen, B. B. (2008). The essential role of balance tests in propensity-matched observational studies: Comments on ‘A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003’ by Peter Austin, Statistics in Medicine. Stat. Med. 27 2050–2054.
  • Hansen, B. B. (2008). The prognostic analogue of the propensity score. Biometrika 95 481–488.
  • Harder, V. S., Stuart, E. A. and Anthony, J. (2010). Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological Methods. To appear.
  • Heckman, J. J., Hidehiko, H. and Todd, P. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. Rev. Econom. Stud. 64 605–654.
  • Heckman, J. J., Ichimura, H., Smith, J. and Todd, P. (1998). Characterizing selection bias using experimental data. Econometrica 66 1017–1098.
  • Heckman, J. J., Ichimura, H. and Todd, P. (1998). Matching as an econometric evaluation estimator. Rev. Econom. Stud. 65 261–294.
  • Heller, R., Rosenbaum, P. and Small, D. (2009). Split samples and design sensitivity in observational studies. J. Amer. Statist. Assoc. 104 1090–1101.
  • Hill, J. L. and Reiter, J. P. (2006). Interval estimation for treatment effects using propensity score matching. Stat. Med. 25 2230–2256.
  • Hill, J., Reiter, J. and Zanutto, E. (2004). A comparison of experimental and observational data analyses. In Applied Bayesian Modeling and Causal Inference From an Incomplete-Data Perspective (A. Gelman and X.-L. Meng, eds.). Wiley, Hoboken, NJ.
  • Hill, J., Rubin, D. B. and Thomas, N. (1999). The design of the New York School Choice Scholarship Program evaluation. In Research Designs: Inspired by the Work of Donald Campbell, (L. Bickman, ed.) 155–180. Sage, Thousand Oaks, CA.
  • Hirano, K., Imbens, G. W. and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 1161–1189.
  • Ho, D. E., Imai, K., King, G. and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15 199–236.
  • Holland, P. W. (1986). Statistics and causal inference. J. Amer. Statist. Assoc. 81 945–960.
  • Hong, G. and Raudenbush, S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. J. Amer. Statist. Assoc. 101 901–910.
  • Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference. J. Amer. Statist. Assoc. 103 832–842.
  • Iacus, S. M., King, G. and Porro, G. (2009). CEM: Software for coarsened exact matching. J. Statist. Software 30 9. Available at http://gking.harvard.edu/files/abs/cemR-abs.shtml.
  • Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing the propensity score. J. Amer. Statist. Assoc. 99 854–866.
  • Imai, K., King, G. and Stuart, E. A. (2008). Misunderstandings among experimentalists and observationalists in causal inference. J. Roy. Statist. Soc. Ser. A 171 481–502.
  • Imbens, G. W. (2000). The role of the propensity score in estimating dose–response functions. Biometrika 87 706–710.
  • Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics 86 4–29.
  • Joffe, M. M. and Rosenbaum, P. R. (1999). Propensity scores. American Journal of Epidemiology 150 327–333.
  • Joffe, M. M., Ten Have, T. R., Feldman, H. I. and Kimmel, S. E. (2004). Model selection, confounder control, and marginal structural models. Amer. Statist. 58 272–279.
  • Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 523–539.
  • Keele, L. (2009). rbounds: An R package for sensitivity analysis with matched data. R package. Available at http://www.polisci.ohio-state.edu/faculty/lkeele/rbounds.html.
  • King, G. and Zeng, L. (2006). The dangers of extreme counterfactuals. Political Analysis 14 131–159.
  • Kurth, T., Walker, A. M., Glynn, R. J., Chan, K. A., Gaziano, J. M., Berger, K. and Robins, J. M. (2006). Results of multivariable logistic regresion, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. American Journal of Epidemiology 163 262–270.
  • Lechner, M. (2002). Some practical issues in the evaluation of heterogeneous labour market programmes by matching methods. J. Roy. Statist. Soc. Ser. A 165 59–82.
  • Lee, B., Lessler, J. and Stuart, E. A. (2009). Improving propensity score weighting using machine learning. Stat. Med. 29 337–346.
  • Li, Y. P., Propert, K. J. and Rosenbaum, P. R. (2001). Balanced risk set matching. J. Amer. Statist. Assoc. 96 455, 870–882.
  • Lu, B., Zanutto, E., Hornik, R. and Rosenbaum, P. R. (2001). Matching with doses in an observational study of a media campaign against drug abuse. J. Amer. Statist. Assoc. 96 1245–1253.
  • Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat. Med. 23 2937–2960.
  • Lunt, M., Solomon, D., Rothman, K., Glynn, R., Hyrich, K., Symmons, D. P., Sturmer, T., the British Society for Rheumatology Biologics Register and the British Society for Rheumatology Biologics Register Contrl Centre Consortium (2009). Different methods of balancing covariates leading to different effect estimates in the presence of effect modification. American Journal of Epidemiology 169 909–917.
  • McCaffrey, D. F., Ridgeway, G. and Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9 403–425.
  • Ming, K. and Rosenbaum, P. R. (2001). A note on optimal matching with variable controls using the assignment algorithm. J. Comput. Graph. Statist. 10 455–463.
  • Morgan, S. L. and Harding, D. J. (2006). Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods & Research 35 3–60.
  • Potter, F. J. (1993). The effect of weight trimming on nonlinear survey estimates. In Proceedings of the Section on Survey Research Methods of American Statistical Association. Amer. Statist. Assoc., San Francisco, CA.
  • Qu, Y. and Lipkovich, I. (2009). Propensity score estimation with missing values using a multiple imputation missingness pattern (MIMP) approach. Stat. Med. 28 1402–1414.
  • Reinisch, J., Sanders, S., Mortensen, E. and Rubin, D. B. (1995). In utero exposure to phenobarbital and intelligence deficits in adult men. Journal of the American Medical Association 274 1518–1525.
  • Ridgeway, G., McCaffrey, D. and Morral, A. (2006). twang: Toolkit for weighting and analysis of nonequivalent groups. Software for using matching methods in R. Available at http://cran.r-project.org/web/packages/twang/index.html.
  • Robins, J. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90 122–129.
  • Robins, J. M., Hernan, M. A. and Brumback, B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11 550–560.
  • Robins, J. M., Mark, S. and Newey, W. (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48 479–495.
  • Rosenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable that has been affected by the treatment. J. Roy. Statist. Soc. Ser. A 147 656–666.
  • Rosenbaum, P. R. (1987a). Model-based direct adjustment. J. Amer. Statist. Assoc. 82 387–394.
  • Rosenbaum, P. R. (1987b). The role of a second control group in an observational study (with discussion). Statist. Sci. 2 292–316.
  • Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. J. Roy. Statist. Soc. Ser. B 53 597–610.
  • Rosenbaum, P. R. (1999). Choice as an alternative to control in observational studies (with discussion). Statist. Sci. 14 259–304.
  • Rosenbaum, P. R. (2002). Observational Studies, 2nd ed. Springer, New York.
  • Rosenbaum, P. R. (2010). Design of Observational Studies. Springer, New York.
  • Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J. Roy. Statist. Soc. Ser. B 45 212–218.
  • Rosenbaum, P. R. and Rubin, D. B. (1983b). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
  • Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. J. Amer. Statist. Assoc. 79 516–524.
  • Rosenbaum, P. R. and Rubin, D. B. (1985a). The bias due to incomplete matching. Biometrics 41 103–116.
  • Rosenbaum, P. R. and Rubin, D. B. (1985b). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Amer. Statist. 39 33–38.
  • Rosenbaum, P. R., Ross, R. N. and Silber, J. H. (2007). Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. J. Amer. Statist. Assoc. 102 75–83.
  • Rubin, D. B. (1973a). Matching to remove bias in observational studies. Biometrics 29 159–184.
  • Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics 29 185–203.
  • Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 688–701.
  • Rubin, D. B. (1976a). Inference and missing data (with discussion). Biometrika 63 581–592.
  • Rubin, D. B. (1976b). Multivariate matching methods that are equal percent bias reducing, I: Some examples. Biometrics 32 109–120.
  • Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. J. Amer. Statist. Assoc. 74 318–328.
  • Rubin, D. B. (1980). Bias reduction using Mahalanobis metric matching. Biometrics 36 293–298.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Rubin, D. B. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services & Outcomes Research Methodology 2 169–188.
  • Rubin, D. B. (2004). On principles for modeling propensity scores in medical research. Pharmacoepidemiology and Drug Safety 13 855–857.
  • Rubin, D. B. (2006). Matched Sampling for Causal Inference. Cambridge Univ. Press, Cambridge.
  • Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Stat. Med. 26 20–36.
  • Rubin, D. B. and Stuart, E. A. (2006). Affinely invariant matching methods with discriminant mixtures of proportional ellipsoidally symmetric distributions. Ann. Statist. 34 1814–1826.
  • Rubin, D. B. and Thomas, N. (1992a). Affinely invariant matching methods with ellipsoidal distributions. Ann. Statist. 20 1079–1093.
  • Rubin, D. B. and Thomas, N. (1992b). Characterizing the effect of matching using linear propensity score methods with normal distributions. Biometrika 79 797–809.
  • Rubin, D. B. and Thomas, N. (1996). Matching using estimated propensity scores, relating theory to practice. Biometrics 52 249–264.
  • Rubin, D. B. and Thomas, N. (2000). Combining propensity score matching with additional adjustments for prognostic covariates. J. Amer. Statist. Assoc. 95 573–585.
  • Schafer, J. L. and Kang, J. D. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated case study. Psychological Methods 13 279–313.
  • Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999). Adjusting for non-ignorable drop-out using semiparametric non-response models. J. Amer. Statist. Assoc. 94 1096–1120.
  • Schneider, E. C., Zaslavsky, A. M. and Epstein, A. M. (2004). Use of high-cost operative procedures by Medicare beneficiaries enrolled in for-profit and not-for-profit health plans. The New England Journal of Medicine 350 143–150.
  • Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J. and Cook, E. F. (2008). Evaluating uses of data mining techniques in propensity score estimation: A simulation study. Pharmacoepidemiology and Drug Safety 17 546–555.
  • Shadish, W. R., Clark, M. and Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. J. Amer. Statist. Assoc. 103 1334–1344.
  • Smith, H. (1997). Matching with multiple controls to estimate treatment effects in observational studies. Sociological Methodology 27 325–353.
  • Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, 7th ed. Iowa State Univ. Press, Ames, IA.
  • Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate?: Causal inference in the face of interference. J. Amer. Statist. Assoc. 101 1398–1407.
  • Song, J., Belin, T. R., Lee, M. B., Gao, X. and Rotheram-Borus, M. J. (2001). Handling baseline differences and missing items in a longitudinal study of HIV risk among runaway youths. Health Services & Outcomes Research Methodology 2 317–329.
  • Stuart, E. A. (2008). Developing practical recommendations for the use of propensity scores: Discussion of “A critical appraisal of propensity score matching in the medical literature between 1996 and 2003” by P. Austin. Stat. Med. 27 2062–2065.
  • Stuart, E. A. and Green, K. M. (2008). Using full matching to estimate causal effects in non-experimental studies: Examining the relationship between adolescent marijuana use and adult outcomes. Developmental Psychology 44 395–406.
  • Stuart, E. A. and Ialongo, N. S. (2009). Matching methods for selection of subjects for follow-up. Multivariate Behavioral Research. To appear.
  • Wacholder, S. and Weinberg, C. R. (1982). Paired versus two-sample design for a clinical trial of treatments with dichotomous outcome: Power considerations. Biometrics 38 801–812.
  • Weitzen, S., Lapane, K. L., Toledano, A. Y., Hume, A. L. and Mor, V. (2004). Principles for modeling propensity scores in medical research: A systematic literature review. Pharmacoepidemiology and Drug Safety 13 841–853.
  • Zhao, Z. (2004). Using matching to estimate treatment effects: Data requirements, matching metrics, and Monte Carlo evidence. Review of Economics and Statistics 86 91–107.