Statistical Science

Two-Sample Instrumental Variable Analyses Using Heterogeneous Samples

Qingyuan Zhao, Jingshu Wang, Wes Spiller, Jack Bowden, and Dylan S. Small

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Instrumental variable analysis is a widely used method to estimate causal effects in the presence of unmeasured confounding. When the instruments, exposure and outcome are not measured in the same sample, Angrist and Krueger (J. Amer. Statist. Assoc. 87 (1992) 328–336) suggested to use two-sample instrumental variable (TSIV) estimators that use sample moments from an instrument-exposure sample and an instrument-outcome sample. However, this method is biased if the two samples are from heterogeneous populations so that the distributions of the instruments are different. In linear structural equation models, we derive a new class of TSIV estimators that are robust to heterogeneous samples under the key assumption that the structural relations in the two samples are the same. The widely used two-sample two-stage least squares estimator belongs to this class. It is generally not asymptotically efficient, although we find that it performs similarly to the optimal TSIV estimator in most practical situations. We then attempt to relax the linearity assumption. We find that, unlike one-sample analyses, the TSIV estimator is not robust to misspecified exposure model. Additionally, to nonparametrically identify the magnitude of the causal effect, the noise in the exposure must have the same distributions in the two samples. However, this assumption is in general untestable because the exposure is not observed in one sample. Nonetheless, we may still identify the sign of the causal effect in the absence of homogeneity of the noise.

Article information

Statist. Sci., Volume 34, Number 2 (2019), 317-333.

First available in Project Euclid: 19 July 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Generalized method of moments linkage disequilibrium local average treatment effect Mendelian randomization two stage least squares


Zhao, Qingyuan; Wang, Jingshu; Spiller, Wes; Bowden, Jack; Small, Dylan S. Two-Sample Instrumental Variable Analyses Using Heterogeneous Samples. Statist. Sci. 34 (2019), no. 2, 317--333. doi:10.1214/18-STS692.

Export citation


  • 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68–74.
  • Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. J. Econometrics 113 231–263.
  • Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. Ann. Math. Stat. 20 46–63.
  • Angrist, J. D., Graddy, K. and Imbens, G. W. (2000). The interpretation of instrumental variables estimators in simultaneous equations models with an application to the demand for fish. Rev. Econ. Stud. 67 499–527.
  • Angrist, J. D., Imbens, G. W. and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91 444–455.
  • Angrist, J. D. and Krueger, A. B. (1992). The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples. J. Amer. Statist. Assoc. 87 328–336.
  • Angrist, J. D. and Krueger, A. B. (1995). Split-sample instrumental variables estimates of the return to schooling. J. Bus. Econom. Statist. 13 225–235.
  • Baiocchi, M., Cheng, J. and Small, D. S. (2014). Instrumental variable methods for causal inference. Stat. Med. 33 2297–2340.
  • Baker, S. G. and Lindeman, K. S. (1994). The paired availability design: A proposal for evaluating epidural analgesia during labor. Stat. Med. 13 2269–2278.
  • Balke, A. and Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance. J. Amer. Statist. Assoc. 92 1171–1176.
  • Barbeira, A., Dickinson, S. P., Bonazzola, R., Zheng, J., Wheeler, H. E., Torres, J. M., Torstenson, E. S., Shah, K. P., Garcia, T. et al. (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9 1825.
  • Bowden, J., Davey Smith, G. and Burgess, S. (2015). Mendelian randomization with invalid instruments: Effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44 512–525.
  • Bowden, J., Davey Smith, G., Haycock, P. C. and Burgess, S. (2016). Consistent estimation in mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40 304–314.
  • Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L. and Zhang, K. (2014). Models as approximations, part I: A conspiracy of nonlinearity and random regressors in linear regression. Statist. Sci. Available at arXiv:1404.1578.
  • Burgess, S., Small, D. S. and Thompson, S. G. (2017). A review of instrumental variable estimators for Mendelian randomization. Stat. Methods Med. Res. 26 2333–2355.
  • Burgess, S., Scott, R. A., Timpson, N. J., Smith, G. D., Thompson, S. G. and EPIC-InterAct Consortium (2015). Using published data in Mendelian randomization: A blueprint for efficient identification of causal risk factors. Eur. J. Epidemiol. 30 543–552.
  • Choi, J., Gu, J. and Shen, S. (2018). Weak-instrument robust inference for two-sample instrumental variables regression. J. Appl. Econometrics 33 109–125.
  • Currie, J. and Yelowitz, A. (2000). Are public housing projects good for kids? J. Public Econ. 75 99–124.
  • Davey Smith, G. and Ebrahim, S. (2003). “Mendelian randomization”: Can genetic epidemiology contribute to understanding environmental determinants of disease? Int. J. Epidemiol. 32 1–22.
  • Davey Smith, G. and Hemani, G. (2014). Mendelian randomization: Genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 23 R89–98.
  • Davidson, R. and MacKinnon, J. G. (1993). Estimation and Inference in Econometrics. Oxford University Press, New York.
  • Fuller, W. A. (1977). Some properties of a modification of the limited information estimator. Econometrica 45 939–953.
  • Gamazon, E. R., Wheeler, H. E., Shah, K. P., Mozaffari, S. V., Aquino-Michaels, K., Carroll, R. J., Eyler, A. E., Denny, J. C., Nicolae, D. L. et al. (2015). A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47 1091–1098.
  • Graham, B. S., Pinto, C. C. X. and Egel, D. (2016). Efficient estimation of data combination models by the method of auxiliary-to-study tilting (AST). J. Bus. Econom. Statist. 34 288–301.
  • Haavelmo, T. (1944). The probability approach in econometrics. Econometrica 12 S1–S115.
  • Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50 1029–1054.
  • Hansen, C., Hausman, J. and Newey, W. (2008). Estimation with many instrumental variables. J. Bus. Econom. Statist. 26 398–422.
  • Hemani, G., Zheng, J., Elsworth, B., Wade, K. H., Haberland, V., Baird, D., Laurin, C., Burgess, S., Bowden, J. et al. (2018). The MR-Base platform supports systematic causal inference across the human phenome. eLife 7 e34408.
  • Hernán, M. A. and Robins, J. M. (2006). Instruments for causal inference: An epidemiologist’s dream? Epidemiology 360–372.
  • Imbens, G. W. (2007). Nonadditive models with endogenous regressors. In Advances in Economics and Econometrics (R. Blundell, W. Newey and T. Persson, eds.) 3 17–46. Cambridge Univ. Press, Cambridge.
  • Imbens, G. and Angrist, J. (1994). Identification and estimation of local average treatment effects. Econometrica 62 467–475.
  • Inoue, A. and Solon, G. (2010). Two-sample instrumental variables estimators. Rev. Econ. Stat. 92 557–561.
  • Jappelli, T., Pischke, J.-S. and Souleles, N. S. (1998). Testing for liquidity constraints in Euler equations with complementary data sources. Rev. Econ. Stat. 80 251–262.
  • Kang, H., Zhang, A., Cai, T. T. and Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. J. Amer. Statist. Assoc. 111 132–144.
  • Klevmarken, A. (1982). Missing variables and two-stage least-squares estimation from more than one data set. Working Paper Series 62, Research Institute of Industrial Economics, Stockholm.
  • Lawlor, D. A. (2016). Commentary: Two-sample Mendelian randomization: Opportunities and challenges. Int. J. Epidemiol. 45 908–915.
  • Lawlor, D. A., Harbord, R. M., Sterne, J. A. C., Timpson, N. and Smith, G. D. (2008). Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Stat. Med. 27 1133–1163.
  • Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam, S., Buchkovich, M. L. et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518 197–206.
  • Machiela, M. J. and Chanock, S. J. (2015). LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31 3555–3557.
  • Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R. et al. (2009). Finding the missing heritability of complex diseases. Nature 461 747–753.
  • Ogburn, E. L., Rotnitzky, A. and Robins, J. M. (2015). Doubly robust estimation of the local average treatment effect curve. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 373–396.
  • Pacini, D. (2018). The two-sample linear regression model with interval-censored covariates. J. Appl. Econometrics 34 66–81.
  • Pacini, D. and Windmeijer, F. (2016). Robust inference for the two-sample 2SLS estimator. Econom. Lett. 146 50–54.
  • Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, Cambridge.
  • Peters, J., Bühlmann, P. and Meinshausen, N. (2016). Causal inference by using invariant prediction: Identification and confidence intervals. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 947–1012.
  • Pierce, B. L. and Burgess, S. (2013). Efficient design for Mendelian randomization studies: Subsample and 2-sample instrumental variable estimators. Am. J. Epidemiol. 178 1177–1184.
  • Ridder, G. and Moffitt, R. (2007). The econometrics of data combination. Handb. Econom. 6 5469–5547.
  • Sherry, S. T., Ward, M.-H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M. and Sirotkin, K. (2001). dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29 308–311.
  • Stock, J. H., Wright, J. H. and Yogo, M. (2002). A survey of weak instruments and weak identification in generalized method of moments. J. Bus. Econom. Statist. 20 518–529.
  • Theil, H. (1958). Economic Forecasts and Policy. North-Holland, Amsterdam.
  • Vansteelandt, S. and Didelez, V. (2015). Robustness and efficiency of covariate adjusted linear instrumental variable estimators. Preprint. Available at arXiv:1510.01770.
  • Wald, A. (1940). The fitting of straight lines if both variables are subject to error. Ann. Math. Stat. 11 285–300.
  • Wang, L. and Tchetgen Tchetgen, E. (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 531–550.
  • White, H. (1980). Using least squares to approximate unknown regression functions. Internat. Econom. Rev. 21 149–170.
  • Wright, P. G. (1928). Tariff on Animal and Vegetable Oils. Macmillan, New York.
  • Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Weedon, M. N. et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44 369–375.
  • Zhao, Q., Wang, J., Bowden, J. and Small, D. S. (2019). Statistical inference in two-sample summary-data mendelian randomization using robust adjusted profile score. Ann. Statist. To appear. Available at arXiv:1801.09652.