Statistical Science

Combining Survey Data with Other Data Sources

Sharon L. Lohr and Trivellore E. Raghunathan

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Collecting data using probability samples can be expensive, and response rates for many household surveys are decreasing. The increasing availability of large data sources opens new opportunities for statisticians to use the information in survey data more efficiently by combining survey data with information from these other sources. We review some of the work done to date on statistical methods for combining information from multiple data sources, discuss the limitations and challenges for different methods that have been proposed, and describe research that is needed for combining survey estimates.

Article information

Source
Statist. Sci. Volume 32, Number 2 (2017), 293-312.

Dates
First available in Project Euclid: 11 May 2017

Permanent link to this document
https://projecteuclid.org/euclid.ss/1494489817

Digital Object Identifier
doi:10.1214/16-STS584

Keywords
Hierarchical models imputation multiple frame survey probability sample record linkage small area estimation

Citation

Lohr, Sharon L.; Raghunathan, Trivellore E. Combining Survey Data with Other Data Sources. Statist. Sci. 32 (2017), no. 2, 293--312. doi:10.1214/16-STS584. https://projecteuclid.org/euclid.ss/1494489817.


Export citation

References

  • Ades, A. E. and Sutton, A. J. (2006). Multiparameter evidence synthesis in epidemiology and medical decision-making: Current approaches. J. Roy. Statist. Soc. Ser. A 169 5–35.
  • American Association of Public Opinion Research (2015). Code of Professional Ethics and Practices. Available at https://www.aapor.org/Standards-Ethics/AAPOR-Code-of-Ethics.aspx.
  • Andridge, R. R. and Little, R. J. A. (2010). A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78 40–64.
  • Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., Gile, K. J. and Tourangeau, R. (2013). Summary report of the AAPOR task force on non-probability sampling. Journal of Survey Statistics and Methodology 1 90–143.
  • Bancroft, T. A. (1944). On biases in estimation due to the use of preliminary tests of significance. Ann. Math. Stat. 15 190–204.
  • Bankier, M. D. (1986). Estimators based on several stratified samples with applications to multiple frame surveys. J. Amer. Statist. Assoc. 81 1074–1079.
  • Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey snd satellite data. J. Amer. Statist. Assoc. 83 28–36.
  • Berlin, J. A. and Rennie, D. (1999). Measuring the quality of trials: The quality of quality scales. J. Amer. Med. Assoc. 282 1083–1085.
  • Bhatt, S., Weiss, D. J., Cameron, E., Bisanzio, D., Mappin, B., Dalrymple, U., Battle, K. E., Moyes, C. L., Henry, A., Eckhoff, P. A. et al. (2015). The effect of Malaria control on Plasmodium falciparum in Africa between 2000 and 2015. Nature 526 207–211.
  • Bohensky, M. A., Jolley, D., Sundararajan, V., Evans, S., Pilcher, D. V., Scott, I. and Brand, C. A. (2010). Data linkage: A powerful research tool with potential problems. BMC Health Serv. Res. 10 1–7.
  • Brick, J. M. (2013). Unit nonresponse and weighting adjustments: A critical review. J. Off. Stat. 29 329–353.
  • Brick, J. M. (2015). Compositional model inference. In Proceedings of the Survey Research Methods Section 299–307. Amer. Statist. Assoc., Alexandria, VA.
  • Brick, J. M., Cervantes, I. F., Lee, S. and Norman, G. (2011). Nonsampling errors in dual frame telephone surveys. Surv. Methodol. 37 1–12.
  • Carpenter, J. and Kenward, M. (2012). Multiple Imputation and Its Application. Wiley, Hoboken, NJ.
  • Chauvet, G. and de Marsac, G. T. (2014). Estimation methods on multiple sampling frames in two-stage sampling designs. Surv. Methodol. 40 335–346.
  • Chen, C., Wakefield, J. and Lumely, T. (2014). The use of sampling weights in Bayesian hierarchical models for small area estimation. Spat. Spatiotemporal Epidemiol. 11 33–43.
  • Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, New York.
  • Citro, C. F. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv. Methodol. 40 137–161.
  • Citro, C. F. and Straf, M. L., eds. (2013). Principles and Practices for a Federal Statistical Agency, 5th ed. National Academies Press, Washington, DC.
  • Cruze, N. (2015). Integrating survey data with auxiliary sources of information to estimate crop yields. In Proceedings of the Survey Research Methods Section 565–578. Amer. Statist. Assoc., Alexandria, VA.
  • Daas, P. J. H., Puts, M. J., Buelens, B. and van den Hurk, P. A. (2015). Big data as a source for official statistics. J. Off. Stat. 31 249–262.
  • Datta, G. S., Ghosh, M., Steorts, R. and Maples, J. (2011). Bayesian benchmarking with applications to small area estimation. TEST 20 574–588.
  • Deming, W. E. (1950). Some Theory of Sampling. Wiley, New York.
  • Deville, J.-C., Särndal, C.-E. and Sautory, O. (1993). Generalized raking procedures in survey sampling. J. Amer. Statist. Assoc. 88 1013–1020.
  • Dong, Q., Elliott, M. R. and Raghunathan, T. E. (2014a). A nonparametric method to generate synthetic populations to adjust for complex sampling design features. Surv. Methodol. 40 29–46.
  • Dong, Q., Elliott, M. R. and Raghunathan, T. E. (2014b). Combining information from multiple complex surveys. Surv. Methodol. 40 347–354.
  • Dugoff, E. H., Schuler, M. and Stuart, E. A. (2014). Generalizing observational study results: Applying propensity score methods to complex surveys. Health Serv. Res. 49 284–303.
  • Duncan, G. T., Jabine, T. B. and de Wolf, V. A. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. National Academies Press, Washington, DC.
  • Duncan, J. W. and Shelton, W. C. (1992). U.S. Government contributions to probability sampling and statistical analysis. Statist. Sci. 7 320–338.
  • Durrant, G. B. (2009). Imputation methods for handling item-nonresponse in practice: Methodological issues and recent debates. International Journal of Social Research Methodology 12 293–304.
  • Dwork, C. (2011). A firm foundation for private data analysis. Commun. ACM 54 86–95.
  • Elliott, M. R. and Davis, W. W. (2005). Obtaining cancer risk factor prevalence estimates in small areas: Combining data from two surveys. J. Roy. Statist. Soc. Ser. C 54 595–609.
  • Fay, R. E. III and Herriot, R. A. (1979). Estimates of income for small places: An application of James–Stein procedures to census data. J. Amer. Statist. Assoc. 74 269–277.
  • Fellegi, I. P. (1999). Record linkage and public policy: A dynamic evolution. In Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition 1–12. National Academy Press, Washington, DC.
  • Fellegi, I. P. and Sunter, A. B. (1969). A theory of record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
  • Finucane, M. M., Paciorek, C. J., Danaei, G. and Ezzati, M. (2014). Bayesian estimation of population-level trends in measures of health status. Statist. Sci. 29 18–25.
  • Finucane, M. M., Paciorek, C. J., Stevens, G. A. and Ezzati, M. (2015). Semiparametric Bayesian density estimation with disparate data sources: A meta-analysis of global childhood undernutrition. J. Amer. Statist. Assoc. 110 889–901.
  • Gelman, A., King, G. and Liu, C. (1998). Not asked and not answered: Multiple imputation for multiple surveys. J. Amer. Statist. Assoc. 93 846–857.
  • Goldstein, H., Harron, K. and Wade, A. (2012). The analysis of record-linked data using multiple imputation with data value priors. Stat. Med. 31 3481–3493.
  • Greenland, S. (2005). Multiple-bias modelling for analysis of observational data. J. Roy. Statist. Soc. Ser. A 168 267–306.
  • Groves, R. M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opin. Q. 70 646–675.
  • Groves, R. M. and Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively controlling survey errors and costs. J. Roy. Statist. Soc. Ser. A 169 439–457.
  • Harron, K., Goldstein, H. and Dibben, C. (2016). Methodological Developments in Data Linkage. Wiley, Hoboken, NJ.
  • Hartley, H. O. (1962). Multiple Frame Surveys. In Proceedings of the Social Statistics Section, American Statistical Association 203–206. Amer. Statist. Assoc., Alexandria, VA.
  • Hartley, H. O. (1974). Multiple frame methodology and selected applications. Sankhyā, Ser. C 36 99–118.
  • He, Y., Landrum, M. B. and Zaslavsky, A. M. (2014). Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: A multiple imputation approach. Stat. Med. 33 3710–3724.
  • Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. Springer Science & Business Media, New York.
  • Hurst, B. (2015). Big Data and Agriculture: Innovations and Implications. Statement of the American Farm Bureau Federation to the House Committee on Agriculture, available at http://agriculture.house.gov/uploadedfiles/10.28.15_hurst_testimony.pdf.
  • Hyndman, R. J., Lee, A. J. and Wang, E. (2016). Fast computation of reconciled forecasts for hierarchical and grouped time series. Comput. Statist. Data Anal. 97 16–32.
  • Jackson, C., Best, N. and Richardson, S. (2008). Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. J. Roy. Statist. Soc. Ser. A 171 159–178.
  • Jones, K. M., Thomson, J. C. and Arnold, K. (2014). Questions of data ownership on campus. EDUCAUSE Review, August 1–10.
  • Kalton, G. and Anderson, D. W. (1986). Sampling rare populations. J. Roy. Statist. Soc. Ser. A 149 65–82.
  • Kim, J. K. and Rao, J. N. K. (2012). Combining data from two independent surveys: A model-assisted approach. Biometrika 99 85–100.
  • Kish, L. J. and Topol, E. J. (2015). Unpatients—Why patients should own their medical data. Nat. Biotechnol. 33 921–924.
  • Kohut, A., Keeter, S., Doherty, C., Dimock, M. and Christian, L. (2012). Assessing the Representativeness of Public Opinion Surveys. Pew Research Center, Washington DC. Available at http://www.people-press.org/files/legacy-pdf/Assessing%20the%20Representativeness%20of%20Public%20Opinion%20Surveys.pdf.
  • Korn, E. L. and Graubard, B. I. (1999). Analysis of Health Surveys. Wiley, New York.
  • Kostkova, P., Brewer, H., de Lusignan, S., Fottrell, E., Goldacre, B., Hart, G., Koczan, P., Knight, P., Marsolier, C., McKendry, R. A. et al. (2016). Who owns the data? Open data for healthcare. Frontiers in Public Health 4 1–6.
  • Lee, S. and Valliant, R. (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res. 37 319–343.
  • Lesser, V. M., Newton, L. and Yang, D. (2008). Evaluating Frames and Modes of Contact in a Study of Individuals with Disabilities. Paper presented at the Joint Statistical Meetings, Denver, Colorado.
  • Lohr, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames. Surv. Methodol. 37 197–213.
  • Lohr, S. L. and Brick, J. M. (2012). Blending domain estimates from two victimization surveys with possible bias. Canad. J. Statist. 40 679–696.
  • Lohr, S. L. and Brick, J. M. (2014). Allocation for dual frame telephone surveys with nonresponse. Journal of Survey Statistics and Methodology 2 388–409.
  • Lohr, S. L. and Rao, J. N. K. (2006). Estimation in multiple-frame surveys. J. Amer. Statist. Assoc. 101 1019–1030.
  • Machanavajjhala, A. and Kifer, D. (2015). Designing statistical privacy for your data. Commun. ACM 58 58–67.
  • Manzi, G., Spiegelhalter, D. J., Turner, R. M., Flowers, J. and Thompson, S. G. (2011). Modelling bias in combining small area prevalence estimates from multiple surveys. J. Roy. Statist. Soc. Ser. A 174 31–50.
  • Mecatti, F. (2007). A single frame multiplicity estimator for multiple frame surveys. Surv. Methodol. 33 151–157.
  • Mercer, L., Wakefield, J., Chen, C. and Lumley, T. (2014). A comparison of spatial smoothing methods for small area estimation with sampling weights. Spat. Stat. 8 69–85.
  • Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. J. Amer. Statist. Assoc. 99 1131–1139.
  • Merkouris, T. (2010). Combining information from multiple surveys by using regression for efficient small domain estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 27–48.
  • Metcalf, P. and Scott, A. (2009). Using multiple frames in health surveys. Stat. Med. 28 1512–1523.
  • Moriarity, C. and Scheuren, F. (2001). Statistical matching: A paradigm for assessing the uncertainty in the procedure. J. Off. Stat. 17 407–422.
  • Mosteller, F. (1948). On pooling data. J. Amer. Statist. Assoc. 43 231–242.
  • Nachman, K. E. and Parker, J. D. (2012). Exposures to fine particulate air pollution and respiratory outcomes in adults using two national datasets: A cross-sectional study. Environ. Health 11 1–12.
  • Nandram, B., Berg, E. and Barboza, W. (2014). A hierarchical Bayesian model for forecasting state-level corn yield. Environ. Ecol. Stat. 21 507–530.
  • National Center for Health Statistics (2016). Survey Description, National Health Interview Survey, 2014. Centers for Disease Control and Prevention, Hyattsville, MD. ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2015/srvydesc.pdf.
  • Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society 97 558–625.
  • Pfeffermann, D. and Tiller, R. (2006). Small-area estimation with state-space models subject to benchmark constraints. J. Amer. Statist. Assoc. 101 1387–1397.
  • Pocock, S. J. (1976). The combination of randomized and historical controls in clinical trials. J. Chronic. Dis. 29 175–188.
  • Prentice, R. L., Smythe, R. T., Krewski, D. and Mason, M. (1992). On the use of historical control data to estimate dose response trends in quantal bioassay. Biometrics 48 459–478.
  • Raghunathan, T. E. (1991). Pooling controls from different studies. Stat. Med. 10 1417–1426.
  • Raghunathan, T. E. (2006). Combining information from multiple surveys for assessing health disparities. Allg. Stat. Arch. 90 515–526.
  • Raghunathan, T. E., Xie, D., Schenker, N., Parsons, V. L., Davis, W. W., Dodd, K. W. and Feuer, E. J. (2007). Combining information from two surveys to estimate county-level prevalence rates of cancer risk factors and screening. J. Amer. Statist. Assoc. 102 474–486.
  • Ranalli, M. G., Arcos, A., Rueda, M. d. M. and Teodoro, A. (2016). Calibration estimation in dual-frame surveys. Stat. Methods Appl. 25 321–349.
  • Rao, J. N. K. and Molina, I. (2015). Small Area Estimation, 2nd ed. Wiley, Hoboken, NJ.
  • Rao, J. N. K. and Wu, C. (2010). Pseudo-empirical likelihood inference for multiple frame surveys. J. Amer. Statist. Assoc. 105 1494–1503.
  • Rao, S. R., Graubard, B. I., Schmid, C. H., Morton, S. C., Louis, T. A., Zaslavsky, A. M. and Finkelstein, D. M. (2008). Meta-analysis of survey data: Application to health services research. Health Serv. Outcomes Res. Methodol. 8 98–114.
  • Rässler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Lecture Notes in Statistics 168. Springer, New York.
  • Renssen, R. H. and Nieuwenbroek, N. J. (1997). Aligning estimates for common variables in two or more sample surveys. J. Amer. Statist. Assoc. 92 368–374.
  • Rodgers, W. L. (1984). An evaluation of statistical matching. J. Bus. Econom. Statist. 2 91–102.
  • Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
  • Särndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol. 33 99–119.
  • Schenker, N., Raghunathan, T. E. and Bondarenko, I. (2010). Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Stat. Med. 29 533–545.
  • Skinner, C. J. and Rao, J. N. K. (1996). Estimation in dual frame surveys with complex designs. J. Amer. Statist. Assoc. 91 349–356.
  • Smith, T. W. (2011). The report of the international workshop on using multi-level data from sample frames, auxiliary databases, paradata and related sources to detect and adjust for nonresponse bias in surveys. Int. J. Public Opin. Res. 23 389–402.
  • Statistics Canada (2014). Note to Users of Data from the 2012 Canadian Income Survey, available at http://www.statcan.gc.ca/pub/75-513-x/75-513-x2014001-eng.htm.
  • Steorts, R. C., Hall, R. and Fienberg, S. E. (2016). A Bayesian approach to graphical record linkage and de-duplication. J. Amer. Statist. Assoc. 111 1660–1672.
  • Stokes, L. and Lin, D. (2015). Measurement error in dual frame designs. Paper presented at the Joint Statistical Meetings, Seattle WA.
  • Strauss, W. J., Carroll, R. J., Bortnick, S. M., Menkedick, J. R. and Schultz, B. D. (2001). Combining datasets to predict the effects of regulation of environmental lead exposure in housing stock. Biometrics 57 203–210.
  • Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci. 25 1–21.
  • Sutton, A. J. and Higgins, J. (2008). Recent developments in meta-analysis. Stat. Med. 27 625–650.
  • Sweeting, M. J., de Angelis, D., Hickman, M. and Ades, A. E. (2008). Estimating hepatitis C prevalence in England and Wales by synthesizing evidence from multiple data sources. Assessing data conflict and model fit. Biostatistics 9 715–734.
  • Tourangeau, R., Brick, J. M., Lohr, S. and Li, J. (2017). Adaptive and responsive survey designs: A review and assessment. J. Roy. Statist. Soc. Ser. A. 180 203–223.
  • Turner, R. M., Omar, R. Z., Yang, M., Goldstein, H. and Thompson, S. G. (2000). A multilevel model framework for meta-analysis of clinical trials with binary outcomes. Stat. Med. 19 3417–3432.
  • Turner, R. M., Spiegelhalter, D. J., Smith, G. C. S. and Thompson, S. G. (2009). Bias modelling in evidence synthesis. J. Roy. Statist. Soc. Ser. A 172 21–47.
  • United States Census Bureau (2016). Model-Based Small Area Income & Poverty Estimates (SAIPE) for School Districts, Counties, and States. Available at http://www.census.gov/did/www/saipe/.
  • United States General Accounting Office (1992). Cross-Design Synthesis: A New Strategy for Medical Effectiveness Research. U.S. General Accounting Office, Washington, DC. Available at archive.gao.gov/d31t10/145906.pdf.
  • Valliant, R. and Dever, J. A. (2011). Estimating propensity adjustments for volunteer web surveys. Sociol. Methods Res. 40 105–137.
  • Valliant, R., Dorfman, A. H. and Royall, R. M. (2000). Finite Population Sampling and Inference: A Prediction Approach. Wiley, New York.
  • Vos, T., Barber, R. M., Bell, B., Bertozzi-Villa, A., Biryukov, S., Bolliger, I., Charlson, F., Davis, A., Degenhardt, L., Dicker, D. et al. (2015). Global, regional, and national incidence, prevalence, and years lived with disability for 301 acute and chronic diseases and injuries in 188 countries, 1990–2013: A systematic analysis for the Global Burden of Disease Study 2013. Lancet 386 743–800.
  • Wagner, J. and Raghunathan, T. (2007). Bayesian approaches to sequential selection of survey design protocols. In Proceedings of the Survey Research Methods Section 3333–3340. Amer. Statist. Assoc., Alexandria, VA.
  • Wagner, J., West, B. T., Kirgis, N., Lepkowski, J. M., Axinn, W. G. and Ndiaye, S. K. (2012). Use of paradata in a responsive design framework to manage a field data collection. J. Off. Stat. 28 477.
  • Wakefield, J. (2004). Ecological inference for $2\times2$ tables (with discussion). J. Roy. Statist. Soc. Ser. A 167 385–445.
  • Wakefield, J. and Salway, R. (2001). A statistical framework for ecological and aggregate studies. J. Roy. Statist. Soc. Ser. A 164 119–137.
  • Wang, J. C., Holan, S. H., Nandram, B., Barboza, W., Toto, C. and Anderson, E. (2012). A Bayesian approach to estimating agricultural yield based on multiple repeated surveys. J. Agric. Biol. Environ. Stat. 17 84–106.
  • Wang, H., Wolock, T. M., Carter, A., Nguyen, G., Kyu, H. H., Gakidou, E., Hay, S. I., Mills, E. J., Trickey, A., Msemburi, W. et al. (2016). Estimates of global, regional, and national incidence, prevalence, and mortality of HIV, 1980–2015: The Global Burden of Disease Study 2015. The Lancet. HIV 3 e361–e387.
  • Wheldon, M. C., Raftery, A. E., Clark, S. J. and Gerland, P. (2016). Bayesian population reconstruction of female populations for less developed and more developed countries. Popul. Stud. (Camb.) 70 21–37.
  • Winkler, W. E. (2014). Matching and record linkage. Wiley Interdiscip. Rev.: Comput. Stat. 6 313–325.
  • Ybarra, L. M. and Lohr, S. L. (2008). Small area estimation when auxiliary information is measured with error. Biometrika 95 919–931.
  • Yeager, D. S., Krosnick, J. A., Chang, L., Javitz, H. S., Levendusky, M. S., Simpser, A. and Wang, R. (2011). Comparing the accuracy of RDD telephone surveys and Internet surveys conducted with probability and non-probability samples. Public Opin. Q. 75 709–747.
  • You, J., Datta, G. S. and Maples, J. J. (2014). Modeling disability in small areas: An area-level approach of combining two surveys. In Proceedings of the Survey Research Methods Section 3770–3784. Amer. Statist. Assoc., Alexandria, VA.
  • Zhou, H., Elliott, M. R. and Raghunathan, T. E. (2015). A two-step semiparametric method to accommodate sampling weights in multiple imputation. Biometrics 72 242–252.
  • Zolas, N., Goldschlag, N., Jarmin, R., Stephan, P., Owen-Smith, J., Rosen, R. F., Allen, B. M., Weinberg, B. A. and Lane, J. I. (2015). Wrapping it up in a person: Examining employment and earnings outcomes for Ph.D. recipients. Science 350 1367–1371.