Statistical Science

Fitting Regression Models to Survey Data

Thomas Lumley and Alastair Scott

Full-text: Open access


Data from complex surveys are being used increasingly to build the same sort of explanatory and predictive models used in the rest of statistics. Although the assumptions underlying standard statistical methods are not even approximately valid for most survey data, analogues of most of the features of standard regression packages are now available for use with survey data. We review recent developments in the field and illustrate their use on data from NHANES.

Article information

Statist. Sci., Volume 32, Number 2 (2017), 265-278.

First available in Project Euclid: 11 May 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Complex sampling statistical graphics


Lumley, Thomas; Scott, Alastair. Fitting Regression Models to Survey Data. Statist. Sci. 32 (2017), no. 2, 265--278. doi:10.1214/16-STS605.

Export citation


  • Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 1–48.
  • Beaumont, J.-F. (2008). A new approach to weighting and inference in sample surveys. Biometrika 95 539–553.
  • Beaumont, J.-F., Béliveau, A. and Haziza, D. (2015). Clarifying some aspects of variance estimation in two-phase sampling. Journal of Survey Statistics and Methodology 3 524–542.
  • Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 51 279–292.
  • Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume I—The Analysis of Case-Control Studies. IARC Publications, Paris.
  • Breslow, N. E., Robins, J. M. and Wellner, J. A. (2000). On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli 6 447–455.
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009a). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences 1 32–49.
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009b). Using the whole cohort in the analysis of case-cohort data. Am. J. Epidemiol. 169 1398–1405.
  • Carr, D. B., Littlefield, R. J., Nicholson, W. L. and Littlefield, J. S. (1987). Scatterplot matrix techniques for large $N$. J. Amer. Statist. Assoc. 82 424–436.
  • Chambers, R. L. and Skinner, C. J., eds. (2003). Analysis of Survey Data. Wiley, Chichester.
  • Chaudhuri, S., Handcock, M. S. and Rendall, M. S. (2008). Generalized linear models incorporating population level information: An empirical-likelihood-based approach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 311–328.
  • Chen, Y.-H. and Chen, H. (2000). A unified approach to regression analysis under double-sampling designs. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 449–460.
  • Davies, R. B. (1980). Algorithm AS 155: The distribution of a linear combination of $\chi^{2}$ random variables. J. R. Stat. Soc. Ser. C. Appl. Stat. 29 323–333.
  • Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87 376–382.
  • DuMouchel, W. H. and Duncan, G. J. (1983). Using sample survey weights in multiple regression analyses of stratified samples. J. Amer. Statist. Assoc. 78 535–543.
  • Elliott, M. R. (2007). Bayesian weight trimming for generalized linear regression models. Surv. Methodol. 33 23–34.
  • Elliott, M. R. (2009). Model averaging methods for weight trimming in generalized linear regression models. J. Off. Stat. 25 1–20.
  • Fabrizi, E. and Lahiri, P. (2007). A design-based approximation to the BIC in finite population sampling. Technical Report 4, Dipartimento di Matematica, Statistica, Informatica e Applicazioni, Università degli Studi di Bergamo.
  • Farebrother, R. W. (1984). Algorithm AS 204: The distribution of a positive linear combination of $\chi^{2}$ random variables. J. R. Stat. Soc. Ser. C. Appl. Stat. 33 332–339.
  • National Center for Health Statistics (1994). Plan and Operation of the Third National Health and Nutrition Examination Survey, 1976–1980. Number 32 in Series 1: Programs and Collection Procedures.
  • Fuller, W. A. (1975). Regression analysis for sample survey. Sankhyā, Series C 37 117–132.
  • Fuller, W. A. (2009). Sampling Statistics. Wiley, Hoboken, NJ.
  • Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statist. Sci. 22 153–164.
  • Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Stat. 31 1208–1211.
  • Guggenberger, P. (2010a). The impact of a Hausman pretest on the size of a hypothesis test: The panel data case. J. Econometrics 156 337–343.
  • Guggenberger, P. (2010b). The impact of a Hausman pretest on the asymptotic size of a hypothesis test. Econometric Theory 26 369–382.
  • Harms, T. and Duchesne, P. (2010). On kernel nonparametric regression designed for complex survey data. Metrika 72 111–138.
  • Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 46 1251–1271.
  • Heeringa, S., West, B. T. and Berglund, P. A. (2010). Applied Survey Data Analysis. CRC Press, Boca Raton, FL.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Kim, J. K. and Skinner, C. J. (2013). Weighting in survey analysis under informative sampling. Biometrika 100 385–398.
  • Koenker, R. and Basset, G. (1978). Regression quantiles. Econometrica 46 33–50.
  • Korn, E. L. and Graubard, B. I. (1998). Scatterplots with survey data. Amer. Statist. 52 58–69.
  • Korn, E. L. and Graubard, B. I. (1999). Analysis of Health Surveys. Wiley, New York.
  • Kott, P. S. and Chang, T. (2010). Using calibration weighting to adjust for nonignorable unit nonresponse. J. Amer. Statist. Assoc. 105 1265–1275.
  • Kuonen, D. (1999). Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86 929–935.
  • Lin, D., Tao, R., Kalsbeek, W., Zeng, D., Gonzalez, F. II, Fernandez-Rhodes, L., Graff, M., Koch, G. G., North, K. and Heiss, G. (2014). Genetic association analysis under complex survey sampling: The Hispanic Community Health Study/Study of Latinos. American Journal of Human Genetics 95 675–688.
  • Little, R. J. A. (2012). Calibrated Bayes: An alternative inferential paradigm for official statistics. J. Off. Stat. 28 309–372.
  • Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley, Hoboken, NJ.
  • Lumley, T. (2015). survey: Analysis of complex survey samples. R package version 3.30-3. Available at
  • Lumley, T. and Scott, A. J. (2013). Partial likelihood ratio tests for the Cox model under complex sampling. Stat. Med. 32 110–123.
  • Lumley, T. and Scott, A. J. (2014). Tests for regression models fitted to survey data. Aust. N. Z. J. Stat. 56 1–14.
  • Lumley, T. and Scott, A. (2015). AIC and BIC for modeling with complex survey data. Journal of Survey Statistics and Methodology 3 1–18.
  • Lumley, T., Shaw, P. A. and Dai, J. Y. (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. Int. Stat. Rev. 79 200–220.
  • Magee, L. (1998). Improving survey-weighted least squares regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 115–126.
  • Morrison, J., Laurie, C., Marazita, M., Sanders, S., Offenbacher, S., Salazar, C., Conomos, M., Thornton, T., Jain, D., Laurie, C., Kerr, K., Papanicolaou, G., Taylor, K., Kaste, L., Beck, J. and Shaffer, J. (2016). Genome-wide association study of dental caries in the Hispanic Communities Health Study/Study of Latinos (HCHS/SOL). Human Molecular Genetics 25 807–816.
  • Muthén, L. K. and Muthén, B. O. (2012). Mplus User’s Guide, 7th ed. Muthén & Muthén, Los Angeles, CA.
  • Pfeffermann, D., Krieger, A. M. and Rinott, Y. (1998). Parametric distributions of complex survey data under informative probability sampling. Statist. Sinica 8 1087–1114.
  • Pfeffermann, D. and Sikov, A. (2011). Imputation and estimation under nonignorable nonresponse in household surveys with missing covariate information. J. Off. Stat. 27 181–209.
  • Pfeffermann, D. and Sverchkov, M. (1999). Parametric and semi-parametric estimation of regression models fitted to survey data. Sankhyā, Series B 61 166–186.
  • Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 23–40.
  • Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–411.
  • Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel modelling of complex survey data. J. Roy. Statist. Soc. Ser. A 169 805–827.
  • Rao, J. N. K. and Scott, A. J. (1981). The analysis of categorical data from complex sample surveys: Chi-squared tests for goodness of fit and independence in two-way tables. J. Amer. Statist. Assoc. 76 221–230.
  • Rao, J. N. K. and Scott, A. J. (1984). On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data. Ann. Statist. 12 46–60.
  • Rao, J. N. K., Scott, A. J. and Skinner, C. J. (1998). Quasi-score tests with survey data. Statist. Sinica 8 1059–1070.
  • Rao, J. N. K., Verret, F. and Hidiroglou, M. A. (2014). A weighted composite likelihood approach to inference for two-level models from survey data. Surv. Methodol. 39 263–282.
  • Rao, J. N. K., Yung, W. and Hidiroglou, M. A. (2002). Estimating equations for the analysis of survey data using poststratification information. Sankhyā, Series A 64 364–378.
  • Rivera, C. and Lumley, T. (2015). Using the whole cohort in the analysis of countermatched samples. Biometrics 72 382–391.
  • Robins, J. M., Hernán, M. and Brumback, B. (2000). Marginal structural models and causal inference in epidemology. Epidemiology 11 550–560.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Rosenbaum, P. L. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
  • Rotnitzky, A. and Jewell, N. P. (1990). Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika 77 485–497.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rust, K. F. and Rao, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res. 5 283–310.
  • Särndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol. 33 99–119.
  • Särndal, C.-E., Swensson, B. and Wretman, J. (2003). Model Assisted Survey Sampling. Springer, Berlin.
  • Scott, A. and Wild, C. (2002). On the robustness of weighted methods for fitting models to case-control data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 207–219.
  • Skinner, C. and Mason, B. (2012). Weighting in the regression analysis of survey data with a cross-national application. Canad. J. Statist. 40 697–711.
  • Solon, G., Haider, S. J. and Wooldridge, J. (2013). What are we weighting for? Working Paper 18859, National Bureau of Economic Research, Cambridge, MA.
  • StataCorp (2015). Stata Statistical Software: Release 14. StataCorp LP, College Station, TX.
  • Støer, N. C. and Samuelsen, S. O. (2012). Comparison of estimators in nested case-control studies with multiple outcomes. Lifetime Data Anal. 18 261–283.
  • Thomas, D. R. and Rao, J. N. K. (1987). Small-sample comparisons of level and power for simple goodness-of-fit statistics under cluster sampling. J. Amer. Statist. Assoc. 82 630–636.
  • Unwin, A., Theus, M. and Hofmann, H., eds. (2007). Graphics of Large Datasets: Visualizing a Million. Springer, New York.
  • Valliant, R. (1993). Poststratification and conditional variance estimation. J. Amer. Statist. Assoc. 88 89–96.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
  • Verret, F., Rao, J. and Hidiroglou, M. A. (2015). Model-based small area estimation under informative sampling. Surv. Methodol. 41 333–347.
  • Wu, Y. and Fuller, W. A. (2005). Preliminary testing procedures for regression with survey samples. In Proceedings of the Section on Survey Research Methods 3683–3688. Amer. Statist. Assoc., Alexandria, VA.