Statistical Science
- Statist. Sci.
- Volume 32, Number 2 (2017), 265-278.
Fitting Regression Models to Survey Data
Thomas Lumley and Alastair Scott
Full-text: Access has been disabled (more information)
Abstract
Data from complex surveys are being used increasingly to build the same sort of explanatory and predictive models used in the rest of statistics. Although the assumptions underlying standard statistical methods are not even approximately valid for most survey data, analogues of most of the features of standard regression packages are now available for use with survey data. We review recent developments in the field and illustrate their use on data from NHANES.
Article information
Source
Statist. Sci. Volume 32, Number 2 (2017), 265-278.
Dates
First available in Project Euclid: 11 May 2017
Permanent link to this document
http://projecteuclid.org/euclid.ss/1494489815
Digital Object Identifier
doi:10.1214/16-STS605
Keywords
Complex sampling statistical graphics
Citation
Lumley, Thomas; Scott, Alastair. Fitting Regression Models to Survey Data. Statist. Sci. 32 (2017), no. 2, 265--278. doi:10.1214/16-STS605. http://projecteuclid.org/euclid.ss/1494489815.
References
- Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 1–48.
- Beaumont, J.-F. (2008). A new approach to weighting and inference in sample surveys. Biometrika 95 539–553.Zentralblatt MATH: 05609531
- Beaumont, J.-F., Béliveau, A. and Haziza, D. (2015). Clarifying some aspects of variance estimation in two-phase sampling. Journal of Survey Statistics and Methodology 3 524–542.
- Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 51 279–292.Mathematical Reviews (MathSciNet): MR731144
Zentralblatt MATH: 0535.62014
Digital Object Identifier: doi:10.2307/1402588 - Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume I—The Analysis of Case-Control Studies. IARC Publications, Paris.
- Breslow, N. E., Robins, J. M. and Wellner, J. A. (2000). On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli 6 447–455.Zentralblatt MATH: 0965.62033
Digital Object Identifier: doi:10.2307/3318670
Project Euclid: euclid.bj/1081616700 - Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009a). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences 1 32–49.
- Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009b). Using the whole cohort in the analysis of case-cohort data. Am. J. Epidemiol. 169 1398–1405.
- Carr, D. B., Littlefield, R. J., Nicholson, W. L. and Littlefield, J. S. (1987). Scatterplot matrix techniques for large $N$. J. Amer. Statist. Assoc. 82 424–436.
- Chambers, R. L. and Skinner, C. J., eds. (2003). Analysis of Survey Data. Wiley, Chichester.Zentralblatt MATH: 1024.00035
- Chaudhuri, S., Handcock, M. S. and Rendall, M. S. (2008). Generalized linear models incorporating population level information: An empirical-likelihood-based approach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 311–328.Mathematical Reviews (MathSciNet): MR2424755
Zentralblatt MATH: 1148.62056
Digital Object Identifier: doi:10.1111/j.1467-9868.2007.00637.x - Chen, Y.-H. and Chen, H. (2000). A unified approach to regression analysis under double-sampling designs. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 449–460.Mathematical Reviews (MathSciNet): MR1772408
Zentralblatt MATH: 0963.62062
Digital Object Identifier: doi:10.1111/1467-9868.00243 - Davies, R. B. (1980). Algorithm AS 155: The distribution of a linear combination of $\chi^{2}$ random variables. J. R. Stat. Soc. Ser. C. Appl. Stat. 29 323–333.Zentralblatt MATH: 0473.62025
- Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87 376–382.
- DuMouchel, W. H. and Duncan, G. J. (1983). Using sample survey weights in multiple regression analyses of stratified samples. J. Amer. Statist. Assoc. 78 535–543.
- Elliott, M. R. (2007). Bayesian weight trimming for generalized linear regression models. Surv. Methodol. 33 23–34.
- Elliott, M. R. (2009). Model averaging methods for weight trimming in generalized linear regression models. J. Off. Stat. 25 1–20.
- Fabrizi, E. and Lahiri, P. (2007). A design-based approximation to the BIC in finite population sampling. Technical Report 4, Dipartimento di Matematica, Statistica, Informatica e Applicazioni, Università degli Studi di Bergamo.
- Farebrother, R. W. (1984). Algorithm AS 204: The distribution of a positive linear combination of $\chi^{2}$ random variables. J. R. Stat. Soc. Ser. C. Appl. Stat. 33 332–339.
- National Center for Health Statistics (1994). Plan and Operation of the Third National Health and Nutrition Examination Survey, 1976–1980. Number 32 in Series 1: Programs and Collection Procedures.
- Fuller, W. A. (1975). Regression analysis for sample survey. Sankhyā, Series C 37 117–132.Zentralblatt MATH: 0395.62009
- Fuller, W. A. (2009). Sampling Statistics. Wiley, Hoboken, NJ.Zentralblatt MATH: 1179.62019
- Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statist. Sci. 22 153–164.
- Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Stat. 31 1208–1211.Zentralblatt MATH: 0118.34301
- Guggenberger, P. (2010a). The impact of a Hausman pretest on the size of a hypothesis test: The panel data case. J. Econometrics 156 337–343.Zentralblatt MATH: 06608364
- Guggenberger, P. (2010b). The impact of a Hausman pretest on the asymptotic size of a hypothesis test. Econometric Theory 26 369–382.
- Harms, T. and Duchesne, P. (2010). On kernel nonparametric regression designed for complex survey data. Metrika 72 111–138.
- Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 46 1251–1271.Mathematical Reviews (MathSciNet): MR513692
Zentralblatt MATH: 0397.62043
Digital Object Identifier: doi:10.2307/1913827 - Heeringa, S., West, B. T. and Berglund, P. A. (2010). Applied Survey Data Analysis. CRC Press, Boca Raton, FL.
- Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
- Kim, J. K. and Skinner, C. J. (2013). Weighting in survey analysis under informative sampling. Biometrika 100 385–398.
- Koenker, R. and Basset, G. (1978). Regression quantiles. Econometrica 46 33–50.
- Korn, E. L. and Graubard, B. I. (1998). Scatterplots with survey data. Amer. Statist. 52 58–69.
- Korn, E. L. and Graubard, B. I. (1999). Analysis of Health Surveys. Wiley, New York.Zentralblatt MATH: 0927.62112
- Kott, P. S. and Chang, T. (2010). Using calibration weighting to adjust for nonignorable unit nonresponse. J. Amer. Statist. Assoc. 105 1265–1275.Mathematical Reviews (MathSciNet): MR2752620
Zentralblatt MATH: 06446177
Digital Object Identifier: doi:10.1198/jasa.2010.tm09016 - Kuonen, D. (1999). Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86 929–935.
- Lin, D., Tao, R., Kalsbeek, W., Zeng, D., Gonzalez, F. II, Fernandez-Rhodes, L., Graff, M., Koch, G. G., North, K. and Heiss, G. (2014). Genetic association analysis under complex survey sampling: The Hispanic Community Health Study/Study of Latinos. American Journal of Human Genetics 95 675–688.
- Little, R. J. A. (2012). Calibrated Bayes: An alternative inferential paradigm for official statistics. J. Off. Stat. 28 309–372.
- Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley, Hoboken, NJ.
- Lumley, T. (2015). survey: Analysis of complex survey samples. R package version 3.30-3. Available at https://cran.r-project.org/package=survey.
- Lumley, T. and Scott, A. J. (2013). Partial likelihood ratio tests for the Cox model under complex sampling. Stat. Med. 32 110–123.
- Lumley, T. and Scott, A. J. (2014). Tests for regression models fitted to survey data. Aust. N. Z. J. Stat. 56 1–14.Mathematical Reviews (MathSciNet): MR3200288
Zentralblatt MATH: 1334.62018
Digital Object Identifier: doi:10.1111/anzs.12065 - Lumley, T. and Scott, A. (2015). AIC and BIC for modeling with complex survey data. Journal of Survey Statistics and Methodology 3 1–18.
- Lumley, T., Shaw, P. A. and Dai, J. Y. (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. Int. Stat. Rev. 79 200–220.Zentralblatt MATH: 06179508
- Magee, L. (1998). Improving survey-weighted least squares regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 115–126.
- Morrison, J., Laurie, C., Marazita, M., Sanders, S., Offenbacher, S., Salazar, C., Conomos, M., Thornton, T., Jain, D., Laurie, C., Kerr, K., Papanicolaou, G., Taylor, K., Kaste, L., Beck, J. and Shaffer, J. (2016). Genome-wide association study of dental caries in the Hispanic Communities Health Study/Study of Latinos (HCHS/SOL). Human Molecular Genetics 25 807–816.
- Muthén, L. K. and Muthén, B. O. (2012). Mplus User’s Guide, 7th ed. Muthén & Muthén, Los Angeles, CA.Zentralblatt MATH: 40.0031.04
- Pfeffermann, D., Krieger, A. M. and Rinott, Y. (1998). Parametric distributions of complex survey data under informative probability sampling. Statist. Sinica 8 1087–1114.Zentralblatt MATH: 0923.62019
- Pfeffermann, D. and Sikov, A. (2011). Imputation and estimation under nonignorable nonresponse in household surveys with missing covariate information. J. Off. Stat. 27 181–209.
- Pfeffermann, D. and Sverchkov, M. (1999). Parametric and semi-parametric estimation of regression models fitted to survey data. Sankhyā, Series B 61 166–186.Zentralblatt MATH: 0985.62013
- Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 23–40.
- Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–411.Mathematical Reviews (MathSciNet): MR556730
Zentralblatt MATH: 0428.62078
Digital Object Identifier: doi:10.1093/biomet/66.3.403 - Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel modelling of complex survey data. J. Roy. Statist. Soc. Ser. A 169 805–827.
- Rao, J. N. K. and Scott, A. J. (1981). The analysis of categorical data from complex sample surveys: Chi-squared tests for goodness of fit and independence in two-way tables. J. Amer. Statist. Assoc. 76 221–230.
- Rao, J. N. K. and Scott, A. J. (1984). On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data. Ann. Statist. 12 46–60.
- Rao, J. N. K., Scott, A. J. and Skinner, C. J. (1998). Quasi-score tests with survey data. Statist. Sinica 8 1059–1070.Zentralblatt MATH: 0914.62004
- Rao, J. N. K., Verret, F. and Hidiroglou, M. A. (2014). A weighted composite likelihood approach to inference for two-level models from survey data. Surv. Methodol. 39 263–282.
- Rao, J. N. K., Yung, W. and Hidiroglou, M. A. (2002). Estimating equations for the analysis of survey data using poststratification information. Sankhyā, Series A 64 364–378.Zentralblatt MATH: 1192.62023
- Rivera, C. and Lumley, T. (2015). Using the whole cohort in the analysis of countermatched samples. Biometrics 72 382–391.Zentralblatt MATH: 06603728
- Robins, J. M., Hernán, M. and Brumback, B. (2000). Marginal structural models and causal inference in epidemology. Epidemiology 11 550–560.Mathematical Reviews (MathSciNet): MR1766776
Zentralblatt MATH: 1078.62523
Digital Object Identifier: doi:10.1023/A:1005285815569 - Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.Mathematical Reviews (MathSciNet): MR1294730
Zentralblatt MATH: 0815.62043
Digital Object Identifier: doi:10.1080/01621459.1994.10476818 - Rosenbaum, P. L. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
- Rotnitzky, A. and Jewell, N. P. (1990). Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika 77 485–497.
- Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
- Rust, K. F. and Rao, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res. 5 283–310.
- Särndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol. 33 99–119.
- Särndal, C.-E., Swensson, B. and Wretman, J. (2003). Model Assisted Survey Sampling. Springer, Berlin.
- Scott, A. and Wild, C. (2002). On the robustness of weighted methods for fitting models to case-control data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 207–219.
- Skinner, C. and Mason, B. (2012). Weighting in the regression analysis of survey data with a cross-national application. Canad. J. Statist. 40 697–711.
- Solon, G., Haider, S. J. and Wooldridge, J. (2013). What are we weighting for? Working Paper 18859, National Bureau of Economic Research, Cambridge, MA.
- StataCorp (2015). Stata Statistical Software: Release 14. StataCorp LP, College Station, TX.
- Støer, N. C. and Samuelsen, S. O. (2012). Comparison of estimators in nested case-control studies with multiple outcomes. Lifetime Data Anal. 18 261–283.
- Thomas, D. R. and Rao, J. N. K. (1987). Small-sample comparisons of level and power for simple goodness-of-fit statistics under cluster sampling. J. Amer. Statist. Assoc. 82 630–636.
- Unwin, A., Theus, M. and Hofmann, H., eds. (2007). Graphics of Large Datasets: Visualizing a Million. Springer, New York.Zentralblatt MATH: 1118.62003
- Valliant, R. (1993). Poststratification and conditional variance estimation. J. Amer. Statist. Assoc. 88 89–96.Zentralblatt MATH: 0775.62019
- van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
- Verret, F., Rao, J. and Hidiroglou, M. A. (2015). Model-based small area estimation under informative sampling. Surv. Methodol. 41 333–347.
- Wu, Y. and Fuller, W. A. (2005). Preliminary testing procedures for regression with survey samples. In Proceedings of the Section on Survey Research Methods 3683–3688. Amer. Statist. Assoc., Alexandria, VA.

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Bayesian prediction with adaptive ridge estimators
Denison, David G.T. and George, Edward I., Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman, 2012 - Customized training with an application to mass spectrometric imaging of cancer tissue
Powers, Scott, Hastie, Trevor, and Tibshirani, Robert, The Annals of Applied Statistics, 2015 - Adaptive-modal Bayesian nonparametric regression
Karabatsos, George and Walker, Stephen G., Electronic Journal of Statistics, 2012
- Bayesian prediction with adaptive ridge estimators
Denison, David G.T. and George, Edward I., Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman, 2012 - Customized training with an application to mass spectrometric imaging of cancer tissue
Powers, Scott, Hastie, Trevor, and Tibshirani, Robert, The Annals of Applied Statistics, 2015 - Adaptive-modal Bayesian nonparametric regression
Karabatsos, George and Walker, Stephen G., Electronic Journal of Statistics, 2012 - Small Area Shrinkage Estimation
Datta, G. and Ghosh, M., Statistical Science, 2012 - Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US
Krivitsky, Pavel N. and Morris, Martina, The Annals of Applied Statistics, 2017 - High-dimensional data: p > > n in mathematical statistics and bio-medical applications
Van De Geer, Sara A. and Van Houwelingen, Hans C., Bernoulli, 2004 - Learning a nonlinear dynamical system model of gene regulation: A perturbed steady-state approach
Meister, Arwen, Li, Ye Henry, Choi, Bokyung, and Wong, Wing Hung, The Annals of Applied Statistics, 2013 - Bridging the Gap between Different Statistical Approaches: An Integrated Framework for Modelling
Kuhnert, Petra M., Mengersen, Kerrie, and Tesar, Peter, International Statistical Review, 2003 - Monitoring robust regression
Riani, Marco, Cerioli, Andrea, Atkinson, Anthony C., and Perrotta, Domenico, Electronic Journal of Statistics, 2014 - Semiparametric models and two-phase samples: Applications to Cox regression
Breslow, Norman E. and Lumley, Thomas, From Probability to Statistics and Back: High-Dimensional Models and Processes -- A Festschrift in Honor of Jon A. Wellner, 2013
