Statistical Science

Inference for Superpopulation Parameters Using Sample Surveys

Barry I. Graubardand and Edward L. Korn

Full-text: Open access

Abstract

Sample survey inference is historically concerned with finite-population parameters, that is, functions (like means and totals) of the observations for the individuals in the population. In scientific applications, however, interest usually focuses on the “superpopulation” parameters associated with a stochastic mechanismhypothesized to generate the observations in the population rather than the finite-population parameters. Two relevant findings discussed in this paper are that (1) with stratified sampling, it is not sufficient to drop finite-population correction factors from standard design-based variance formulas to obtain appropriate variance formulas for superpopulation inference, and (2) with cluster sampling, standard design-based variance formulas can dramatically underestimate superpopulation variability, even with a small sampling fraction of the final units. A literature review of inference for superpopulation parameters is given, with emphasis on why these findings have not been previously appreciated. Examples are provided for estimating superpopulation means, linear regression coefficients and logistic regression coefficients using U.S. data from the 1987 National Health Interview Survey, the third National Health and Nutrition Examination Survey and the 1986 National Hospital Discharge Survey.

Article information

Source
Statist. Sci., Volume 17, Number 1 (2002), 73-96.

Dates
First available in Project Euclid: 11 June 2002

Permanent link to this document
https://projecteuclid.org/euclid.ss/1023798999

Digital Object Identifier
doi:10.1214/ss/1023798999

Mathematical Reviews number (MathSciNet)
MR1910075

Zentralblatt MATH identifier
1013.62005

Keywords
Cluster sampling complex survey data design-based inference model-based inference random effects stratified sampling

Citation

Graubardand, Barry I.; Korn, Edward L. Inference for Superpopulation Parameters Using Sample Surveys. Statist. Sci. 17 (2002), no. 1, 73--96. doi:10.1214/ss/1023798999. https://projecteuclid.org/euclid.ss/1023798999


Export citation

References

  • ARNAB, R. (1992). Estimation of a finite population mean under superpopulation models. Comm. Statist. Theory Methods 21 1717-1724.
  • BELLHOUSE, D. R., THOMPSON, M. E. and GODAMBE, V. P.
  • (1977). Two-stage sampling with exchangeable prior distributions. Biometrika 64 97-103.
  • BOUZA, C. N. (1995). Linear rank tests derived from a superpopulation model. Biometrical J. 37 497-506.
  • BRECKLING, J. U., CHAMBERS, R. L., DORFMAN, A. H.,
  • TAM, S. M. and WELSH, A. H. (1994). Maximum likelihood inference from sample survey data. Internat. Statist. Rev. 62 349-363.
  • BREEN, N. and KESSLER, L. (1994). Changes in the use of screening mammography: evidence from the 1987 and 1990 National Health Interview Survey s. Amer. J. Pub. Health 84 62-7.
  • CAMPBELL, C. (1977). Properties of ordinary and weighted least square estimators of regression coefficients for two-stage samples. In Proceedings of the Section on Social Statistics 800-805. Amer. Statist. Assoc., Alexandria, VA.
  • CASSEL, C., SÄRNDAL, C. and WRETMAN, H. H. (1977). Foundations of Inference in Survey Sampling. Wiley, New York.
  • CHAMBERS, R. L. (1986). Design-adjusted parameter estimation. J. Roy. Statist. Soc. Ser. A 149 161-173.
  • CHAMBERS, R. L., DORFMAN, A. H. and WANG, S. (1998). Limited information likelihood analysis of survey data. J. Roy. Statist. Soc. Ser. B 60 397-411.
  • CHRISTENSEN, R. (1984). A note on ordinary least squares methods for two-stage sampling. J. Amer. Statist. Assoc. 79 720-721.
  • CHRISTENSEN, R. (1987). The analysis of two-stage sampling data by ordinary least squares. J. Amer. Statist. Assoc. 82 492-498.
  • COCHRAN, W. G. (1939). The use of the analysis of variance in enumeration by sampling. J. Amer. Statist. Assoc. 34 492-510.
  • COCHRAN, W. G. (1946). Relative accuracy of sy stematic and stratified random samples for a certain class of populations. Ann. Math. Statist. 17 164-177.
  • COCHRAN, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York.
  • COSSLETT, S. R. (1981). Maximum likelihood estimator for choice-based samples. Econometrica 49 1289-1316.
  • DEMETS, D. and HALPERIN, M. (1977). Estimation of a simple regression coefficient in samples arising from a sub-sampling procedure. Biometrics 33 47-56.
  • DEMING, W. E. (1953). On the distinction between enumerative and analytic survey s. J. Amer. Statist. Assoc. 48 244-255.
  • DEMING, W. E. and STEPHAN, F. F. (1941). On the interpretation of censuses as samples. J. Amer. Statist. Assoc. 36 45-49.
  • DUMOUCHEL, W. H. and DUNCAN, G. J. (1983). Using sample survey weights in multiple regression analyses of stratified samples. J. Amer. Statist. Assoc. 78 535-543.
  • DURBIN, J. (1953). Some results in sampling theory when the units are selected with unequal probabilities. J. Roy. Statist. Soc. Ser. B 15 262-269.
  • ELTINGE, J. L. and JANG, D. S. (1996). Stability measures of variance component estimators under a stratified multistage design. Survey Methodology 22 157-165.
  • ERICSON, W. A. (1969). Subjective Bayesian models in sampling finite populations (with discussion). J. Roy. Statist. Soc. Ser. B 31 195-233.
  • EZZATI, T. M., MASSEY, J. T., WAKSBERG, J., CHU, A. and
  • MAURER, K. R. (1992). Sample design: third National Health and Nutrition Examination Survey. Vital Health Statist. 2.
  • FULLER, W. A. (1975). Regression analysis for sample survey. Sankhy¯a Ser. C 37 117-132.
  • GODAMBE, V. P. and THOMPSON, M. E. (1986). Parameters of superpopulation and survey population: their relationships and estimation. Internat. Statist. Rev. 54 127-138.
  • GOLDSTEIN, H. (1986). Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika 73 43-56.
  • GRAUBARD, B. I. and KORN, E. L. (1996a). Modelling the sampling design in the analysis of health survey s. Statist. Methods Medical Res. 5 263-281.
  • GRAUBARD, B. I. and KORN, E. L. (1996b). Survey inference for subpopulations. Amer. J. Epidemiol. 144 102-106.
  • GRAVES, E. J. (1988). Utilization of short-stay hospitals, United States, 1986, annual summary. Vital Health Statist. 13.
  • HANSEN, M. H., HURWITZ, W. N. and MADOW, W. G. (1953). Sample Survey Methods and Theory 1. Wiley, New York.
  • HANSEN, M. H., MADOW, W. G. and TEPPING, B. J. (1983). An evaluation of model-dependent and probability-sampling inferences in sample survey s (with discussion). J. Amer. Statist. Assoc. 78 776-807.
  • HARTLEY, H. O. and SIELKEN, R. L., Jr. (1975). A "superpopulation viewpoint" for finite population sampling. Biometrics 31 411-422.
  • HASLETT, S. (1985). The linear non-homogeneous estimator in sample survey s. Sankhy¯a Ser. B 47 101-117.
  • HAUSMAN, J. A. and WISE, D. A. (1981). Stratification of endogenous variables and estimation: the Gary income maintenance experiment. In Structural Analy sis of Discrete Data with Econometric Applications (C. F. Manski and S. McFadden, eds.) 365-391. MIT Press, Cambridge, MA.
  • HOLT, D. and SCOTT, A. J. (1981). Regression analysis using survey data. The Statistician 30 169-178.
  • HOLT, D. and SMITH, T. M. F. (1979). Post stratification. J. Roy. Statist. Soc. Ser. A 142 33-46.
  • HOLT, D., SMITH, T. M. F. and WINTER, P. D. (1980). Regression analysis of data from complex survey s. J. Roy. Statist. Soc. Ser. A 143 474-487.
  • ISAKI, C. T. and FULLER, W. A. (1982). Survey design under the regression superpopulation model. J. Amer. Statist. Assoc. 77 89-96.
  • JEWELL, N. P. (1985). Least squares regression with data arising from stratified samples of the dependent variable. Biometrika 72 11-21.
  • KLEIN, L. R. and MORGAN, J. N. (1951). Results of alternative statistical treatments of sample survey data. J. Amer. Statist. Assoc. 46 442-460.
  • KONIJN, H. S. (1962). Regression analysis in sample survey s. J. Amer. Statist. Assoc. 57 590-606.
  • KOOP, J. C. (1986). Some problems of statistical inference from sample survey data for analytic studies. Statistics 17 237-247. [Correction (1992) Statistics 23 187.]
  • KORN, E. L. and GRAUBARD, B. I. (1990). Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics. Amer. Statist. 44 270-276.
  • KORN, E. L. and GRAUBARD, B. I. (1995). Analy sis of large health survey s: accounting for the sampling design. J. Roy. Statist. Soc. Ser. A 158 263-295.
  • KORN, E. L. and GRAUBARD, B. I. (1998). Variance estimation for superpopulation parameters. Statist. Sinica 8 1131-1151.
  • KORN, E. L. and GRAUBARD, B. I. (1999). Analy sis of Health Survey s. Wiley, New York.
  • KOTT, P. S. (1991). A model-based look at linear regression with survey data. Amer. Statist. 45 107-112.
  • KOTT, P. S. (1993). Comment on Potthoff, Woodbury, and Manton. Letter to the Editor. J. Amer. Statist. Assoc. 88 716.
  • KRIEGER, A. M. and PFEFFERMANN, D. (1992). Maximum likelihood estimation from complex sample survey s. Survey Methodology 18 225-239.
  • LEHMANN, E. L. (1975). Nonparametrics. Holden-Day, San Francisco.
  • LONGFORD, N. T. (1996). Model-based variance estimation in survey s with stratified clustered design. Austral. J. Statist. 38 333-352.
  • MAGEE, L. (1998). Improving survey-weighted least squares regression. J. Roy. Statist. Soc. Ser. B 60 115-126.
  • MASSEY, J. T., MOORE, T. F., PARSONS, V. L. and TADROS,
  • W. (1989). Design and estimation for the National Health Interview Survey, 1985-1994. Vital Health Statist. 2.
  • NATHAN, G. and HOLT, D. (1980). The effect of survey design on regression analysis. J. Roy. Statist. Soc. Ser. B 42 377-386.
  • NATIONAL CENTER FOR HEALTH STATISTICS. (1994). Plan and operation of the Third National Health and Nutrition Examination Survey, 1988-1994. Vital Health Statist. 1.
  • NORDBERG, L. (1989). Generalized linear modeling of sample survey data. J. Official Statist. 5 223-239.
  • PATIL, G. P. and RAO, C. R. (1978). Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics 34 179-189.
  • PFEFFERMANN, D. and HOLMES, D. J. (1985). Robustness considerations in the choice of a method of inference for regression analysis of survey data. J. Roy. Statist. Soc. Ser. A 148 268-278. [Correction (1985) J. Roy. Statist. Soc. Ser. A 148 357.]
  • PFEFFERMANN, D. and LAVANGE, L. (1989). Regression models for stratified multi-stage cluster samples. In Analy sis of Complex Survey s (C. J. Skinner, D. Holt and T. M. F. Smith, eds.) 237-260. Wiley, New York.
  • PFEFFERMANN, D. and NATHAN, G. (1981) Regression analysis of data from a cluster sample. J. Amer. Statist. Assoc. 76 681-689.
  • PFEFFERMANN, D., SKINNER, C. J., HOLMES, D. J., GOLD
  • STEIN, H. and RASBASH, J. (1998). Weighting for unequal selection probabilities in multilevel models (with discussion). J. Roy. Statist. Soc. Ser. B 60 23-56.
  • PORTER, R. D. (1973). On the use of survey sample weights in the linear model. Ann. Econom. Social Measurement 2 141-158.
  • POTTHOFF, R. F., WOODBURY, M. A. and MANTON, K. G.
  • (1992). "Equivalent sample size" and "equivalent degrees of freedom" refinements for inference using survey weights under superpopulation models. J. Amer. Statist. Assoc. 87 383-396.
  • QUESENBERRY, C. P., Jr. and JEWELL, N. P. (1986). Regression analysis based on stratified samples. Biometrika 73 605-614.
  • RAO, C. R. (1965). On discrete distributions arising out of methods of ascertainment. In Classical and Contagious Discrete Distributions (G. P. Patil, ed.) 320-332. Statistical Publishing Society, Calcutta.
  • RAO, J. N. K. (1985). Conditional inference in survey sampling. Survey Methodology 11 15-31.
  • ROy ALL, R. M. and CUMBERLAND, W. G. (1981). An empirical study of the ratio estimator and its variance (with discussion). J. Amer. Statist. Assoc. 76 66-88.
  • SÄRNDAL, C. E. (1980). Two model-based inference arguments in survey sampling. Austral. J. Statist. 22 341-348.
  • SÄRNDAL, C. E., SWENSSON, B. and WRETMAN, J. (1992). Model Assisted Survey Sampling. Springer, New York.
  • SCHOENBORN, C. A. and MARANO, M. (1988). Current estimates from the National Health Interview Survey, United States, 1987. Vital Health Statist. 10.
  • SCOTT, A. J. and HOLT, D. (1982). The effect of two-stage sampling on ordinary least squares methods. J. Amer. Statist. Assoc. 77 848-854.
  • SCOTT, A. J. and SMITH, T. M. F. (1969). Estimation in multi-stage survey s. J. Amer. Statist. Assoc. 64 830-840.
  • SCOTT, A. J. and WILD, C. J. (1986). Fitting logistic models under case-control or choice based sampling. J. Roy. Statist. Soc. Ser. B 48 170-182.
  • SCOTT, A. J. and WILD, C. J. (1989). Selection based on the response variable in logistic regression. In Analy sis of Complex Survey s (C. J. Skinner, D. Holt and T. M. F. Smith, eds.) 191-205. Wiley, New York.
  • SCOTT, A. J. and WILD, C. J. (1991). Fitting logistic regression models in stratified case-control studies. Biometrics 47 497-510.
  • SEDRANSK, J. (1965). Analy tical survey s with cluster sampling. J. Roy. Statist. Soc. Ser. B 27 264-278.
  • SHAH, B. V., BARNWELL, B. G. and BIELER, G. S. (1997). SUDAAN User's Manual, Release 7.5. Research Triangle Institute, Research Triangle Park, NC.
  • SIMMONS, W. R. and SCHNACK, G. A. (1970). Development of the design of the NCHS Hospital Discharge Survey. Vital Health Statist. 2.
  • SKINNER, C. J. (1994). Sample models and weights. In Proceedings of the Section on Survey Research Methods 133-142. Amer. Statist. Assoc., Alexandria, VA.
  • SKINNER, C. J., HOLT, D. and SMITH, T. M. F., eds. (1989). Analy sis of Complex Survey s. Wiley, New York.
  • TEN CATE, A. (1986). Regression analysis using survey data with endogenous design. Survey Methodology 12 121-138.
  • THOMSEN, I. (1978). Design and estimation problems when estimating a regression coefficient from survey data. Metrika 25 27-35.
  • YATES, F. (1981). Sampling Methods for Censuses and Survey s, 4th ed. Oxford Univ. Press.