Electronic Journal of Statistics

Estimating hidden population size using Respondent-Driven Sampling data

Mark S. Handcock, Krista J. Gile, and Corinne M. Mar

Full-text: Open access


Respondent-Driven Sampling (RDS) is n approach to sampling design and inference in hard-to-reach human populations. It is often used in situations where the target population is rare and/or stigmatized in the larger population, so that it is prohibitively expensive to contact them through the available frames. Common examples include injecting drug users, men who have sex with men, and female sex workers. Most analysis of RDS data has focused on estimating aggregate characteristics, such as disease prevalence. However, RDS is often conducted in settings where the population size is unknown and of great independent interest. This paper presents an approach to estimating the size of a target population based on data collected through RDS.

The proposed approach uses a successive sampling approximation to RDS to leverage information in the ordered sequence of observed personal network sizes. The inference uses the Bayesian framework, allowing for the incorporation of prior knowledge. A flexible class of priors for the population size is used that aids elicitation. An extensive simulation study provides insight into the performance of the method for estimating population size under a broad range of conditions. A further study shows the approach also improves estimation of aggregate characteristics. Finally, the method demonstrates sensible results when used to estimate the size of known networked populations from the National Longitudinal Study of Adolescent Health, and when used to estimate the size of a hard-to-reach population at high risk for HIV.

Article information

Electron. J. Statist., Volume 8, Number 1 (2014), 1491-1521.

First available in Project Euclid: 2 September 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 91D30: Social networks 62D05: Sampling theory, sample surveys
Secondary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Hard-to-reach population sampling network sampling social networks successive sampling model-based survey sampling


Handcock, Mark S.; Gile, Krista J.; Mar, Corinne M. Estimating hidden population size using Respondent-Driven Sampling data. Electron. J. Statist. 8 (2014), no. 1, 1491--1521. doi:10.1214/14-EJS923. https://projecteuclid.org/euclid.ejs/1409619420

Export citation


  • Abdul-Quader, A. S., Heckathorn, D. D., McKnight, C., Bramson, H., Nemeth, C., Sabin, K., Gallagher, K. and Jarlais, D. C. D. (2006). Effectiveness of Respondent-Driven Sampling for recruiting drug users in New York City: findings from a pilot study. Journal of Urban Health 83 459–476.
  • Anderson, R. M. and May, R. M. (1992). Understanding the AIDS pandemic. Scientific American 266 58–66.
  • Andreatta, G. and Kaufman, G. M. (1986). Estimation of finite population properties when sampling is without replacement and proportional to magnitude. Journal of the American Statistical Association 81 657–666.
  • Bao, L., Raftery, A. E. and Reddy, A. (2010). Estimating the Size of Populations at High Risk of HIV in Bangladesh Using a Bayesian Hierarchical Model, Department of Statistics Technical Report number 573, University of Washington.
  • Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York.
  • Berchenko, Y. and Frost, S. D. W. (2011). Capture-recapture methods and Respondent-Driven Sampling: their potential and limitations. Sexually Transmitted Infections 87 267–268.
  • Bernard, H. R., Hallett, T., Iovita, A., Johnsen, E. C., Lyerla, R., McCarty, C., Mahy, M., Salganik, M. J., Saliuk, T., Scutelniciuc, O., Shelley, G. A., Sirinirund, P., Weir, S. and Stroup, D. F. (2010). Counting hard-to-count populations: the network scale-up method for public health. Sexually Transmitted Infections 86 ii11–ii15.
  • Bernhardt, A., Milkman, R., Theodore, N., Heckathorn, D., Auer, M., DeFilippis, J., Gonzalez, A. L., Narro, V., Perelshteyn, J., Polson, D., and Spiller, M. (2009). Broken Laws, Unprotected Workers: Violations of Employment and Labor Laws in America’s Cities, Report, National Employment Law Project, New York, NY 10038.
  • Bickel, P. J., Nair, V. N. and Wang, P. C. C. (1992). Nonparametric inference under biased sampling from a finite population. The Annals of Statistics 20 853–878.
  • Felix-Medina, M. H. and Thompson, S. K. (2004). Combining link-tracing sampling and cluster sampling to estimate the size of hidden populations. Journal of Official Statistics 20 19–38.
  • Fienberg, S. E., Johnson, M. S. and Junker, B. W. (1999). Classical multilevel and Bayesian approaches to population size estimation using multiple lists. Journal of the Royal Statistical Society: Series A (Statistics in Society) 162 383–405.
  • Frank, O. and Snijders, T. A. B. (1994). Estimating the size of hidden populations using snowball sampling. Journal of Official Statistics 10 53–67.
  • Freeman, L. C. (1996). Some antecedents of social network analysis. Connections 19 39–42.
  • Friedman, S., Tempalski, B., Cooper, H., Perlis, T., Keem, M., Friedman, R. and Flom, P. (2004). Estimating numbers of injecting drug users in metropolitan areas for structural analyses of community vulnerability and for assessing relative degrees of service provision for injecting drug users. Journal of Urban Health 81 377–400.
  • Gile, K. J. (2008). Inference from Partially-Observed Network Data PhD in Statistics, University of Washington, Advisor: Mark S. Handcock.
  • Gile, K. J. (2011). Improved inference for Respondent-Driven Sampling data with application to HIV prevalence estimation. Journal of the American Statistical Association 106 135–146.
  • Gile, K. J. and Handcock, M. S. (2010). Respondent-Driven Sampling: an assessment of current methodology. Sociological Methodology 40 285–327.
  • Gile, K. J. and Handcock, M. S. (2014). Network model-assisted inference from Respondent-Driven Sampling data. Journal of the Royal Statistical Society: Series A (Statistics in Society). Forthcoming.
  • Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall, London.
  • Goel, S. and Salganik, M. J. (2010). Assessing Respondent-Driven Sampling. Proceedings of the National Academy of Science, USA 107 6743– 6747.
  • Handcock, M. S. (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks, Statnet Project, Seattle, WA Version 1.0.
  • Handcock, M. S. (2011). size: Estimating Population Size from Discovery Models using Successive Sampling Data, Hard-to-Reach Population Methods Research Group, Los Angeles, CA R package version 0.20.
  • Handcock, M. S. and Gile, K. J. (2010). Modeling networks from sampled data. Annals of Applied Statistics 272 383–426.
  • Handcock, M. S., Gile, K. J. and Mar, C. M. (2014). Estimating the size of populations at high risk for HIV using Respondent-Driven Sampling data. Biometrics. Forthcoming.
  • Handcock, M. S. and Jones, J. H. (2004). Likelihood-based inference for stochastic models of sexual network formation. Theoretical Population Biology 65 413–422.
  • Handcock, M. S. and Jones, J. H. (2006). Interval estimates for epidemic thresholds in two-sex network models. Theoretical Population Biology 70 125–134.
  • Handcock, M. S., Hunter, D. R., Butts, C. T., Goodreau, S. M. and Morris, M. (2003). statnet: Software Tools for the Statistical Modeling of Network Data, Statnet Project http://statnet.org/, Seattle, WA, R package version 2.0.
  • Heckathorn, D. D. (1997). Respondent-Driven Sampling: a new approach to the study of hidden populations. Social Problems 44 174–199.
  • Heckathorn, D. D. (2002). Respondent-Driven Sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49 11–34.
  • Heckathorn, D. D. and Jeffri, J. (2001). Finding the beat: using Respondent-Driven Sampling to study jazz musicians. Poetics 28 307–329.
  • Johnston, L. G., Malekinejad, M., Kendall, C., Iuppa, I. M. and Rutherford, G. W. (2008). Implementation challenges to using Respondent-Driven Sampling methodology for HIV biological and behavioral surveillance: field experiences in international settings. AIDS and Behavior 12 131–141.
  • Johnston, L. G., Prybylski, D., Raymond, H. F., Mirzazadeh, A., Manopaiboon, C. and McFarland1, W. (2011). Incorporating the service multiplier method in respondent driven sampling surveys to estimate the size of hidden and hard-to-reach populations: Case studies from around the world. Unpublished manuscript, University of California, San Francisco.
  • Jones, J. H. and Handcock, M. S. (2003a). An assessment of preferential attachment as a mechanism for human sexual network formation. Proceedings of the Royal Society of London, B 270 1123–1128.
  • Jones, J. H. and Handcock, M. S. (2003b). Sexual contacts and epidemic thresholds. Nature 423 605–606.
  • Lazarsfeld, P. and Merton, R. (1954). Friendship as social process: a substantive and methodological analysis. In Freedom and Control in Modern Society ( M. Berger, T. Abel and C. H. Page, eds.) 18–66. Van Nostrand, New York.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd. ed. John Wiley & Sons, Inc., Hoboken, New Jersey.
  • McPherson, M., Smith-Lovin, L. and Cook, J. M. (2001). Birds of a feather: homophily in social networks. Annual Review of Sociology 27 415–444.
  • Nair, V. N. and Wang, P. C. C. (1989). Maximum likelihood estimation under a successive sampling discovery model. Technometrics 31 423–436.
  • Niccolai, L. M., Verevochkin, S. V., Toussova, O. V., White, E., Barbour, R., Kozlov, A. P. and Heimer, R. (2010). Estimates of HIV incidence among drug users in St. Petersburg, Russia: continued growth of a rapidly expanding epidemic. The European Journal of Public Health.
  • Paz-Bailey, G., Jacobson, J. O., Guardado, M. E., Hernandez, F. M., Nieto, A. I., Estrada, M. and Creswell, J. (2011). How many men who have sex with men and female sex workers live in El Salvador? Using respondent-driven sampling and capture-recapture to estimate population sizes. Sexually Transmitted Infections 87 279–282.
  • Perline, R. (2005). Strong, weak and false inverse power laws. Statistical Science 20 68–88.
  • Plummer, M., Best, N., Cowles, K. and Vines, K. (2006). CODA: convergence diagnosis and output analysis for MCMC. R News 6 7–11.
  • Potterat, J. J., Woodhouse, D. E., Muth, S. Q., Rothenberg, R. B., Darrow, W. W., Klovdahl, A. S., and Muth, J. B. (2004). Network dynamism: history and lessons of the Colorado Springs study. In Network Epidemiology: A Handbook for Survey Design and Data Collection, ( M. Morris, ed.). International Studies in Demography Series 87–114. Oxford University Press.
  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, Version 2.14.
  • Rocchetti, I., Bunge, J. and Böhning, D. (2011). Population size estimation based upon ratios of recapture probabilities. Annals of Applied Statistics 5 1512–1533.
  • Rothenberg, R. B., Woodhouse, D. E., Potterat, J. J., Muth, S. Q., Darrow, W. W. and Klovdahl, A. S. (1995). Social networks in disease transmission: the Colorado Springs study. NIDA 1995 3–19.
  • Salganik, M. J. and Heckathorn, D. D. (2004). Sampling and estimation in hidden populations using Respondent-Driven Sampling. Sociological Methodology 34 193–239.
  • Salganik, M. J., Fazito, D., Bertoni, N., Abdo, A. H., Mello, M. B. and Bastos, F. I. (2011). Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. American Journal of Epidemiology 174 1190– 1196.
  • Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S. and Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 127–142.
  • Snijders, T. A. B., Pattison, P., Robins, G. L. and Handcock, M. S. (2006). New specifications for exponential random graph models. Sociological Methodology 36 99–153.
  • Tomas, A. and Gile, K. J. (2011). The effect of differential recruitment, non-response and non-recruitment on estimators for Respondent-Driven Sampling. Electronic Journal of Statistics 5 899–934.
  • Udry, J. R. (2003). The National Longitudinal Study of Adolescent Health (Add Health), Waves I & II, 1994–1996; Wave III, 2001–2002 [machine-readable data file and documentation], Technical Report, Carolina Population Center, University of North Carolina at Chapel Hill.
  • UNAIDS (2009). Estimating National Adult Prevalence of HIV-1 in Concentrated Epidemics, Technical Report, UNAIDS – Joint United Nations Programme on HIV/AIDS.
  • UNAIDS and World Health Organization (2010). Guidelines on estimating the size of populations most at risk to HIV, Technical Report No. UNAIDS/00.03E, UNAIDS – Joint United Nations Programme on HIV/AIDS.
  • van Duijn, M. A. J., Handcock, M. S. and Gile, K. J. (2009). A framework for the comparison of maximum pseudo likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks 31 52–62.
  • Volz, E. and Heckathorn, D. D. (2008). Probability based estimation theory for Respondent Driven Sampling. Journal of Official Statistics 24 79–97.
  • West, M. (1996). Inference in successive sampling discovery models. Journal of Econometrics 75 217–238.