## The Annals of Statistics

### A critical threshold for design effects in network sampling

Karl Rohe

#### Abstract

Web crawling, snowball sampling, and respondent-driven sampling (RDS) are three types of network sampling techniques used to contact individuals in hard-to-reach populations. This paper studies these procedures as a Markov process on the social network that is indexed by a tree. Each node in this tree corresponds to an observation and each edge in the tree corresponds to a referral. Indexing with a tree (instead of a chain) allows for the sampled units to refer multiple future units into the sample.

In survey sampling, the design effect characterizes the additional variance induced by a novel sampling strategy. If the design effect is some value $\operatorname{DE}$, then constructing an estimator from the novel design makes the variance of the estimator $\operatorname{DE}$ times greater than it would be under a simple random sample with the same sample size $n$. Under certain assumptions on the referral tree, the design effect of network sampling has a critical threshold that is a function of the referral rate $m$ and the clustering structure in the social network, represented by the second eigenvalue of the Markov transition matrix, $\lambda_{2}$. If $m<1/\lambda_{2}^{2}$, then the design effect is finite (i.e., the standard estimator is $\sqrt{n}$-consistent). However, if $m>1/\lambda_{2}^{2}$, then the design effect grows with $n$ (i.e., the standard estimator is no longer $\sqrt{n}$-consistent). Past this critical threshold, the standard error of the estimator converges at the slower rate of $n^{\log_{m}\lambda_{2}}$. The Markov model allows for nodes to be resampled; computational results show that the findings hold in without-replacement sampling. To estimate confidence intervals that adapt to the correct level of uncertainty, a novel resampling procedure is proposed. Computational experiments compare this procedure to previous techniques.

#### Article information

Source
Ann. Statist., Volume 47, Number 1 (2019), 556-582.

Dates
Revised: February 2018
First available in Project Euclid: 30 November 2018

https://projecteuclid.org/euclid.aos/1543568598

Digital Object Identifier
doi:10.1214/18-AOS1700

Mathematical Reviews number (MathSciNet)
MR3909942

Zentralblatt MATH identifier
07036211

#### Citation

Rohe, Karl. A critical threshold for design effects in network sampling. Ann. Statist. 47 (2019), no. 1, 556--582. doi:10.1214/18-AOS1700. https://projecteuclid.org/euclid.aos/1543568598

#### References

• Abdul-Quader, A. S., Heckathorn, D. D., McKnight, C., Bramson, H., Nemeth, C., Sabin, K., Gallagher, K. and Des Jarlais, D. C. (2006). Effectiveness of respondent-driven sampling for recruiting drug users in New York City: Findings from a pilot study. J. Urban Health 83 459–476.
• Arayasirikul, S., Cai, X. and Wilson, E. C. (2015). A qualitative examination of respondent-driven sampling (RDS) peer referral challenges among young transwomen in the San Francisco bay area. JMIR Public Health Surveill. 1 e9.
• Athreya, K. B. and Ney, P. E. (1972). Branching Processes. Die Grundlehren der Mathematischen Wissenschaften 196. Springer, New York.
• Baraff, A. J., McCormick, T. H. and Raftery, A. E. (2016). Estimating uncertainty in respondent-driven sampling using a tree bootstrap method. Proc. Natl. Acad. Sci. USA 201617258.
• Benjamini, I. and Peres, Y. (1994). Markov chains indexed by trees. Ann. Probab. 22 219–243.
• Chung, F. R. K. (1997). Spectral Graph Theory. CBMS Regional Conference Series in Mathematics 92. Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the Amer. Math. Soc., Providence, RI.
• Gile, K. J. (2011). Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. J. Amer. Statist. Assoc. 106 135–146.
• Gile, K. J. and Handcock, M. S. (2010). Respondent-driven sampling: An assessment of current methodology. Sociol. Method. 40 285–327.
• Gile, K. J., Johnston, L. G. and Salganik, M. J. (2015). Diagnostics for respondent-driven sampling. J. Roy. Statist. Soc. Ser. A 178 241–269.
• Goel, S. and Salganik, M. J. (2009). Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28 2202–2229.
• Goel, S. and Salganik, M. J. (2010). Assessing respondent-driven sampling. Proc. Natl. Acad. Sci. USA 107 6743–6747.
• Handcock, M. S., Fellows, I. E. and Gile, K. J. (2016). RDS: Respondent-driven sampling. Los Angeles, CA, R package version 0.7-5. http://CRAN.R-project.org/package=RDS.
• Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Soc. Probl. 44 174–199.
• Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137.
• Johnston, L. G., Chen, Y.-H., Silva-Santisteban, A. and Raymond, H. F. (2013). An empirical examination of respondent driven sampling design effects among HIV risk groups from studies conducted around the world. AIDS Behav. 17 2202–2210.
• Khabbazian, M., Hanlon, B., Russek, Z. and Rohe, K. (2017). Novel sampling design for respondent-driven sampling. Electron. J. Stat. 11 4769–4812.
• Levin, D. A., Peres, Y. and Wilmer, E. L. (2009). Markov Chains and Mixing Times. Amer. Math. Soc., Providence, RI.
• Li, X. and Rohe, K. (2017). Central limit theorems for network driven sampling. Electron. J. Stat. 11 4871–4895.
• Lu, X., Bengtsson, L., Britton, T., Camitz, M., Kim, B. J., Thorson, A. and Liljeros, F. (2012). The sensitivity of respondent-driven sampling. J. Roy. Statist. Soc. Ser. A 175 191–216.
• McCreesh, N., Frost, S., Seeley, J., Katongole, J., Tarsh, M. N., Ndunguse, R., Jichi, F., Lunel, N. L., Maher, D., Johnston, L. G. et al. (2012). Evaluation of respondent-driven sampling. Epidemiology 23 138.
• Roch, S. and Rohe, K. (2017). Generalized least squares can overcome the critical threshold in respondent-driven sampling. ArXiv Preprint ArXiv:1708.04999.
• Rohe, K. (2019). Supplement to “A critical threshold for design effects in network sampling.” DOI:10.1214/18-AOS1700SUPP.
• Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878–1915.
• Salganik, M. J. (2006). Variance estimation, design effects, and sample size calculations for respondent-driven sampling. J. Urban Health 83 98–112.
• Salganik, M. J. and Heckathorn, D. D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Method. 34 193–240.
• Szwarcwald, C. L., de Souza Júnior, P. R. B., Damacena, G. N., Junior, A. B. and Kendall, C. (2011). Analysis of data collected by RDS among sex workers in 10 Brazilian cities, 2009: Estimation of the prevalence of HIV, variance, and design effect. JAIDS J. Acquir. Immune Defic. Syndr. 57 S129–S135.
• Verdery, A. M., Mouw, T., Bauldry, S. and Mucha, P. J. (2015). Network structure and biased variance estimation in respondent driven sampling. PLoS ONE 10 e0145296.
• Volz, E. and Heckathorn, D. D. (2008). Probability based estimation theory for respondent driven sampling. J. Off. Stat. 24 79.
• von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
• White, R. G., Hakim, A. J., Salganik, M. J., Spiller, M. W., Johnston, L. G., Kerr, L., Kendall, C., Drake, A., Wilson, D., Orroth, K. et al. (2015). Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies: STROBE-RDS statement. J. Clin. Epidemiol. 68 1463–1471.
• World Health Organization and UNAIDS (2013). Introduction To HIV/AIDS And Sexually Transmitted Infection Surveillance Module 4: Introduction to Respondent-drive Sampling. World Health Organization & UNAIDS. http://applications.emro.who.int/dsaf/EMRPUB_2013_EN_1539.pdf.

#### Supplemental materials

• Supplement: Proofs for Sections 3 and 4. Due to space constraints, this supplement contains the proofs for the results in Sections 3 and 4. Moreover, it contains an addition computational experiment to study the widths of the bootstrap confidence intervals.