## The Annals of Statistics

### Large sample theory for merged data from multiple sources

Takumi Saegusa

#### Abstract

We develop large sample theory for merged data from multiple sources. Main statistical issues treated in this paper are (1) the same unit potentially appears in multiple datasets from overlapping data sources, (2) duplicated items are not identified and (3) a sample from the same data source is dependent due to sampling without replacement. We propose and study a new weighted empirical process and extend empirical process theory to a dependent and biased sample with duplication. Specifically, we establish the uniform law of large numbers and uniform central limit theorem over a class of functions along with several empirical process results under conditions identical to those in the i.i.d. setting. As applications, we study infinite-dimensional $M$-estimation and develop its consistency, rates of convergence and asymptotic normality. Our theoretical results are illustrated with simulation studies and a real data example.

#### Article information

Source
Ann. Statist., Volume 47, Number 3 (2019), 1585-1615.

Dates
Revised: May 2018
First available in Project Euclid: 13 February 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1550026850

Digital Object Identifier
doi:10.1214/18-AOS1727

Mathematical Reviews number (MathSciNet)
MR3911123

Zentralblatt MATH identifier
07053519

#### Citation

Saegusa, Takumi. Large sample theory for merged data from multiple sources. Ann. Statist. 47 (2019), no. 3, 1585--1615. doi:10.1214/18-AOS1727. https://projecteuclid.org/euclid.aos/1550026850

#### References

• [1] Alexander, K. S. (1985). Rates of growth for weighted empirical processes. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983). Wadsworth Statist./Probab. Ser. 475–493. Wadsworth, Belmont, CA.
• [2] Bae, J. and Levental, S. (1995). Uniform CLT for Markov chains and its invariance principle: A martingale approach. J. Theoret. Probab. 8 549–570.
• [3] Berkes, I. and Philipp, W. (1977/78). An almost sure invariance principle for the empirical distribution function of mixing random variables. Z. Wahrsch. Verw. Gebiete 41 115–137.
• [4] Bertail, P., Chautru, E. and Clémençon, S. (2017). Empirical processes in survey sampling with (conditional) Poisson designs. Scand. J. Stat. 44 97–111.
• [5] Boistard, H., Lopuhaä, H. P. and Ruiz-Gazen, A. (2017). Functional central limit theorems for single-stage sampling designs. Ann. Statist. 45 1728–1758.
• [6] Breslow, N. E. and Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. J. R. Stat. Soc. Ser. C. Appl. Stat. 48 457–468.
• [7] Breslow, N. E. and Wellner, J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat. 34 86–102.
• [8] Breslow, N. E. and Wellner, J. A. (2008). A $Z$-theorem with estimated nuisance parameters and correction note for: “Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression” [Scand. J. Statist. 34 (2007) 86–102; MR2325244]. Scand. J. Stat. 35 186–192.
• [9] Brick, J. M., Dipko, S., Presser, S., Tucker, C. and Yuan, Y. (2006). Nonresponse bias in a dual frame sample of cell and landline numbers. Public Opin. Q. 70 780–793.
• [10] Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilita. G. Ist. Ital. Attuari 4 421–424.
• [11] Cervantes, I. F., Jones, M. E., Rojas, L. A., Brick, J. M., Kurata, J. and Grant, D. (2006). A review of the sample design for the California health interview survey. In Proceedings of the Social Statistics Section 3023–3030. Amer. Statist. Assoc., Alexandria, VA.
• [12] Chatterjee, N., Chen, Y.-H., Maas, P. and Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Amer. Statist. Assoc. 111 107–117.
• [13] Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B 34 187–220.
• [14] D’Angio, G. J., Breslow, N., Beckwith, J. B., Evans, A., Baum, H., deLorimier, A., Fernbach, D., Hrabovsky, E., Jones, B. and Kelalis, P. (1989). Treatment of Wilms’ tumor. Results of the Third National Wilms’ Tumor Study. Cancer 64 349–360.
• [15] de Leeuw, E. D. (2005). To mix or not to mix data collection modes in surveys. J. Off. Stat. 21 233–255.
• [16] Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87 376–382.
• [17] Dillman, D. A., Smyth, J. D., Christian and Melani, L. (2014). Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method, 4th ed. Wiley, New York.
• [18] Ding, Y. and Nan, B. (2011). A sieve $M$-theorem for bundled parameters in semiparametric models, with application to the efficient estimation in a linear model for censored data. Ann. Statist. 39 3032–3061.
• [19] Donsker, M. D. (1952). Justification and extension of Doob’s heuristic approach to the Komogorov–Smirnov theorems. Ann. Math. Stat. 23 277–281.
• [20] Dudley, R. M. (1981). Donsker classes of functions. In Statistics and Related Topics (Ottawa, Ont., 1980) 341–352. North-Holland, Amsterdam.
• [21] Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
• [22] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–998.
• [23] Glivenko, V. (1933). Sulla determinazione empirica della legge di probabilita. G. Ist. Ital. Attuari 4 92–99.
• [24] Hájek, J. (1960). Limiting distributions in simple random sampling from a finite population. Magy. Tud. Akad. Mat. Kut. Intéz. Közl. 5 361–374.
• [25] Hartley, H. O. (1962). Multiple frame surveys. In Proceedings of the Social Statistics Section 203–206. Amer. Statist. Assoc., Alexandria,VA.
• [26] Hartley, H. O. (1974). Multiple frame methodology and selected applications. Sankhyā, Ser. C 36 99–118.
• [27] Hartley, H. O. and Sielken, R. L. Jr. (1975). A “super-population viewpoint” for finite population sampling. Biometrics 31 411–422.
• [28] Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. 1st ed. Springer, Berlin.
• [29] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
• [30] Hu, S. S., Balluz, L., Battaglia, M. P. and Frankel, M. R. (2011). Improving public health surveillance using a dual-frame survey of landline and cell phone numbers. Am. J. Epidemiol. 173 703–711.
• [31] Huang, J. (1996). Efficient estimation for the proportional hazards model with interval censoring. Ann. Statist. 24 540–568.
• [32] Huang, J. and Wellner, J. A. (1997). Interval censored survival data: A review of recent progress. In Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis 123–169. Springer, Berlin.
• [33] Iachan, R. and Dennis, M. L. (1993). A multiple frame approach to sampling the homeless and transient population. J. Off. Stat. 9 747–764.
• [34] Isaki, C. T. and Fuller, W. A. (1982). Survey design under the regression superpopulation model. J. Amer. Statist. Assoc. 77 89–96.
• [35] Kalton, G. and Anderson, D. W. (1986). Sampling rare populations. J. R. Stat. Soc., A 149 65–82.
• [36] Keiding, N. and Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. J. Roy. Statist. Soc. Ser. A 179 319–376.
• [37] Kim, G. and Chambers, R. (2012). Regression analysis under incomplete linkage. Comput. Statist. Data Anal. 56 2756–2770.
• [38] Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics. Springer, New York.
• [39] Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. J. Amer. Statist. Assoc. 100 222–230.
• [40] Levental, S. (1989). A uniform CLT for uniformly bounded families of martingale differences. J. Theoret. Probab. 2 271–287.
• [41] Lohr, S. and Rao, J. N. K. (2006). Estimation in multiple-frame surveys. J. Amer. Statist. Assoc. 101 1019–1030.
• [42] Lu, Y. (2012). Regression coefficient estimation in dual frame surveys. In Proceedings of the Section on Survey Research Methods 4687–4695. Amer. Statist. Assoc., Alexandria, VA.
• [43] Lu, Y. and Lohr, S. L. (2010). Gross flow estimation in dual frame surveys. Surv. Methodol. 36 13–22.
• [44] Ma, S. and Kosorok, M. R. (2005). Robust semiparametric M-estimation and the weighted bootstrap. J. Multivariate Anal. 96 190–217.
• [45] Metcalf, P. and Scott, A. (2009). Using multiple frames in health surveys. Stat. Med. 28 1512–1523.
• [46] Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 21 2053–2086.
• [47] Ranalli, M. G., Arcos, A., del Mar Rueda, M. and Teodoro, A. (2016). Calibration estimation in dual-frame surveys. Stat. Methods Appl. 25 321–349.
• [48] Rao, J. N. K. (1994). Estimating totals and distribution functions using auxiliary information at the estimation stage. J. Off. Stat. 10 153–165.
• [49] Rao, J. N. K. and Wu, C. (2010). Pseudo-empirical likelihood inference for multiple frame surveys. J. Amer. Statist. Assoc. 105 1494–1503.
• [50] Rubin-Bleuer, S. and Schiopu Kratina, I. (2005). On the two-phase framework for joint model and design-based inference. Ann. Statist. 33 2789–2810.
• [51] Saegusa, T. (2019). Supplement to “Large sample theory for merged data from multiple sources.” DOI:10.1214/18-AOS1727SUPP.
• [52] Saegusa, T. and Wellner, J. A. (2013). Weighted likelihood estimation under two-phase sampling. Ann. Statist. 41 269–295.
• [53] Skinner, C. J. (1991). On the efficiency of raking ratio estimation for multiple frame surveys. J. Amer. Statist. Assoc. 86 779–784.
• [54] Skinner, C. J. and Rao, J. N. K. (1996). Estimation in dual frame surveys with complex designs. J. Amer. Statist. Assoc. 91 349–356.
• [55] van der Vaart, A. (2002). Semiparametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1999). Lecture Notes in Math. 1781 331–457. Springer, Berlin.
• [56] van der Vaart, A. W. (1995). Efficiency of infinite-dimensional $M$-estimators. Stat. Neerl. 49 9–30.
• [57] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
• [58] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York.
• [59] Winkler, W. E. (1995). Matching and Record Linkage 353–384. Wiley, New York.
• [60] Ziegler, K. (1997). Functional central limit theorems for triangular arrays of function-indexed processes under uniformly integrable entropy conditions. J. Multivariate Anal. 62 233–272.
• [61] Ziegler, K. (2001). Uniform laws of large numbers for triangular arrays of function-indexed processes under random entropy conditions. Results Math. 39 374–389.

#### Supplemental materials

• Supplement to “Large sample theory for merged data from multiple sources.”. The proofs and additional simulations are given in the Supplement [51].