## The Annals of Statistics

### Semi-supervised inference: General theory and estimation of means

#### Abstract

We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses (“labels”). Otherwise, the formulation is “assumption-lean” in that no major conditions are imposed on the statistical or functional form of the data. We consider both the ideal semi-supervised setting where infinitely many unlabeled samples are available, as well as the ordinary semi-supervised setting in which only a finite number of unlabeled samples is available.

Estimators are proposed along with corresponding confidence intervals for the population mean. Theoretical analysis on both the asymptotic distribution and $\ell_{2}$-risk for the proposed procedures are given. Surprisingly, the proposed estimators, based on a simple form of the least squares method, outperform the ordinary sample mean. The simple, transparent form of the estimator lends confidence to the perception that its asymptotic improvement over the ordinary sample mean also nearly holds even for moderate size samples. The method is further extended to a nonparametric setting, in which the oracle rate can be achieved asymptotically. The proposed estimators are further illustrated by simulation studies and a real data example involving estimation of the homeless population.

#### Article information

Source
Ann. Statist., Volume 47, Number 5 (2019), 2538-2566.

Dates
Received: August 2017
Revised: August 2018
First available in Project Euclid: 3 August 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1564797856

Digital Object Identifier
doi:10.1214/18-AOS1756

Mathematical Reviews number (MathSciNet)
MR3988765

#### Citation

Zhang, Anru; Brown, Lawrence D.; Cai, T. Tony. Semi-supervised inference: General theory and estimation of means. Ann. Statist. 47 (2019), no. 5, 2538--2566. doi:10.1214/18-AOS1756. https://projecteuclid.org/euclid.aos/1564797856

#### References

• Ando, R. K. and Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6 1817–1853.
• Ando, R. K. and Zhang, T. (2007). Two-view feature generation model for semi-supervised learning. In Proceedings of the 24th International Conference on Machine Learning 25–32. ACM, New York.
• Azriel, D., Brown, L. D., Sklar, M., Berk, R., Buja, A. and Zhao, L. (2016). Semi-supervised linear regression. Preprint. Available at arXiv:1612.02391.
• Bickel, P. J., Ritov, Y. and Wellner, J. A. (1991). Efficient estimation of linear functionals of a probability measure $P$ with known marginal distributions. Ann. Statist. 19 1316–1346.
• Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer, New York. Reprint of the 1993 original.
• Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998) 92–100. ACM, New York.
• Bratley, P., Fox, B. L. and Schrage, L. E. (1987). A Guide to Simulation. Springer, New York.
• Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhan, K. and Zhao, L. (2014). Models as approximations, Part I: A conspiracy of nonlinearity and random regressors in linear regression. Preprint. Available at arXiv:1404.1578.
• Buja, A., Berk, R., Brown, L., George, E., Kuchibhotla, A. K. and Zhao, L. (2016). Models as approximations—Part II: A general theory of model-robust regression. Preprint. Available at ArXiv:1612.03257.
• Chakrabortty, A. and Cai, T. (2018). Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist. 46 1541–1572.
• Cochran, W. G. (1953). Sampling Techniques. Wiley, New York.
• Deng, L.-Y. and Wu, C.-F. J. (1987). Estimation of variance of the regression estimator. J. Amer. Statist. Assoc. 82 568–576.
• Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer, New York.
• Graham, B. S. (2011). Efficiency bounds for missing data models with semiparametric restrictions. Econometrica 79 437–452.
• Hansen, B. E. (2017). Econometrics. Book draft. Available at http://www.ssc.wisc.edu/~bhansen/econometrics/.
• Hasminskii, R. Z. and Ibragimov, I. A. (1983). On asymptotic efficiency in the presence of an infinite-dimensional nuisance parameter. In Probability Theory and Mathematical Statistics (Tbilisi, 1982). Lecture Notes in Math. 1021 195–229. Springer, Berlin.
• Hickernell, F. J., Lemieux, C. and Owen, A. B. (2005). Control variates for quasi-Monte Carlo. Statist. Sci. 20 1–31.
• Johnson, R. and Zhang, T. (2008). Graph-based semi-supervised learning and spectral kernel design. IEEE Trans. Inform. Theory 54 275–288.
• Kriegler, B. and Berk, R. (2010). Small area estimation of the homeless in Los Angeles: An application of cost-sensitive stochastic gradient boosting. Ann. Appl. Stat. 4 1234–1255.
• Kuchibhotla, A. (2017). Research notes on efficiency in semi-supervised problems. Available from the author at arunku@wharton.upenn.edu.
• Lafferty, J. D. and Wasserman, L. (2008). Statistical analysis of semi-supervised regression. In Advances in Neural Information Processing Systems 801–808.
• Lohr, S. (2009). Sampling: Design and Analysis. Nelson Education.
• Peng, H. and Schick, A. (2002). On efficient estimation of linear functionals of a bivariate distribution with known marginals. Statist. Probab. Lett. 59 83–91.
• Pitkin, E., Berk, R., Brown, L., Buja, A., George, E., Zhang, K. and Zhao, L. (2013). Improved precision in estimating average treatment effects. Preprint. Available at arXiv:1311.0291.
• Rossi, P. H. (1991). Strategies for homeless research in the 1990s. Hous. Policy Debate 2 1027–1055.
• Rubin, D. B. (1990). Comment on J. Neyman and causal inference in experiments and observational studies: “On the application of probability theory to agricultural experiments. Essay on principles. Section 9” [Ann. Agric. Sci. 10 (1923), 1–51]. Statist. Sci. 5 472–480.
• Spłlawa-Neyman, J. (1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. 5 465–472. Translated from the Polish and edited by D. M. Dabrowska and T. P. Speed.
• van der Vaart, A. (2002). Semiparametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1999). Lecture Notes in Math. 1781 331–457. Springer, Berlin.
• Vapnik, V. N. (2013). The Nature of Statistical Learning Theory. Springer, Berlin.
• Wang, J. and Shen, X. (2007). Large margin semi-supervised learning. J. Mach. Learn. Res. 8 1867–1891.
• Wang, J., Shen, X. and Liu, Y. (2008). Probability estimation for large-margin classifiers. Biometrika 95 149–167.
• Wang, J., Shen, X. and Pan, W. (2009). On efficient large margin semisupervised learning: Method and theory. J. Mach. Learn. Res. 10 719–742.
• Yaskov, P. (2014). Lower bounds on the smallest eigenvalue of a sample covariance matrix. Electron. Commun. Probab. 19 no. 83, 10.
• Zhang, A., Brown, L. D. and Cai, T. T. (2019). Supplement to “Semi-supervised inference: General theory and estimation of means.” DOI:10.1214/18-AOS1756SUPP.
• Zhu, X. (2008). Semi-supervised learning literature survey. Technical report.
• Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 3 1–130. DOI:10.2200/S00196ED1V01Y200906AIM006.

#### Supplemental materials

• Supplement to “Semi-supervised inference: General theory and estimation of means”. The supplement contains additional proofs for the main results of the paper.