The Annals of Statistics

Efficient and adaptive linear regression in semi-supervised settings

Abhishek Chakrabortty and Tianxi Cai

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized “labeled” data, and (ii) a much larger sized “unlabeled” data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. It is often of interest to investigate if and when the unlabeled data can be exploited to improve estimation of the regression parameter in the adopted linear model.

In this paper, we propose a class of “Efficient and Adaptive Semi-Supervised Estimators” (EASE) to improve estimation efficiency. The EASE are two-step estimators adaptive to model mis-specification, leading to improved (optimal in some cases) efficiency under model mis-specification, and equal (optimal) efficiency under a linear model. This adaptive property, often unaddressed in the existing literature, is crucial for advocating “safe” use of the unlabeled data. The construction of EASE primarily involves a flexible “semi-nonparametric” imputation, including a smoothing step that works well even when the number of covariates is not small; and a follow up “refitting” step along with a cross-validation (CV) strategy both of which have useful practical as well as theoretical implications towards addressing two important issues: under-smoothing and over-fitting. We establish asymptotic results including consistency, asymptotic normality and the adaptive properties of EASE. We also provide influence function expansions and a “double” CV strategy for inference. The results are further validated through extensive simulations, followed by application to an EMR study on auto-immunity.

Article information

Source
Ann. Statist., Volume 46, Number 4 (2018), 1541-1572.

Dates
Received: January 2017
First available in Project Euclid: 27 June 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1530086425

Digital Object Identifier
doi:10.1214/17-AOS1594

Mathematical Reviews number (MathSciNet)
MR3819109

Zentralblatt MATH identifier
06936470

Subjects
Primary: 62F35: Robustness and adaptive procedures 62J05: Linear regression 62F12: Asymptotic properties of estimators 62G08: Nonparametric regression

Keywords
Semi-supervised linear regression semiparametric inference model mis-specification adaptive estimation semi-nonparametric imputation

Citation

Chakrabortty, Abhishek; Cai, Tianxi. Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist. 46 (2018), no. 4, 1541--1572. doi:10.1214/17-AOS1594. https://projecteuclid.org/euclid.aos/1530086425


Export citation

References

  • Andrews, D. W. K. (1995). Nonparametric kernel estimation for semiparametric models. Econometric Theory 11 560–596.
  • Belkin, M., Niyogi, P. and Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7 2399–2434.
  • Castelli, V. and Cover, T. M. (1995). The exponential value of labeled samples. Pattern Recogn. Lett. 16 105–111.
  • Castelli, V. and Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inform. Theory 42 2102–2117.
  • Chakrabortty, A. and Cai, T. (2018). Supplement to “Efficient and adaptive linear regression in semi-supervised settings.” DOI:10.1214/17-AOS1594SUPP.
  • Chapelle, O., Schölkopf, B. and Zien, A. (2006). Semi-Supervised Learning. MIT Press, Cambridge, MA.
  • Cook, R. D. (1998). Principal Hessian directions revisited. J. Amer. Statist. Assoc. 93 84–100.
  • Cook, R. D. and Lee, H. (1999). Dimension reduction in binary response regression. J. Amer. Statist. Assoc. 94 1187–1200.
  • Cook, R. D. and Weisberg, S. (1991). Discussion of “Sliced inverse regression” by K.-C. Li. J. Amer. Statist. Assoc. 86 328–332.
  • Cozman, F. G. and Cohen, I. (2001). Unlabeled data can degrade classification performance of generative classifiers. Technical Report No. HPL-2001-234, HP Laboratories, Palo Alto, CA, USA.
  • Cozman, F. G., Cohen, I. and Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proceedings of the Twentieth ICML 99–106.
  • Duan, N. and Li, K.-C. (1991). Slicing regression: A link-free regression method. Ann. Statist. 19 505–530.
  • Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24 726–748.
  • Kawakita, M. and Kanamori, T. (2013). Semi-supervised learning with density-ratio estimation. Mach. Learn. 91 189–209.
  • Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12 417–428.
  • Lafferty, J. D. and Wasserman, L. (2007). Statistical analysis of semi-supervised regression. Adv. Neural Inf. Process. Syst. 20 801–808.
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86 316–342.
  • Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc. 87 1025–1039.
  • Liao, K. P., Cai, T., Gainer, V. et al. (2010). Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care and Research 62 1120–1127.
  • Masry, E. (1996). Multivariate local polynomial regression for time series: Uniform strong consistency and rates. J. Time Series Anal. 17 571–599.
  • Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory 10 233–253.
  • Newey, W. K., Hsieh, F. and Robins, J. (1998). Undersmoothing and bias corrected functional estimation. Technical Report No. 98-17, Dept. of Economics, MIT, USA.
  • Nigam, K. P. (2001). Using unlabeled data to improve text classification. Ph.D. thesis, Carnegie Mellon University, USA. CMU-CS-01-126.
  • Nigam, K., McCallum, A. K., Thrun, S. and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39 103–134.
  • Seeger, M. (2002). Learning with labeled and unlabeled data. Technical Report No. EPFL-REPORT-161327, Univ. Edinburgh, UK.
  • Sokolovska, N., Cappé, O. and Yvon, F. (2008). The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the Twenty Fifth ICML 984–991.
  • Zhang, T. and Oles, F. J. (2000). The value of unlabeled data for classification problems. In Proceedings of the Seventeenth ICML 1191–1198.
  • Zhu, X. (2005). Semi-supervised learning through graphs. Ph.D. thesis, Carnegie Mellon Univ., USA. CMU-LTI-05-192.
  • Zhu, X. (2008). Semi-supervised learning literature survey. Technical Report No. 1530, Computer Sciences, Univ. Wisconsin-Madison, USA.
  • Zhu, L. X. and Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statist. Sinica 5 727–736.

Supplemental materials

  • Supplement to “Efficient and adaptive linear regression in semi-supervised settings”. The supplement includes: (i) Supplementary results for the simulation studies and the real data analysis; (ii) Discussions on generalization of the proposed SS estimators to MAR settings; (iii) Proof of Lemma A.1; (iv) Proof of Theorem 3.1; (v) Proof of Theorem 4.1; and (vi) Proofs of Lemmas A.2–A.3 and Theorem 4.2.