## Statistical Science

### The Use of Unlabeled Data in Predictive Modeling

#### Abstract

The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of central illustrative examples and two substantial real data analyses, shows precisely when, why and how unlabeled data matter.

#### Article information

Source
Statist. Sci., Volume 22, Number 2 (2007), 189-205.

Dates
First available in Project Euclid: 27 September 2007

Permanent link to this document
https://projecteuclid.org/euclid.ss/1190905518

Digital Object Identifier
doi:10.1214/088342307000000032

Mathematical Reviews number (MathSciNet)
MR2408958

Zentralblatt MATH identifier
1246.62157

#### Citation

Liang, Feng; Mukherjee, Sayan; West, Mike. The Use of Unlabeled Data in Predictive Modeling. Statist. Sci. 22 (2007), no. 2, 189--205. doi:10.1214/088342307000000032. https://projecteuclid.org/euclid.ss/1190905518

#### References

• Ando, R. and Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. J. Machine Learning Research 6 1817--1853.
• Belkin, M. and Niyogi, P. (2005). Towards a theoretical foundation for Laplacian-based manifold methods. Learning Theory. Lecture Notes in Comput. Sci. 3559 486--500. Springer, Berlin.
• Belkin, M., Niyogi, P. and Sindhwani, V. (2004). Manifold regularization: A geometric framework for learning from examples. Technical Report 04-06, Dept. Computer Science, Univ. Chicago. Available at www.cs.uchicago.edu/research/publications/techreports/TR-2004-06.
• Bennett, K. and Demiriz, A. (1999). Semi-supervised support vector machines. In Advances in Neural Information Processing Systems (NIPS) 11 368--374. MIT Press, Cambridge, MA.
• Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. Eleventh Annual Conference on Computational Learning Theory 92--100. ACM, New York.
• Castelli, V. and Cover, T. (1995). On the exponential value of labeled samples. Pattern Recognition Letters 16 105--111.
• Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005a). Geometric diffusions as a tool for harmonic analysis and structure definition of data. I. Diffusion maps. Proc. Natl. Acad. Sci. U.S.A. 102 7426--7431.
• Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005b). Geometric diffusions as a tool for harmonic analysis and structure definition of data. II. Multiscale methods. Proc. Natl. Acad. Sci. U.S.A. 102 7432--7437.
• Cozman, F. and Cohen, I. (2002). Unlabeled data can degrade classification performance of generative classifiers. In Proc. Fifteenth International Florida Artificial Intelligence Research Society Conference 327--331. AAAI Press, Menlo Park, CA.
• Dobra, A., Hans, C., Jones, B., Nevins, J., Yao, G. and West, M. (2004). Sparse graphical models for exploring gene expression data. J. Multivariate Anal. 90 196--212.
• Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577--588.
• Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209--230.
• Ganesalingam, S. and McLachlan, G. J. (1978). The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65 658--662.
• Ganesalingam, S. and McLachlan, G. J. (1979). Small sample results for a linear discriminant function estimated from a mixture of normal populations. J. Stat. Comput. Simul. 9 151--158.
• Geiger, D. and Heckerman, D. (2002). Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. Ann. Statist. 30 1412--1440.
• Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. Sixteenth International Conference on Machine Learning (I. Bratko and S. Dzeroski, eds.) 200--209. Morgan Kaufmann, San Francisco.
• Lavine, M. and West, M. (1992). A Bayesian method for classification and discrimination. Canad. J. Statist. 20 451--461.
• Liang, F., Mao, K., Liao, M., Mukherjee, S. and West, M. (2007). Nonparametric Bayesian kernel models. Technical report, Dept. Statistical Science, Duke Univ. Available at www.stat.duke.edu/research/papers/.
• Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. and Mesirov, J. (2003). Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol. 10 119--142.
• Müller, P., Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika 83 67--79.
• Nigam, K., McCallum, A., Thrun, S. and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning 39 103--134.
• O'Neill, T. J. (1978). Normal discrimination with unclassified observations. J. Amer. Statist. Assoc. 73 821--826.
• Poggio, T. and Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science 247 978--982.
• Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E. and Golub, T. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. U.S.A. 98 15,149--15,154.
• Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
• Seeger, M. (2000). Learning with labeled and unlabeled data. Technical report, Univ. Edinburgh. Available at www.kyb.tuebingen.mpg.de/bs/people/seeger/papers/review.pdf.
• Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press.
• Szummer, M. and Jaakkola, T. (2002). Partially labeled classification with Markov random walks. In Advances in Neural Information Processing Systems (NIPS) 14 945--952. MIT Press, Cambridge, MA.
• Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
• Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
• West, M. (1992). Modelling with mixtures (with discussion). In Bayesian Statistics 4 (J. Bernardo, J. Berger, A. Dawid and A. Smith, eds.) 503--524. Oxford Univ. Press.
• West, M. (2003). Bayesian factor regression models in the large $p$, small $n$'' paradigm. In Bayesian Statistics 7 (J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 733--742. Oxford Univ. Press.
• Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with $g$-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. Goel and A. Zellner, eds.) 233--243. North-Holland, Amsterdam.
• Zhu, X., Ghahramani, Z. and Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In Proc. Twentieth International Conference on Machine Learning (T. Fawcett and N. Mishra, eds.) 912--919. AAAI Press, Menlo Park, CA.