The Annals of Applied Statistics

Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications

Thomas Brendan Murphy, Nema Dean, and Adrian E. Raftery
Source: Ann. Appl. Stat. Volume 4, Number 1 (2010), 396-421.

Abstract

Food authenticity studies are concerned with determining if food samples have been correctly labeled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity data sets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity data sets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.

First Page: Show Hide

Related Works:

Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1273584460
Digital Object Identifier: doi:10.1214/09-AOAS279
Zentralblatt MATH identifier: 1189.62105
Mathematical Reviews number (MathSciNet): MR2758177

References

Arnalds, T., McElhinney, J., Fearn, T. and Downey, G. (2004). A hierarchical discriminant analysis for species identification in raw meat by visible and near infrared spectroscopy. Journal of Near Infrared Spectroscopy 12 183–188.
Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Computational Statistics (Y. Dodge and J. Whittaker, eds.) 1 251–256. Physica, Heidelberg.
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
Mathematical Reviews (MathSciNet): MR1243494
Digital Object Identifier: doi:10.2307/2532201
Bensmail, H. and Celeux, G. (1996). Regularized Gaussian discriminant analysis through eigenvalue decomposition. J. Amer. Statist. Assoc. 91 1743–1748.
Mathematical Reviews (MathSciNet): MR1439118
Zentralblatt MATH: 0885.62068
Digital Object Identifier: doi:10.1080/01621459.1996.10476746
Breiman, L. (2001). Random Forests. Mach. Learn. 45 5–32.
Chang, W.-C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. J. Roy. Statist. Soc. Ser. C. 32 267–275.
Mathematical Reviews (MathSciNet): MR770316
Zentralblatt MATH: 0538.62050
Digital Object Identifier: doi:10.2307/2347949
Chapelle, O., Schölkopf, B. and Zien, A. (2006). Semi-Supervised Learning. MIT Press, Cambridge. Available at http://www.kyb.tuebingen.mpg.de/ssl-book.
Mathematical Reviews (MathSciNet): MR2441315
Chiang, L. H. and Pell, R. J. (2004). Genetic algorithms combined with discriminant analysis for key variable identification. J. Process Control 14 143–155.
Collobert, R., Sinz, F., Weston, J. and Bottou, L. (2006). Large scale transductive SVMs. J. Mach. Learn. Res. 7 1687–1712.
Mathematical Reviews (MathSciNet): MR2274421
Zentralblatt MATH: 1222.68173
Connolly, C. (2006). Spectroscopic and Analytical Developments Ltd fingerprints brand spirits with ultraviolet spectrophotometry. Sensor Review 26 94–97.
Cortés, E. A., Martínez, M. G. and Rubio, N. G. (2007). adabag: Applies adaboost.M1 and bagging. R package version 1.1.
Dash, D. and Cooper, G. F. (2004). Model averaging for prediction with discrete Bayesian networks. J. Mach. Learn. Res. 5 1177–1203.
Mathematical Reviews (MathSciNet): MR2248014
Zentralblatt MATH: 1222.68178
Dean, N., Murphy, T. B. and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. J. Roy. Statist. Soc. Ser. C 55 1–14.
Mathematical Reviews (MathSciNet): MR2224157
Zentralblatt MATH: 05188723
Digital Object Identifier: doi:10.1111/j.1467-9876.2005.00526.x
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38.
Mathematical Reviews (MathSciNet): MR501537
Downey, G. (1996). Authentication of food and food ingredients by near infrared spectroscopy. Journal of Near Infrared Spectroscopy 4 47–61.
Downey, G., McIntyre, P. and Davies, A. N. (2003). Geographical classification of extra virgin olive oils from the eastern Mediterranean by chemometric analysis of visible and near infrared spectroscopic data. Applied Spectroscopy 57 158–163.
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41 578–588.
Fraley, C. and Raftery, A. E. (1999). MCLUST: Software for model-based clustering. J. Classification 16 297–306.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
Mathematical Reviews (MathSciNet): MR1951635
Zentralblatt MATH: 1073.62545
Digital Object Identifier: doi:10.1198/016214502760047131
Fraley, C. and Raftery, A. E. (2003). Enhanced model-based clustering, density estimation and discriminant analysis software: MCLUST. J. Classification 20 263–296.
Mathematical Reviews (MathSciNet): MR2019797
Zentralblatt MATH: 1055.62071
Digital Object Identifier: doi:10.1007/s00357-003-0015-3
Fraley, C. and Raftery, A. E. (2007). mclust: Model-based clustering/normal mixture modeling. R package version 3.1-1.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. System Sci. 55 119–139.
Mathematical Reviews (MathSciNet): MR1473055
Digital Object Identifier: doi:10.1006/jcss.1997.1504
Ganesalingam, S. and McLachlan, G. J. (1978). The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65 658–662.
Mathematical Reviews (MathSciNet): MR521834
Zentralblatt MATH: 0389.62045
Digital Object Identifier: doi:10.1093/biomet/65.3.658
Genkin, A., Lewis, D. D. and Madigan, D. (2005). BMR: Bayesian multinomial regression software. Available at http://www.stat.rutgers.edu/~madigan/BMR/.
Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 34 2367–2386.
Mathematical Reviews (MathSciNet): MR2291503
Zentralblatt MATH: 1106.62022
Digital Object Identifier: doi:10.1214/009053606000000768
Project Euclid: euclid.aos/1169571800
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res. 3 1157–1182.
Hastie, T. and Tibshirani, R. (1998). Classification by pairwise coupling. Ann. Statist. 26 451–471.
Mathematical Reviews (MathSciNet): MR1626055
Zentralblatt MATH: 0932.62071
Digital Object Identifier: doi:10.1214/aos/1028144844
Project Euclid: euclid.aos/1028144844
Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of Statistical Learning. Springer, New York.
Mathematical Reviews (MathSciNet): MR1851606
Hoos, H. H. and Stützle, T. (2005). Stochastic Local Search: Foundations and Applications. Morgan Kaufmann, San Francisco.
Zentralblatt MATH: 1126.68032
Indahl, U. and Naes, T. (2004). A variable selection strategy for supervised classification with continuous spectroscopic data. Journal of Chemometrics 18 53–61.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In ICML’99: Proceedings of the Sixteenth International Conference on Machine Learning 200–209. Morgan Kaufmann, San Francisco.
Kohavi, R. and John, G. (1997). Wrappers for feature selection. Artificial Intelligence 91 273–324.
Liang, F., Mukherjee, S. and West, M. (2007). The use of unlabeled data in predictive modeling. Statist. Sci. 22 189–205.
Mathematical Reviews (MathSciNet): MR2408958
Digital Object Identifier: doi:10.1214/088342307000000032
Project Euclid: euclid.ss/1190905518
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2 18–22.
Liu, Y. and Chen, Y. R. (2000). Two-dimensional correlation spectroscopy study of visible and near-infrared spectral variations of chicken meats in cold storage. Applied Spectroscopy 54 1458–1470.
Liu, Y., Chen, Y. R. and Ozaki, Y. (2000). Two-dimensional visible/near infrared correlation spectroscopy study of thermal treatment of chicken meat. Journal of Agricultural and Food Chemistry 48 901–908.
Louw, N. and Steep, S. J. (2006). Variable selection in kernel Fisher discriminant analysis by means of recursive feature elimination. Comput. Statist. Data Anal. 51 2043–2055.
Mathematical Reviews (MathSciNet): MR2307560
Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc. 89 1535–1546.
Madigan, D., Genkin, A., Lewis, D. D. and Fradkin, D. (2005). Bayesian multinomial logistic regression for author identification. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering (K. H. Knuth, A. E. Abbas, R. D. Morris and J. P. Castle, eds.). AIP Conf. Proc. 803 509–516. Institute of Physics, London.
Mathematical Reviews (MathSciNet): MR2906080
Mary-Huard, T., Robin, S. and Daudin, J.-J. (2007). A penalized criterion for variable selection in classification. J. Multivariate Anal. 98 695–705.
Mathematical Reviews (MathSciNet): MR2322124
Zentralblatt MATH: 1118.62066
Digital Object Identifier: doi:10.1016/j.jmva.2006.06.003
McElhinney, J., Downey, G. and Fearn, T. (1999). Chemometric processing of visible and near infrared reflectance spectra for species identification in selected raw homogenised meats. Journal of Near Infrared Spectroscopy 7 145–154.
McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1190469
Zentralblatt MATH: 1108.62317
McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1789474
Munita, C. S., Barroso, L. P. and Oliveira, P. M. S. (2006). Stopping rule for variable selection using stepwise discriminant analysis. Journal of Radioanalytical and Nuclear Chemistry 269 335–338.
Murphy, T. B., Dean, N. and Raftery, A. E. (2009). Supplement to “Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications.” DOI: 10.1214/09-AOAS279SUPP.
Mathematical Reviews (MathSciNet): MR2758177
Zentralblatt MATH: 1189.62105
Digital Object Identifier: doi:10.1214/09-AOAS279
Project Euclid: euclid.aoas/1273584460
O’Neill, T. J. (1978). Normal discrimination with unclassified observations. J. Amer. Statist. Assoc. 73 821–826.
Mathematical Reviews (MathSciNet): MR521330
Zentralblatt MATH: 0409.62047
Digital Object Identifier: doi:10.1080/01621459.1978.10480106
Osborne, B. G., Fearn, T. and Hindle, P. H. (1993). Practical NIR Spectroscopy With Applications in Food and Beverage Analysis. Longman Scientific & Technical, Harlow, UK.
Osborne, B. G., Fearn, T., Miller, A. R. and Douglas, S. (1984). Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs. Journal of the Science of Food and Agriculture 35 99–105.
Pacheco, J., Casado, S., Núñez, L. and Gómez, O. (2006). Analysis of new variable selection methods for discriminant analysis. Comput. Statist. Data Anal. 51 1463–1478.
Mathematical Reviews (MathSciNet): MR2307519
R Development Core Team (2007). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria.
Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101 168–178.
Mathematical Reviews (MathSciNet): MR2268036
Zentralblatt MATH: 1118.62339
Digital Object Identifier: doi:10.1198/016214506000000113
Reid, L. M., O’Donnell, C. P. and Downey, G. (2006). Recent technological advances in the determination of food authenticity. Trends in Food Science and Technology 17 344–353.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
Mathematical Reviews (MathSciNet): MR468014
Zentralblatt MATH: 0379.62005
Digital Object Identifier: doi:10.1214/aos/1176344136
Project Euclid: euclid.aos/1176344136
Sinz, F. and Roffilli, M. (2007). UniverSVM software. Version 1.1. Available at http://mloss.org/software/view/19/.
Szepannek, G. and Weihs, C. (2006). Variable selection for discrimination of more than two classes where data are sparse. In From Data and Information Analysis to Knowledge Engineering (M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nurnberger and W. Gaul, eds.) 700–707. Springer, Berlin.
Toher, D., Downey, G. and Murphy, T. B. (2007). A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies. Chemometrics and Intelligent Laboratory Systems 89 102–115.
Trendafilov, N. T. and Jolliffe, I. T. (2007). DALASS: Variable selection in discriminant analysis via the LASSO. Comput. Statist. Data Anal. 51 3718–3736.
Mathematical Reviews (MathSciNet): MR2364486
Vapnik, V. (1995). The Nature of Statistical Learning Theory, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR1367965
Wang, L. and Xiatong, S. (2007). On L1-norm multiclass support vector machines: Methodology and theory. J. Amer. Statist. Assoc. 102 583–594.
Mathematical Reviews (MathSciNet): MR2370855
Zentralblatt MATH: 1172.62317
Digital Object Identifier: doi:10.1198/016214506000001383
West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. In Bayesian Statistics 7 723–732. Oxford Univ. Press, Oxford.
Mathematical Reviews (MathSciNet): MR2003537
Yeung, K. Y., Bumgarner, R. and Raftery, A. E. (2005). Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21 2394–2402.

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?