The Annals of Applied Statistics

Variable selection for latent class analysis with application to low back pain diagnosis

Michael Fop, Keith M. Smart, and Thomas Brendan Murphy

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


The identification of most relevant clinical criteria related to low back pain disorders may aid the evaluation of the nature of pain suffered in a way that usefully informs patient assessment and treatment. Data concerning low back pain can be of categorical nature, in the form of a check-list in which each item denotes presence or absence of a clinical condition. Latent class analysis is a model-based clustering method for multivariate categorical responses, which can be applied to such data for a preliminary diagnosis of the type of pain. In this work, we propose a variable selection method for latent class analysis applied to the selection of the most useful variables in detecting the group structure in the data. The method is based on the comparison of two different models and allows the discarding of those variables with no group information and those variables carrying the same information as the already selected ones. We consider a swap-stepwise algorithm where at each step the models are compared through an approximation to their Bayes factor. The method is applied to the selection of the clinical criteria most useful for the clustering of patients in different classes. It is shown to perform a parsimonious variable selection and to give a clustering performance comparable to the expert-based classification of patients into three classes of pain.

Article information

Ann. Appl. Stat., Volume 11, Number 4 (2017), 2080-2110.

Received: February 2017
Revised: May 2017
First available in Project Euclid: 28 December 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Clinical criteria selection clustering latent class analysis low back pain mixture models model-based clustering variable selection


Fop, Michael; Smart, Keith M.; Murphy, Thomas Brendan. Variable selection for latent class analysis with application to low back pain diagnosis. Ann. Appl. Stat. 11 (2017), no. 4, 2080--2110. doi:10.1214/17-AOAS1061.

Export citation


  • Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wiley, New York.
  • Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley, Hoboken, NJ.
  • Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 1–10.
  • Arima, S. (2015). Item selection via Bayesian IRT models. Stat. Med. 34 487–503.
  • Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Computational Statistics, Vol. 1 (Y. Dodge and J. Whittaker, eds.) 251–256. Physica-Verlag, Heidelberg.
  • Bartholomew, D., Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis: A Unified Approach, 3rd ed. Wiley, Chichester.
  • Bartolucci, F., Montanari, G. E. and Pandolfi, S. (2016). Item selection by latent class-based methods: An application to nursing home evaluation. Adv. Data Anal. Classif. 10 245–262.
  • Bontemps, D. and Toussile, W. (2013). Clustering and variable selection for categorical multivariate data. Electron. J. Stat. 7 2344–2371.
  • Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. and Raftery, A. E. (2014). Comparing model selection and regularization approaches to variable selection in model-based clustering. J. SFdS 155 57–71.
  • Clogg, C. C. (1988). Latent class models for measuring. In Latent Trait and Latent Class Models (R. Langeheine and J. Rost, eds.) 173–205. Plenum, New York.
  • Clogg, C. C. (1995). Latent class models. In Handbook of Statistical Modeling for the Social and Behavioral Sciences (G. Arminger, C. C. Clogg and M. E. Sobel, eds.) 311–360. Plenum, New York.
  • Dean, N. and Raftery, A. E. (2010). Latent class analysis variable selection. Ann. Inst. Statist. Math. 62 11–35.
  • Dy, J. G. and Brodley, C. E. (2003/04). Feature selection for unsupervised learning. J. Mach. Learn. Res. 5 845–889.
  • Fop, M., Smart, K. M. and Murphy, T. B. (2017). Supplement to “Variable selection for latent class analysis with application to low back pain diagnosis.” DOI:10.1214/17-AOAS1061SUPP.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Froud, R., Patterson, S., Eldridge, S., Seale, C., Pincus, T., Rajendran, D., Fossum, C. and Underwood, M. (2014). A systematic review and meta-synthesis of the impact of low back pain on people’s lives. BMC Musculoskeletal Disorders 15 1–14.
  • Gelman, A., Jakulin, A., Pittau, M. G. and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. Ann. Appl. Stat. 2 1360–1383.
  • Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading.
  • Gollini, I. and Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Stat. Comput. 24 569–588.
  • Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 215–231.
  • Graven-Nielsen, T. and Arendt-Nielsen, L. (2010). Assessment of mechanisms in localized and widespread musculoskeletal pain. Nat Rev Rheumatol 6 599–606.
  • Haberman, S. J. (1979). Analysis of Qualitative Data, Vol. 2: New Developments. Academic Press, New York.
  • Heinze, G. and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Stat. Med. 21 2409–2419.
  • Hoy, D., Bain, C., Williams, G., March, L., Brooks, P., Blyth, F., Woolf, A., Vos, T. and Buchbinder, R. (2012). A systematic review of the global prevalence of low back pain. Arthritis & Rheumatism 64 2028–2037.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193–218.
  • Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
  • Katz, J. N., Stock, S. R., Evanoff, B. A., Rempel, D., Moore, J. S., Franzblau, A. and Gray, R. H. (2000). Classification criteria and severity assessment in work-associated upper extremity disorders: Methods matter. Am. J. Ind. Med. 38 369–372.
  • Kim, S., Tadesse, M. G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika 93 877–893.
  • Law, M. H. C., Figueiredo, M. A. T. and Jain, A. K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26 1154–1166.
  • Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston, MA.
  • Lesaffre, E. and Albert, A. (1989). Partial separation in logistic discrimination. J. R. Stat. Soc. Ser. B. Stat. Methodol. 51 109–116.
  • Linzer, D. A. and Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent class analysis. J. Stat. Softw. 42 1–29.
  • Liu, H. H. and Ong, C. S. (2008). Variable selection in clustering for marketing segmentation using genetic algorithms. Expert Systems with Applications 34 502–510.
  • Malsiner-Walli, G., Frühwirth-Schnatter, S. and Grün, B. (2016). Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26 303–324.
  • Marbac, M., Biernacki, C. and Vandewalle, V. (2015). Model-based clustering for conditionally correlated categorical data. J. Classification 32 145–175.
  • Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated complete-data likelihood. Stat. Comput. 27 1049–1063.
  • Maugis, C., Celeux, G. and Martin-Magniette, M.-L. (2009a). Variable selection for clustering with Gaussian mixture models. Biometrics 65 701–709.
  • Maugis, C., Celeux, G. and Martin-Magniette, M.-L. (2009b). Variable selection in model-based clustering: A general variable role modeling. Comput. Statist. Data Anal. 53 3872–3882.
  • McLachlan, G. J. and Krishnan, T. (2008). The EM Algorithm and Extensions, 2nd ed. Wiley, Hoboken, NJ.
  • McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
  • McLachlan, G. J. and Rathnayake, S. (2014). On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 341–355.
  • McNicholas, P. D. (2016). Model-based clustering. J. Classification 33 331–373.
  • Merskey, H. and Bogduk, N., eds. (2002). Classification of Chronic Pain. IASP Press, Washington, DC.
  • Meynet, C. and Maugis-Rabusseau, C. (2012). A sparse variable selection procedure in model-based clustering. INRIA Report.
  • Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. Chapman & Hall/CRC, Boca Raton, FL.
  • Milligan, G. W. and Cooper, M. C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 21 441–458.
  • Murphy, T. B., Dean, N. and Raftery, A. E. (2010). Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. Ann. Appl. Stat. 4 396–421.
  • Nijs, J., Apeldoorn, A., Hallegraeff, H., Clark, J., Smeets, R., Malfliet, A., Girbes, E. L., Kooning, M. D. and Ickmans, K. (2015) Low back pain: Guidelines for the clinical classification of predominant neuropathic, nociceptive, or central sensitization pain. Pain Physician 18 E333–E346.
  • Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8 1145–1164.
  • R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101 168–178.
  • Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press, Cambridge.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Scrucca, L. (2016). Genetic algorithms for subset selection in model-based clustering. In Unsupervised Learning Algorithms (M. E. Celebi and K. Aydin, eds.) 55–70. Springer, Berlin.
  • Scrucca, L. and Raftery, A. E. (2017). Clustvarsel: A package implementing variable selection for model-based clustering in R. J. Stat. Softw. To appear.
  • Silvestre, C., Cardoso, M. G. M. S. and Figueiredo, M. (2015). Feature selection for clustering categorical data with an embedded modelling approach. Expert Systems 32 444–453.
  • Smart, K. M., O’Connell, N. E. and Doody, C. (2008). Towards a mechanisms-based classification of pain in musculoskeletal physiotherapy? Physical Therapy Reviews 13 1–10.
  • Smart, K. M., Blake, C., Staines, A. and Doody, C. (2010). Clinical indicators of “nociceptive”, “peripheral neuropathic” and “central” mechanisms of musculoskeletal pain. A Delphi survey of expert clinicians. Manual Therapy 15 80–87.
  • Smart, K. M., Blake, C., Staines, A. and Doody, C. (2011). The discriminative validity of “nociceptive”,“peripheral neuropathic”, and “central sensitization” as mechanisms-based classifications of musculoskeletal pain. The Clinical Journal of Pain 27 655–663.
  • Stynes, S., Konstantinou, K. and Dunn, K. M. (2016). Classification of patients with low back-related leg pain: A systematic review. BMC Musculoskelet Disord 17 226.
  • Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. J. Amer. Statist. Assoc. 100 602–617.
  • Walker, B. F. (2000). The prevalence of low back pain: A systematic review of the literature from 1966 to 1998. Journal of Spinal Disorders & Techniques 13 205–217.
  • Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64 440–448, 666.
  • White, A., Wyse, J. and Murphy, T. B. (2016). Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. Stat. Comput. 26 511–527.
  • Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering. J. Amer. Statist. Assoc. 105 713–726.
  • Woolf, C. J. (2004). Pain: Moving from symptom control toward mechanism-specific pharmacologic management. Ann. Intern. Med. 140 441–451.
  • Woolf, C. J., Bennett, G. J., Doherty, M., Dubner, R., Kidd, B., Koltzenburg, M., Lipton, R., Loeser, J. D., Payne, R. and Torebjork, E. (1998). Towards a mechanism-based classification of pain? Pain 77 227–229.
  • Xie, B., Pan, W. and Shen, X. (2008). Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron. J. Stat. 2 168–212.
  • Zhang, Q. and Ip, E. H. (2014). Variable assessment in latent class models. Comput. Statist. Data Anal. 77 146–156.
  • Zorn, C. (2005). A solution to separation in binary response models. Polit. Anal. 13 157–170.

Supplemental materials

  • Supplementary information, data and R code [Fop, Smart and Murphy (2017)]. The .zip folder contains a document with: further considerations regarding the “don’t know” entries, a description of the backward-stepwise selection algorithm for the multinomial logistic regression, a detailed description of the simulated data experiments, a complete list of clinical criteria and a notation page for reference. The folder also contains the data used in this paper and R code implementing the variable selection method.