Bayesian Analysis

Bayesian variable selection for probit mixed models applied to gene selection

Meli Baragatti

Full-text: Open access


In computational biology, gene expression datasets are characterized by very few individual samples compared to a large number of measurements per sample. Thus, it is appealing to merge these datasets in order to increase the number of observations and diversify the data, allowing a more reliable selection of genes relevant to the biological problem. Besides, the increased size of a merged dataset facilitates its re-splitting into training and validation sets. This necessitates the introduction of the dataset as a random effect. In this context, extending a work of Lee et al. (2003), a method is proposed to select relevant variables among tens of thousands in a probit mixed regression model, considered as part of a larger hierarchical Bayesian model. Latent variables are used to identify subsets of selected variables and the grouping (or blocking) technique of Liu (1994) is combined with a Metropolis-within-Gibbs algorithm (Robert and Casella 2004). The method is applied to a merged dataset made of three individual gene expression datasets, in which tens of thousands of measurements are available for each of several hundred human breast cancer samples. Even for this large dataset comprised of around 20000 predictors, the method is shown to be efficient and feasible. As an illustration, it is used to select the most important genes that characterize the estrogen receptor status of patients with breast cancer.

Article information

Bayesian Anal., Volume 6, Number 2 (2011), 209-229.

First available in Project Euclid: 13 June 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J12: Generalized linear models
Secondary: 62-04: Explicit machine computation and programs (not the theory of computation or programming) 62F15: Bayesian inference 62J07: Ridge regression; shrinkage estimators 62P10: Applications to biology and medical sciences 92D10: Genetics {For genetic algebras, see 17D92}

Bayesian variable selection random effects probit mixed regression model grouping technique (or blocking technique) Metropolis-within-Gibbs algorithm


Baragatti, Meli. Bayesian variable selection for probit mixed models applied to gene selection. Bayesian Anal. 6 (2011), no. 2, 209--229. doi:10.1214/11-BA607.

Export citation


  • Albert, J. and Chib, S. (1993). "Bayesian analysis of binary and polychotomous response data." Journal of the American Statistical Association, 88(422): 669–679.
  • Baragatti, M. and Pommeret, D. (2011). "Ridge parameter for g-prior distribution in probit mixed models with collinearity." arxiv:1102.0470.
  • Bottolo, L. and Richardson, S. (2010). "Evolutionary Stochastic Search for Bayesian Model Exploration." Bayesian Analysis, 5(3): 583–618.
  • Chen, M. and Dey, D. (2003). "Variable selection for multivariate logistic regression models." Journal of Statistical Planning and Inference, 111: 37–55.
  • Cheng, W., Tsai, M., Chang, C., Huang, C., Chen, C., Shu, W., Lee, Y., Wang, T., Hong, J., Li, C., and Hsu, I. (2010). "Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database." BMC Bioinformatics, 11.
  • Chipman, H., George, E., and McCulloch, R. (2001). "The practical implementation of Bayesian model selection." In Model selection - IMS Lecture Notes. P. LAHIRI. Institute of Mathematical Statistics.
  • Cimino, D., Fuso, L., Sfiligoi, C., Biglia, N., Ponzone, R., Maggiorotto, F., Russo, G., Cicatiello, L., Weisz, A., Taverna, D., Sismondi, P., and De-Bortoli, M. (2008). "Identification of new genes associated with breast cancer progression by gene expression analysis of predefined sets of neoplastic tissues." International Journal of Cancer, 123(6): 1327–1338.
  • Clyde, M. and George, E. (2000). "Flexible empirical Bayes estimation for wavelets." Journal of the Royal Statistical Society B, 62(4): 681–698.
  • Edgar, R., Domrachev, M., and Lash, A. (2002). "Gene Expression Omnibus: NCBI" gene expression and hybridization array data repository. Nucleic Acids Research, 30(1): 207–210.
  • Frühwirth-Schnatter, S. and Wagner, H. (2010). "Bayesian Variable Selection for Random Intercept Modeling of Gaussian and non-Gaussian Data." In Bernardo, J. et al. (eds.), Bayesian Statistics 9, Proc. of the 9th Valencia Internat. Meeting, 165–200. Oxford University Press.
  • Gelman, A. (2006). "Prior distributions for variance parameters in hierarchical models (Comment on Article by Browne and Draper)." Bayesian Analysis, 1(3): 515–534.
  • George, E. and Foster, D. (2000). "Calibration and empirical Bayes variable selection." Biometrika, 87(4): 731–747.
  • George, E. and McCulloch, R. (1993). "Variable selection via Gibbs sampling." Journal of the American Statistical Association, 88(423): 881–889.
  • –- (1997). "Approaches for Bayesian variable selection." Statistica Sinica, 7: 339–373.
  • Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). "Gene selection for cancer classification using support vector machines." Machine Learning, (46): 389–422.
  • Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U., and Speed, T. (2003). "Exploration, normalization, and summaries of high density oligonucleotide array probe level data." Biostatistics, 4(2): 249–264.
  • Lee, K., Sha, N., Dougherty, E., Vannucci, M., and Mallick, B. (2003). "Gene selection: a Bayesian variable selection approach." Bioinformatics, 19(1): 90–97.
  • Liu, J. (1994). "The collapsed Gibbs sampler in Bayesian computations with application to a gene regulation problem." Journal of the American Statistical Association, 89(427): 958–966.
  • Nagaraja, G., Othman, M., Fox, B., Alsaber, R., Pellegrino, C., Zeng, Y., Khanna, R., Tamburini, P., Swaroop, A., and Kandpal, R. (2006). "Gene expression signatures and biomarkers of noninvasive and invasive breast cancer cells: comprehensive profiles by representational difference analysis, microarrays and proteomics." Oncogene, 25(16): 2328–2338.
  • O'Hara, R. and Sillanpää., M. (2009). "A review of Bayesian variable selection methods: What, How and Which." Bayesian Analysis, 4(1): 85–118.
  • Rae, J., Johnson, M., Scheys, J., Cordero, K., Larios, J., and Lippman, M. (2005). "GREB 1" is a critical regulator of hormone dependent breast cancer growth. Breast Cancer Research and Treatment, 92(2): 141–149.
  • Raftery, A., Madigan, D., and Hoeting, J. (1997). "Bayesian model averaging for linear regression models." Journal of the American Statistical Association, 92: 179–191.
  • Robert, C. and Casella, G. (2004). Monte Carlo statistical methods, second edition. Springer.
  • Roberts, G. and Rosenthal, J. (2006). "Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains." The Annals of Applied Probability, 16(4): 2123–2139.
  • Sha, N., Vannucci, M., Tadesse, M., Brown, P., Dragoni, I., Davies, N., Roberts, T., Contestabile, A., Salmon, M., Buckley, C., and Falciani, F. (2004). "Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage." Biometrics, 60: 812–819.
  • Singh, D. and Chaturvedi, R. (2009). "Recent patents on genes and gene sequences useful for developing breast cancer detection systems." Recent Patents on DNA and Gene Sequences, 3(2): 139–147.
  • Smith, M. and Kohn, R. (1997). "Non parametric regression using Bayesian variable selection." Journal of Econometrics, 75: 317–344.
  • Somol, P. and Novovicova, J. (2008). "Evaluating the stability of feature selectors that optimize feature subset cardinality." In da Vitora Lobo et al., N. (ed.), Lecture Notes in Computer Science, vol 5342, 956–966. Springer-Verlag Berlin Heidelberg.
  • Tadesse, M., Sha, N., and Vannucci, M. (2005). "Bayesian variable selection in clustering high-dimensional data." Journal of the American Statistical Association, 100: 602–617.
  • Townson, S. and O'Connell, P. (2006). "Identification of estrogen receptor alpha variants in breast tumors: implications for predicting response to hormonal therapies." Journal of Surgical Oncology, 94(4): 271–273.
  • Tüchler, R. (2008). "Bayesian variable selection for logistic models using auxiliary mixture sampling." Journal of Computational and Graphical Statistics, 17(1): 76–94.
  • van Dyk, D. and Park, T. (2008). "Partially Collapsed Gibbs Samplers: Theory and Methods." Journal of the American Statistical Association, 103: 790–796.
  • Yang, A. and Song, X. (2010). "Bayesian variable selection for disease classification using gene expression data." Bioinformatics, 26(2): 215–222.
  • Zellner, A. (1986). Bayesian Inference and Decision Techniques – Essays in honour of Bruno De Finetti, chapter On assessing prior distributions and Bayesian regression analysis with g-prior distributions., 233–243. Amsterdam.
  • Zellner, A. and Siow, A. (1980). "Posterior Odds Ratios for Selected Regression Hypotheses." In Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia, 585–603. University of Valencia Press.
  • Zhou, X., Liu, K., and Wong, S. (2004). "Cancer classification and prediction using logistic regression with Bayesian gene selection." Journal of Biomedical Informatics, 37: 249–259.
  • Zhou, X., Wang, X., and Dougherty, E. (2004). "A Bayesian approach to non linear probit gene selection and classification." Journal of the Franklin Institute, (341): 137–156.
  • –- (2004). "Gene prediction using multinomial probit regression with Bayesian gene selection." EURASIP Journal on Applied Signal Processing, (1): 115–124.