The Annals of Applied Statistics

Model-based clustering with data correction for removing artifacts in gene expression data

William Chad Young, Adrian E. Raftery, and Ka Yee Yeung

Full-text: Open access


The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution, leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.

Article information

Ann. Appl. Stat. Volume 11, Number 4 (2017), 1998-2026.

Received: February 2016
Revised: April 2017
First available in Project Euclid: 28 December 2017

Permanent link to this document

Digital Object Identifier

Model-based clustering MCDC gene regulatory network LINCS


Young, William Chad; Raftery, Adrian E.; Yeung, Ka Yee. Model-based clustering with data correction for removing artifacts in gene expression data. Ann. Appl. Stat. 11 (2017), no. 4, 1998--2026. doi:10.1214/17-AOAS1051.

Export citation


  • Ball, C. A., Sherlock, G., Parkinson, H., Rocca-Sera, P., Brooksbank, C., Causton, H. C., Cavalieri, D., Gaasterland, T., Hingamp, P., Holstege, F., Ringwald, M., Spellman, P., Stoeckert, C. J., Stewart, J. E., Taylor, R., Brazma, A., Quackenbush, J. and Microarray Gene Expression Data (MGED) Society (2002). Standards for microarray data. Science Article ID 298 539.
  • Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
  • Bansal, M., Della Gatta, G. and Di Bernardo, D. (2006). Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22 815–822.
  • Basso, K., Margolin, A. A., Stolovitzky, G., Klein, U., Dalla-Favera, R. and Califano, A. (2005). Reverse engineering of regulatory networks in human B cells. Nat. Genet. 37 382–390.
  • Biernacki, C., Celeux, G. and Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Statist. Data Anal. 41 561–575.
  • Binder, H. and Preibisch, S. (2008). “hook”-calibration of GeneChip-microarrays: Theory and algorithm. Algorithms Mol. Biol. 3 1–25.
  • Blocker, A. W. and Meng, X.-L. (2013). The potential and perils of preprocessing: Building new foundations. Bernoulli 19 1176–1211.
  • Bolstad, B. M., Irizarry, R. A., Åstrand, M. and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19 185–193.
  • Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognit. 28 781–793.
  • Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L. and Liu, C. (2011). Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS ONE 6 Article ID e17238.
  • Chen, E. Y., Tan, C. M., Kou, Y., Duan, Q., Wang, Z., Meirelles, G. V., Clark, N. R. and Ma’ayan, A. (2013a). Enrichr: Interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 14 Article ID 128. DOI:10.1186/1471-2105-14-128.
  • Chen, B., Greenside, P., Paik, H., Sirota, M., Hadley, D. and Butte, A. (2015). Relating chemical structure to cellular response: An integrative analysis of gene expression, bioactivity, and structural data across 11,000 compounds. CPT: Pharmacom. Syst. Pharmacol. 4 576–584.
  • Cooke, E. J., Savage, R. S., Kirk, P. D. W., Darkins, R. and Wild, D. L. (2011). Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinform. 12 Article ID 399. DOI:10.1186/1471-2105-12-399.
  • D’haeseleer, P., Wen, X., Fuhrman, S. and Somogyi, R. (1999). Linear modeling of mRNA expression levels during CNS development and injury In Pacific Symposium on Biocomputing 4 41–52. World Scientific, Singapore.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
  • Duan, Q., Flynn, C., Niepel, M., Hafner, M., Muhlich, J. L., Fernandez, N. F., Rouillard, A. D., Tan, C. M., Chen, E. Y., Golub, T. R., Sorger, P. K., Subramanian, A. and Ma’ayan A. (2014). LINCS Canvas Browser: Interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 42 W449–W460. DOI:10.1093/nar/gku476.
  • Dunbar, S. A. (2006). Applications of Luminex® xMAP™technology for rapid, high-throughput multiplexed nucleic acid detection. Clin. Chim. Acta 363 71–82.
  • Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J. J. and Gardner, T. S. (2007). Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5 Article ID e8. DOI:10.1371/journal.pbio.0050008.
  • Flynt, A. and Daepp, M. I. (2015). Diet-related chronic disease in the northeastern United States: A model-based clustering approach. Int. J. Health Geogr. 14 1–14.
  • Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41 578–588.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Fraley, C. and Raftery, A. E. (2007). Model-based methods of classification: Using the mclust software in chemometrics. J. Stat. Softw. 18 1–13.
  • Gomez-Alvarez, V., Teal, T. K. and Schmidt, T. M. (2009). Systematic artifacts in metagenomes from complex microbial communities. ISME J. 3 1314–1317.
  • Gustafsson, M., Hörnquist, M., Lundström, J., Björkegren, J. and Tegnér, J. (2009). Reverse engineering of gene networks with LASSO and nonlinear basis functions. Ann. N.Y. Acad. Sci. 1158 265–275.
  • Jiang, D., Tang, C. and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng. 16 1370–1386.
  • Kim, S. Y., Imoto, S. and Miyano, S. (2003). Inferring gene networks from time series microarray data using dynamic Bayesian networks. Brief. Bioinform. 4 228–235.
  • Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 733–739.
  • Lehmann, R., Machné, R., Georg, J., Benary, M., Axmann, I. M. and Steuer, R. (2013). How cyanobacteria pose new problems to old methods: Challenges in microarray time series analysis. BMC Bioinform. 14 Article ID 133.
  • Li, Q., Fraley, C., Bumgarner, R. E., Yeung, K. Y. and Raftery, A. E. (2005). Donuts, scratches and blanks: Robust model-based segmentation of microarray images. Bioinformatics 21 2875–2882.
  • Liu, X. and Rattray, M. (2010). Including probe-level measurement error in robust mixture clustering of replicated microarray gene expression. Stat. Appl. Genet. Mol. Biol. 9 Article ID 42.
  • Liu, C., Su, J., Yang, F., Wei, K., Ma, J. and Zhou, X. (2015). Compound signature detection on LINCS L1000 big data. Mol. BioSyst. 11 714–722.
  • Lo, K., Raftery, A. E., Dombek, K. M., Zhu, J., Schadt, E. E., Bumgarner, R. E. and Yeung, K. Y. (2012). Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC Syst. Biol. 6 Article ID 101. DOI:10.1186/1752-0509-6-101.
  • Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays Nat. Biotechnol. 14 1675–1680.
  • Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D. and Califano, A. (2006). ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform. 7 Article ID S7.
  • McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.
  • McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
  • Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18 1194–1206.
  • Menéndez, P., Kourmpetis, Y. A. I., ter Braak, C. J. F. and van Eeuwijk, F. A. (2010). Gene regulatory networks from multifactorial perturbations using graphical Lasso: Application to the DREAM4 challenge. PLoS ONE 5 Article ID e14147. DOI:10.1371/journal.pone.0014147.
  • Meyer, P. E., Kontos, K., Lafitte, F. and Bontempi, G. (2007). Information-theoretic inference of large transcriptional regulatory networks. EURASIP J. Bioinform. Syst. Biol. 2007 Article ID 79879. DOI:10.1155/2007/79879.
  • Murphy, K. and Mian, S. (1999). Modelling gene expression data using dynamic Bayesian networks. Technical report, Computer Science Division, Univ. California, Berkeley.
  • Peck, D., Crawford, E. D., Ross, K. N., Stegmaier, K., Golub, T. R. and Lamb, J. (2006). A method for high-throughput gene expression signature analysis. Genome Biol. 7 Article ID R61. DOI:10.1186/gb-2006-7-7-r61.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • Raghavan, V., Bollman, P. and Jung, G. S. (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst. 7 205–220.
  • Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. and Lenhard, B. (2004). JASPAR: An open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32 D91–D94.
  • Schena, M., Shalon, D., Davis, R. W. and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270 467–470.
  • Sebastiani, P., Gussoni, E., Kohane, I. S. and Ramoni, M. F. (2003). Statistical challenges in functional genomics. Statist. Sci. 18 33–70.
  • Shao, H., Peng, T., Ji, Z., Su, J. and Zhou, X. (2013). Systematically studying kinase inhibitor induced signaling network signatures by integrating both therapeutic and side effects. PLoS ONE 8 Article ID e80832. DOI:10.1371/journal.pone.0080832.
  • Siegmund, K. D., Laird, P. W. and Laird-Offringa, I. A. (2004). A comparison of cluster analysis methods using DNA methylation data. Bioinformatics 20 1896–1904.
  • Stokes, T. H., Moffitt, R. A., Phan, J. H. and Wang, M. D. (2007). Chip artifact CORRECTion (caCORRECT): A bioinformatics system for quality assurance of genomics and proteomics array data. Ann. Biomed. Eng. 35 1068–1080.
  • Sun, Z., Chai, H. S., Wu, Y., White, W. M., Donkena, K. V., Klein, C. J., Garovic, V. D., Therneau, T. M. and Kocher, J.-P. A. (2011). Batch effect correction for genome-wide methylation data with illumina infinium platform. BMC Med. Genomics 4 Article ID 84. DOI:10.1186/1755-8794-4-84.
  • Templ, M., Filzmoser, P. and Reimann, C. (2008). Cluster analysis applied to regional geochemical data: Problems and possibilities. Appl. Geochem. 23 2198–2213.
  • Vempati, U. D., Chung, C., Mader, C., Koleti, A., Datar, N., Vidović, D., Wrobel, D., Erickson, S., Muhlich, J. L., Berriz, G., Benes, C. H., Subramanian, A., Pillai, A., Shamu, C. E. and Schürer, S. C. (2014). Metadata standard and data exchange specifications to describe, model, and integrate complex and diverse high-throughput screening data from the Library of Integrated Network-based Cellular Signatures (LINCS). J. Biomol. Screen. 19 803–816.
  • Verbist, B., Clement, L., Reumers, J., Thys, K., Vapirev, A., Talloen, W., Wetzels, Y., Meys, J., Aerssens, J., Bijnens, L. and Thas, O. (2015). ViVaMBC: Estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC Bioinform. 16 Article ID 59. DOI:10.1186/s12859-015-0458-7.
  • Vidović, D., Koleti, A. and Schürer, S. C. (2013). Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action Front. Genetics 5 342–342.
  • Wang, Z., Clark, N. R. and Ma’ayan, A. (2016). Drug-induced adverse events prediction with the LINCS L1000 data. Bioinformatics 32 2338–2345. DOI:10.1093/bioinformatics/btw168.
  • Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10 57–63. DOI:10.1038/nrg2484.
  • Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Prüß, M., Reuter, I. and Schacherer, F. (2000). TRANSFAC: An integrated system for gene expression regulation. Nucleic Acids Res. 28 316–319.
  • Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5 329–350.
  • Wu, C. F. J. (1983). On convergence properties of the EM algorithm. Ann. Statist. 11 95–103.
  • Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. and Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics 17 977–987.
  • Young, W. C., Raftery, A. E. and Yeung, K. Y. (2014). Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Syst. Biol. 8 Article ID 47. DOI:10.1186/1752-0509-8-47.
  • Young, W. C., Raftery, A. E. and Yeung, K. Y. (2016). A posterior probability approach for gene regulatory network inference in genetic perturbation data. Math. Biosci. Eng. 13 1241–1251. DOI:10.3934/mbe.2016041.
  • Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with $g$-prior distributions. In Bayesian Inference and Decision Techniques. Stud. Bayesian Econometrics Statist. 6 233–243. North-Holland, Amsterdam.
  • Zou, M. and Conzen, S. D. (2005). A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21 71–79.