Open Access
December 2017 Model-based clustering with data correction for removing artifacts in gene expression data
William Chad Young, Adrian E. Raftery, Ka Yee Yeung
Ann. Appl. Stat. 11(4): 1998-2026 (December 2017). DOI: 10.1214/17-AOAS1051


The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution, leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.


Download Citation

William Chad Young. Adrian E. Raftery. Ka Yee Yeung. "Model-based clustering with data correction for removing artifacts in gene expression data." Ann. Appl. Stat. 11 (4) 1998 - 2026, December 2017.


Received: 1 February 2016; Revised: 1 April 2017; Published: December 2017
First available in Project Euclid: 28 December 2017

zbMATH: 1383.62299
MathSciNet: MR3743286
Digital Object Identifier: 10.1214/17-AOAS1051

Keywords: gene regulatory network , LINCS , MCDC , Model-based clustering

Rights: Copyright © 2017 Institute of Mathematical Statistics

Vol.11 • No. 4 • December 2017
Back to Top