The Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 11, Number 1 (2017), 114-138.
A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments
In mass spectrometry (MS) based quantitative proteomics research, the emerging iTRAQ (isobaric tag for relative and absolute quantitation) and TMT (tandem mass tags) techniques have been widely adopted for high throughput protein profiling. In a typical iTRAQ/TMT proteomics study, samples are grouped into batches, and each batch is processed by one multiplex experiment, in which the abundances of thousands of proteins/peptides in a batch of samples can be measured simultaneously. The multiplex labeling technique greatly enhances the throughput of protein quantification. However, the technical variation across different iTRAQ/TMT multiplex experiments is often large due to the dynamic nature of MS instruments. This leads to strong batch effects in the iTRAQ/TMT data. Moreover, the iTRAQ/TMT data often contain substantial batch-level nonignorable missing entries. Specifically, the abundance measures of a given protein/peptide are often either observed or missing altogether in all the samples from the same batch, with the missing probability depending on the combined batch-level abundances. We term this unique missing-data mechanism as the Batch-level Abundance-Dependent Missing-data Mechanism (BADMM). We introduce a new method—mixEMM—for analyzing iTRAQ/TMT data with batch effects and batch-level nonignorable missingness. The mixEMM method employs a linear mixed-effects model and explicitly models the batch effects and the BADMM. With simulation studies, we showed that, compared with existing approaches that utilize relative abundances and ignore the missing batches under the missing-completely-at-random assumption, the mixEMM method achieves more accurate parameter estimation and inference. We applied the method to an iTRAQ proteomics data from a breast cancer study and identified phosphopeptides differentially expressed between different breast cancer subtypes. The method can be applied to general clustered data with cluster-level nonignorable missing-data mechanisms.
Ann. Appl. Stat. Volume 11, Number 1 (2017), 114-138.
Received: June 2015
Revised: September 2016
First available in Project Euclid: 8 April 2017
Permanent link to this document
Digital Object Identifier
Chen, Lin S.; Wang, Jiebiao; Wang, Xianlong; Wang, Pei. A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments. Ann. Appl. Stat. 11 (2017), no. 1, 114--138. doi:10.1214/16-AOAS994. http://projecteuclid.org/euclid.aoas/1491616874.