The Annals of Applied Statistics

A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments

Lin S. Chen, Jiebiao Wang, Xianlong Wang, and Pei Wang

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


In mass spectrometry (MS) based quantitative proteomics research, the emerging iTRAQ (isobaric tag for relative and absolute quantitation) and TMT (tandem mass tags) techniques have been widely adopted for high throughput protein profiling. In a typical iTRAQ/TMT proteomics study, samples are grouped into batches, and each batch is processed by one multiplex experiment, in which the abundances of thousands of proteins/peptides in a batch of samples can be measured simultaneously. The multiplex labeling technique greatly enhances the throughput of protein quantification. However, the technical variation across different iTRAQ/TMT multiplex experiments is often large due to the dynamic nature of MS instruments. This leads to strong batch effects in the iTRAQ/TMT data. Moreover, the iTRAQ/TMT data often contain substantial batch-level nonignorable missing entries. Specifically, the abundance measures of a given protein/peptide are often either observed or missing altogether in all the samples from the same batch, with the missing probability depending on the combined batch-level abundances. We term this unique missing-data mechanism as the Batch-level Abundance-Dependent Missing-data Mechanism (BADMM). We introduce a new method—mixEMM—for analyzing iTRAQ/TMT data with batch effects and batch-level nonignorable missingness. The mixEMM method employs a linear mixed-effects model and explicitly models the batch effects and the BADMM. With simulation studies, we showed that, compared with existing approaches that utilize relative abundances and ignore the missing batches under the missing-completely-at-random assumption, the mixEMM method achieves more accurate parameter estimation and inference. We applied the method to an iTRAQ proteomics data from a breast cancer study and identified phosphopeptides differentially expressed between different breast cancer subtypes. The method can be applied to general clustered data with cluster-level nonignorable missing-data mechanisms.

Article information

Ann. Appl. Stat. Volume 11, Number 1 (2017), 114-138.

Received: June 2015
Revised: September 2016
First available in Project Euclid: 8 April 2017

Permanent link to this document

Digital Object Identifier

Mixed-effects models the expectation-conditional-maximization (ECM) algorithm Batch-level Abundance-Dependent Missing-data Mechanism (BADMM)


Chen, Lin S.; Wang, Jiebiao; Wang, Xianlong; Wang, Pei. A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments. Ann. Appl. Stat. 11 (2017), no. 1, 114--138. doi:10.1214/16-AOAS994.

Export citation


  • Bernardo, G. M., Bebek, G., Ginther, C. L., Sizemore, S. T., Lozada, K. L., Miedler, J. D., Anderson, L. A., Godwin, A. K., Abdul-Karim, F. W., Slamon, D. J. and Keri, R. A. (2013). FOXA1 represses the molecular phenotype of basal breast cancer cells. Oncogene 32 554–563.
  • Chang, C.-Y., Picotti, P., Hüttenhain, R., Heinzelmann-Schwarz, V., Jovanovic, M., Aebersold, R. and Vitek, O. (2012). Protein significance analysis in selected reaction monitoring (SRM) measurements. Mol. Cell. Proteomics 11 M111.014662.
  • Chen, L. S., Prentice, R. L. and Wang, P. (2014). A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70 312–322.
  • Cimino-Mathews, A., Subhawong, A. P., Elwood, H., Warzecha, H. N., Sharma, R., Park, B. H., Taube, J. M., Illei, P. B. and Argani, P. (2013). Neural crest transcription factor Sox10 is preferentially expressed in triple-negative and metaplastic breast carcinomas. Human Pathol. 44 959–965.
  • Clough, T., Key, M., Ott, I., Ragg, S., Schadow, G. and Vitek, O. (2009). Protein quantification in label-free LC-MS experiments. J. Proteome Res. 8 5275–5284.
  • Dobbin, K. and Simon, R. (2002). Comparison of microarray designs for class comparison and class discovery. Bioinformatics 18 1438–1445.
  • Ellis, M., Gillette, M., Carr, S., Paulovich, A., Smith, R., Rodland, K., Townsend, R., Kinsinger, C., Mesri, M., Rodriguez, H., Liebler, D. and CPTAC (2013). Connecting genomic alterations to cancer biology with proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discovery 3 1108–1112.
  • Franken, H., Mathieson, T., Childs, D., Sweetman, G. M. A., Werner, T., Tögel, I., Doce, C., Gade, S., Bantscheff, M., Drewes, G., Reinhard, F. B. M., Huber, W. and Savitski, M. M. (2015). Thermal proteome profiling for unbiased identification of direct and indirect drug targets using multiplexed quantitative mass spectrometry. Nat. Protoc. 10 1567–1593.
  • Hill, E. G., Schwacke, J. H., Comte-Walters, S., Slate, E. H., Oberg, A. L., Eckel-Passow, J. E., Therneau, T. M. and Schey, K. L. (2008). A statistical model for iTRAQ data analysis. J. Proteome Res. 7 3091–3101.
  • Ibrahim, J. G. and Molenberghs, G. (2009). Missing data methods in longitudinal studies: A review. TEST 18 1–43.
  • Ivanov, S. V., Panaccione, A., Nonaka, D., Prasad, M. L., Boyd, K. L., Brown, B., Guo, Y., Sewell, A. and Yarbrough, W. G. (2013). Diagnostic SOX10 gene signatures in salivary adenoid cystic and breast basal-like carcinomas. Br. J. Cancer 109 444–451.
  • Karp, N. A., Huber, W., Sadowski, P. G., Charles, P. D., Hester, S. V. and Lilley, K. S. (2010). Addressing accuracy and precision issues in iTRAQ quantitation. Mol. Cell. Proteomics 9 1885–1897.
  • Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38 963–974.
  • Liebler, D., Zhang, B., Wang, J., Wang, X., Zhu, J., liu, Q., Shi, Z., Chambers, M. C. et al. (2014). Proteogenomic characterization of human colon and rectal cancer. Nature 513 382–387.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Luo, R., Colangelo, C. M., Sessa, W. C. and Zhao, H. (2009). Bayesian analysis of iTRAQ data with nonrandom missingness: Identification of differentially expressed proteins. Statistics in Biosciences 1 228–245.
  • McAlister, G. C., Nusinow, D. P., Jedrychowski, M. P., Wühr, M., Huttlin, E. L., Erickson, B. K., Rad, R., Haas, W. and Gygi, S. P. (2014). MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical Chemistry 86 7150–7158.
  • Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • Mertins, P., Mani, D. R., Ruggles, K. V., Gillette, M. A., Clauser, K. R., Wang, P. et al. (2016). Proteogenomic connects somatic mutations to signaling in breast cancer. Nature 534 55–62.
  • Meyer, K. B. and Carroll, J. S. (2012). FOXA1 and breast cancer risk. Nat. Genet. 44 1176–1177.
  • The Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of human breast tumours. Nature 490 61–70.
  • Oberg, A. L., Mahoney, D. W., Eckel-Passow, J. E., Malone, C. J., Wolfinger, R. D., Hill, E. G., Cooper, L. T., Onuma, O. K., Spiro, C., Therneau, T. M. and Bergen, H. (2008). Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. J. Proteome Res. 7 225–233.
  • Paulo, J. A., McAllister, F. E., Everley, R. A., Beausoleil, S. A., Banks, A. S. and Gygi, S. P. (2014). Effects of MEK inhibitors GSK1120212 and PD0325901 in vivo using 10-plex quantitative proteomics and phosphoproteomics. Proteomics 15 462–473.
  • Paulovich, A. G., Billheimer, D., Ham, A. J., Vega-Montoto, L., Rudnick, P. A., Tabb, D. L., Wang, P., Blackman, R. K., Bunk, D. M. and Cardasis, H. et al. (2010). Interlaboratory study characterizing a yeast performance standard for benchmarking LC-MS platform performance. Mol. Cell. Proteomics 9 242–254.
  • Pinheiro, J. C. and Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. J. Comput. Graph. Statist. 4 12–35.
  • Rauniyar, N. and Yates III, J. R. (2014). Isobaric labeling-based relative quantification in shotgun proteomics. J. Proteome Res. 13 5293–5309.
  • Ross, P. L., Huang, Y. N., Marchese, J. N., Williamson, B., Parker, K., Hattan, S., Khainovski, N., Pillai, S., Dey, S., Daniels, S., Purkayastha, S., Juhasz, P., Martin, S., Bartlet-Jones, M., He, F., Jacobson, A. and Pappin, D. J. (2004). Multiplexed protein quantitation in saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 3 1154–1169.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K. P. et al. (2014). STRING v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. gku1003.
  • Wang, P., Tang, H., Zhang, H., Whiteaker, J., Paulovich, A. G. and Mcintosh, M. (2006). Normalization regarding non-random missing values in high-throughput mass spectrometry data. Pacific Symposium on Biocomputing 315–326.
  • Werner, T., Sweetman, G., Savitski, M. F., Mathieson, T., Bantscheff, M. and Savitski, M. M. (2014). Ion coalescence of neutron encoded TMT 10-plex reporter ions. Anal. Chem. 86 3594–3601.
  • Wiese, S., Reidegeld, K. A., Meyer, H. E. and Warscheid, B. (2007). Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics 7 340–350.
  • Zhang, H., Liu, T., Zhang, Z., Payne, S. H., Zhang, B. and McDermott, J. E. et al. (2016). Integrated proteogenomic characterization of human high-grade serous ovarian cancer. Cell 166 755–765.