The Annals of Applied Statistics

The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments

Jonathon J. O’Brien, Harsha P. Gunawardena, Joao A. Paulo, Xian Chen, Joseph G. Ibrahim, Steven P. Gygi, and Bahjat F. Qaqish

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread nonignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data, and a substantial amount of useful information will often go unused.

To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.

Article information

Ann. Appl. Stat., Volume 12, Number 4 (2018), 2075-2095.

Received: March 2017
Revised: January 2018
First available in Project Euclid: 13 November 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Data dependent analysis estimable contrasts selection model Bayesian inference imputation interval coverage


O’Brien, Jonathon J.; Gunawardena, Harsha P.; Paulo, Joao A.; Chen, Xian; Ibrahim, Joseph G.; Gygi, Steven P.; Qaqish, Bahjat F. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 12 (2018), no. 4, 2075--2095. doi:10.1214/18-AOAS1144.

Export citation


  • Azzalini, A. (2014). The Skew-Normal and Related Families. Institute of Mathematical Statistics (IMS) Monographs 3. Cambridge Univ. Press, Cambridge.
  • Catherman, A. D., Skinner, O. S. and Kelleher, N. L. (2014). Top down proteomics: Facts and perspectives. Biochem. Biophys. Res. Commun. 445 683–693.
  • Chen, E. I. and Yates, J. R. (2007). Cancer proteomics by quantitative shotgun proteomics. Mol. Oncol. 1 144–159.
  • Clough, T., Thaminy, S., Ragg, S., Aebersold, R. and Vitek, O. (2012). Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinform. 13 (Suppl. 1) S6.
  • Cox, J. and Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26 1367–1372.
  • Cox, J., Hein, M. Y., Luber, C. A., Paron, I., Nagaraj, N. and Mann, M. (2014). Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol. Cell. Proteomics 13 2513–2526.
  • de Brevern, A. G., Hazout, S. and Malpertuy, A. (2004). Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform. 5 Art. ID 114.
  • Fabre, B., Lambour, T., Bouyssié, D., Menneteau, T., Monsarrat, B., Burlet-Schiltz, O. and Bousquet-Dubouch, M.-P. (2014). Comparison of label-free quantification methods for the determination of protein complexes subunits stoichiometry. EuPA Open Proteomics 4 82–86.
  • Karpievitch, Y. V., Dabney, A. R. and Smith, R. D. (2012). Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinform. 13 (Suppl. 1) S5.
  • Karpievitch, Y., Stanley, J., Taverner, T., Huang, J., Adkins, J. N., Ansong, C., Heffron, F., Metz, T. O., Qian, W. J., Yoon, H., Smith, R. D. and Dabney, A. R. (2009). A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics 25 2028–2034.
  • Keilhauer, E. C., Hein, M. Y. and Mann, M. (2015). Accurate protein complex retrieval by affinity enrichment mass spectrometry (AE-MS) rather than affinity purification mass spectrometry (AP-MS). Mol. Cell. Proteomics 14 120–135.
  • Lazar, C., Gatto, L., Ferro, M., Bruley, C. and Burger, T. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J. Proteome Res. 15 1116–1125.
  • Liebler, D. C. and Zimmerman, L. J. (2013). Targeted quantitation of proteins by mass spectrometry. Biochemistry 52 3797–3806.
  • Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York.
  • Luo, R., Colangelo, C. M., Sessa, W. C. and Zhao, H. (2009). Bayesian analysis of iTRAQ data with nonrandom missingness: Identification of differentially expressed proteins. Statist. Biosci. 1 228–245.
  • Mueller, L. N., Rinner, O., Schmidt, A., Letarte, S., Bodenmiller, B., Brusniak, M.-Y., Vitek, O., Aebersold, R. and Müller, M. (2007). SuperHirn—A novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics 7 3470–3480.
  • O’Brien, J. J., O’Connell, J. D., Paulo, J. A., Thakurta, S., Rose, C. M., Weekes, M. P., Huttlin, E. L. and Gygi, S. P. (2018a). Compositional proteomics: Effects of spatial constraints on protein quantification utilizing isobaric tags. J. Proteome Res. 17 590–599.
  • O’Brien, J. J., Gunawardena, H. P., Paulo, J. A., Chen, X., Ibrahim, J. G., Gygi, S. P. and Qaqish, B. F. (2018b). Supplement to “The effects of non-ignorable missing data on label-free mass spectrometry proteomics experiments.” DOI:10.1214/18-AOAS1144SUPP.
  • Owen, A. B. and Perry, P. O. (2009). Bi-cross-validation of the SVD and the nonnegative matrix factorization. Ann. Appl. Stat. 3 564–594.
  • Polpitiya, A. D., Qian, W.-J., Jaitly, N., Petyuk, V. A., Adkins, J. N., Camp, D. G., Anderson, G. A. and Smith, R. D. (2008). DAnTE: A statistical tool for quantitative analysis of -omics data. Bioinformatics 24 1556–1558.
  • Ross, P. L., Huang, Y. N., Marchese, J. N., Williamson, B., Parker, K., Hattan, S., Khainovski, N., Pillai, S., Dey, S., Daniels, S., Purkayastha, S., Juhasz, P., Martin, S., Bartlet-Jones, M., He, F., Jacobson, A. and Pappin, D. J. (2004). Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 3 1154–1169.
  • Röst, H. L., Rosenberger, G., Navarro, P., Gillet, L., Miladinović, S. M., Schubert, O. T., Wolski, W., Collins, B. C., Malmström, J., Malmström, L. and Aebersold, R. (2014). OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32 219–223.
  • Sacco, F., Humphrey, S. J., Cox, J., Mischnik, M., Schulte, A., Klabunde, T., Schäfer, M. and Mann, M. (2016). Glucose-regulated and drug-perturbed phosphoproteome reveals molecular mechanisms controlling insulin secretion. Nat. Commun. 7 Art. ID 13250.
  • Sandin, M., Krogh, M., Hansson, K. and Levander, F. (2011). Generic workflow for quality assessment of quantitative label-free LC-MS analysis. Proteomics 11 1114–1124.
  • Scheffé, H. (1999). The Analysis of Variance. Wiley, New York. Reprint of the 1959 original.
  • Schliekelman, P. and Liu, S. (2014). Quantifying the effect of competition for detection between coeluting peptides on detection probabilities in mass-spectrometry-based proteomics. J. Proteome Res. 13 348–361.
  • Taverner, T., Karpievitch, Y. V., Polpitiya, A. D., Brown, J. N., Dabney, A. R., Anderson, G. A. and Smith, R. D. (2012). DanteR: An extensible R-based tool for quantitative analysis of -omics data. Bioinformatics 28 2404–2406.
  • Tekwe, C. D., Carroll, R. J. and Dabney, A. R. (2012). Application of survival analysis methodology to the quantitative analysis of LC-MS proteomics data. Bioinformatics 28 1998–2003.
  • Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C. (2003). Tandem mass tags: A novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal. Chem. 75 1895–1904.
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17 520–525.
  • Webb-Robertson, B. J. M., Wiberg, H. K., Matzke, M. M., Brown, J. N., Wang, J., McDermott, J. E., Smith, R. D., Rodland, K. D., Metz, T. O., Pounds, J. G. and Waters, K. M. (2015). Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J. Proteome Res. 14 1993–2001.

Supplemental materials

  • Supplementary material. $\bullet$ Additional text containing proofs, simulation details, algorithmic and experimental procedures. $\bullet$ W2W16 Tables: Data tables and results pertaining to the two sample breast cancer data. $\bullet$ Dilution Tables part 1: Data tables and results pertaining to the ground truth dilution experiment. $\bullet$Dilution Tables part 2: Data tables and results pertaining to the ground truth dilution experiment. $\bullet$ Dilution Tables part 3: Data tables and results pertaining to the ground truth dilution experiment.