The Annals of Applied Statistics

A likelihood-based scoring method for peptide identification using mass spectrometry

Qunhua Li, Jimmy K. Eng, and Matthew Stephens

Full-text: Open access


Mass spectrometry provides a high-throughput approach to identify proteins in biological samples. A key step in the analysis of mass spectrometry data is to identify the peptide sequence that, most probably, gave rise to each observed spectrum. This is often tackled using a database search: each observed spectrum is compared against a large number of theoretical “expected” spectra predicted from candidate peptide sequences in a database, and the best match is identified using some heuristic scoring criterion. Here we provide a more principled, likelihood-based, scoring criterion for this problem. Specifically, we introduce a probabilistic model that allows one to assess, for each theoretical spectrum, the probability that it would produce the observed spectrum. This probabilistic model takes account of peak locations and intensities, in both observed and theoretical spectra, which enables incorporation of detailed knowledge of chemical plausibility in peptide identification. Besides placing peptide scoring on a sounder theoretical footing, the likelihood-based score also has important practical benefits: it provides natural measures for assessing the uncertainty of each identification, and in comparisons on benchmark data it produced more accurate peptide identifications than other methods, including SEQUEST. Although we focus here on peptide identification, our scoring rule could easily be integrated into any downstream analyses that require peptide-spectrum match scores.

Article information

Ann. Appl. Stat. Volume 6, Number 4 (2012), 1775-1794.

First available in Project Euclid: 27 December 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Generative model maximum likelihood peptide identification proteomics


Li, Qunhua; Eng, Jimmy K.; Stephens, Matthew. A likelihood-based scoring method for peptide identification using mass spectrometry. Ann. Appl. Stat. 6 (2012), no. 4, 1775--1794. doi:10.1214/12-AOAS568.

Export citation


  • Coon, J. J., Syka, J. E., Shabanowitz, J. and Hunt, D. (2005). Tandem mass spectrometry for peptide and proteins sequence analysis. BioTechniques 38 519–521.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Dongre, A. R., Johns, J. L., Somogyi, A. and Wysocki, V. (1996). Influence of peptide composition, gass-phase basicity, and chemical modification on fragmentation efficiency: Evidence for the mobile proton model. J. Am. Chem. Soc. 118 8365–8374.
  • Elias, J. and Gygi, S. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods 4 207–214.
  • Eng, J., McCormack, A. and Yates, J. I. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom 5 976–989.
  • Fenyo, D. and Beavis, R. (2003). A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75 768–774.
  • Gerster, S., Qeli, E., Ahrens, C. H. and Buehlmann, P. (2010). Protein and gene model inference based on statistical modeling in k-partite graphs. Proc. Natl. Acad. Sci. USA 107 12101–12106.
  • Hernandez, P., Muller, M. and Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: Issues and strategies. Mass Spectrometry Reviews 25 235–254.
  • Keller, A., Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D. R. and Kolker, E. (2002a). Experimental protein mixture for validating tandem mass spectral analysis. OMICS 6 207–212.
  • Keller, A., Nesvizhskii, A., Kolker, E. and Aebersold, R. (2002b). Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal. Chem. 74 5383–5392.
  • Kinter, M. and Sherman, N. E. (2000). Protein Sequencing and Identification Using Tandem Mass Spectrometry. Wiley, New York.
  • Klammer, A. A., Park, C. Y. and Noble, W. S. (2009). Statistical calibration of the SEQUEST XCorr function. Journal of Proteome Research 8 2106–2113.
  • Klammer, A. A., Reynolds, S., MacCoss, M. J., Bilmes, J. and Noble, W. (2008). Modelling peptide fragmentation with dynamic Bayesian networks for peptide identification. Bioinformatics 24 i348–i356.
  • Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., King, N., Stein, S. E. and Aebersold, R. (2007). Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7 655–667.
  • Li, Q., Eng, J. K. and Stephens, M. (2012). Supplement to “A likelihood-based scoring method for peptide identification using mass spectrometry.” DOI:10.1214/12-AOAS568SUPP.
  • Li, Q., MacCoss, M. J. and Stephens, M. (2010). A nested mixture model for protein identification using mass spectrometry. Ann. Appl. Stat. 4 962–987.
  • Nesvizhskii, A. I. and Aebersold, R. (2004). Analysis, statistical validation and dissermination of large-scale proteomics datasets generated by tandem ms. Drug Discovery Today 9 173–181.
  • Nesvizhskii, A., Keller, A., Kolker, E. and Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75 4646–4653.
  • Sadygov, R., Liu, H. and Yates, J. (2004). Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76 1664–1671.
  • Shen, C., Wang, Z., Shankar, G., Zhang, X. and Li, L. (2008). A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics 24 202–208.
  • Sun, S., Meyer-Arendt, K., Eichelberger, B., Brown, R., Yen, C., Old, W., Pierce, K., Cios, K., Ahn, N. G. and Resing, K. A. (2007). Improved validation of peptide ms/ms assignments using spectral intensity prediction. Molecular and Cellular Proteomics 6 1–17.
  • Wan, Y., Yang, A. and Chen, T. (2006). PepHMM: A hidden Markov model based scoring function for mass spectrometry database search. Anal. Chem. 78 432–437.
  • Wysocki, V. H., Tsaprsilis, G., Smith, L. and Breci, L. A. (2000). Mobile and localized protons: A framework for understanding peptide dissociation. J. Mass Spectrom. 35 1399–1406.
  • Yen, C., Houel, S., Ahn, N. G. and Old, W. (2011). Spectrum-to-spectrum searching using a proteome-wide spectral library. Mol. Cell. Proteomics 10 M111.007666.
  • Yu, W., Taylor, J. A., Davis, M. T., Bonilla, L. E., Lee, K. A., Auger, P. L., Farnsworth, C. C., Welcher, A. A. and Patterson, S. D. (2010). Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines. Proteomics 10 1172–1189.
  • Zhang, Z. (2004). Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76 3908–3922.
  • Zhang, Z. (2005). Prediction of low-energy collision-induced dissociation spectra of peptides with three or more charges. Anal. Chem. 77 6364–6373.

Supplemental materials