The Annals of Applied Statistics

Bayesian nonparametric models for peak identification in MALDI-TOF mass spectroscopy

Leanna L. House, Merlise A. Clyde, and Robert L. Wolpert

Full-text: Open access


We present a novel nonparametric Bayesian approach based on Lévy Adaptive Regression Kernels (LARK) to model spectral data arising from MALDI-TOF (Matrix Assisted Laser Desorption Ionization Time-of-Flight) mass spectrometry. This model-based approach provides identification and quantification of proteins through model parameters that are directly interpretable as the number of proteins, mass and abundance of proteins and peak resolution, while having the ability to adapt to unknown smoothness as in wavelet based methods. Informative prior distributions on resolution are key to distinguishing true peaks from background noise and resolving broad peaks into individual peaks for multiple protein species. Posterior distributions are obtained using a reversible jump Markov chain Monte Carlo algorithm and provide inference about the number of peaks (proteins), their masses and abundance. We show through simulation studies that the procedure has desirable true-positive and false-discovery rates. Finally, we illustrate the method on five example spectra: a blank spectrum, a spectrum with only the matrix of a low-molecular-weight substance used to embed target proteins, a spectrum with known proteins, and a single spectrum and average of ten spectra from an individual lung cancer patient.

Article information

Ann. Appl. Stat. Volume 5, Number 2B (2011), 1488-1511.

First available in Project Euclid: 13 July 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Gamma random field kernel regression Lévy random fields reversible jump Markov chain Monte Carlo wavelets


House, Leanna L.; Clyde, Merlise A.; Wolpert, Robert L. Bayesian nonparametric models for peak identification in MALDI-TOF mass spectroscopy. Ann. Appl. Stat. 5 (2011), no. 2B, 1488--1511. doi:10.1214/10-AOAS450.

Export citation


  • Abramowitz, M. and Stegun, I. A., eds. (1964). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. National Bureau of Standards Applied Mathematics Series 55. For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, DC.
  • Applied Biosystems (2001). Voyager Biospectrometry Workstation with Delayed Extraction Technology User Guide Version 5.1. Applied Biosystems, Foster City, CA.
  • Baggerly, K. A., Coombes, K. R. and Morris, J. S. (2006). An introduction to high-throughput bioinformatics data. In Bayesian Inference for Gene Expression and Proteomics ( K.-A. Do, P. Müller and M. Vannucci, eds.) Chapter 1, 1–39. Cambridge Univ. Press, Cambridge.
  • Baggerly, K. A., Morris, J. S. and Coombes, K. R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 20 777–785.
  • Campa, M. J., Wang, M. Z., Howard, B. A., Fitzgerald, M. C. and Patz, E. F. Jr. (2003). Protein expression profiling identifies MIF and Cyclophilin A as potential molecular targets in non-small cell lung cancer. Cancer Research 63 1652–1656.
  • Chu, J.-H., Clyde, M. A. and Liang, F. (2009). Bayesian function estimation using continuous wavelet dictionaries. Statist. Sinica 19 1419–1438.
  • Clyde, M. A., House, L. L. and Wolpert, R. L. (2006). Nonparametric models for proteomic peak identification and quantification. In Bayesian Inference for Gene Expression and Proteomics ( K.-A. Do, P. Müller and M. Vannucci, eds.) Chapter 15, 293–308. Cambridge Univ. Press, Cambridge.
  • Clyde, M. A. and Wolpert, R. L. (2007). Nonparametric function estimation using overcomplete dictionaries. In Bayesian Statistics 8 ( J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 91–114. Oxford Univ. Press, Oxford.
  • Coombes, K. R., Koomen, J. M., Baggerly, K. A., Morris, J. S. and Kobayashi, R. (2005a). Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics 1 41–52.
  • Coombes, K. R., Tsavachidis, S., Morris, J. S., Baggerly, K. A., Hung, M. C. and Kuerer, H. M. (2005b). Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics 5 4107–4117.
  • Cromwell (2004). Cromwell MatLab package. M. D. Anderson Cancer Center, Houston, TX. Available at
  • Dass, C. (2001). Principles and Practice of Biological Mass Spectrometry. Wiley, New York.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.
  • Franzen, J. (1997). Improved resolution for MALDI-TOF mass spectrometers: A mathematical study. International Journal of Mass Spectrometry and Ion Processes 164 19–34.
  • Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732.
  • Guindani, M., Do, K. A., Müller, P. and Morris, J. S. (2006). Bayesian mixture models for gene expression and protein profiles. ( K.-A. Do, P. Müller and M. Vannucci, eds.) Chapter 12, 238–253. Cambridge Univ. Press, Cambridge.
  • Harezlak, J., Wu, M., Wang, M., Schwartzman, A., Christian, D. and Lin, X. (2008). Biomarker discovery for Arsenic exposure using functional data analysis and feature learning of mass spectrometry proteomic data. Journal of Proteome Research 7 217–224.
  • House, L. L. (2006). Nonparametric Bayesian models in expression proteomic applications. Ph.D. dissertation. Dept. Statist. Sci., Duke Univ., Durham, NC.
  • House, L. L., Clyde, M. A. and Wolpert, R. L. (2011). Supplement to “Bayesian nonparametric models for peak identification in MALDI-TOF mass spectroscopy.” DOI:10.1214/10-AOAS450SUPP.
  • Johnstone, I. M. and Silverman, B. W. (2005). Empirical Bayes selection of wavelet thresholds. Ann. Statist. 33 1700–1752.
  • Kempka, M., Södahl, J., Björk, A. and Roeraade, J. (2004). Improved method for peak picking in matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Rapid Communications in Mass Spectrometry 18 1208–1212.
  • Li, X. (2005). PROcess: Ciphergen SELDI-TOF Processing. R Package Version 1.24.0. Available at
  • Malyarenko, D. I., Cooke, W. E., Adam, B.-L., Malik, G., Chen, H., Tracy, E. R., Trosset, M. W., Sasinowski, M., Semmes, O. J. and Manos, D. M. (2005). Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques. Clin. Chem. 51 65–74.
  • Morris, J. S., Coombes, K. R., Koomen, J., Baggerly, K. A. and Kobayashi, R. (2005). Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21 1764–1775.
  • Morris, J. S., Brown, P. J., Baggerly, K. A. and Coombes, K. R. (2006). Analysis of mass spectrometry data using Bayesian wavelet-based functional mixed models. In Bayesian Inference for Gene Expression and Proteomics ( K.-A. Do, P. Müller and M. Vannucci, eds.) Chapter 14, 269–292. Cambridge Univ. Press, Cambridge.
  • Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479–489, 667.
  • Müller, P., Baggerly, K. A., Do, K.-A. and Bandyopadhyay, R. (2010). A Bayesian mixture model for protein biomarker discovery. In Bayesian Modeling in Bioinformatics (D. K. Dey, S. Ghosh and B. K. Mallick, eds.). Chapman & Hall/CRC Press, Boca Raton, FL.
  • Nguyen, N., Huang, H., Oraintara, S. and Vo, A. (2010). Mass spectrometry data processing using zero-crossing lines in multi-scale of Gaussian derivative wavelet. Bioinformatics 26 i659–i665.
  • R Development Core Team (2010). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
  • Siuzdak, G. (2003). The Expanding Role of Mass Spectrometry in Biotechnology. MCC Press, San Diego, CA.
  • Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A. and Le, Q.-T. (2004). Sample classification from protein mass spectrometry, by ’peak probability contrasts’. Bioinformatics 20 3034–3044.
  • Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley, New York.
  • Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Monographs on Statistics and Applied Probability 60. Chapman & Hall, London.
  • Wang, X., Ray, S. and Mallick, B. K. (2007). Bayesian curve classification using wavelets. J. Amer. Statist. Assoc. 102 962–973.
  • Wang, M. Z., Howard, B. A., Campa, M. J., Patz, E. F. Jr. and Fitzgerald, M. C. (2003). Analysis of human serum proteins by liquid phase iso-electric focusing and matrix-assisted laser desorption/ionization mass spectrometry. Proteomics 3 1661–1666.
  • Wolpert, R. L., Clyde, M. A. and Tu, C. (2011). Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels. Ann. Statist. To appear.
  • Wolpert, R. L. and Ickstadt, K. (2004). Reflecting uncertainty in inverse problems: A Bayesian solution using Lévy processes. Inverse Problems 20 1759–1771.
  • Yasui, Y., McLerran, D., Adam, B.-L., Winget, M., Thornquist, M. and Feng, Z. (2003). An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers. J. Biomed. Biotechnol. 2003 242–248.
  • Zhigilei, L. V. and Garrison, B. J. (1998). Velocity distributions of analyte molecules in matrix assisted laser desorption from computer simulations. Rapid Communications in Mass Spectrometry 12 1273–1277.

Supplemental materials

  • Supplementary material: Additional results for the simulation study. True positive rates for LARK estimates from the simulation study broken down by peak prevalence and average intensity of peaks across samples.