The Annals of Applied Statistics

A decision-theoretic approach for segmental classification

Christopher Yau and Christopher C. Holmes

Full-text: Open access

Abstract

This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data $y$ and a statistical model $\pi(x,y)$ of the hidden states $x$, what should we report as the prediction $\hat{x}$ under the posterior distribution $\pi(x|y)$? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report $\hat{x}$ via the principle of minimum expected loss (maximum expected utility). We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models.

Article information

Source
Ann. Appl. Stat. Volume 7, Number 3 (2013), 1814-1835.

Dates
First available in Project Euclid: 3 October 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1380804818

Digital Object Identifier
doi:10.1214/13-AOAS657

Mathematical Reviews number (MathSciNet)
MR3127970

Zentralblatt MATH identifier
06237199

Keywords
Segmental classification decision theory Bayesian

Citation

Yau, Christopher; Holmes, Christopher C. A decision-theoretic approach for segmental classification. Ann. Appl. Stat. 7 (2013), no. 3, 1814--1835. doi:10.1214/13-AOAS657. https://projecteuclid.org/euclid.aoas/1380804818


Export citation

References

  • Banachewicz, K., Lucas, A. and van der Vaart, A. (2008). Modelling portfolio defaults using Hidden Markov models with covariates. Econom. J. 11 155–171.
  • Barry, D. and Hartigan, J. A. (1992). Product partition models for change point problems. Ann. Statist. 20 260–279.
  • Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
  • Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory. Wiley, New York.
  • Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., Henry, K. T. M., Pinchback, R. M., Ligon, A. H., Cho, Y.-J., Haery, L., Greulich, H., Reich, M., Winckler, W., Lawrence, M. S., Weir, B. A., Tanaka, K. E., Chiang, D. Y., Bass, A. J., Loo, A., Hoffman, C., Prensner, J., Liefeld, T., Gao, Q., Yecies, D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H., Tepper, J. E., Fletcher, J. A., Tabernero, J., Baselga, J., Tsao, M.-S., Demichelis, F., Rubin, M. A., Janne, P. A., Daly, M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S., Rustgi, A. K., Antonescu, C. R., Ladanyi, M., Letai, A., Garraway, L. A., Loda, M. and Beer, D. G. (2010). The landscape of somatic copy-number alteration across human cancers. Nature 463 899–905.
  • Bignell, G. R., Greenman, C. D., Davies, H., Butler, A. P., Edkins, S., Andrews, J. M., Buck, G., Chen, L., Beare, D., Latimer, C., Widaa, S., Hinton, J., Fahey, C., Fu, B., Swamy, S., Dalgliesh, G. L., Teh, B. T., Deloukas, P., Yang, F., Campbell, P. J., Futreal, P. A. and Stratton, M. R. (2010). Signatures of mutation and selection in the cancer genome. Nature 463 893–898.
  • Cancer Genome Atlas Network (2012). Comprehensive molecular characterization of human colon and rectal cancer. Nature 487 330–337.
  • Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler, W., Weir, B. A., Beroukhim, R., Pellman, D., Levine, D. A., Lander, E. S., Meyerson, M. and Getz, G. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30 413–421.
  • Chien, J. T. and Furui, S. (2005). Predictive hidden Markov model selection for speech recognition. IEEE Transactions on Speech and Audio Processing 13 377–387.
  • Chopin, N. and Pelgrin, F. (2004). Bayesian inference and state number determination for hidden Markov models: An application to the information content of the yield curve about inflation. J. Econometrics 123 327–344.
  • Christie, M., Jorissen, R. N., Mouradov, D., Sakthianandeswaren, A., Li, S., Day, F., Tsui, C., Lipton, L., Desai, J., Jones, I. T., McLaughlin, S., Ward, R. L., Hawkins, N. J., Ruszkiewicz, A. R., Moore, J., Burgess, A. W., Busam, D., Zhao, Q., Strausberg, R. L., Simpson, A. J., Tomlinson, I. P. M., Gibbs, P. and Sieber, O. M. (2012). Different APC genotypes in proximal and distal sporadic colorectal cancers suggest distinct WNT/$\beta$-catenin signalling thresholds for tumourigenesis. Oncogene. DOI:10.1038/onc.2012.486.
  • Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari, G., Bashashati, A., Russell, R., McKinney, S., Group, M. E. T. A. B. R. I. C., Langerød, A., Green, A., Provenzano, E., Wishart, G., Pinder, S., Watson, P., Markowetz, F., Murphy, L., Ellis, I., Purushotham, A., Børresen-Dale, A.-L., Brenton, J. D., Tavaré, S., Caldas, C. and Aparicio, S. (2012). The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature 486 346–352.
  • Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. and Noble, W. S. (2007). Unsupervised segmentation of continuous genomic data. Bioinformatics 23 1424–1426.
  • Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 589–605.
  • Giampieri, G., Davis, M. and Crowder, M. (2005). Analysis of default data using hidden Markov models. Quant. Finance 5 27–34.
  • Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., Swamy, S., Santarius, T., Chen, L., Widaa, S., Futreal, P. A. and Stratton, M. R. (2010). PICNIC: An algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics 11 164–175.
  • Kesten, H. (1976). Existence and uniqueness of countable one-dimensional Markov random fields. Ann. Probab. 4 557–569.
  • Knight, S. J. L., Yau, C., Clifford, R., Timbs, A. T., Sadighi Akha, E., Dréau, H. M., Burns, A., Ciria, C., Oscier, D. G., Pettitt, A. R., Dutton, S., Holmes, C. C., Taylor, J., Cazier, J.-B. and Schuh, A. (2012). Quantification of subclonal distributions of recurrent genomic aberrations in paired pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia. Leukemia 26 1564–1575.
  • Lember, J. and Koloydenko, A. A. (2010). A generalized risk approach to path inference based on hidden Markov models. Preprint. Available at arXiv:1007.3622.
  • Li, A., Liu, Z., Lezon-Geyda, K., Sarkar, S., Lannin, D., Schulz, V., Krop, I., Winer, E., Harris, L. and Tuck, D. (2011). GPHMM: An integrated hidden Markov model for identification of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome SNP arrays. Nucleic Acids Res. 39 4928–4941.
  • Loo, P. V. and Campbell, P. J. (2012). ABSOLUTE cancer genomics. Nat. Biotechnol. 30 620–621.
  • Loo, P. V., Nordgard, S. H., Lingjærde, O. C., Russnes, H. G., Rye, I. H., Sun, W., Weigman, V. J., Marynen, P., Zetterberg, A., Naume, B., Perou, C. M., Børresen-Dale, A.-L. andKristensen, V. N. (2010). Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. USA 107 16910–16915.
  • Majoros, W. H., Pertea, M. and Salzberg, S. L. (2004). TigrScan and GlimmerHMM: Two open source ab initio eukaryotic gene-finders. Bioinformatics 20 2878–2879.
  • Murphy, K. P. (2002). Hidden semi-Markov models (hsmms). Technical report.
  • Northcott, P. A., Shih, D. J. H., Peacock, J., Garzia, L., Morrissy, A. S., Zichner, T., Stütz, A. M., Korshunov, A., Reimand, J., Schumacher, S. E., Beroukhim, R., Ellison, D. W., Marshall, C. R., Lionel, A. C., Mack, S., Dubuc, A., Yao, Y., Ramaswamy, V., Luu, B., Rolider, A., Cavalli, F. M. G., Wang, X., Remke, M., Wu, X., Chiu, R. Y. B., Chu, A., Chuah, E., Corbett, R. D., Hoad, G. R., Jackman, S. D., Li, Y., Lo, A., Mungall, K. L., Nip, K. M., Qian, J. Q., Raymond, A. G. J., Thiessen, N. T., Varhol, R. J., Birol, I., Moore, R. A., Mungall, A. J., Holt, R., Kawauchi, D., Roussel, M. F., Kool, M., Jones, D. T. W., Witt, H., Fernandez-L, A., Kenney, A. M., Wechsler-Reya, R. J., Dirks, P., Aviv, T., Grajkowska, W. A. and Perek-Polnik, M. (2012). Subgroup-specific structural variation across 1000 medulloblastoma genomes. Nature 488 49–56.
  • Popova, T., Manié, E., Stoppa-Lyonnet, D., Rigaill, G., Barillot, E. and Stern, M. H. (2009). Genome Alteration Print (GAP): A tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biol. 10 R128.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE 77 257–286.
  • Rossi, A. and Gallo, G. M. (2006). Volatility estimation via hidden Markov models. Journal of Empirical Finance 13 203–230.
  • Rue, H. (1995). New loss functions in Bayesian imaging. J. Amer. Statist. Assoc. 90 900–908.
  • Sengupta, N., Yau, C., Sakthianandeswaren, A., Mouradov, D., Gibbs, P., Suraweera, N., Cazier, J.-B., Polanco-Echeverry, G., Ghosh, A., Thaha, M., Ahmed, S., Feakins, R., Propper, D., Dorudi, S., Sieber, O., Silver, A. and Lai, C. (2013). Analysis of colorectal cancers in British Bangladeshi identifies early onset, frequent mucinous histotype and a high prevalence of RBFOX1 deletion. Mol. Cancer 12 1.
  • Siddiqi, S. M. and Moore, A. W. (2005). Fast inference and learning in large-state-space HMMs. In Proceedings of the 22nd International Conference on Machine Learning (Bonn, Germany) 800–807. ACM, New York.
  • Su, S. Y., Balding, D. J. and Coin, L. J. M. (2008). Disease association tests by inferring ancestral haplotypes using a hidden Markov model. Bioinformatics 24 972.
  • Sun, W., Wright, F. A., Tang, Z., Nordgard, S. H., Loo, P. V., Yu, T., Kristensen, V. N. and Perou, C. M. (2009). Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 37 5365–5377.
  • Weiss, R. J. and Ellis, D. P. W. (2008). Speech separation using speaker-adapted eigenvoice speech models. Computer Speech & Language 24 16–29.
  • Yan, Q., Vaseghi, S., Zavarehei, E., Milner, B., Darch, J., White, P. and Andrianakis, I. (2007). Formant tracking linear prediction model using HMMs and Kalman filters for noisy speech processing. Computer Speech & Language 21 543–561.
  • Yau, C., Mouradov, D., Jorissen, R. N., Colella, S., Mirza, G., Steers, G., Harris, A., Ragoussis, J., Sieber, O. and Holmes, C. C. (2010). A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol. 11 R92.
  • Zhang, Z., Lange, K., Ophoff, R. and Sabatti, C. (2010). Reconstructing DNA copy number by penalized estimation and imputation. Ann. Appl. Stat. 4 1749–1773.