The Annals of Applied Statistics

Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays

Robert B. Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski

Full-text: Open access

Abstract

Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g., aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and are therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g., the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such alterations, modeling the spatial dependence between neighboring SNPs. Here, we improve previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we, in particular, demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package VanillaICE.

Article information

Source
Ann. Appl. Stat. Volume 2, Number 2 (2008), 687-713.

Dates
First available in Project Euclid: 3 July 2008

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1215118534

Digital Object Identifier
doi:10.1214/07-AOAS155

Mathematical Reviews number (MathSciNet)
MR2524352

Zentralblatt MATH identifier
05591294

Citation

Scharpf, Robert B.; Parmigiani, Giovanni; Pevsner, Jonathan; Ruczinski, Ingo. Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays. Ann. Appl. Stat. 2 (2008), no. 2, 687--713. doi:10.1214/07-AOAS155. http://projecteuclid.org/euclid.aoas/1215118534.


Export citation

References

  • Affymetrix (2006). Brlmm: An improved genotype calling method for the genechip human mapping 500k array set. Technical report, Affymetrix, Inc.
  • Aggarwal, A., Leong, S. H., Lee, C., Kon, O. L. and Tan, P. (2005). Wavelet transformations of tumor expression profiles reveals a pervasive genome-wide imprinting of aneuploidy on the cancer transcriptome., Cancer Res. 65 186–194.
  • Aguirre, A. J., Brennan, C., Bailey, G., Sinha, R., Feng, B., Leo, C., Zhang, Y., Zhang, J., Gans, J. D., Bardeesy, N., Cauwels, C., Cordon-Cardo, C., Redston, M. S., DePinho, R. A. and Chin, L. (2004). High-resolution characterization of the pancreatic adenocarcinoma genome., Proc. Natl. Acad. Sci. USA 101 9067–9072.
  • Altug-Teber, O., Dufke, A., Poths, S., Mau-Holzmann, U. A., Bastepe, M., Colleaux, L., Cormier-Daire, V., Eggermann, T., Gillessen-Kaesbach, G., Bonin, M. and Riess, O. (2005). A rapid microarray based whole genome analysis for detection of uniparental disomy., 26 153–159.
  • Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L. A., Fox, E. A., Hochberg, E. P., Mellinghoff, I. K., Hofer, M. D., Descazeaud, A., Rubin, M. A., Meyerson, M., Wong, W. H., Sellers, W. R. and Li, C. (2006). Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays., PLoS Comput. Biol. 2 e41.
  • Carvalho, B., Bengtsson, H., Speed, T. P. and Irizarry, R. A. (2007). Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data., Biostatistics 8 485–499.
  • Chambers, J. M. (1998)., Programming with Data. Springer, New York.
  • Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data., Nucleic Acids Res. 35 2013–2025.
  • Dempster, A., Laird, D. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Di, X., Matsuzaki, H., Webster, T. A., Hubbell, E., Liu, G., Dong, S., Bartell, D., Huang, J., Chiles, R., Yang, G., mei Shen, M., Kulp, D., Kennedy, G. C., Mei, R., Jones, K. W. and Cawley, S. (2005). Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays., Bioinformatics 21 1958–1963.
  • Dutt, A. and Beroukhim, R. (2007). Single nucleotide polymorphism array analysis of cancer., Curr. Opin. Oncol. 19 43–49.
  • Eichler, E. E., Nickerson, D. A., Altshuler, D., Bowcock, A. M., Brooks, L. D., Carter, N. P., Church, D. M., Felsenfeld, A., Guyer, M., Lee, C., Lupski, J. R., Mullikin, J. C., Pritchard, J. K., Sebat, J., Sherry, S. T., Smith, D., Valle, D. and Waterston, R. H. (2007). Completing the map of human genetic variation., Nature 447 161–165.
  • Eilers, P. H. C. and de Menezes, R. X. (2005). Quantile smoothing of array CGH data., Bioinformatics 21 1146–1153.
  • Engel, E. (2006). A fascination with chromosome rescue in uniparental disomy: Mendelian recessive outlaws and imprinting copyrights infringements., Eur. J. Hum. Genet. 14 1158–1169.
  • Freeman, J. L., Perry, G. H., Feuk, L., Redon, R., McCarroll, S. A., Altshuler, D. M., Aburatani, H., Jones, K. W., Tyler-Smith, C., Hurles, M. E., Carter, N. P., Scherer, S. W. and Lee, C. (2006). Copy number variation: New insights in genome diversity., Genome Res. 16 949–961.
  • Fridlyand, J., Snijders, A., Pinkel, D., Albertson, D. and Jain, A. (2004). Hidden Markov models approach to the analysis of array CGH data., J. Multivariate Anal. 90 132–153.
  • Guha, S., Li, Y. and Neuberg, D. (2006)., Bayesian Hidden Markov Modeling of Array CGH Data. Berkeley Electronic Press.
  • Houseman, E. A., Coull, B. A. and Betensky, R. A. (2006). Feature-specific penalized latent class analysis for genomic data., Biometrics 62 1062–1070.
  • Hsu, L., Self, S. G., Grove, D., Randolph, T., Wang, K., Delrow, J. J., Loo, L. and Porter, P. (2005). Denoising array-based comparative genomic hybridization data using wavelets., Biostatistics 6 211–226.
  • Hua, J., Craig, D. W., Brun, M., Webster, J., Zismann, V., Tembe, W., Joshipura, K., Huentelman, M. J., Dougherty, E. R. and Stephan, D. A. (2007). SNiPer-HD: Improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays., Bioinformatics 23 57–63.
  • Huang, J., Wei, W., Chen, J., Zhang, J., Liu, G., Di, X., Mei, R., Ishikawa, S., Aburatani, H., Jones, K. W. and Shapero, M. H. (2006). CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays., BMC Bioinformatics 7 83.
  • Huang, T., Wu, B., Lizardi, P. and Zhao, H. (2005). Detection of DNA copy number alterations using penalized least squares regression., Bioinformatics 21 3811–3817.
  • Hupe, P., Stransky, N., Thiery, J. P., Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: From signal ratio to gain and loss of DNA regions., Bioinformatics 20 3413–3422.
  • Kennedy, G. C., Matsuzaki, H., Dong, S., min Liu, W., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., Liu, W., Yang, G., Di, X., Ryder, T., He, Z., Surti, U., Phillips, M. S., Boyce-Jacino, M. T., Fodor, S. P. A. and Jones, K. W. (2003). Large-scale genotyping of complex DNA., Nat. Biotechnol. 21 1233–1237.
  • Laframboise, T., Harrington, D. and Weir, B. A. (2006). PLASQ: A generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data., Biostatistics 8 323–326.
  • Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data., Bioinformatics 21 3763–3770.
  • Lai, Y. and Zhao, H. (2005). A statistical method to detect chromosomal regions with DNA copy number alterations using SNP-array-based CGH data., Comput. Biol. Chem. 29 47–54.
  • Lin, M., Wei, L. J., Sellers, W. R., Lieberfarb, M., Wong, W. H. and Li, C. (2004). dChipSNP: Significance curve and clustering of SNP-array-based loss-of-heterozygosity data., Bioinformatics 20 1233–1240.
  • McClellan, J. M., Susser, E. and King, M. C. (2007). Schizophrenia: A common disease caused by multiple rare alleles., Br. J. Psychiatry 190 194–199.
  • Nannya, Y., Sanada, M., Nakazaki, K., Hosoya, N., Wang, L., Hangaishi, A., Kurokawa, M., Chiba, S., Bailey, D. K., Kennedy, G. C. and Ogawa, S. (2005). A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays., Cancer Res. 65 6071–6079.
  • Newton, M. A., Gould, M. N., Reznikoff, C. A. and Haag, J. D. (1998). On the statistical analysis of allelic-loss data., Stat. Med. 17 1425–1445.
  • Ninomiya, H., Nomura, K., Satoh, Y., Okumura, S., Nakagawa, K., Fujiwara, M., Tsuchiya, E. and Ishikawa, Y. (2006). Genetic instability in lung cancer: Concurrent analysis of chromosomal, mini- and microsatellite instability and loss of heterozygosity., Br. J. Cancer 94 1485–1491.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data., Biostatistics 5 557–572.
  • Picard, F., Robin, S., Lavielle, M., Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis., BMC Bioinformatics 6 1471–2105.
  • Rabbee, N. and Speed, T. P. (2006). A genotype calling algorithm for affymetrix SNP arrays., Bioinformatics 22 7–12.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition., Proc. IEEE 77 257–286.
  • Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome., Nature 444 444–454.
  • Robinson, W. P. (2000). Mechanisms leading to uniparental disomy and their clinical consequences., Bioessays 22 452–459.
  • Scharpf, R. B., Ting, J. C., Pevsner, J. and Ruczinski, I. (2007). SNPchip: R classes and methods for SNP array data., Bioinformatics 23 627–628.
  • Sebat, J., Lakshmi, B., Malhotra, D., Troge, J., Lese-Martin, C., Walsh, T., Yamrom, B., Yoon, S., Krasnitz, A., Kendall, J., Leotta, A., Pai, D., Zhang, R., Lee, Y. H., Hicks, J., Spence, S. J., Lee, A. T., Puura, K., Lehtimaki, T., Ledbetter, D., Gregersen, P. K., Bregman, J., Sutcliffe, J. S., Jobanputra, V., Chung, W., Warburton, D., King, M. C., Skuse, D., Geschwind, D. H., Gilliam, T. C., Ye, K. and Wigler, M. (2007). Strong association of de novo copy number mutations with autism., Science 316 445–449.
  • Shah, S. P., Xuan, X., DeLeeuw, R. J., Khojasteh, M., Lam, W. L., Ng, R. and Murphy, K. P. (2006). Integrating copy number polymorphisms into array CGH analysis using a robust HMM., Bioinformatics 22 e431–e439.
  • Shaw-Smith, C., Redon, R., Rickman, L., Rio, M., Willatt, L., Fiegler, H., Firth, H., Sanlaville, D., Winter, R., Colleaux, L., Bobrow, M. and Carter, N. P. (2004). Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features., J. Med. Genet. 41 241–248.
  • Szatmari, P., Paterson, A. D., Zwaigenbaum, L., Roberts, W., Brian, J., Liu, X. Q., Vincent, J. B., Skaug, J. L., Thompson, A. P., Senman, L., Feuk, L., Qian, C., Bryson, S. E., Jones, M. B., Marshall, C. R., Scherer, S. W., Vieland, V. J., Bartlett, C., Mangin, L. V., Goedken, R., Segre, A., Pericak-Vance, M. A., Cuccaro, M. L., Gilbert, J. R., Wright, H. H., Abramson, R. K., Betancur, C., Bourgeron, T., Gillberg, C., Leboyer, M., Buxbaum, J. D., Davis, K. L., Hollander, E., Silverman, J. M., Hallmayer, J., Lotspeich, L., Sutcliffe, J. S., Haines, J. L., Folstein, S. E., Piven, J., Wassink, T. H., Sheffield, V., Geschwind, D. H., Bucan, M., Brown, W. T., Cantor, R. M., Constantino, J. N., Gilliam, T. C., Herbert, M., Lajonchere, C., Ledbetter, D. H., Lese-Martin, C., Miller, J., Nelson, S., Samango-Sprouse, C. A., Spence, S., State, M., Tanzi, R. E., Coon, H., Dawson, G., Devlin, B., Estes, A., Flodman, P., Klei, L., McMahon, W. M., Minshew, N., Munson, J., Korvatska, E., Rodier, P. M., Schellenberg, G. D., Smith, M., Spence, M. A., Stodgell, C., Tepper, P. G., Wijsman, E. M., Yu, C. E., Roge, B., Mantoulan, C., Wittemeyer, K., Poustka, A., Felder, B., Klauck, S. M., Schuster, C., Poustka, F., Bolte, S., Feineis-Matthews, S., Herbrecht, E., Schmotzer, G., Tsiantis, J., Papanikolaou, K., Maestrini, E., Bacchelli, E., Blasi, F., Carone, S., Toma, C., Van Engeland, H., de Jonge, M., Kemner, C., Koop, F., Langemeijer, M., Hijimans, C., Staal, W. G., Baird, G., Bolton, P. F., Rutter, M. L., Weisblatt, E., Green, J., Aldred, C., Wilkinson, J. A., Pickles, A., Le Couteur, A., Berney, T., McConachie, H., Bailey, A. J., Francis, K., Honeyman, G., Hutchinson, A., Parr, J. R., Wallace, S., Monaco, A. P., Barnby, G., Kobayashi, K., Lamb, J. A., Sousa, I., Sykes, N., Cook, E. H., Guter, S. J., Leventhal, B. L., Salt, J., Lord, C., Corsello, C., Hus, V., Weeks, D. E., Volkmar, F., Tauber, M., Fombonne, E. and Shih, A. (2007). Mapping autism risk loci using genetic linkage and chromosomal rearrangements., Nat. Genet. 39 319–328.
  • Ting, J., Ye, Y., Thomas, G., Ruczinski, I. and Pevsner, J. (2006). Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan., BMC Bioinformatics 7 25.
  • Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data., Bioinformatics 23 657–663.
  • Viterbi, A. (1967). Error bounds for convolution codes and an asymptotically optimal decoding algorithm., IEEE Trans. Inform. Theory 13 260–269.
  • Wang, P., Kim, Y., Pollack, J., Narasimhan, B. and Tibshirani, R. (2005). A method for calling gains and losses in array CGH data., Biostatistics 6 45–58.
  • Wang, W., Carvalho, B., Miller, N., Pevsner, J., Chakravarti, A. and Irizarry, R. A. (2007). Estimating genome-wide copy number using allele specific mixture models. In, RECOMB 137–150.
  • Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to array CGH data for downstream analyses., Bioinformatics 21 4084–4091.
  • Zhao, X., Li, C., Paez, J. G., Chin, K., Jänne, P. A., Chen, T. H., Girard, L., Minna, J., Christiani, D., Leo, C., Gray, J. W., Sellers, W. R. and Meyerson, M. (2004). An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays., Cancer Res. 64 3060–3071.
  • Zhou, X., Mok. S. C., Chen, Z., Li, Y. and Wong, D. T. W. (2004). Concurrent analysis of loss of heterozygosity (loh) and copy number abnormality (cna) for oral premalignancy progression using the affymetrix 10k SNP mapping array., Hum. Genet. 115 327–330.
  • Zhou, X., Rao, N. P., Cole, S. W., Mok, S. C., Chen, Z. and Wong, D. T. (2005). Progress in concurrent analysis of loss of heterozygosity and comparative genomic hybridization utilizing high density single nucleotide polymorphism arrays., Cancer Genet. Cytogenet. 159 53–57.
  • Zlotogora, J. (2004). Parents of children with autosomal recessive diseases are not always carriers of the respective mutant alleles., Hum. Genet. 114 521–526.