The Annals of Applied Statistics

Survival analysis of DNA mutation motifs with penalized proportional hazards

Jean Feng, David A. Shaw, Vladimir N. Minin, Noah Simon, and Frederick A. Matsen IV

Full-text: Open access


Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with “mutation motifs” which are short DNA subsequences that affect how likely a given site is to experience a mutation. Quantifying the effect of motifs on mutation rates is challenging. A large number of possible motifs makes this statistical problem high dimensional, while the unobserved history of the mutation process leads to a nontrivial missing data problem. We introduce an $\ell_{1}$-penalized proportional hazards model to infer mutation motifs and their effects. In order to estimate model parameters, our method uses a Monte Carlo EM algorithm to marginalize over the unknown ordering of mutations. We show that our method performs better on simulated data compared to current methods and leads to more parsimonious models. The application of proportional hazards to mutation processes is, to our knowledge, novel and formalizes the current methods in a statistical framework that can be easily extended to analyze the effect of other biological features on mutation rates.

Article information

Ann. Appl. Stat., Volume 13, Number 2 (2019), 1268-1294.

Received: November 2017
Revised: September 2018
First available in Project Euclid: 17 June 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Antibody maturation survival analysis Monte Carlo expectation–maximization lasso somatic hypermutation


Feng, Jean; Shaw, David A.; Minin, Vladimir N.; Simon, Noah; Matsen IV, Frederick A. Survival analysis of DNA mutation motifs with penalized proportional hazards. Ann. Appl. Stat. 13 (2019), no. 2, 1268--1294. doi:10.1214/18-AOAS1233.

Export citation


  • Aggarwala, V. and Voight, B. F. (2016). An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48 349–355.
  • Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • Caffo, B. S., Jank, W. and Jones, G. L. (2005). Ascent-based Monte Carlo expectation-maximization. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 235–251.
  • Chahwan, R., Edelmann, W., Scharff, M. D. and Roa, S. (2012). AIDing antibody diversity by error-prone mismatch repair. Semin. Immunol. 24 293–300.
  • Cohen, R. M., Kleinstein, S. H. and Louzoun, Y. (2011). Somatic hypermutation targeting is influenced by location within the immunoglobulin V region. Mol. Immunol. 48 1477–1483.
  • Cowell, L. G. and Kepler, T. B. (2000). The nucleotide-replacement spectrum under somatic hypermutation exhibits microsequence dependence that is strand-symmetric and distinct from that under germline mutation. J. Immunol. 164 1971–1976.
  • Cui, A., Di Niro, R., Vander Heiden, J. A., Briggs, A. W., Adams, K., Gilbert, T., O’Connor, K. C., Vigneault, F., Shlomchik, M. J. et al. (2016). A model of somatic hypermutation targeting in mice based on high-throughput Ig sequencing data. J. Immunol. 197 3566–3574.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Dezeure, R., Bühlmann, P., Meier, L. and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, $p$-values and R-software hdi. Statist. Sci. 30 533–558.
  • Dunn-Walters, D. K., Dogan, A., Boursier, L., MacDonald, C. M. and Spencer, J. (1998). Base-specific sequences that bias somatic hypermutation deduced by analysis of out-of-frame human IgVH genes. J. Immunol. 160 2360–2364.
  • Elhanati, Y., Sethna, Z., Marcou, Q., Callan, C. G. Jr, Mora, T. and Walczak, A. M. (2015). Inferring processes underlying B-cell repertoire diversity. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 370 20140243.
  • Feng, J., Shaw, D. A., Minin, V. N., Simon, N. and Matsen IV, F. A. (2019). Supplement to “Survival analysis of DNA mutation motifs with penalized proportional hazards.” DOI:10.1214/18-AOAS1233SUPP.
  • Goggins, W. B., Finkelstein, D. M., Schoenfeld, D. A. and Zaslavsky, A. M. (1998). A Markov chain Monte Carlo EM algorithm for analyzing interval-censored data under the Cox proportional hazards model. Biometrics 54 1498–1507.
  • Gupta, N. T., Vander Heiden, J. A., Uduman, M., Gadala-Maria, D., Yaari, G. and Kleinstein, S. H. (2015). Change-O: A toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31 3356–3358.
  • Haynes, B. F., Kelsoe, G., Harrison, S. C. and Kepler, T. B. (2012). B-cell-lineage immunogen design in vaccine development with HIV-1 as a case study. Nat. Biotechnol. 30 423–433.
  • He, L., Sok, D., Azadnia, P., Hsueh, J., Landais, E., Simek, M., Koff, W. C., Poignard, P., Burton, D. R. et al. (2014). Toward a more accurate view of human B-cell repertoire by next-generation sequencing, unbiased repertoire capture and single-molecule barcoding. Sci. Rep. 4 6778.
  • Hershberg, U., Uduman, M., Shlomchik, M. J. and Kleinstein, S. H. (2008). Improved methods for detecting selection by mutation analysis of Ig V region sequences. Int. Immunol. 20 683–694.
  • Hesterberg, T., Choi, N. H., Meier, L. and Fraley, C. (2008). Least angle and $l_{1}$ penalized regression: A review. Stat. Surv. 2 61–93.
  • Hobolth, A. (2008). A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbor-dependent substitution rates. J. Comput. Graph. Statist. 17 138–162.
  • Hoehn, K. B., Lunter, G. and Pybus, O. G. (2017). A phylogenetic codon substitution model for antibody lineages. Genetics 206 417–427.
  • Hwang, D. G. and Green, P. (2004). Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. USA 101 13994–14001.
  • Hwang, J. K., Wang, C., Du, Z., Meyers, R. M., Kepler, T. B., Neuberg, D., Kwong, P. D., Mascola, J. R., Joyce, M. G. et al. (2017). Sequence intrinsic somatic mutation mechanisms contribute to affinity maturation of VRC01-class HIV-1 broadly neutralizing antibodies. Proc. Natl. Acad. Sci. USA 114 8614–8619.
  • Kalbfleisch, J. D. and Prentice, R. L. (2011). The Statistical Analysis of Failure Time Data. Wiley Series in Probability and Mathematical Statistics 360. Wiley, New York.
  • Leeb, H., Pötscher, B. M. and Ewald, K. (2015). On various confidence intervals post-model-selection. Statist. Sci. 30 216–227.
  • Lefranc, M.-P. (2014). Immunoglobulins: 25 years of immunoinformatics and IMGT-ONTOLOGY. Biomolecules 4 1102–1139.
  • Lefranc, M.-P., Giudicelli, V., Ginestoux, C., Bodmer, J., Müller, W., Bontrop, R., Lemaitre, M., Malik, A., Barbié, V. et al. (1999). IMGT, the international ImMunoGeneTics database. Nucleic Acids Res. 27 209–212.
  • Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44 226–233.
  • McCoy, C. O., Bedford, T., Minin, V. N., Bradley, P., Robins, H. and Matsen, F. A. IV (2015). Quantifying evolutionary constraints on B-cell affinity maturation. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 370 20140244.
  • Methot, S. P. and Di Noia, J. M. (2017). Chapter two—Molecular mechanisms of somatic hypermutation and class switch recombination. In Advances in Immunology (F. W. Alt, ed.) 133 37–87. Academic Press, San Diego, CA.
  • Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125–161.
  • Pham, P., Bransteitter, R., Petruska, J. and Goodman, M. F. (2003). Processive AID-catalysed cytosine deamination on single-stranded DNA simulates somatic hypermutation. Nature 424 103–107.
  • Ralph, D. K. and Matsen IV, F. A. (2016a). Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation. PLoS Comput. Biol. 12 1–25.
  • Ralph, D. K. and Matsen IV, F. A. (2016b). Likelihood-based inference of B cell clonal families. PLoS Comput. Biol. 12 e1005086.
  • Rogozin, I. B. and Diaz, M. (2004). Cutting edge: DGYW/WRCH is a better predictor of mutability at G: C bases in Ig hypermutation than the widely accepted RGYW/WRCY motif and probably reflects a two-step Activation-Induced Cytidine Deaminase-triggered process. J. Immunol. 172 3382–3384.
  • Rogozin, I. B. and Kolchanov, N. A. (1992). Somatic hypermutagenesis in immunoglobulin genes. II. Influence of neighbouring base sequences on mutagenesis. Biochim. Biophys. Acta 1171 11–18.
  • Rogozin, I. B., Pavlov, Y. I., Bebenek, K., Matsuda, T. and Kunkel, T. A. (2001). Somatic mutation hotspots correlate with DNA polymerase $\eta$ error spectrum. Nat. Immunol. 2 530–536.
  • Schatz, D. G. and Ji, Y. (2011). Recombination centres and the orchestration of V (D) J recombination. Nat. Rev., Immunol. 11 251–263.
  • Sheng, Z., Schramm, C. A., Kong, R., NISC Comparative Sequencing Program, Mullikin, J. C., Mascola, J. R., Kwong, P. D. and Shapiro, L. (2017). Gene-specific substitution profiles describe the types and frequencies of amino acid changes during antibody somatic hypermutation. Front. Immunol. 8 537.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. et al. (1997). The lasso method for variable selection in the Cox model. Stat. Med. 16 385–395.
  • Tonegawa, S. (1983). Somatic generation of antibody diversity. Nature 302 575–581.
  • Uduman, M., Yaari, G., Hershberg, U., Stern, J. A., Shlomchik, M. J. and Kleinstein, S. H. (2011). Detecting selection in immunoglobulin sequences. Nucleic Acids Res. 39 W499–W504.
  • Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Amer. Statist. Assoc. 85 699–704.
  • Wiehe, K., Bradley, T., Ryan Meyerhoff, R., Hart, C., Williams, W. B., Easterhoff, D., Faison, W. J., Kepler, T. B., Saunders, K. O. et al. (2018). Functional relevance of improbable antibody mutations for HIV broadly neutralizing antibody development. Cell Host Microbe 23 759–765.
  • Yaari, G. and Kleinstein, S. H. (2015). Practical guidelines for B-cell receptor repertoire sequencing analysis. Gen. Med. 7 121.
  • Yaari, G., Uduman, M. and Kleinstein, S. H. (2012). Quantifying selection in high-throughput immunoglobulin sequencing data sets. Nucleic Acids Res. 40 e134.
  • Yaari, G., Vander Heiden, J. A., Uduman, M., Gadala-Maria, D., Gupta, N., Stern, J. N. H., O’Connor, K. C., Hafler, D. A., Laserson, U. et al. (2013). Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front. Immunol. 4 358.
  • Yaari, G., Benichou, J. I. C., Vander Heiden, J. A., Kleinstein, S. H. and Louzoun, Y. (2015). The mutation patterns in B-cell immunoglobulin receptors reflect the influence of selection acting at multiple time-scales. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 370 20140242.
  • Yeap, L.-S., Hwang, J. K., Du, Z., Meyers, R. M., Meng, F.-L., Jakubauskaitė, A., Liu, M., Mani, V., Neuberg, D. et al. (2015). Sequence-intrinsic mechanisms that target AID mutational outcomes on antibody genes. Cell 163 1124–1137.
  • Zhao, S., Shojaie, A. and Witten, D. (2017). In defense of the indefensible: A very naive approach to high-dimensional inference. Preprint. Available ar arXiv:1705.05543.
  • Zhou, Q. and Liu, J. S. (2004). Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20 909–916.

Supplemental materials