The Annals of Applied Statistics

“Virus hunting” using radial distance weighted discrimination

Jie Xiong, D. P. Dittmer, and J. S. Marron

Full-text: Open access

Abstract

Motivated by the challenge of using DNA-seq data to identify viruses in human blood samples, we propose a novel classification algorithm called “Radial Distance Weighted Discrimination” (or Radial DWD). This classifier is designed for binary classification, assuming one class is surrounded by the other class in very diverse radial directions, which is seen to be typical for our virus detection data. This separation of the 2 classes in multiple radial directions naturally motivates the development of Radial DWD. While classical machine learning methods such as the Support Vector Machine and linear Distance Weighted Discrimination can sometimes give reasonable answers for a given data set, their generalizability is severely compromised because of the linear separating boundary. Radial DWD addresses this challenge by using a more appropriate (in this particular case) spherical separating boundary. Simulations show that for appropriate radial contexts, this gives much better generalizability than linear methods, and also much better than conventional kernel based (nonlinear) Support Vector Machines, because the latter methods essentially use much of the information in the data for determining the shape of the separating boundary. The effectiveness of Radial DWD is demonstrated for real virus detection.

Article information

Source
Ann. Appl. Stat., Volume 9, Number 4 (2015), 2090-2109.

Dates
Received: May 2014
Revised: August 2015
First available in Project Euclid: 28 January 2016

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1453994193

Digital Object Identifier
doi:10.1214/15-AOAS869

Mathematical Reviews number (MathSciNet)
MR3456367

Zentralblatt MATH identifier
06560823

Keywords
Virus hunting nonlinear classification high-dimension low-sample size data analysis DNA sequencing

Citation

Xiong, Jie; Dittmer, D. P.; Marron, J. S. “Virus hunting” using radial distance weighted discrimination. Ann. Appl. Stat. 9 (2015), no. 4, 2090--2109. doi:10.1214/15-AOAS869. https://projecteuclid.org/euclid.aoas/1453994193


Export citation

References

  • Alizadeh, F. and Goldfarb, D. (2003). Second-order cone programming. Math. Program. 95 3–51.
  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 955–974.
  • Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148.
  • Goldstein, D. B., Allen, A., Keebler, J., Margulies, E. H., Petrou, S., Petrovski, S. and Sunyaev, S. (2013). Sequencing studies in human genetics: Design and interpretation. Nat. Rev. Genet. 14 460–470.
  • Grada, A. and Weinbrecht, K. (2013). Next-generation sequencing: Methodology and application. J. Invest. Dermatol. 133 e11.
  • Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Jiang, J., Marron, J. S. and Jiang, X. (2009). Robust centroid based classification with minimum error rates for high dimension, low sample size data. J. Statist. Plann. Inference 139 2571–2580.
  • Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • Liu, Y., Hayes, D. N., Nobel, A. and Marron, J. S. (2008). Statistical significance of clustering for high-dimension, low-sample size data. J. Amer. Statist. Assoc. 103 1281–1293.
  • Marron, J. S., Todd, M. J. and Ahn, J. (2007). Distance-weighted discrimination. J. Amer. Statist. Assoc. 102 1267–1271.
  • Metzker, M. L. (2010). Sequencing technologies—the next generation. Nat. Rev. Genet. 11 31–46.
  • Mwenifumbo, J. C. and Marra, M. A. (2013). Cancer genome-sequencing study design. Nat. Rev. Genet. 14 321–332.
  • Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J. and Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. J. Amer. Statist. Assoc. 105 401–414.
  • Rehm, H. L. (2013). Disease-targeted sequencing: A cornerstone in the clinic. Nat. Rev. Genet. 14 295–300.
  • Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
  • Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge, MA.
  • Shen, D., Shen, H. and Marron, J. S. (2013). Consistency of sparse PCA in high dimension, low sample size contexts. J. Multivariate Anal. 115 317–333.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tutuncu, R. H., Toh, K. C. and Todd, M. J. (2001). SDPT3—a MATLAB software package for semidefinite-quadratic-linear programming. Available at http://www.math.cmu.edu/users/reha/home.html.
  • Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
  • World Health Organization WHO (2014). Middle East respiratory syndrome coronavirus (MERS-CoV) summary and literature update-as of 9 May 2014. Available at http://www.who.int/csr/disease/coronavirus_infections/MERS_CoV_Update_09_May_2014.pdf.
  • Xiong, J., Dittmer, D. P. and Marron, J. S. (2015). Supplement to: “Virus hunting” using Radial Distance Weighted Discrimination. DOI:10.1214/15-AOAS869SUPP.
  • Yata, K. and Aoshima, M. (2013). PCA consistency for the power spiked model in high-dimensional settings. J. Multivariate Anal. 122 334–354.

Supplemental materials

  • Supplement to: “Virus hunting” using Radial Distance Weighted Discrimination. In the supplementary materials, we first introduce some useful biology background for virus detection in Section 1, DNA alignment process in Section 2, and then discuss the insights of the Dirichlet distribution in Section 3. Real data examples and simulation studies are included in Sections 4 and 5, respectively. Theorems and proofs are in Section 6.