Source: Ann. Appl. Stat. Volume 5, Number 2A
(2011), 628-644.
We introduce a new version of forward stepwise regression. Our
modification finds solutions to regression problems where the
selected predictors appear in a structured pattern, with respect
to a predefined distance measure over the candidate predictors.
Our method is motivated by the problem of predicting HIV-1 drug
resistance from protein sequences. We find that our method
improves the interpretability of drug resistance while producing
comparable predictive accuracy to standard methods. We also
demonstrate our method in a simulation study and present some
theoretical results and connections.
References
Barron, A. R., Cohen, A., Dahmen, W. and DeVore, R. A. (2008). Approximation and learning by greedy algorithms. Ann. Statist. 36 64–94.
Beerenwinkel, N., Daumer, M., Oette, M., Korn, K., Hoffmann, D., Kaiser, R., Lengauer, T., Selbig, J. and Walter, H. (2003). Geno2pheno: Estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Res. 31 3850–3855.
Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the lasso. Electron. J. Statist. 1 169–194.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Statist. Software 33 1–13.
Greenshtein, E. and Ritov, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli 10 971–988.
Huang, J., Zhang, T. and Metaxas, D. (2009). Learning with structured sparsity. In ICML’09: Proceedings of the 26th Annual International Conference on Machine Learning 417–424. ACM, New York.
Jacob, L., Obozinski, G. and Vert, J.-P. (2009). Group lasso with overlap and graph lasso. In ICML’09: Proceedings of the 26th Annual International Conference on Machine Learning 433–440. ACM, New York.
Liu, D., Lin, X. and Ghosh, D. (2007). Semiparametric regression for multi-dimensional genomic pathway data: Least square kernel machines and linear mixed models. Biometrics 63 1079–1088.
Petropoulos, C. J., Parkin, N. T., Limoli, K. L., Lie, Y. S., Wrin, T., Huang, W., Tian, H., Smith, D., Winslow, G. A., Capon, D. J. and Whitcomb, J. M. (2000). A novel phenotypic drug susceptibility assay for human immunodeficiency virus type 1. Antimicrobial Agents and Chemotherapy 44 920–928.
Rhee, S.-Y., Gonzales, M. J., Kantor, R., Betts, B. J., Ravela, J. and Shafer, R. W. (2003). Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 31 298–303.
Rhee, S.-Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L. and Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proc. Natl. Acad. Sci. USA 103 17355–17360.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91–108.
Wainwright, M. J. (2007). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. Available at
arXiv:math/0702301v2.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49–67.
Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression. J. Mach. Learn. Res. 10 555–568.
Zhang, J., Rhee, S.-Y., Taylor, J. and Shafer, R. W. (2005). Comparison of the precision and sensitivity of the antivirogram and PhenoSense HIV drug susceptibility assays. Journal of Aquired Immune Deficiency Syndrones 38 439–444.