## Electronic Journal of Statistics

### Designing penalty functions in high dimensional problems: The role of tuning parameters

#### Abstract

Various forms of penalty functions have been developed for regularized estimation and variable selection. Screening approaches are often used to reduce the number of covariate before penalized estimation. However, in certain problems, the number of covariates remains large after screening. For example, in genome-wide association (GWA) studies, the purpose is to identify Single Nucleotide Polymorphisms (SNPs) that are associated with certain traits, and typically there are millions of SNPs and thousands of samples. Because of the strong correlation of nearby SNPs, screening can only reduce the number of SNPs from millions to tens of thousands and the variable selection problem remains very challenging. Several penalty functions have been proposed for such high dimensional data. However, it is unclear which class of penalty functions is the appropriate choice for a particular application. In this paper, we conduct a theoretical analysis to relate the ranges of tuning parameters of various penalty functions with the dimensionality of the problem and the minimum effect size. We exemplify our theoretical results in several penalty functions. The results suggest that a class of penalty functions that bridges $L_{0}$ and $L_{1}$ penalties requires less restrictive conditions on dimensionality and minimum effect sizes in order to attain the two fundamental goals of penalized estimation: to penalize all the noise to be zero and to obtain unbiased estimation of the true signals. The penalties such as SICA and Log belong to this class, but they have not been used often in applications. The simulation and real data analysis using GWAS data suggest the promising applicability of such class of penalties.

#### Article information

Source
Electron. J. Statist., Volume 10, Number 2 (2016), 2312-2328.

Dates
First available in Project Euclid: 29 August 2016

https://projecteuclid.org/euclid.ejs/1472498029

Digital Object Identifier
doi:10.1214/16-EJS1169

Mathematical Reviews number (MathSciNet)
MR3541973

Zentralblatt MATH identifier
06624518

#### Citation

Chen, Ting-Huei; Sun, Wei; Fine, Jason P. Designing penalty functions in high dimensional problems: The role of tuning parameters. Electron. J. Statist. 10 (2016), no. 2, 2312--2328. doi:10.1214/16-EJS1169. https://projecteuclid.org/euclid.ejs/1472498029

#### References

• Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection., The Annals of Applied Statistics 5 232–253.
• Browning, S. R. and Browning, B. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering., The American Journal of Human Genetics 81 1084–1097.
• Bühlmann, P. and Mandozzi, J. (2012). High-dimensional variable screening and bias in subsequent inference, with an empirical comparison., Computational Statistics 1–24.
• Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces., Biometrika 95 759–771.
• Chen, J. and Chen, Z. (2012). Extended BIC for small-n-large-P sparse GLM., Statistica Sinica 22 555.
• Chen, T.-H., Sun, W. and Fine, J.P. (2016). Supplement to “Designing penalty functions in high dimensional problems: The role of tuning parameters.”. DOI:, 10.1214/16-EJS1169SUPP
• Devlin, B. and Roeder, K. (1999). Genomic control for association studies., Biometrics 55 997–1004.
• Fan, J. (1997). Comments on ‘Wavelets in statistics: A review’ by A. Antoniadis., Statistical Methods & Applications 6 131–138.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association 96 1348–1360.
• Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space., Statistica Sinica 20 101.
• Fan, J. and Lv, J. (2011). Nonconcave Penalized Likelihood With NP-Dimensionality., Information Theory, IEEE Transactions on 57 5467–5484.
• Friedman, J. H. (2008). Fast sparse regression and, classification.
• Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources., Nature Protocols 4 44–57.
• Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares., The Annals of Statistics 37 3498–3528.
• Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties., Journal of the American Statistical Association 106.
• Mei, Y., Wang, Z., Zhang, L., Zhang, Y., Li, X., Liu, H., Ye, J. and You, H. (2012). Regulation of neuroblastoma differentiation by forkhead transcription factors FOXO1/3/4 through the receptor tyrosine kinase PDGFRA., Proceedings of the National Academy of Sciences 109 4898–4903.
• Schwarz, G. (1978). Estimating the dimension of a model., The Annals of Statistics 6 461–464.
• Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression., Annals of the Institute of Statistical Mathematics 65 807–832.
• Shi, J., Levinson, D. F., Duan, J., Sanders, A. R., Zheng, Y., Peâ, I. et al. (2009). Common variants on chromosome 6p22. 1 are associated with schizophrenia., Nature 460 753–757.
• Sun, W., Ibrahim, J. G. and Zou, F. (2010). Genomewide Multiple-Loci Mapping in Experimental Crosses by Iterative Adaptive Penalized Regression., Genetics 185 349.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
• Wang, L., Kim, Y. and Li, R. (2013). Calibrating non-convex penalized regression in ultra-high dimension., Annals of statistics 41 2505.
• Wright, F. A., Sullivan, P., Brooks, A., Zou, F., Sun, W., Xia, K., Madar, V., Abdellaoui, A., Batista, S., Butler, C., Chen, G., Chen, T., W., C. et al. (2014). Heritability and Genomics of Gene Expression In Peripheral Blood., Nature Genetics in press.
• Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty., The Annals of Statistics 38 894–942.
• Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso., The Journal of Machine Learning Research 7 2541–2563.
• Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models., Annals of Statistics 36 1509.