Electronic Journal of Statistics

A provable smoothing approach for high dimensional generalized regression with applications in genomics

Fang Han, Hongkai Ji, Zhicheng Ji, and Honglang Wang

Full-text: Open access

Abstract

In many applications, linear models fit the data poorly. This article studies an appealing alternative, the generalized regression model. This model only assumes that there exists an unknown monotonically increasing link function connecting the response $Y$ to a single index $\boldsymbol{X} ^{\mathsf{T}}\boldsymbol{\beta } ^{*}$ of explanatory variables $\boldsymbol{X} \in{\mathbb{R}} ^{d}$. The generalized regression model is flexible and covers many widely used statistical models. It fits the data generating mechanisms well in many real problems, which makes it useful in a variety of applications where regression models are regularly employed. In low dimensions, rank-based M-estimators are recommended to deal with the generalized regression model, giving root-$n$ consistent estimators of $\boldsymbol{\beta } ^{*}$. Applications of these estimators to high dimensional data, however, are questionable. This article studies, both theoretically and practically, a simple yet powerful smoothing approach to handle the high dimensional generalized regression model. Theoretically, a family of smoothing functions is provided, and the amount of smoothing necessary for efficient inference is carefully calculated. Practically, our study is motivated by an important and challenging scientific problem: decoding gene regulation by predicting transcription factors that bind to cis-regulatory elements. Applying our proposed method to this problem shows substantial improvement over the state-of-the-art alternative in real data.

Article information

Source
Electron. J. Statist., Volume 11, Number 2 (2017), 4347-4403.

Dates
Received: December 2016
First available in Project Euclid: 16 November 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1510801790

Digital Object Identifier
doi:10.1214/17-EJS1352

Mathematical Reviews number (MathSciNet)
MR3724223

Zentralblatt MATH identifier
06816619

Subjects
Primary: 47N30: Applications in probability theory and statistics

Keywords
Semiparametric regression generalized regression model rank-based M-estimator smoothing approximation transcription factor binding

Rights
Creative Commons Attribution 4.0 International License.

Citation

Han, Fang; Ji, Hongkai; Ji, Zhicheng; Wang, Honglang. A provable smoothing approach for high dimensional generalized regression with applications in genomics. Electron. J. Statist. 11 (2017), no. 2, 4347--4403. doi:10.1214/17-EJS1352. https://projecteuclid.org/euclid.ejs/1510801790


Export citation

References

  • [1] Peter J Huber and Elvezio M. Ronchetti., Robust Statistics. Wiley, 2011.
  • [2] David Ruppert, Matt P Wand, and Raymond J Carroll., Semiparametric Regression. Cambridge University Press, 2003.
  • [3] Aaron K Han. Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator., Journal of Econometrics, 35(2):303–316, 1987.
  • [4] Peter J Park. ChIP–seq: advantages and challenges of a maturing technology., Nature Reviews Genetics, 10(10):669–680, 2009.
  • [5] Alan P Boyle, Sean Davis, Hennady P Shulha, Paul Meltzer, Elliott H Margulies, Zhiping Weng, Terrence S Furey, and Gregory E Crawford. High-resolution mapping and characterization of open chromatin across the genome., Cell, 132(2):311–322, 2008.
  • [6] ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project., Science, 306 (5696):636–640, 2004.
  • [7] Christopher Cavanagh and Robert P Sherman. Rank estimators for monotonic index models., Journal of Econometrics, 84(2):351–381, 1998.
  • [8] Joel L Horowitz. Semiparametric estimation of a regression model with an unknown transformation of the dependent variable., Econometrica, 64(1):103–137, 1996.
  • [9] Jianming Ye and Naihua Duan. Nonparametric $n^-1/2$-consistent estimation for the general transformation models., The Annals of Statistics, 25(6) :2682–2717, 1997.
  • [10] Songnian Chen. Rank estimation of transformation models., Econometrica, 70(4) :1683–1697, 2002.
  • [11] Peng-Jie Dai, Qing-Zhao Zhang, and Zhi-Hua Sun. Variable selection of generalized regression models based on maximum rank correlation., Acta Mathematicae Applicatae Sinica, English Series, 30(3):833–844, 2014.
  • [12] Xingjie Shi, Jin Liu, Jian Huang, Yong Zhou, Yang Xie, and Shuangge Ma. A penalized robust method for identifying gene–environment interactions., Genetic Epidemiology, 38(3):220–230, 2014.
  • [13] Hyungtaik Ahn, Hidehiko Ichimura, and James L Powell. Simple estimators for monotone index models. Technical report, Department of Economics, UC Berkeley, 1996.
  • [14] Hisatoshi Tanaka. Semiparametric least squares estimation of monotone single index models and its application to the iterative least squares estimation of binary choice models. Technical report, Citeseer, 2008.
  • [15] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In, Advances in Neural Information Processing Systems, pages 927–935, 2011.
  • [16] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach., The Annals of Operations Research, 46(1):157–178, 1993.
  • [17] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems., SIAM Journal on Optimization, 22(2):341–362, 2012.
  • [18] Joel L Horowitz. A smoothed maximum score estimator for the binary response model., Econometrica, 60(3):505–531, 1992.
  • [19] Shuangge Ma and Jian Huang. Regularized ROC method for disease classification and biomarker selection with microarray data., Bioinformatics, 21(24) :4356–4362, 2005.
  • [20] Joel L Horowitz. Bootstrap methods for median regression models., Econometrica, 66(6) :1327–1351, 1998.
  • [21] Junyi Zhang, Zhezhen Jin, Yongzhao Shao, and Zhiliang Ying. Statistical inference on transformation models: a self-induced smoothing approach., arXiv preprint arXiv :1302.6651, 2013.
  • [22] Robert P Sherman. The limiting distribution of the maximum rank correlation estimator., Econometrica, 61(1):123–137, 1993.
  • [23] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust M-estimators., The Annals of Statistics (in press), 2015.
  • [24] Jianqing Fan, Quefeng Li, and Yuyan Wang. Robust estimation of high-dimensional mean regression., Journal of the Royal Statistical Society: Series B (Methodological), 79(1):247–265, 2017.
  • [25] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In, Conference on Learning Theory, 2009.
  • [26] Jared C Foster, Jeremy MG Taylor, and Bin Nan. Variable selection in monotone single-index models via the adaptive lasso., Statistics in Medicine, 32(22) :3944–3954, 2013.
  • [27] Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric constraints., Information and Inference, 6(1):1–40, 2017.
  • [28] Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation under unknown nonlinear transform., arXiv preprint arXiv :1505.03257, 2015.
  • [29] Peter Radchenko. High dimensional single index models., Journal of Multivairate Analysis, 139:266–282, 2015.
  • [30] Ker-Chau Li. Sliced inverse regression for dimension reduction., Journal of the American Statistical Association, 86(414):316–327, 1991.
  • [31] Ker-Chau Li. On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma., Journal of the American Statistical Association, 87(420) :1025–1039, 1992.
  • [32] R Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction: Comment., Journal of the American Statistical Association, 86(414):328–332, 1991.
  • [33] R Dennis Cook. Principal Hessian directions revisited., Journal of the American Statistical Association, 93(441):84–94, 1998.
  • [34] Xiangrong Yin and Haileab Hilafu. Sequential sufficient dimension reduction for large $p$, small $n$ problems., Journal of the Royal Statistical Society: Series B (Methodological), 77(4):879–892, 2015.
  • [35] George EP Box and David R Cox. An analysis of transformations., Journal of the Royal Statistical Society: Series B (Methodological), 26(2):211–252, 1964.
  • [36] John D Kalbfleisch and Ross L Prentice., The Statistical Analysis of Failure Time Data. John Wiley and Sons, 2011.
  • [37] Gangadharrao S Maddala., Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, 1986.
  • [38] James Tobin. Estimation of relationships for limited dependent variables., Econometrica, 26(1):24–36, 1958.
  • [39] Bo E Honoré and James Powell. Pairwise difference estimators for nonlinear models. In, Identification and Inference for Econometric Models, pages 520–553. Cambridge University Press, 1997.
  • [40] Maurice G Kendall. A new measure of rank correlation., Biometrika, 30(1/2):81–93, 1938.
  • [41] Roger B Nelsen., An Introduction to Copulas. Springer, 2013.
  • [42] Chris AJ Klaassen and Jon A Wellner. Efficient estimation in the bivariate normal copula model: normal margins are least favourable., Bernoulli, 3(1):55–77, 1997.
  • [43] Marc Hallin and Davy Paindaveine. Semiparametrically efficient rank-based inference for shape. i. optimal rank-based tests for sphericity., The Annals of Statistics, 34(6) :2707–2756, 2006.
  • [44] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions., Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–147, 1974.
  • [45] BM Brown and You-Gan Wang. Standard errors and covariance matrices for smoothed rank estimators., Biometrika, 92(1):149–158, 2005.
  • [46] Sara A Van de Geer. High-dimensional generalized linear models and the lasso., The Annals of Statistics, 36(2):614–645, 2008.
  • [47] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector., The Annals of Statistics, 37(4) :1705–1732, 2009.
  • [48] Aurélie C Lozano, Nicolai Meinshausen, and Eunho Yang. Minimum distance estimation for robust high-dimensional regression., Electronic Journal of Statistics, 10(1) :1296–1340, 2016.
  • [49] Robert Tibshirani. Regression shrinkage and selection via the lasso., Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
  • [50] Anthony Mathelier, Oriol Fornes, David J Arenillas, Chih-yu Chen, Grégoire Denay, Jessica Lee, Wenqiang Shi, Casper Shyr, Ge Tan, Rebecca Worsley-Hunt, et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles., Nucleic Acids Research, page gkv1176, 2015.
  • [51] Hongkai Ji, Hui Jiang, Wenxiu Ma, David S Johnson, Richard M Myers, and Wing H Wong. An integrated software system for analyzing ChIP-chip and ChIP-seq data., Nature Biotechnology, 26(11) :1293–1300, 2008.
  • [52] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated Gaussian designs., Journal of Machine Learning Research, 11 :2241–2259, 2010.
  • [53] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers., Statistical Science, 27(4):538–557, 11 2012.
  • [54] T Tony Cai and Harrison H Zhou. Optimal rates of convergence for sparse covariance matrix estimation., The Annals of Statistics, 40(5) :2389–2420, 2012.
  • [55] Eitan Greenshtein and Ya’Acov Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization., Bernoulli, 10(6):971–988, 2004.
  • [56] Ulf Grenander., Abstract Inference. Wiley New York, 1981.
  • [57] Xiaotong Shen, Jian Shi, and Wing Hung Wong. Random sieve likelihood and general regression models., Journal of the American Statistical Association, 94(447):835–846, 1999.
  • [58] Deborah Nolan and David Pollard. U-processes: rates of convergence., The Annals of Statistics, 15(2):780–799, 1987.
  • [59] Miguel A Arcones and Evarist Gine. Limit theorems for U-processes., The Annals of Probability, 21(3) :1494–1542, 1993.
  • [60] Robert P Sherman. Maximal inequalities for degenerate U-processes with applications to optimization estimators., The Annals of Statistics, 22(1):439–459, 1994.
  • [61] Xuming He and Qi-Man Shao. On parameters of increasing dimensions., Journal of Multivariate Analysis, 73(1):120–135, 2000.
  • [62] Jana Jurečková, Pranab Kumar Sen, and Jan Picek., Methodology in Robust and Nonparametric Statistics. CRC Press, 2012.
  • [63] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association, 96(456) :1348–1360, 2001.
  • [64] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty., The Annals of Statistics, 38(2):894–942, 2010.
  • [65] Wenjiang J Fu. Penalized regressions: the bridge versus the lasso., Journal of Computational and Graphical Statistics, 7(3):397–416, 1998.
  • [66] Shikai Luo and Subhashis Ghosal. Forward selection and estimation in high dimensional single index model., Statistical Methodology, 33:172–179, 2016.
  • [67] Ker-Chau Li and Naihua Duan. Regression analysis under link violation., The Annals of Statistics, 17(3) :1009–1052, 1989.
  • [68] Robert E Thurman, Eric Rynes, Richard Humbert, et al. The accessible chromatin landscape of the human genome., Nature, 489 (7414):75–82, 2012.
  • [69] Karen Kapur, Yi Xing, Zhengqing Ouyang, and Wing Hung Wong. Exon arrays provide accurate assessments of gene expression., Genome Biolology, 8(5):R82, 2007.
  • [70] Hong-Mei Zhang, Hu Chen, Wei Liu, Hui Liu, Jing Gong, Huili Wang, and An-Yuan Guo. AnimalTFDB: a comprehensive animal transcription factor database., Nucleic Acids Research, 40(D1):D144–D149, 2012.
  • [71] Myles Hollander, Douglas A Wolfe, and Eric Chicken., Nonparametric Statistical Methods. John Wiley & Sons, 2013.
  • [72] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In, Compressed Sensing, pages 210–268. Cambridge University Press, 2012.