## Electronic Journal of Statistics

### Heterogeneity adjustment with applications to graphical model inference

#### Abstract

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the ‘Bless of Dimensionality’. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.

#### Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 3908-3952.

Dates
Received: September 2017
First available in Project Euclid: 5 December 2018

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1543979030

Digital Object Identifier
doi:10.1214/18-EJS1466

#### Citation

Fan, Jianqing; Liu, Han; Wang, Weichen; Zhu, Ziwei. Heterogeneity adjustment with applications to graphical model inference. Electron. J. Statist. 12 (2018), no. 2, 3908--3952. doi:10.1214/18-EJS1466. https://projecteuclid.org/euclid.ejs/1543979030

#### References

• [1] Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors., Econometrica 81 1203–1227.
• [2] Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling., Proceedings of the National Academy of Sciences 97 10101–10106.
• [3] Bai, J. (2003). Inferential theory for factor models of large dimensions., Econometrica 71 135–171.
• [4] Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models., Econometrica 70 191–221.
• [5] Bai, J. and Ng, S. (2013). Principal components estimation and identification of static factors., Journal of Econometrics 176 18–29.
• [6] Biswal, B. B., Mennes, M., Zuo, X.-N., Gohel, S., Kelly, C., Smith, S. M., Beckmann, C. F., Adelstein, J. S., Buckner, R. L. and Colcombe, S. (2010). Toward discovery science of human brain function., Proceedings of the National Academy of Sciences 107 4734–4739.
• [7] Cai, T. T., Li, H., Liu, W. and Xie, J. (2012). Covariate-adjusted precision matrix estimation with an application in genetical genomics., Biometrika ass058.
• [8] Cai, T. T., Li, H., Liu, W. and Xie, J. (2015). Joint estimation of multiple high-dimensional precision matrices., The Annals of Statistics 38 2118–2144.
• [9] Cai, T. T., Liu, W. and Luo, X. (2011). A constrained $\ell _1$ minimization approach to sparse precision matrix estimation., Journal of the American Statistical Association 106 594–607.
• [10] Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation., The Annals of Statistics 41 3074–3110.
• [11] Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L. and Liu, C. (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods., PloS one 6 e17238.
• [12] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models., Handbook of Econometrics 6 5549–5632.
• [13] Connor, G., Hagmann, M. and Linton, O. (2012). Efficient semiparametric estimation of the fama–french model and extensions., Econometrica 80 713–754.
• [14] Connor, G. and Linton, O. (2007). Semiparametric estimation of a characteristic-based factor model of common stock returns., Journal of Empirical Finance 14 694–717.
• [15] Danaher, P., Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373–397.
• [16] Fan, J., Ke, Y. and Wang, K. (2016a). Decorrelation of covariates for high dimensional sparse regression., arXiv preprint arXiv:1612.08490.
• [17] Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680.
• [18] Fan, J., Liao, Y. and Wang, W. (2016b). Projected principal component analysis in factor models., The Annals of Statistics 44 219–254.
• [19] Fan, J., Rigollet, P. and Wang, W. (2015). Estimation of functionals of sparse covariance matrices., Annals of statistics 43 2706.
• [20] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso., Biostatistics 9 432– 441.
• [21] Guo, J., Cheng, J., Levina, E., Michailidis, G. and Zhu, J. (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting., The annals of applied statistics 9 821.
• [22] Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models., Biometrika asq060.
• [23] Higgins, J., Thompson, S. G. and Spiegelhalter, D. J. (2009). A re-evaluation of random-effects meta-analysis., Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 137–159.
• [24] Hsu, D., Kakade, S. M. and Zhang, T. (2012). A tail inequality for quadratic forms of subgaussian random vectors., Electron. Commun. Probab 17.
• [25] Johnson, W. E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods., Biostatistics 8 118–127.
• [26] Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions., Journal of the American Statistical Association 104 682–693.
• [27] Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation., Annals of Statistics 37 4254.
• [28] Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors., The Annals of Statistics 40 694–726.
• [29] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data., Nature Reviews Genetics 11 733–739.
• [30] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis., PLoS Genet 3 1724–1735.
• [31] Liu, H., Han, F. and Zhang, C.-h. (2012). Transelliptical graphical models. In, Advances in Neural Information Processing Systems.
• [32] Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs., The Journal of Machine Learning Research 10 2295–2328.
• [33] Loh, P.-L. and Wainwright, M. J. (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses., The Annals of Statistics 41 3022–3049.
• [34] Lorentz, G. G. (2005)., Approximation of functions, vol. 322. American Mathematical Soc.
• [35] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso., The Annals of Statistics 1436–1462.
• [36] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling., The Annals of Statistics 1069–1097.
• [37] Onatski, A. (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors., Journal of Econometrics 168 244–258.
• [38] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model., Statistica Sinica 17 1617.
• [39] Power, J. D., Cohen, A. L., Nelson, S. M., Wig, G. S., Barnes, K. A., Church, J. A., Vogel, A. C., Laumann, T. O., Miezin, F. M. and Schlaggar, B. L. (2011). Functional network organization of the human brain., Neuron 72 665–678.
• [40] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_1$-penalized log-determinant divergence., Electronic Journal of Statistics 5 935–980.
• [41] Rudelson, M. and Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration., Electron. Commun. Probab 18.
• [42] Shen, X., Pan, W. and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation., Journal of the American Statistical Association 107 223–232.
• [43] Sims, A. H., Smethurst, G. J., Hey, Y., Okoniewski, M. J., Pepper, S. D., Howell, A., Miller, C. J. and Clarke, R. B. (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis., BMC medical genomics 1 42.
• [44] Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors., Journal of the American statistical association 97 1167–1179.
• [45] Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population., Journal of the American Statistical Association 91 217–221.
• [46] Wang, W. and Fan, J. (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance., Annals of statistics 45 1342–1374.
• [47] Yang, S., Lu, Z., Shen, X., Wonka, P. and Ye, J. (2015). Fused multiple graphical lasso., SIAM Journal on Optimization 25 916–943.
• [48] Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming., The Journal of Machine Learning Research 11 2261–2286.
• [49] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model., Biometrika 94 19–35.