Electronic Journal of Statistics
- Electron. J. Statist.
- Volume 12, Number 2 (2018), 3908-3952.
Heterogeneity adjustment with applications to graphical model inference
Jianqing Fan, Han Liu, Weichen Wang, and Ziwei Zhu
Full-text: Open access
Abstract
Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the ‘Bless of Dimensionality’. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.
Article information
Source
Electron. J. Statist., Volume 12, Number 2 (2018), 3908-3952.
Dates
Received: September 2017
First available in Project Euclid: 5 December 2018
Permanent link to this document
https://projecteuclid.org/euclid.ejs/1543979030
Digital Object Identifier
doi:10.1214/18-EJS1466
Zentralblatt MATH identifier
07003233
Keywords
Multiple sourcing batch effect semiparametric factor model principal component analysis brain image network
Rights
Creative Commons Attribution 4.0 International License.
Citation
Fan, Jianqing; Liu, Han; Wang, Weichen; Zhu, Ziwei. Heterogeneity adjustment with applications to graphical model inference. Electron. J. Statist. 12 (2018), no. 2, 3908--3952. doi:10.1214/18-EJS1466. https://projecteuclid.org/euclid.ejs/1543979030
References
- [1] Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors., Econometrica 81 1203–1227.
- [2] Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling., Proceedings of the National Academy of Sciences 97 10101–10106.
- [3] Bai, J. (2003). Inferential theory for factor models of large dimensions., Econometrica 71 135–171.
- [4] Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models., Econometrica 70 191–221.
- [5] Bai, J. and Ng, S. (2013). Principal components estimation and identification of static factors., Journal of Econometrics 176 18–29.
- [6] Biswal, B. B., Mennes, M., Zuo, X.-N., Gohel, S., Kelly, C., Smith, S. M., Beckmann, C. F., Adelstein, J. S., Buckner, R. L. and Colcombe, S. (2010). Toward discovery science of human brain function., Proceedings of the National Academy of Sciences 107 4734–4739.
- [7] Cai, T. T., Li, H., Liu, W. and Xie, J. (2012). Covariate-adjusted precision matrix estimation with an application in genetical genomics., Biometrika ass058.
- [8] Cai, T. T., Li, H., Liu, W. and Xie, J. (2015). Joint estimation of multiple high-dimensional precision matrices., The Annals of Statistics 38 2118–2144.
- [9] Cai, T. T., Liu, W. and Luo, X. (2011). A constrained $\ell _1$ minimization approach to sparse precision matrix estimation., Journal of the American Statistical Association 106 594–607.
- [10] Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation., The Annals of Statistics 41 3074–3110.Zentralblatt MATH: 1288.62099
Digital Object Identifier: doi:10.1214/13-AOS1178
Project Euclid: euclid.aos/1388545679 - [11] Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L. and Liu, C. (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods., PloS one 6 e17238.
- [12] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models., Handbook of Econometrics 6 5549–5632.
- [13] Connor, G., Hagmann, M. and Linton, O. (2012). Efficient semiparametric estimation of the fama–french model and extensions., Econometrica 80 713–754.
- [14] Connor, G. and Linton, O. (2007). Semiparametric estimation of a characteristic-based factor model of common stock returns., Journal of Empirical Finance 14 694–717.
- [15] Danaher, P., Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373–397.
- [16] Fan, J., Ke, Y. and Wang, K. (2016a). Decorrelation of covariates for high dimensional sparse regression., arXiv preprint arXiv:1612.08490.
- [17] Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680.
- [18] Fan, J., Liao, Y. and Wang, W. (2016b). Projected principal component analysis in factor models., The Annals of Statistics 44 219–254.Zentralblatt MATH: 1331.62295
Digital Object Identifier: doi:10.1214/15-AOS1364
Project Euclid: euclid.aos/1449755962 - [19] Fan, J., Rigollet, P. and Wang, W. (2015). Estimation of functionals of sparse covariance matrices., Annals of statistics 43 2706.Zentralblatt MATH: 1327.62338
Digital Object Identifier: doi:10.1214/15-AOS1357
Project Euclid: euclid.aos/1444222090 - [20] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso., Biostatistics 9 432– 441.
- [21] Guo, J., Cheng, J., Levina, E., Michailidis, G. and Zhu, J. (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting., The annals of applied statistics 9 821.Zentralblatt MATH: 1397.62195
Digital Object Identifier: doi:10.1214/13-AOAS700
Project Euclid: euclid.aoas/1437397113 - [22] Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models., Biometrika asq060.
- [23] Higgins, J., Thompson, S. G. and Spiegelhalter, D. J. (2009). A re-evaluation of random-effects meta-analysis., Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 137–159.
- [24] Hsu, D., Kakade, S. M. and Zhang, T. (2012). A tail inequality for quadratic forms of subgaussian random vectors., Electron. Commun. Probab 17.
- [25] Johnson, W. E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods., Biostatistics 8 118–127.
- [26] Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions., Journal of the American Statistical Association 104 682–693.
- [27] Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation., Annals of Statistics 37 4254.Zentralblatt MATH: 1191.62101
Digital Object Identifier: doi:10.1214/09-AOS720
Project Euclid: euclid.aos/1256303543 - [28] Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors., The Annals of Statistics 40 694–726.Zentralblatt MATH: 1273.62214
Digital Object Identifier: doi:10.1214/12-AOS970
Project Euclid: euclid.aos/1337268209 - [29] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data., Nature Reviews Genetics 11 733–739.
- [30] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis., PLoS Genet 3 1724–1735.
- [31] Liu, H., Han, F. and Zhang, C.-h. (2012). Transelliptical graphical models. In, Advances in Neural Information Processing Systems.
- [32] Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs., The Journal of Machine Learning Research 10 2295–2328.Zentralblatt MATH: 1235.62035
- [33] Loh, P.-L. and Wainwright, M. J. (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses., The Annals of Statistics 41 3022–3049.Zentralblatt MATH: 1288.62081
Digital Object Identifier: doi:10.1214/13-AOS1162
Project Euclid: euclid.aos/1388545677 - [34] Lorentz, G. G. (2005)., Approximation of functions, vol. 322. American Mathematical Soc.
- [35] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso., The Annals of Statistics 1436–1462.Zentralblatt MATH: 1113.62082
Digital Object Identifier: doi:10.1214/009053606000000281
Project Euclid: euclid.aos/1152540754 - [36] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling., The Annals of Statistics 1069–1097.Zentralblatt MATH: 1216.62090
Digital Object Identifier: doi:10.1214/10-AOS850
Project Euclid: euclid.aos/1304947044 - [37] Onatski, A. (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors., Journal of Econometrics 168 244–258.Mathematical Reviews (MathSciNet): MR2923766
Zentralblatt MATH: 06714698
Digital Object Identifier: doi:10.1016/j.jeconom.2012.01.034 - [38] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model., Statistica Sinica 17 1617.Zentralblatt MATH: 1134.62029
- [39] Power, J. D., Cohen, A. L., Nelson, S. M., Wig, G. S., Barnes, K. A., Church, J. A., Vogel, A. C., Laumann, T. O., Miezin, F. M. and Schlaggar, B. L. (2011). Functional network organization of the human brain., Neuron 72 665–678.
- [40] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_1$-penalized log-determinant divergence., Electronic Journal of Statistics 5 935–980.Mathematical Reviews (MathSciNet): MR2836766
Zentralblatt MATH: 1274.62190
Digital Object Identifier: doi:10.1214/11-EJS631
Project Euclid: euclid.ejs/1316092865 - [41] Rudelson, M. and Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration., Electron. Commun. Probab 18.
- [42] Shen, X., Pan, W. and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation., Journal of the American Statistical Association 107 223–232.
- [43] Sims, A. H., Smethurst, G. J., Hey, Y., Okoniewski, M. J., Pepper, S. D., Howell, A., Miller, C. J. and Clarke, R. B. (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis., BMC medical genomics 1 42.
- [44] Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors., Journal of the American statistical association 97 1167–1179.
- [45] Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population., Journal of the American Statistical Association 91 217–221.
- [46] Wang, W. and Fan, J. (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance., Annals of statistics 45 1342–1374.Zentralblatt MATH: 1373.62299
Digital Object Identifier: doi:10.1214/16-AOS1487
Project Euclid: euclid.aos/1497319697 - [47] Yang, S., Lu, Z., Shen, X., Wonka, P. and Ye, J. (2015). Fused multiple graphical lasso., SIAM Journal on Optimization 25 916–943.
- [48] Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming., The Journal of Machine Learning Research 11 2261–2286.Zentralblatt MATH: 1242.62043
- [49] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model., Biometrika 94 19–35.
The Institute of Mathematical Statistics and the Bernoulli Society

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Sparse median graphs estimation in a high-dimensional semiparametric model
Han, Fang, Han, Xiaoyan, Liu, Han, and Caffo, Brian, The Annals of Applied Statistics, 2016 - A Unified Theory of Confidence Regions and Testing for High-Dimensional Estimating Equations
Neykov, Matey, Ning, Yang, Liu, Jun S., and Liu, Han, Statistical Science, 2018 - A likelihood ratio framework for high-dimensional semiparametric regression
Ning, Yang, Zhao, Tianqi, and Liu, Han, The Annals of Statistics, 2017
- Sparse median graphs estimation in a high-dimensional semiparametric model
Han, Fang, Han, Xiaoyan, Liu, Han, and Caffo, Brian, The Annals of Applied Statistics, 2016 - A Unified Theory of Confidence Regions and Testing for High-Dimensional Estimating Equations
Neykov, Matey, Ning, Yang, Liu, Jun S., and Liu, Han, Statistical Science, 2018 - A likelihood ratio framework for high-dimensional semiparametric regression
Ning, Yang, Zhao, Tianqi, and Liu, Han, The Annals of Statistics, 2017 - A general theory of hypothesis tests and confidence regions for sparse high dimensional models
Ning, Yang and Liu, Han, The Annals of Statistics, 2017 - A partially linear framework for massive heterogeneous data
Zhao, Tianqi, Cheng, Guang, and Liu, Han, The Annals of Statistics, 2016 - Distributed testing and estimation under sparse high dimensional models
Battey, Heather, Fan, Jianqing, Liu, Han, Lu, Junwei, and Zhu, Ziwei, The Annals of Statistics, 2018 - Partial information framework: Model-based aggregation of estimates from diverse information sources
Satopää, Ville A., Jensen, Shane T., Pemantle, Robin, and Ungar, Lyle H., Electronic Journal of Statistics, 2017 - Investigating differences in brain functional networks using hierarchical covariate-adjusted independent component analysis
Shi, Ran and Guo, Ying, The Annals of Applied Statistics, 2016 - Inference of global clusters from locally distributed data
Nguyen, XuanLong, Bayesian Analysis, 2010 - A statistical framework for data integration through graphical models with application to cancer genomics
Zhang, Yuping, Ouyang, Zhengqing, and Zhao, Hongyu, The Annals of Applied Statistics, 2017