## The Annals of Applied Statistics

### A method for generating realistic correlation matrices

#### Abstract

Simulating sample correlation matrices is important in many areas of statistics. Approaches such as generating Gaussian data and finding their sample correlation matrix or generating random uniform $[-1,1]$ deviates as pairwise correlations both have drawbacks. We develop an algorithm for adding noise, in a highly controlled manner, to general correlation matrices. In many instances, our method yields results which are superior to those obtained by simply simulating Gaussian data. Moreover, we demonstrate how our general algorithm can be tailored to a number of different correlation models. Using our results with a few different applications, we show that simulating correlation matrices can help assess statistical methodology.

#### Article information

Source
Ann. Appl. Stat., Volume 7, Number 3 (2013), 1733-1762.

Dates
First available in Project Euclid: 3 October 2013

https://projecteuclid.org/euclid.aoas/1380804814

Digital Object Identifier
doi:10.1214/13-AOAS638

Mathematical Reviews number (MathSciNet)
MR3127966

Zentralblatt MATH identifier
06237195

#### Citation

Hardin, Johanna; Garcia, Stephan Ramon; Golan, David. A method for generating realistic correlation matrices. Ann. Appl. Stat. 7 (2013), no. 3, 1733--1762. doi:10.1214/13-AOAS638. https://projecteuclid.org/euclid.aoas/1380804814

#### References

• Barnard, J., McCulloch, R. and Meng, X.-L. (2000). Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statist. Sinica 10 1281–1311.
• Böttcher, A. and Grudsky, S. M. (2005). Spectral Properties of Banded Toeplitz Matrices. SIAM, Philadelphia, PA.
• Böttcher, A. and Silbermann, B. (1999). Introduction to Large Truncated Toeplitz Matrices. Springer, New York.
• Cho, E. (2009). Inner product of random vectors. Int. J. Pure Appl. Math. 56 217–221.
• Dabney, A. R. and Storey, J. D. (2007). Optimality driven nearest centroid classification from genomic data. PLoS ONE 2 e1002.
• Davies, P. I. and Higham, N. J. (2000). Numerically stable generation of correlation matrices and their factors. BIT 40 640–651.
• Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 179–188.
• Golan, D. and Rosset, S. (2011). Accurate estimation of heritability in genome wide studies using random effects models. Bioinformatics 27 i317–i323.
• Guo, Y., Hastie, T. and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8 86–100.
• Hafdahl, A. (2007). Combing correlation matrices: Simulation analysis of improved fixed-effects methods. Journal of Educational and Behavioral Statistics 32 180–205.
• Halmos, P. R. (1982). A Hilbert Space Problem Book, 2nd ed. Graduate Texts in Mathematics 19. Springer, New York.
• Hardin, J. and Wilson, J. (2009). A note on oligonucleotide expression values not being normally distributed. Biostatistics 10 446–450.
• Hardin, J., Garcia, S. R. and Golan, D. (2013). Supplement to “A method for generating realistic correlation matrices.” DOI:10.1214/13-AOAS638SUPP.
• Holmes, R. B. (1989). On random correlation matrices. II. The Toeplitz case. Comm. Statist. Simulation Comput. 18 1511–1537.
• Holmes, R. B. (1991). On random correlation matrices. SIAM J. Matrix Anal. Appl. 12 239–272.
• Hong, S. (1999). Generating correlation matrices with model error for simulation studies in factor analysis: A combination of the Tucker–Koopman–Linn model and Wijsman’s algorithm. Behavior Research Methods, Instruments & Computers 31 727–730.
• Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge.
• Hu, R., Qiu, X. and Glazko, G. (2010). A new gene selection procedure based on the covariance distance. Bioinformatics 25 348–354.
• Huang, S., Tong, T. and Zhao, H. (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics 66 1096–1106.
• Joe, H. (2006). Generating random correlation matrices based on partial correlations. J. Multivariate Anal. 97 2177–2189.
• Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
• Kim, K. I. and van de Wiel, M. A. (2008). Effects of dependence in high-dimensional multiple testing problems. BMC Bioinformatics 9 114.
• Kraj, P., Sharma, A., Garge, N., Podolsky, R. and McIndoe, R. A. (2008). ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 9 200.
• Kraus, J. M. and Kestler, H. A. (2010). A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11 169.
• Langfelder, P. and Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics 9 559.
• Langfelder, P., Zhang, B. and Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: The Dynamic Tree Cut package for R. Bioinformatics 24 719–720.
• Lee, S. H., Wray, N. R., Goddard, M. E. and Visscher, P. M. (2011). Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88 294–305.
• Lee, S. H., DeCandia, T. R., Ripke, S., Yang, J., Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ), International Schizophrenia Consortium (ISC), Molecular Genetics of Schizophrenia Collaboration (MGS), Sullivan, P. F., Goddard, M. E., Keller, M. C., Visscher, P. M. and Wray, N. R. (2012). Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 44 247–250.
• Lewandowski, D., Kurowicka, D. and Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. J. Multivariate Anal. 100 1989–2001.
• Liu, X. and Daniels, M. J. (2006). A new algorithm for simulating a correlation matrix based on parameter expansion and reparameterization. J. Comput. Graph. Statist. 15 897–914.
• Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature 456 18–21.
• Marsaglia, G. and Olkin, I. (1984). Generating correlation matrices. SIAM J. Sci. Statist. Comput. 5 470–475.
• Mezzich, J. E. and Solomon, H. (1980). Taxonomy and Behavioral Science. Academic Press, San Diego, CA.
• Muller, M. (1959). A note on a method for generating points uniformly on N-dimensional spheres. Communications of the ACM 2 19–20.
• Nelson, B. L. and Goldsman, D. (2001). Comparisons with a standard in simulation experiments. Management Science 47 449–463.
• Ng, C. T. and Joe, H. (2010). Generating random $\operatorname{AR}(p)$ and $\operatorname{MA}(q)$ Toeplitz correlation matrices. J. Multivariate Anal. 101 1532–1545.
• Pang, H., Tong, T. and Zhao, H. (2009). Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics 65 1021–1029.
• Rae, G. (1997). A FORTRAN 77 program for generating sample correlation matrices. Educ. Psychol. Meas. 57 189–192.
• Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66 846–850.
• Tai, F. and Pan, W. (2007). Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23 3170–3177.
• Tritchler, D., Parkhomenko, E. and Beyene, J. (2009). Filtering genes for cluster and network analysis. BMC Bioinformatics 10 193.
• Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high dimensional problems. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 615–636.
• Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Goddard, M. E. and Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 565–569.
• Yang, J., Lee, S. H., Goddard, M. E. and Visscher, P. M. (2011). GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics 88 76–82.
• Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17 763–774.
• Zhang, X., Boscardin, W. J. and Belin, T. R. (2006). Sampling correlation matrices in Bayesian models with correlated latent variables. J. Comput. Graph. Statist. 15 880–896.
• Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4 Art. 17, 45 pp. (electronic).
• Zuber, V. and Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation. Bioinformatics 25 2700–2707.