Statistical Science

Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies

Dan L. Nicolae, Xiao-Li Meng, and Augustine Kong

Full-text: Open access

Abstract

Many practical studies rely on hypothesis testing procedures applied to data sets with missing information. An important part of the analysis is to determine the impact of the missing data on the performance of the test, and this can be done by properly quantifying the relative (to complete data) amount of available information. The problem is directly motivated by applications to studies, such as linkage analyses and haplotype-based association projects, designed to identify genetic contributions to complex diseases. In the genetic studies the relative information measures are needed for the experimental design, technology comparison, interpretation of the data, and for understanding the behavior of some of the inference tools. The central difficulties in constructing such information measures arise from the multiple, and sometimes conflicting, aims in practice. For large samples, we show that a satisfactory, likelihood-based general solution exists by using appropriate forms of the relative Kullback–Leibler information, and that the proposed measures are computationally inexpensive given the maximized likelihoods with the observed data. Two measures are introduced, under the null and alternative hypothesis respectively. We exemplify the measures on data coming from mapping studies on the inflammatory bowel disease and diabetes. For small-sample problems, which appear rather frequently in practice and sometimes in disguised forms (e.g., measuring individual contributions to a large study), the robust Bayesian approach holds great promise, though the choice of a general-purpose “default prior” is a very challenging problem. We also report several intriguing connections encountered in our investigation, such as the connection with the fundamental identity for the EM algorithm, the connection with the second CR (Chapman–Robbins) lower information bound, the connection with entropy, and connections between likelihood ratios and Bayes factors. We hope that these seemingly unrelated connections, as well as our specific proposals, will stimulate a general discussion and research in this theoretically fascinating and practically needed area.

Article information

Source
Statist. Sci. Volume 23, Number 3 (2008), 287-312.

Dates
First available in Project Euclid: 28 January 2009

Permanent link to this document
https://projecteuclid.org/euclid.ss/1233153057

Digital Object Identifier
doi:10.1214/07-STS244

Mathematical Reviews number (MathSciNet)
MR2483902

Zentralblatt MATH identifier
1329.62092

Keywords
EM algorithm entropy Fisher information genetic linkage studies haplotype-based association studies noninformative prior Kullback–Leibler information relative information Cox regression partial likelihood

Citation

Nicolae, Dan L.; Meng, Xiao-Li; Kong, Augustine. Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies. Statist. Sci. 23 (2008), no. 3, 287--312. doi:10.1214/07-STS244. https://projecteuclid.org/euclid.ss/1233153057


Export citation

References

  • Abecasis, G. R., Cardon, L. R. and Cookson, W. O. C. (2000). A general test of association for quantitative traits in nuclear families. Amer. J. Human Genetics 66 279–292.
  • Abreu, P., Greenberg, D. and Hodge, S. (1999). Direct power comparisons between simple lod scores and npl scores for linkage analysis in complex diseases. Amer. J. Human Genetics 65 847–857.
  • Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62 547–554.
  • Akaike, H. (1985). Prediction and entropy. In Celebration of Statistics: The ISI Centenary Volume 1-24 (A. Atkinson and S. Fienberg, eds.). Springer, New York.
  • Chapman, D. C. and Robbins, H. (1951). Minimum variance estimation without regularity assumptions. Ann. Math. Statist. 22 581–586.
  • Chernoff, H. (1979). Sequential Analysis and Optimal Design. SIAM, Philadelphia, PA.
  • Cho, J. H., Nicolae, D. L., Gold, L. H. and Fields, C. T. et al. (1998). Identification of novel susceptibility loci for inflammatory bowel disease. Proc. Natl. Acad. Sci. USA 95 7502–7507.
  • Cleves, M. A. and Elston, R. C. (1997). Alternative test for linkage between two loci. Genetic Epidemiology 14 117–131.
  • Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
  • Cox, D. R. and Hinkley, D. (1974). Theoretical Statistics. Chapman and Hall, London.
  • Daw, E. W., Thompson, E. A. and Wijsman, E. M. (2000). Bias in multipoint linkage analysis arising from map misspecification. Genetic Epidemiology 19 366–380.
  • Dempster, A. P. (1997). The direct use of likelihood for significance testing. Statist. Comput. 7 247–252.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–37.
  • Devlin, B. and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29 311–322.
  • Evans, D. E. and Cardon, L. R. (2004). Guidelines for genotyping in genomewide linkage studies: Single-Nucleotide Polymorphism maps versus microsatellite maps. Amer. J. Human Genetics 75 687–692.
  • Excoffier, L. and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12 921–927.
  • Falk, C. T. and Rubinstein, P. (1987). Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Ann. Human Genetics 51 227–233.
  • George, E., Feng, L. and Xu, X. (2006). Improved minimax predictive densities under Kullback–Leibler loss. Ann. Statist. 34 78–91.
  • Grant, S, F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A. and Sainz, J. et al. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics 38 320–323.
  • Gretarsdottir, S., Thorleifsson, G. and Reynisdottir, S. T. et al. (2003). The gene encoding phosphodiesterase 4d confers risk of ischemic stroke. Nature Genetics 35 131–138.
  • Gudbjartsson, D. F., Jonasson, K., Frigge, M. L. and Kong, A. (2000). Allegro, a new program for multipoint linkage analysis. Nature Genetics 25 12–13.
  • Hawley, M. and Kidd, K. (1995). HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Heredity 86 409–411.
  • Helgason, A., Palsson, S., Thorleifsson, G. and Grant, S. F. et al. (2007). Refining the impact of TCF7L2 gene variants on type 2 diabetes and adaptive evolution. Nature Genetics 39 218–225.
  • Kong, A. and Cox, N. J. (1997). Allele-sharing models: Lod scores and accurate linkage tests. Amer. J. Human Genetics 61 1179–1188.
  • Kruglyak, L. (1997). The use of a genetic map of biallelic markers in linkage studies. Nature Genetics 17 21–24.
  • Kruglyak, L., Daly, M. J., Reeve-Daly, M. P. and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Amer. J. Human Genetics 58 1347–1363.
  • Lam, J. C., Roeder, K. and Devlin, B. (2000). Haplotype fine mapping by evolutionary trees. Amer. J. Human Genetics 66 659–673.
  • Lander, E. S. and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84 2363–2367.
  • Lange, C. and Laird, N. M. (2002a). Analytical sample size and power calculations for a general class of family-based association tests: Dichotomous traits. Amer. J. Human Genetics 71 575–584.
  • Lange, C. and Laird, N. M. (2002b). On a general class of conditional tests for family-based association studies in genetics: The asymptotic distribution, the conditional power and optimality considerations. Genetic Epidemiology 23 165–180.
  • Long, J. C., Williams, R. C. and Urbanek, M. (1995). An E–M algorithm and testing strategy for multiple locus haplotypes. Amer. J. Human Genetics 59 799–810.
  • Martin, E. R., Monks, S. A., Warren, L. L. and Kaplan, N. L. (2000). A test for linkage and association in general pedigrees: The pedigree disequilibrium test. Amer. J. Human Genetics 67 146–154.
  • Matsuzaki, H., Loi, H. and Dong, S. et al. (2004). Parallel genotyping of over 10,000 snps using a one-primer assay on a high-density oligonucleotide array. Genome Research 14 414–425.
  • McPeek, M. S. and Strahs, A. (1999). Assesment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Amer. J. Human Genetics 65 858–875.
  • Meng, X.-L. (1994). On the rate of convergence of the ECM algorithm. Ann. Statist. 22 326–339.
  • Meng, X.-L. (2000). Discussion of “Optimization transfer using surrogate objective functions” by K. Lange, D. Hunter and I. Yang. J. Comput. Graph. Statist. 9 35–43.
  • Meng, X.-L. (2001). A congenial overview and investigation of multiple imputation inference under uncongeniality. In Survey Nonresponse (R. Groves, D. Dillman, J. Eltinge and R. Little, eds.) 343–356. Wiley, New York.
  • Meng, X.-L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc. 86 899–909.
  • Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • Meng, X.-L. and van Dyk, D. (1996). Minimum information ratio and relative augmentation function. In Proceedings of the Statistical Computing Section of the American Statistical Association 73–78.
  • Meng, X.-L. and van Dyk, D. A. (1997). The EM algorithm—an old folk song sung to a fast new tune (with discussion). J. Roy. Statist. Soc. Ser. B 59 511–567.
  • Middleton, F. A. et al. (2004). Genomewide linkage analysis of bipolar disorder by use of a high-density single-nucleotidepolymorphism (snp) genotyping assay: A comparison with microsatellite marker assays and finding of significant linkage to chromosome 6q22. Amer. J. Human Genetics 74 886–897.
  • Morris, A. P., Whittaker, J. C. and Balding, D. J. (2002). Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Amer. J. Human Genetics 70 686–707.
  • Nicolae, D. L. (1999). Allele sharing models in gene mapping: A likelihood approach. Ph.D. thesis, Dept. Statistics, Univ. Chicago.
  • Nicolae, D. L. (2006a). Testing untyped alleles (TUNA)—applications to genome-wide association studies. Genetic Epidemiology 30 718–727.
  • Nicolae, D. L. (2006b). Quantifying the amount of missing information in genetic association studies. Genetic Epidemiology 30 703–717.
  • Nicolae, D. L. and Kong, A. (2004). Measuring the relative information in allele-sharing linkage studies. Biometrics 60 368–275.
  • Niu, T., Qin, Z. S., Xu, X. and Liu, J. S. (2002). Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Amer. J. Human Genetics 71 1242–1247.
  • Ott, J. (1991). Analysis of Human Genetic Linkage. Johns Hopkins Univ. Press, Baltimore.
  • Ott, J. (2001). Major strengths and weaknesses of the lod score method. Adv. Genetics 42 125–132.
  • Pe’er, I., de Bakker, P. I., Maller, J., Yelensky, R., Altshuler, D. and Daly, M. J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics 38 663–667.
  • Pritchard, J. K., Stephens, M., Rosnberg, N. A. and Donnelly, P. (2000). Association mapping in structured populations. Amer. J. Human Genetics 67 170–181.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Schaid, D. J., Guenther, J. C., Christensen, G. B., Hebbring, S., Rosenow, C., Hilker, C. A., McDonnell, S. K., Cunningham, J. M., Slager, S., Blute, M. L. and Thibodeau, S. N. (2004). Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancersusceptibility loci. Amer. J. Human Genetics 75 948–965.
  • Shannon, C. E. (1949). A mathematical theory of communication. Bell Syst. Tech. J. 27 623–656.
  • Stephens, M., Smith, N. J. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. Amer. J. Human Genetics 68 978–989.
  • Teng, J. and Siegmund, D. O. (1998). Multipoint linkage analysis using affected relative pairs and partially informative markers. Biometrics 54 1247–1265.
  • Terwilliger, J. D. and Ott, J. (1992). A haplotype-based ‘haplotype relative risk’ approach to detecting allelic associations. Human Heredity 42 337–346.
  • Thalamuthu, A., Mukhopadhyay, I., Ray, A. and Weeks, D. E. (2005). A comparison between microsatellite and single-nucleotide polymorphism markers with respect to two measures of information content. BMC Genetics 6 (Suppl 1) S27.
  • The International HapMap Consortium (2003). The international hapmap project. Nature 426 789–796.
  • Whittemore, A. S. and Halpern, J. (1994). A class of tests for linkage using affected pedigree members. Biometrics 50 118–127.
  • Zellner, A. (2003). Some aspects of the history of Bayesian information processing. Technical report, The Graduate School of Business, Univ. Chicago.
  • Zollner, S. and Pritchard, J. K. (2005). Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169 1071–1092.