Aluísio Pinheiro, Hildete P. Pinheiro, Samara Kiihl
An important parameter in the study of population evolution is θ=4Nν, where N is the effective population size and ν is the rate of mutation per locus per generation. Therefore, θ represents the mean number of mutations per site per generation. There are many estimators of θ, one of them being the mean number of pairwise nucleotide differences, which we call T2. Other estimators are T1, based on the number of segregating sites and T3, based on the number of singletons. The concept of selective neutrality can be interpreted as a differentiated nucleotide distribution for mutant sites when compared to the overall nucleotide distribution. Tajima (1989) has proposed the so-called Tajima’s test of selective neutrality based on T2−T1. Its complex empirical behavior (Kiihl, 2005) motivates us to propose a test statistic solely based on T2. We are thus able to prove asymptotic normality under different assumptions on the number of sequences and number of sites via U-statistics theory.
References
[1] Arvesen, J. N. (1969). Jackknifing U-statistics. Ann. Math. Statist. 40 2076–2100.
Mathematical Reviews (MathSciNet):
MR264805
[2] Chakraborty, R. and Rao, C. R. (1991). Measurement of genetic variation for evolutionary studies. In Handbook of Statistics (C. R. Rao and R. Chakraborty, eds.) 8. North-Holand, Amsterdam.
[3] Feller, W. (1971). An Introduction to Probability Theory and Its Applications. II, 2nd. ed. Wiley, New York.
[4] Fitch, W. M. and Margoliash, E. (1967). A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem. Genet. 1 65–71.
[5] Fu, Y.-X. (1994). A phylogenetic estimator of effective population size or mutation rate. Genetics 136 685–692.
[6] Fu, Y.-X. (1995). Statistical properties of segregating sites. Theoretical Population Biology 48 172–197.
[7] Fu, Y.-X. and Li, W. H. (1993). Statistical tests of neutrality of mutations. Genetics 133 693–709.
[8] Gini, C. W. (1912). Variabilita e mutabilita. Studi Economico-Giuridici della R. Universita di Cagliari 3 3–159.
[9] Hartl, D. and Clark, A. (1997). Principles of Population Genetics, 3rd ed. Sinauer Associates, Sunderland Massachusetts.
[10] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statistics 19 293–325.
Mathematical Reviews (MathSciNet):
MR26294
[11] Holmes, E. C. and Brown, A. J. (1992). Convergent and divergent sequence evolution in the surface envelope of glycoprotein of human immunodeficiency virus type 1 within a single infected patient. PNAS 89 4835–4839.
[12] Jukes, T. H. and Cantor, C. R. (1969). Evolution of protein molecules. In Mamalian Protein Metabolism III (H. N. Munro, ed.) 21–132. Academic Press, New York.
[13] Kiihl, S. F. (2005). Análise estatística de polimofismo molecular em seqüências de DNA utilizando informações filogenéticas. Master’s thesis, Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica.
[14] Korolyuk, V. S. and Borovskikh, Yu. V. (1985). Approximation of nondegenerate U-statistics. J. Theory Probab. Appl. 30 417–426.
Mathematical Reviews (MathSciNet):
MR805294
[15] Pinheiro, A., Pinheiro, H. P. and Sen, P. K. (2005). The use of Hamming distance in bioinfomatics. In Handbook of Statist. Bioinformatics. To appear.
[16] Pinheiro, A., Sen, P. K. and Pinheiro, H. P. (2006). Decomposability of high-dimensional diversity measures: quasi u-statistics, martingales and nonstandard asymptotics. Submitted for publication.
[17] Pinheiro, H. P., Seillier-Moiseiwitsch, F., Sen, P. K. and Eron, Jr., J. (2000). Genomic sequences and quasi-multivariate CATANOVA. In Bioenvironmental and Public Health Statistics. Handbook of Statist. 18 713–746. North-Holland, Amsterdam.
[18] Pinheiro, H. P., Seillier-Moiseiwitsch, F. and Sen, P. K. (2001). Analysis of variance for hamming distance appllied to unbalanced designs. Technical Report 30/01, Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica, Brazil.
[19] Pinheiro, H. P., Pinheiro, A. and Sen, P. K. (2005). Comparison of genomic sequences using the Hamming distance. J. Statist. Plann. Inference 130 325–339.
[20] Rao, C. R. (1982a). Diversity and dissimilarity coefficients: A unified approach. Theoretical Population Biology 21 24–43.
[21] Rao, C. R. (1982b). Diversity: Its measurement, decomposition, apportionment and analysis. Sankhyā Ser. A 44 1–22.
Mathematical Reviews (MathSciNet):
MR753075
[22] Sen, P. K. (1960). On some convergence properties of U-statistics. Calcutta Statist. Assoc. Bull. 10 1–18.
Mathematical Reviews (MathSciNet):
MR119286
[23] Sen, P. K. (1999). Utility-oriented Simpson-type indexes and inequality measures. Calcutta Statist. Assoc. Bull. 49 1–22.
[24] Sen, P. K. (2001). Excursions in biostochastic: Biometry to biostatistics to bioinformatics. Lecture Notes, Academia Sinica Inst. Statist. Sci. Taipei, ROC.
[25] Sen, P. K. (2006). Robust statistical inference for high-dimensional data models with applications to genomics. Austrian J. Statist. Probab. 36 197–211.
[26] Sen, P. K., Tsai, M.-T. and Jou, Y.-S. (2005). High dimension low sample size perspectives in constrained statistical inference: The SARSCoV RNA genome in illustration. Submitted for publication.
[27] Simpson, E. H. (1949). The Measurement of Diversity. Nature 163 688.
[28] Souza, F. L., Cunha, A. F., Oliveira, M. A., Pereira, G. A. G. and Reis, S. F. (2003). Preliminary phylogeographic analysis of the neotropical freshwater turtle Hydromedusa maximiliani (Chelidae). J. Herpetology 37 199–205.
[29] Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymophism. Genetics 123 585–595.
[30] Tihomirov, A. N. (1980). Convergence rate in the central limit theorem for weakly dependent random variables. J. Theory Probab. Appl. 25 800–818.
Mathematical Reviews (MathSciNet):
MR595140
[31] Utev, S. A. (1990). On the central limit theorem for ϕ-mixing arrays of random variables. J. Theory Probab. Appl. 35 131–139.
[32] Uzzell, T. and Corbin, K. W. (1971). Fitting discrete probability distributions to evolutionary events. Science 172 1089–1096.
[33] Withers, C. S. (1981). Central limit theorems for dependent variables. I. Z. Wahrsch. Verw. Gebiete 57 509–534.
Mathematical Reviews (MathSciNet):
MR631374
[34] Yang, Z. (1996). Among-site variation and its impact on phylogenetic analyses. TREE 11 367–372.
[35] Yoshihara, K.-I. (1984). The Berry-Esseen theorems for U-statistics generated by absolutely regular processes. Yokohama Math. J. 32 89–111.
Mathematical Reviews (MathSciNet):
MR772908