Electronic Journal of Statistics

Efficient distribution estimation for data with unobserved sub-population identifiers

Yanyuan Ma and Yuanjia Wang

Full-text: Open access


We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects’ genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

Article information

Electron. J. Statist., Volume 6 (2012), 710-737.

First available in Project Euclid: 3 May 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G05: Estimation 62G20: Asymptotic properties
Secondary: 62G99: None of the above, but in this section

Finite mixed samples robustness semiparametric efficiency nonparametric maximum likelihood estimator (NPMLE)


Ma, Yanyuan; Wang, Yuanjia. Efficient distribution estimation for data with unobserved sub-population identifiers. Electron. J. Statist. 6 (2012), 710--737. doi:10.1214/12-EJS690. https://projecteuclid.org/euclid.ejs/1336049812

Export citation


  • Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. (1972)., Statistical Inference Under Order Restrictions. New York: John Wiley.
  • Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A. (1993)., Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: The Johns Hopkins University Press.
  • Chatterjee, N. and Wacholder, S. (2001). “A Marginal Likelihood Approach for Estimating Penetrance from Kin-cohort Designs”., Biometrics, 57, 245-252.
  • Davignon, J., Gregg, R.E. and Sing, C.F. (1988). “Apolipoprotein E Polymorphism and Atherosclerosis”., Arteriosclerosis, 8, 1-21.
  • Fine, J.P., Zou, F. and Yandell, B.S. (2004). Nonparametric estimation of the effects of quantitative trait loci., Biometrics, 5, 501-513.
  • Hartge, P., Chatterjee, N., Wacholder, S., Brody, L.C., Tucker, M.A., Struewing, J.P. (2002). Breast cancer risk in Ashkenazi BRCA1/2 mutation carriers: effects of reproductive history., Epidemiology. 13(3), 255-261.
  • Hauptmann, M., Sigurdson, A.J., Chatterjee, N., Rutter, J.L., Hill, D.A., Doody, M.M., Struewing, J.P. (2003). Re: Population-Based, CaseControl Study of HER2 Genetic Polymorphism and Breast Cancer Risk., Journal of the National Cancer Institute, 95, 1251-1252.
  • Hixson, J.E. (1991). “Apolipoprotein E Polymorphisms Affect Atherosclerosis in Young Males: Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Research Group”., Arterioscler Thromb, 11, 237-244.
  • Huang, N., Parco, A., Mew, T., Magpantay, G., McCouch, S., Gulderdoni, E., Xu, J., Subudhi, P., Angeles, E. and Khush, G. (1997). “RFLP Mapping of Isozymes, RAPD and QTLs for Grain Shape, Brown Planthopper Resistance in a Doubled Haploid Rice Population”., Molecular Breeding. 3, 105-113
  • Khoury, M., Beaty, H. and Cohen, B. (1993)., Fundamentals of Genetic Epidemiology. New York: Oxford University Press.
  • Lander, E.S. and Botstein, D. (1989). “Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps”., Genetics, 121 743-756.
  • Li, R. and Liang, H. (2008). “Variable selection in semiparametric regression modeling”., Annals of Statistics, 36, 261-286.
  • Liang, H. and Wang, N. (2005). “Large sample theory in a semiparametric partially linear errors-in-variables model”., Statistica Sinica, 15, 99-117
  • Marder K., Levy, G., Louis, E.D., Mejia-Santana, H., Cote, L., Andrews, H., Harris, J., Waters, C., Ford, B., Frucht, S., Fahn, S. and Ottman, R. (2003). Accuracy of family history data on Parkinson’s disease., Neurology, 61, 18-23.
  • McLachlan, G.J. and Peel, D. (2000)., Finite Mixture Models. New York: Wiley.
  • Newey, W.K. (1990). “Semiparametric Efficiency Bounds”., Journal of Applied Econometrics, 5, 99-135.
  • Rabinowitz, D. (2000). “Computing the Efficient Score in Semi-parametric Problems”., Statistica Sinica, 10, 265-280.
  • Sigurdson, A.J., Hauptmann, M., Chatterjee, N., Alexander, B.H., Doody, M.M., Rutter, J.L., Struewing, J.P. (2004). Kin-cohort estimates for familial breast cancer risk in relation to variants in DNA base excision repair, BRCA1 interacting and growth factor genes., BMC Cancer, 4, 9.
  • Shea, S., Isasi, C.R., Couch, S., Starc, T.J., Tracy, R.P., Deckelbaum, R., Talmud, P., Berglund, L., and Humphries, S.E. (1999). “Relations of Plasma Fibrinogen Level in Children to Measures of Obesity, the, (G-455->A) Mutation in the Beta-Fibrinogen Promoter Gene, and Family History of Ischemic Heart Disease: the Columbia University BioMarkers Study”. American Journal of Epidemiology, 150, 737-46.
  • Tsiatis, A.A. (2006)., Semiparametric Theory and Missing Data. New York: Springer.
  • Tsiatis, A.A. and Ma, Y. (2004). “Locally Efficient Semiparametric Estimators for Functional Measurement Error Models”., Biometrika, 91, 835-848.
  • Wacholder, S., Hartge, P., Struewing, J., Pee, D., McAdams, M., Brody, L. and Tucker, M. (1998). “The Kin-cohort Study for Estimating Penetrance”., American Journal of Epidemiology, 148, 623–630.
  • Wang, Y., Clark, L.N., Marder, K. and Rabinowitz, D. (2007). “Non-parametric Estimation of Genotype-specific Age-at-onset Distributions From Censored Kin-cohort Data”., Biometrika, 94, 403-414.
  • Wang, Y., Clark, L.N., Louis, E.D., Mejia-Santana, H., Harris, J., Cote, L.J., Waters, C., Andrews, D., Ford, B., Frucht, S., Fahn, S., Ottman, R., Rabinowitz, D. and Marder, K. (2008). Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method., Arch Neurol. 65(4):467-474.PMID: 18413468
  • Webb, E.L., Rudd, M.F., and Houlston, R.S. (2006a). Case-control, kin-cohort and meta-analyses provide no support for STK15 F31I as a low penetrance colorectal cancer allele., British Journal of Cancer, 95, 1047-1049.
  • Webb, E.L., Rudd, M.F., Sellick, G.S., Galta, R., Bethke, L., Wood, W., Fletcher, O., Penegar, S., Withey, L., Qureshi, M., Johnson, N., Tomlinson, I., Gray, R., Peto, J., Houlston, R.S. (2006b). Search for low penetrance alleles for colorectal cancer through a scan of 1467 non-synonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives., Hum Mol Genet, 15(21), 3263-3271.
  • Wu, R., Ma, C., and Casella, G. (2007)., Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer.