Open Access
2012 Efficient distribution estimation for data with unobserved sub-population identifiers
Yanyuan Ma, Yuanjia Wang
Electron. J. Statist. 6: 710-737 (2012). DOI: 10.1214/12-EJS690

Abstract

We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects’ genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

Citation

Download Citation

Yanyuan Ma. Yuanjia Wang. "Efficient distribution estimation for data with unobserved sub-population identifiers." Electron. J. Statist. 6 710 - 737, 2012. https://doi.org/10.1214/12-EJS690

Information

Published: 2012
First available in Project Euclid: 3 May 2012

zbMATH: 1274.62250
MathSciNet: MR2988426
Digital Object Identifier: 10.1214/12-EJS690

Subjects:
Primary: 62G05 , 62G20
Secondary: 62G99

Keywords: Finite mixed samples , nonparametric maximum likelihood estimator (NPMLE) , robustness , Semiparametric efficiency

Rights: Copyright © 2012 The Institute of Mathematical Statistics and the Bernoulli Society

Back to Top