and

We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.


Introduction
In many scientific studies, data arise from a mixture of scientifically meaningful distributions.For example, in a quantitative trait locus (QTL) study, the goal is to identify, map and estimate effect of a QTL predisposing the trait.However, the genomic location of the QTL is unknown, therefore subjects' genotypes at the QTL are not observed.Mixture models are widely used to map QTLs using location-known molecular markers such as single nucleotide polymorphisms (SNPs) or microsatellite markers, see Lander and Botstein (1989) and Wu et al. (2007).
Another example where mixture model is useful is genetic studies where genotypes in relatives of an initial sample (probands) are not collected (Marder et al., 2003;Wang et al., 2008).In these studies, of scientific interest is to estimate the conditional distribution of a trait given a genotype (or penetrance, Khoury et al., 1993).Genotype information in the initial sample of probands are collected.However, it is common that due to high cost of administering in-person interviews in relatives, their genotype information is not collected.For example, in Wacholder et al. (1998) and Wang et al. (2007Wang et al. ( , 2008)), only the probands are genotyped, but none of the first-degree relatives of the probands was genotyped.Distribution of possible genotypes of a relative, however, can easily be obtained given the relationship between the relative and the proband and the genotype in the proband.The relatives' disease history or trait information is usually obtained by administering a systematic and reliable phone-interview (Marder et al., 2003).Distribution of the trait in a relative is then a mixture of conditional distribution of the trait given the relative's genotype and these relatives form the main analysis sample.
A concrete example of such genetic studies is an investigation of association between the APOE gene and the LDL concentrations in young children (Shea et al., 1999).There are three common alleles at the APOE locus (ε2, ε3, ε4).The APOE ε3 is the most prevalent allele in the general population, with frequency 75% to 80%.Previous studies have suggested that the APOE ε4 allele may be associated with higher LDL cholesterol levels in adults (Davignon et al., 1988).Of interest is the association between APOE ε4 allele and LDL cholesterol distribution in children.
Subjects included in the study were recruited from a cross-sectional biomarker study of children conducted from 1994 to 1998 (Shea et al., 1999).Proband children were recruited from lists of cardiac patients generated through the Presbyterian Hospital Clinical Information System, private cardiology practices, lipid clinics and pediatric practices.Families with at least one healthy child 4 to 25 years of age were eligible for participation.Siblings of proband children were recruited to the study.The availability of the APOE genotype information of the probands and the sibling relationship enables the calculation of each sibling's probability of carrying the ε4 allele.The cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) and for the non-carriers (carrying zero copy of ε4) are of primary interest in this study.
Traditional statistical analysis of mixture data specifies a parametric form of conditional distribution of an outcome given group membership (e.g., Gaussian mixture model, Wu et al., 2007) and estimates mixture probabilities and parameters in the conditional distribution by maximum likelihood through an EM algorithm (McLachlan and Peel, 2000).In this work, we provide nonparametric estimation in the sense that we do not make any distributional assumption on the conditional distributions.One common feature of the two examples introduced before is that the mixture probabilities are easily calculated without using the outcome data or are known, and the mixture populations are scientifically meaningful (e.g., subjects carrying a certain genotype).Treating these mixture probabilities as random variables, each observation in the data consists a vector of mixture probabilities and a continuous outcome, and the observations are assumed to be independent and identically distributed (i.i.d.).
To fix idea, let Q denote a p-dimensional vector of random mixture probabilities, and let p Q denote the probability mass function of Q, which has a finite support u 1 , . . ., u m .Let S denote a random outcome, let L denote the unobserved group membership (or genotype), and let f (s) denote the p-dimensional conditional density of S given L. For simplicity, we assume that f (s) is sup-ported on a compact interval, say [T 1 , T 2 ].For the ith subject, i = 1, . . ., n, we observe (q i , s i ), where the joint density of Q, S at Q = q i and S = s i is Here f (s) is a length p vector, where the jth component f j (s) represents the conditional probability density function (PDF) of s given that it belongs to the jth genotype group, j = 1, . . ., p.Each component of f (s), f j (s), is the PDF of a trait at time t given the gene mutation status being the jth kind in a relative (for example, j = 1 denotes carriers and j = 2 denotes non-carriers), or the PDF of a quantitative trait given the QTL genotype being the jth kind.Let F (•) denote the corresponding p dimensional cumulative distribution function (CDF) of f (•).Our interest is in estimating F at any fixed time t.The vector q i represents probabilities that a relative carries a certain genotype given the proband's genotype, or a vector of probabilities of a subject having a certain QTL genotype given the flanking markers.Obviously p j=1 q ij = 1.The distribution of q i (i.e., p Q ) depends on study design and can be easily estimated consistently from the empirical distribution of q i .For example, for a backcross QTL experiment, q i takes four different values depending on the marker genotype frequencies (e.g., Table 10.3 of Wu et al., 2007).The vector of density functions f is completely unspecified, thus f is an infinite-dimensional nuisance parameter with length p.
Here, we characterize the complete class of consistent estimators which includes Fine et al. (2004) and Chatterjee and Wacholder (2001).We show that any weighted least squares estimator is a member of this estimation class hence yields a consistent estimator.In addition, we construct a special subclass which obtains the minimum estimation variance and reaches the semiparametric efficiency bound.We inspect two types of widely used NPMLEs and report a surprising finding that they are either inefficient or even inconsistent.Although commonly applied in clinical studies (Sigurdson et al., 2004;Hauptmann et al., 2003;Webb et al., 2006a,b;Hartge et al., 2002), the inconsistency of the second type of NPMLE has not been discovered in the literature before.
The remaining of the paper is organized as follows.In Section 2, weighted least squares estimators are introduced and a complete class of consistent estimators encompassing the least squares is defined.The optimal member of the class is identified and shown to reach the semiparametric efficiency bound.In Section 3, an algorithm to implement the efficient estimator is developed and asymptotic properties of the estimator are proved.In Section 4, two types of commonly used NPMLE estimators are investigated and one type is found to be inefficient while the other is inconsistent.In Section 5, simulation experiments are conducted to investigate the finite sample performance of the developed methods, and several estimators including the efficient estimator, the least squares estimators and the NPMLEs are compared.In Section 6, the proposed methods are implemented to analyze two data examples, one from a genetic linkage study of rice plant height and the other from a study of association between plasma low-density lipoprotein (LDL) cholesterol level and the apolipoprotein-E (APOE) gene.In Section 7, possible extensions of the proposed methods are discussed.

A class of weighted least squares estimators
Although the traditional approach to estimating F (t) is maximum likelihood estimator for a parametric model or NPMLE for a nonparametric model, a very simple weighted estimator can be used if we formulate the same problem from a different angle.Observe that the model in (1) implies q T F (t) = E{I(S ≤ t)|q}, where I(•) denotes an indicator function.Therefore, viewing the q i 's as covariates and I(S i ≤ t) as response variables, the covariates and the responses are linked by F (t) via a familiar linear regression model , where E(e i |q i ) = 0, i = 1, . . ., n.It is straightforward that the e i 's are independent conditional on q i 's, and have the variances v i = q T i F (t){1 − q T i F (t)}.Thus, weighted least squares based method can be used to estimate F (t). Denote by M an arbitrary n × n diagonal matrix.Let A = (q 1 , . . .q n ) T ∈ R n×p , Y = (y 1 , . . .y n ) T ∈ R n , and e = (e 1 , . . ., e n ) T ∈ R n .Then we obtain the general WLS estimator The simplest estimator is the OLS where we set M = I n , also derived in Fine et al. (2004) using a different formulation, while the most efficient WLS estimator is obtained when we assign M to be a diagonal matrix with the ith diagonal entry equals v −1 i .Standard iteratively re-weighted estimation procedure can be used to obtain this optimal WLS (OWLS) estimator.The presence of the matrix M also allows the flexibility to derive other WLS estimators to achieve desired properties such as robustness.

The complete class of consistent estimators
Although simple to derive and easy to implement, it is unclear whether the class of WLS is complete and whether OWLS is the optimal estimator among all consistent estimators of F (t).To answer these questions and to provide easy variance estimation for any consistent estimator, we perform a formal semiparametric analysis to characterize the complete class of consistent estimators.We derive in Appendix A.1 that the family of all influence functions is where I p is a p-dimensional identity matrix, C is an arbitrary p × p constant matrix, and 1 p is a p-dimensional vector with all elements being one.
For any qualified b-function as described in S IF , an estimator for F (t) is where we use C b = b(q, s)q T p Q (q)dµ(q) − I(s ≤ t)I p to denote the constant matrix corresponding to this b-function.For example, a convenient choice of b(q, s) is where h 1 (q, s), h 2 (q, s), and h 3 (q) can be arbitrary functions in R p such that h 1 (q, s)q T p Q (q)dµ(q) and h 2 (q, s)q T p Q (q)dµ(q) are invertible, and B is an arbitrary constant matrix.This characterization provides a simple construction of a very rich class of estimators.
Since S IF contains all the influence functions, any regular asymptotic linear (RAL, Newey, 1990) estimator can be written in the form of (3).For example, we show in Appendix A.2 that the influence function of any WLS estimator is Here, w is a weight variable.For the ith individual, w = w i is the ith diagonal entry of M .We use W to denote the weight variable when it is considered as a random variable.It is easy to see that this corresponds to choosing h 1 = wq, h 2 = 0, and h 3 = −{E(W QQ T )} −1 wqq T F (t) + F (t), hence any WLS is indeed a member of S IF .In addition, comparing the form of φ W LS and S IF indicates that the WLS estimators are only a subset of consistent estimators that can be constructed.To further study whether the optimal WLS estimator is the most efficient among all the consistent estimators for F (t), we need to derive the efficient influence function.

The semiparametric efficient estimator
Projecting an arbitrary influence function φ onto the tangent space Λ T yields an efficient influence function (Newey, 1990).In Appendix A.3, we derive the form of Λ T and its orthogonal complement, which enables us to derive the following theorem.
Theorem 1.The efficient influence function is where and .
The proof of the Theorem 1 is in Appendix A.4.
It is straightforward to see that the construction of the efficient estimator requires correct specification of the nuisance parameter f (s), which is not always easy to obtain.If we unknowingly mis-specify f (s) as f * (s) and follow the same construction in Theorem 1 to obtain φ * ef f , then the result is no longer a valid influence function.To see this, note that φ = {I(s≤t)−K * }A * −1 (s)q , where We thus robustify the influence function by constructing Regardless of the form of f * , (5) always yields a valid influence function.In addition, φ = φ ef f when f * (s) = f 0 (s) and φ can be used to estimate F (t) via Remark 1.In (6), we can replace K * by an arbitrary constant matrix.The resulting estimator remains consistent, and the corresponding φ is still a valid influence function.However, since different K * corresponds to different influence function, the estimators have different variances.
In practice, since f (s) is usually either proposed or estimated so that it may be different from f 0 (t), it is always a safer choice to use (6) to obtain F (t).We will show in Section 3 that as long as f (s) is consistently estimated, the estimator ( 6) is guaranteed to provide an efficient estimator for F (t).

Analytic comparison between OWLS and the efficient estimator
We are now ready to assess whether the OWLS is efficient.Comparing φ ef f with φ OW LS obtained in Appendix A.2, we find that although the OWLS is optimal among the WLS family, it does not reach the semiparametric efficiency bound.We prove this claim by contradiction.Suppose that the OWLS is efficient, then we would have φ ef f = φ OW LS + o p (1), which would imply that for all (q, s) pairs, we then have , which leads to q T F (t)A −1 (s)q = KA −1 (s)q.The left hand-side is a quadratic function of q, while the right hand-side is linear, so the above equality will never hold since q cannot be a constant vector of zero.

Efficient estimator and its asymptotic properties
As we have pointed out, the efficient influence function derived in Theorem 1 involves unknown nuisance parameters f (s) and therefore cannot be directly used to construct an efficient estimator for F (t).Using (6) will provide a robust and locally efficient estimator, in the sense that if f * (s) = f 0 (s), the estimator is indeed efficient, otherwise, the estimator is still guaranteed to be consistent.We now propose a method to construct an estimator that is always efficient.This method avoids estimating the p-dimensional PDF f (s) directly, and is simple to implement.

Algorithm for implementing the efficient estimator
We propose to use the following procedure to construct the efficient estimator.
1. Randomly split the data into two sets.The second set has size n 2 = n 5/6 , and the first set has size n 1 = n − n 2 .Assume that the first set contains (q 1 , s 1 ), . . ., (q n1 , s n1 ) and the second set (q n1+1 , s n1+1 ), . . ., (q n , s n ). 2. Obtain the empirical estimator of q T f (s), q T f (s) from the second set of sample with size n 2 .Recall that the random vector Q can take m different vector values u 1 , . . ., u m , so for each k = 1, . . ., m, we can calculate a kernel estimate for u T k f (s) as .
, where E Q stands for expectation with respect to Q.We construct A −1 (s; q T f )ds using numerical integration, and form and let the estimator be The estimation procedure described above is straightforward to implement.
Comparing to many other semiparametric problems where the efficient estimator often involves solving integral equations (Rabinowitz, 2000) and iterative procedures (Tsiatis and Ma, 2004), the estimator here is very simple.In addition, unlike most semiparametric problems where the nonparametric functions have to be estimated at a certain rate, sometimes using an under-smoothed bandwidth (Liang and Wang, 2005;Li and Liang, 2008) to reach optimality, we do not have such estimation constraints.In fact, we will show that any consistent estimation of f (s) will be as good as the true f (s) asymptotically.Since consistency can be obtained with a wide range of bandwidth, typically one does not have to go through the computationally intensive cross validation procedure to choose an optimal bandwidth.Finally, we point out that the splitting of the data is solely to facilitate the later theoretical proof and is not mandatory.In reality, one can certainly use the whole data set to estimate f (s) and to form F (t) in (7).

Asymptotics and inferences
We present the asymptotic property of the proposed efficient estimator in the following theorem: Theorem 2. The estimator constructed in (7) achieves the semiparametric efficiency bound.Specifically, for n → ∞, √ n{ F (t) − F (t)} → N (0, V ) in distribution, where V = var(φ ef f ) and can be consistently estimated as Intuitively, the reason that (7) can reach the semiparametric efficiency is because it solves the estimating equation formed by summing over the robustified influence functions (5) while replacing the unspecified quantities K * , q T f * (s) and A * by their corresponding optimal choices which are, respectively, the nonparametric estimates of K, q T f (s) and A(s, q T f ).The rigorous proof of Theorem 2 is in Appendix A.5.
Since we are able to construct the optimal estimators and estimate their variances, it is straightforward to make inferences based on these results.For example, we can construct a locally most powerful test for the hypothesis where , and V ij is the (i, j)th element of the covariance matrix V stated in Theorem 2. It is straightforward that when n → ∞, T has a chi-square distribution with one degree of freedom under H 0 .Under the local alternative, say F 1 (t)−F 2 (t) = δ/ √ n, T has a noncentral chi-square distribution with one degree of freedom and noncentrality parameter (δ In some applications, one may be interested in testing whether F 1 (t)−F 2 (t) = δ t at several different t values simultaneously, say at t 1 , . . ., t J .Letting a T = (1, −1){F (t 1 ), . . .F (t J )} − ∆ T 0 , where ∆ 0 = (δ t1 , . . ., δ tJ ) T .This can be written as a problem of testing H 0 : a = 0 versus H 1 : a = 0, Under H 0 , a has a multivariate normal random distribution with mean zero and variance-covariance matrix n −1 Σ, where Here, cov{ F (t j ), F (t k )} can be estimated using where ψ ef f (q i , s i , ; t j , q T f ) and F (t j ) denote ψ ef f and F evaluated at the ith observation and calculated at time t j .Thus, we can construct the test statistic When n → ∞, under H 0 , T has a chi-square distribution with J degrees of freedom.Under a local alternative, say a = ∆/ √ n for some length J vector ∆, T has a noncentral chi-square distribution with noncentrality parameter ∆ T Σ −1 ∆.

Understanding the NPMLEs
For many nonparametric models, the NPMLE is a widely used estimation procedure.In the literature, two types of NPMLE have been proposed (Wacholder et al., 1998;Chatterjee and Wacholder, 2001).The first type of NPMLE treats each u T j f (s), j = 1, . . ., m as an unknown PDF, while the second type treats f (s) as a p-dimensional unknown PDF.To explain these two NPMLEs in detail, group the observations in such a way that the first r 1 observations form a first subset where each observation has the same q value that equals to u 1 , the next r 2 observations form a second subset with the same q values u 2 and so on.Assume that the last r m observations form the mth subset and have the q values equal to u m .We use F (t) to denote the type I NPMLE of F (t), and F (t) the type II NPMLE.
The type I NPMLE maximizes with respect to q T i f (s i ) for the ith subject in the jth subset subject to q T i f (s i ) ≥ 0 and n i=1 q T i f (s i )I(q i = u j ) = 1 for j = 1, . . ., m.This is essentially equivalent to performing an empirical density estimation in each of the m groups, where in each group the q i values are identical.Obviously, the resulting estimation for q T f (s) in the jth group is an empirical PDF with weights r −1 j at the observed values.The procedure then uses u , where we denote U = (u 1 , . . ., u m ) T , and G(t) is a length m vector with the jth component equals and where w i = r −1 j if q i = u j .Thus, the type I NPMLE belongs to the family of WLS estimators (therefore a member of class (2)), where the weights are taken to be r −1 j , the inverse of the number of observations in the jth group with the same q i value.However, the weights of this WLS estimator are obviously non-optimal.In addition, intuitively such choice of weights is not reasonable, because it down-weights the contributions from a larger subset.In fact, one would rather downweight the contribution from the observations with less estimation precision, while the quality of the estimation of F (t) from each observation has no definitive link with its subset size.
The type II NPMLE maximizes the same log likelihood, but with respect to f (s i ), subject to n i=1 f (s i ) = 1 p and f (s i ) ≥ 0 component-wise.It is easy to see that the maximum is obtained when the r j values of f (s i ) corresponding to the same u j are the same.We denote this common f (s i ) value by h j , for j = 1, . . ., m.We thus maximize m j=1 r j log(u T j h j ) with respect to h j 's subject to m j=1 r j h j = 1 p and h j ≥ 0 component-wise.In general, no closed form solution exists for the h j 's, and the EM algorithm is often used to solve this optimization problem and to obtain the h j 's.The NPMLE then proceeds to form The type II NPMLE is different from the type I NPMLE in that here, the term "nonparametric" refers to f (s), not to u T j f (s).In the literature, the type II estimator is considered as an improvement of the type I NPMLE.However, our careful investigation reveals that the type II NPMLE is not even consistent, which is a rather counter intuitive result.In Appendix A.6, we give a detailed calculation in a concrete case to explicitly illustrate the inconsistency and in Section 5 we demonstrate the bias of the type II NPMLE in a moderately large sample through simulations.
We now give a more general demonstration to show why the type II NPMLE is inconsistent.Suppose the solution to the constrained maximization problem is h 1 , . . ., h m , then the type II NPMLE is where H = (r 1 h 1 , . . .r m h m ), and U, G(t), F (t) are the same as defined before.We already know that F (t) is a consistent estimator of F (t).If F (t) is also consistent, then we would have HU → I p when n → ∞.This is a much stronger condition than the original constraints of the maximization problem and is in general not satisfied.In fact, this condition means that the type II NPMLE is asymptotically equivalent to the type I NPMLE, which contradicts the original goal of developing a type II estimator.In other words, as a distinct estimator from the type I NPMLE, the type II NPMLE is inconsistent.

Simulations
To study the finite sample performance of the proposed estimators, we conducted several simulation studies.In all the simulations, the dimension of F (t) is p = 2, and the number of simulation iterations is 1000.
We studied eight different estimators.The efficient estimator with true f (s) inserted (hence unrealistic) is denoted ORACLE, while with the estimated f (s) inserted is denoted EFF.Thus EFF is the implemented efficient estimator.Two different kinds of robust estimators are considered, where ROB1 had the f (t) mis-specified, and ROB2 not only used a mis-specified f (t), but also had K = 0 plugged in.Specifically, in ROB1, we used the true f 1 (t) as the proposed model for f 2 (t), and used the true f 2 (t) as the proposed model for f 1 (t).In ROB2, we proposed uniform model for both f 1 (t) and f 2 (t).These two estimators are expected to be consistent hence reflecting robustness to mis-specification of the PDFs.We also investigated the proposed OWLS estimator.For comparison, we implemented the OLS, NPMLE1 and NPMLE2 estimators that are used in the literature.We implement the estimation procedures at t = 6.8.The resulting estimation mean, sample and estimated standard errors and 95% coverage of the confidence intervals are summarized in Table 1.
It can be seen that all the consistent estimators perform well in finite samples, and the estimated variances are very close to the empirical variances.This indicates that the asymptotic results are relevant for a moderate sample size of n = 300.It is very clear that the type II NPMLE yields very large bias.We emphasize here that this bias is not a reflection of small sample size because the bias persists when we increase the sample size to 1000.
We can also see that the type I NPMLE and OLS does not make a very good choice of the weights, hence the estimation standard errors are both larger than the OWLS.This is especially prominent for the type I NPMLE, in that it performs even worse than the simple OLS estimator.The two robust estimator (ROB1 and ROB2) perform very similarly, and both have minimal bias, reflecting the desired robustness property with respect to the PDF estimation.Finally, although in theory the efficient estimator (EFF) should outperform the OWLS estimator, the performance of OWLS is as satisfactory as EFF.This appears to be often the case in our other simulations not shown here.Thus, using either proposed OWLS or EFF in practice is expected to be adequate.We also studied the type I error and power of the test (8) in this situation, and present the results in Table 2.The overall performance of the proposed tests is satisfactory.From the left panel of Table 2, we see that all estimators maintain correct size.From the right panel of the same table, we see that the OLS and NPMLE1 have lower power compared to other estimators due to their larger estimation variances.
The second simulation experiment is conducted to closely mimic a QTL mapping data analyzed in Section 6.1.We generated the data from a mixture of two distributions.The first one is a uniform distribution on (3, 10), while the second one has CDF c(1 − e −t/2.5 ) on the interval (0, 10).The mixture probability has four different values which are (0.02, 0.98) T , (0.2, 0.8) T , (0.1, 0.9) T , (0.98, 0.02) T , and the sample size is 100.Based on the performance of the various estimators studied in the first simulation, here we used only the two best estimators, the OWLS and the efficient estimator (EFF) to estimate the two CDFs.We also implemented the type II NPMLE for comparison.We plot the true CDFs, the mean of the estimated CDFs and the 95% pointwise confidence band for each method in Figure 1.As expected, both OWLS and EFF give satisfactory results, while NPMLE2 is clearly biased.Again, we emphasize that the bias of NPMLE2 is not caused by the moderate sample size.In fact, when we increased the sample sizes to 1000, the bias became even more prominent.
Similarly, the third simulation is conducted to closely mimic the LDL data analyzed in Section 6.2.The first CDF is c 1 /{1 + e −(t−3)/0.5} on the interval (0, 6), and the second CDF is c 2 /{1 + e −(t−2.5)/0.2} on the interval (0, 7).Note that these two CDFs cross.Here, the mixture probability distribution has three different values which are (0.15, 0.85) T , (0.6, 0.4) T , (0.8, 0.2) T , and the sample size is 300.Estimations based on OWLS, EFF and NPMLE2 are computed, and the mean of the estimated CDFs, the 95% pointwise confidence band for each method are presented in Figure 2 together with the true CDFs.Similar to the second simulation, both OWLS and EFF perform well, while NPMLE2 shows large bias.

Estimation from QTL mapping data
In QTL studies, the trait observations are assumed to be drawn from a mixture of several QTL genotype groups and the mixture probabilities of a subject assuming a certain QTL genotype given flanking markers are calculated based on the study design, the marker genotypes and the recombination fraction between the location-known flanking markers and the putative QTL (Wu et al., 2007).The first example that we use to illustrate our methods is a genetic linkage study used to map QTLs for rice plant height and grain shape.The identified QTL can be used to produce taller rice plants to increase yield.In Huang et al. (1997), a doubled haploid (DH) population of rice plants was derived from two inbred lines (semi-dwarf IR64 and tall Azucena), creating 123 DH lines each genotyped with 135 RFLP markers and 40 isozyme and RAPD markers.Several traits such as grain shape and plant height were recorded.A DH population is equivalent to a backcross population where the two marker genotypes have an approximately 1:1 distribution ratio.The mixture probabilities q i of a plant carrying a certain QTL genotype given the flanking markers are computed based on the marker genotypes and the recombination fraction between the marker and the QTL.The details of q i computation can be found in Table 10.3 of Wu et al. (2007).
Using a Gaussian mixture model, Wu et al. (2007) analyzed the plant height measured at 10 weeks after the rice was transplanted to the field and mapped a QTL for this trait to 199cM on chromosome 1 between the markers RZ730 and RZ801.Here we estimate the cumulative distribution function of the rice plant height for each of the two QTL genotypes at the same locus (199cM on chromosome 1) using the model (1).
There were 84 plant height measurements available.Table 3 presents the estimated CDFs and their standard errors for each of the two QTL genotypes at several values of the plant height.We present the efficient estimator (EFF) and the optimal WLS (OWLS).We omitted OLS and the two NPMLEs due to their respective deficiencies.The proposed OWLS and EFF lead to comparable results.The test of H 0 : F 1 (t) = F 2 (t) based on the test statistic ( 8) was significant at 5% level for both estimators at three typical values of t, indicating a difference in the distribution functions for the two QTL genotypes.In addition, we tested the difference between the two distributions at the three t values simultaneously by the test (9).The null distribution of the test statistic was a chi-square with three degrees of freedom, and the p-value was less than 0.01 which indicates a significant difference.
Figure 3 presents the CDFs of rice plant heights for plants carrying each of the two QTL genotypes estimated by the efficient estimator (EFF).It can be seen that there is a large difference in the CDFs across the entire range of the plant height and carrying a risk allele increases the plant height.For example, it was estimated that 90.5% (CI: 78.3%, 100%) of the plants with Bb QTL genotype will have plant heights greater than 110, compared to 10.5% (CI: 0.7%, 20.3%) in the bb genotype group.This difference is highly significant (p < 0.001).These results are consistent with the analysis conducted in Wu et al. (2007).

Estimation from the LDL data
In the LDL example introduced in Section 1, the association between the APOE ε4 allele and the LDL concentrations in young children is our main research interest.There were 230 subjects included in the data analyses.We show the estimated cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) compared to non-carriers (carrying zero copy of ε4) at several values of the LDL levels in Table 4.As in data example 1, we present the EFF and the OWLS.Both estimators yielded similar results.The comparison of CDF for carriers versus non-carriers was not significant at 5% level at LDL= 100 or LDL= 260, but was significant at LDL= 180.Similar to the QTL analysis, we tested the difference between two distributions at these  three typical t values simultaneously by ( 9).The p-value was 0.29, indicating a non-significant overall difference of the two distributions at these values.
Figure 4 depicts the CDF of LDL for carriers and non-carriers estimated by the efficient estimator, EFF.It can be seen that there is virtually no difference of the two CDFs in the range from 45 to 130.The CDF for carriers is elevated in the interval (130, 200) compared to non-carriers and the two functions merge again for LDL greater than 200.Previous analyses in the literature focus on the mean LDL concentration.Our analysis shows that the effect of APOE ε4 on LDL manifests in the range of 130 to 200.

Discussion
We have developed nonparametric estimation procedures for mixed samples where the conditional distribution of the outcome given the group membership is completely unspecified and the mixing probabilities are known or can be calculated without using the outcome data.We propose an extremely simple optimal weighted least squares estimator and derive an easy-to-compute efficient estimator which reaches the semiparametric efficiency bound.We illustrate by simulations that the OWLS estimator has good efficiency in many practical situations.We investigate performances of two types of NPMLE and show the surprising results that none of them is efficient and one of them is not even consistent.This is in contrast to many other semiparametric problems where the NPMLE is an efficient estimator.
Although the estimators are constructed for CDFs, it is straightforward to adapt these procedures to estimate a quantile function F −1 (τ ).This is because we can then express all the estimators in terms of solving for F (t) from an estimating equation.When we denote t = F −1 (τ ), replace F (t) with τ in these estimating equations, and solve for t from the known τ value instead of solving for F (t) from the known t value, we can obtain estimators for the quantile functions.For example, the efficient quantile estimator at τ can be obtained through solving for t from where K itself is now a function of t hence we use the notation K(t, q T f ).
The CDFs estimated by the consistent estimators may not be monotone increasing functions of t when the sample size is relatively small.In fact, the type II NPMLE was originally proposed to address this issue, but it unfortunately lead to inconsistency.One way to guarantee the monotonicity is though reparametrization.For example, we could write f (t) = e g(t) exp{− t 0 e g(u) du}, and treat g(u) as a nuisance parameter, which will guarantee the range of u) du} to be monotone and within 0 and 1.However, the additional complexity may not be worth the gain.Instead, we suggest to use a post estimation adjustment, such as a pooled adjacent algorithm (Barlow et al., 1972) to modify the results to achieve monotonicity.For a detailed description, see Wang et al. (2007).
Finally, we point out that one needs to be cautious in interpreting inconsistency of the type II NPMLE.The inconsistency occurs when a pure nonparametric model is used.Parametric models and semiparametric models such as Cox proportional hazards model with a nonparametric baseline or piecewise exponential models are likely to be consistent.An extension of the proposed methods to handle censoring based on full data influence functions discovered here and inverse probability weighting is underway.
for any parametric submodel.A parametric submodel is a model where the original unknown function f (s) is replaced by a parametric PDF model f (s; γ), and it satisfies f (s; γ 0 ) = f 0 (s).Here S γ is the score function with respect to γ evaluated at γ 0 , The relation in (A.1) indicates that where µ(q) is the counting measure of Q.
Given any parametric submodel of the form g(q, s; γ) = p Q (q)q T f (s; γ), where γ = (γ 1 , . . ., γ p ) T , and f (s; γ) = {f 1 (s; γ 1 ), . . ., f p (s; γ p )} T , the parameter of interest is θ{f (s; γ)} = On the other hand, the score vector S γ evaluated at the truth is Recall that (A.1) requires for j = 1, . . ., p, and T2 T1 φ k q j f ′ jγj (s; γ j0 )p Q (q)dsdµ(q) = 0 for k = j.Here φ j is the jth component of φ.Because f (s) is completely unspecified, the function f ′ γ (s; γ 0 ) can be any function that satisfies T2 T1 f ′ γ (s; γ 0 )ds = 0.It then follows almost everywhere that φ j q j p Q (q)dµ(q)− I(s ≤ t) is a constant and φ j q k p Q (q)dµ(q) is also a constant for k = j.These requirements can be written concisely as φ(q, s)q T p Q (q)dµ(q) = I(s ≤ t)I p + C. (A.2) Note that a legitimate influence function also needs to have mean zero, hence Thus, we can write φ(q, s) as φ(q, s) = b(q, s) − F (t) − C1 p , where b satisfies (A.2).This gives the desired family of influence functions described in (2).

A.2. Influence function of the WLS
Denote the ith diagonal entry in M as w i for i = 1, . . ., n.When we view the weight w i as a random variable, we denote it as W i .Since our arguments are general for any i = 1, . . ., n, we often omit the subscript i, and use w or W for the corresponding quantities.From w i q i q T i −1 n i=1 w i q i I(s i ≤ t) − w i q i q T i F (t) .
Note that w i q i I(s i ≤ t) − w i q i q T i F (t) + o p (1).
So the influence function of WLS is φ W LS (q, s) = {E(W QQ T )} −1 wq I(s ≤ t) − q T F (t) .
Specifically, for the OLS and the optimal WLS estimators, the influence functions are respectively φ OLS (q, s) = {E(QQ T )} −1 q I(s ≤ t) − q T F (t) , and φ OW LS (q, s) = E QQ T Q T F (t){1 − Q T F (t)} −1 q I(s ≤ t) − q T F (t) q T F (t){1 − q T F (t)} .
A.3.Derivation of Λ T and Λ ⊥ T We denote the collection of mean zero functions orthogonal to all the elements in Λ T as Λ ⊥ T .Consider the space of tangent vectors contributed from the jth component f j (s) only, we obtain Λ j = q j h(s) q T f (s) : h(s)ds = 0, h ∈ R p .
Furthermore, it is easy to see that Λ ⊥ T = r(q, s) : r(q, s)q T p Q (q)dµ(q) = C, C1 p = 0 , where C is a constant p × p matrix.

A.4. Proof of Theorem 1
We only need to verify that φ ef f given in Theorem 1 satisfies φ ef f = Π(φ|Λ T ) = φ − Π(φ|Λ ⊥ T ), where Π denotes an orthogonal projection.To show this, we first point out that K1 p = F (t).This is because from the definition of A(s), we have qq T f (s)p Q (q) q T f (s) dµ(q) = A −1 (s) qp Q (q)dµ(q).
Integrate the both sides of the above equation from T 1 to T 2 and from T 1 to t respectively, we obtain A −1 (s)ds qp Q (q)dµ(q), F (t) = T2 T1 I(s ≤ t)A −1 (s)ds qp Q (q)dµ(q), and the result follows.
Now, letting h 1 (q, s) = h 2 (q, s) = A −1 (s)q/q T f (s), h 3 (q) = K1 p and B = −K, we can easily verify that the corresponding b(q, s) in (4) has the form b ef f = {I(s ≤ t)I p − K} A −1 (s)q q T f (s) + K1 p .
Note that the above expression equals φ ef f .Thus, we have shown that φ ef f is a valid influence function hence φ ef f ∈ Λ T .Now, for any φ ∈ Λ T , we need to show φ − φ ef f ∈ Λ ⊥ T .We have (φ − φ ef f ) q T p Q (q)dµ(q) = φ − {I(s ≤ t)I p − K} A −1 (s)q q T f (s) q T p Q (q)dµ(q) = φq T p Q (q)dµ(q) − {I(s ≤ t)I p − K} = −C − {F (t) + C1 p } q T p Q (q)dµ(q) + K is a constant matrix.In the last equality, we used the fact that an influence function φ can be written as φ = b − F (t) − C1 p , where dq T p Q (q)dµ(q) = I(s ≤ t)I p − C. From −C − {F (t) + C1 p } q T p Q (q)dµ(q) + K 1 p = −C1 p − {F (t) + C1 p } + K1 p = 0 and follow the description of Λ ⊥ T , we indeed have φ − φ ef f ∈ Λ ⊥ T .
A.5. Proof of Theorem 2 First, we note that all the approximations are caused by q T f , which is estimated using the second subset of the data.No other estimation or approximation is This can be written as f 2 (s i ) = r −1 1 I(q i = u 1 ) for all i = 1, . . ., n. Hence the NPMLE2 for the PDF f 2 (s) puts zero weights on observations that are known to be drawn from the first group, and puts equal weights, r −1 1 , on other observations.Such result is equivalent to the standard empirical likelihood estimation of a PDF when we are only given observations s 1 , . . ., s r1 drawn as a random sample from this PDF.Hence its corresponding CDF estimation F 2 (t) = n i=1 f 2 (s i )I(s i ≤ t) = r −1 1 r1 i=1 I(q i = u 1 )I(s i ≤ t) is a consistent estimate of the corresponding true CDF.However, s 1 , . . ., s r1 is a random sample from a mixture of two populations, where the mixture probability is u 11 for being from the first population and is u 12 for the second population.In other words, the estimator F 2 (t) is a consistent estimator of u 11 F 1 (t)+u 12 F 2 (t).Obviously, u 11 F 1 (t) + u 12 F 2 (t) does not equal to F 2 (t) unless u 11 ≡ 0. Consequently, the type II NPMLE is not consistent for this simple case.

Fig 1 .
Fig 1. Simulation 2. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs.The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted.The mean and true CDFs are undistinguishable in OWLS and EFF estimators.Sample size is 100, and results are based on 1000 simulations.

Fig 2 .
Fig 2. Simulation 3. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs.The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted.The mean and true CDFs are undistinguishable in OWLS and EFF estimators.Sample size is 300, and results are based on 1000 simulations.

Fig 3 .
Fig 3. Data example 1.Estimated cumulative distribution function (CDF) of plant height for QTL genotype Bb (solid) and bb (dashed)Table 4 Data example 2. Estimated CDFs of LDL levels and their standard errors of APOE ε4 carriers ( F 1 ) and non-carriers ( F 2 )

Fig 4 .
Fig 4. Data example 2. Estimated CDF of LDL levels for carriers of APOE ε4 allele (solid) and non-carriers (dashed)

Table 2
Type I error and power of test in Simulation 1, sample size n = 300, 1000 simulations

Table 3
Data example 1.Estimated CDFs of plant height and their standard errors for QTL genotypes bb ( F 1 ) and Bb ( F 2 )

Table 4
Data example 2. Estimated CDFs of LDL levels and their standard errors of APOE ε4 carriers ( F 1 ) and non-carriers ( F 2 )