Two-component mixtures with independent coordinates as conditional mixtures : Nonparametric identification and estimation

We show how the multivariate two-component mixtures with independent coordinates in each component by Hall and Zhou (2003) can be studied within the framework of conditional mixtures as recently introduced by Henry, Kitamura and Salanié (2010). Here, the conditional distribution of the random variable Y given the vector of regressors Z can be expressed as a two-component mixture, where only the mixture weights depend on the covariates. Under appropriate tail conditions on the characteristic functions and the distribution functions of the mixture components, which allow for flexible location-scale type mixtures, we show identification and provide asymptotically normal estimators. The main application for our results are bivariate two-component mixtures with independent coordinates, the case not previously covered by Hall and Zhou (2003). In a simulation study we investigate the finite-sample performance of the proposed methods. The main new technical ingredient is the estimation of limits of quotients of two characteristic functions in the tails from independent samples, which might be of some independent interest.


Introduction
Finite mixtures are frequently used to model populations with unobserved heterogeneity.While the component distributions are most often chosen from some parametric family, e.g. the normal or t-distributions, cf.McLachlan and Peel (2000), in recent years there has been quite some interest in finite mixtures with nonparametric components, see below for a review of some of the literature.
A prominent example is the multivariate two-component mixture with independent coordinates in each component by Hall and Zhou (2003, HZ in what follows), which they introduced for modeling results of repeated tests on a single person with unknown disease status.
In this paper we show how the model by HZ can be cast into the framework of two-component conditional mixtures by Henry, Kitamura and Salanié (2010, HKS in the following).In particular, our results imply that for the HZ-model in the (in general only partially identified) two-dimensional case, an appropriate representation of the factors in the independent components can still be identified and estimated under some additional tail assumptions.
Suppose that the conditional distribution of the random variable Y given the vector of regressors Z can be expressed as the two-component mixture where only the mixture weights depend on the covariates.Apart from actual dependence of λ(z) on the covariates, identification in model (1.1) requires additional assumptions on the component distributions F 0 and F 1 .HKS investigate identifiability and estimation under tail conditions of the distribution functions themselves, which are tailored to location-type mixtures, but do not work for scale mixtures.We focus on identification and estimation results of (1.1) under appropriate tail conditions on the characteristic functions of F 0 and F 1 , which shall allow for more flexible location-scale-type mixtures.Indeed, our main technical contribution is the derivation of the large-sample theory for characteristic function-based estimators.
Let us review some of the literature on mixtures with nonparametric components.In the simple case of finite mixtures of univariate distributions, most theoretical work assumes symmetry of each component distribution.For example, Bordes, Mottelet and Vandekerkhove (2006) and Hunter, Wang and Hettmansperger (2007) present results on identifiability and asymptotically normal estimation in a two-component location mixture of a single symmetric distribution, and Bordes and Vandekerkhove (2010) and Hohmann and Holzmann (2012) have similar results in a two-component mixture model with two symmetric components one of which is completely specified while the other is unknown with unknown location parameter.For mixtures of regressions, there is a series of work which exploits the additional information provided by covariates.Kasahara and Shimotsu (2009) extend the HZ-approach to the context of switching regressions.Kitamura (2004, unpublished) considers identifiability issues for univariate mixtures of regressions on the mean functions and obtains full nonparametric identification of the components under some tail assumptions of either their characteristic functions or their moment generating functions.Vandekerkhove (2010) considers the more specific case of a linear switching regression model, where the switching error distributions are only assumed to be symmetric.
Our paper is organized as follows.In Section 2 we show how the HZ-model can be studied within the framework of model (1.1).Further, we show identification under tail conditions on the characteristic functions of F 0 and F 1 , and in particular conclude that for the HZ-model in the two-dimensional case, an appropriate representation of the component distributions is identified and can be estimated under our assumptions.In Section 3 we construct asymptotically normal estimators of the component distributions and the mixture weights.The main new technical ingredient is the estimation of limits of quotients of two characteristic functions in the tails from independent samples, which we discuss in Section 4, and which might be of some independent interest.The proofs use strong approximation as well as an entropy-type bound for the characteristic process.
A simulation study is conducted in Section 5, where we focus on the HZ-model in two dimensions.We also propose a cross-validation scheme in order to select the tuning parameters of the estimators.Proofs are deferred to an appendix, while some additional technical results are given in the supplementary material Hohmann and Holzmann (2013).

Two-component conditional mixtures
In this section, we discuss the model by HZ as our main example of (1.1).Further, we briefly review the identifiability statements of HKS and extend these to cover tail conditions on the characteristic functions of the component distributions.

Examples and model reduction
Example 1 (Unobserved binary status).Let Y be some endogenous variable affected by an unobservable binary status T ∈ {0, 1}.Instead of T we observe the regressor T * which effects the status variable T , but such that Y given T is independent of T * .Then with F j (y) = P(Y ≤ y|T = j) and λ(z) = P(T = 1|T * = z).The main example in HKS is the misclassified binary status variable, e.g. the case of binary T * .
Example 2 (Mixtures with independent components).Suppose that X 1 , . . ., X k are given results of a medical test of a single person where it is unknown if this person is indeed affected by the disease or not.Let T denote the indicator for this unknown affection status, i. e., T = 1 if and only if the person is actually diseased.It is reasonable to assume that the k test results are stochastically independent given the status T .Therefore, the observed data can be modeled by the k-variate two-component mixture with α = P(T = 1) and F j,i (x) = P(X i ≤ x|T = j).This model was investigated by HZ who established nonparametric identifiability of the cdfs F j,i and the mixture weight α under some mild irreducibility condition on F for the case k ≥ 3. Partial identifiability results were obtained for k = 2. Let i 0 ∈ {1, . . ., k} be an arbitrary fixed index, and define Due to the conditional independence of the X j given T , we obtain for y ∈ R and z Note that the function λ(z) is only of minor interest.We may as well condition on {Z ∈ B} for a Borel set B ⊂ R k−1 to obtain (2.1) In particular, the weight π(R k−1 ) determines the parameter α.
Motivated by the HZ-model as well as the case of a misclassified binary status variable, we consider conditioning in model (1.1) on events {Z ∈ B} for a Borel set B ∈ B p which have positive probability.Specifically, setting π(B) = E(λ(Z)|Z ∈ B) allows to write (1.1) as (2.2)

Identification
Next, we briefly revisit the results on identification in HKS, reformulated in the context of (2.2), and add a condition for identifiability on the quotients of characteristic functions which allows to identify scale mixtures.As HKS, we start with two basic assumptions.
Given Assumption A1 set Then direct computations show that , and thus (2.4) From (2.3) and (2.4), HKS observe that F 0 , F 1 and π (and in particular λ) can be identified from the quantities ξ and ζ and the observable cdf F .So, under Assumption A1, the identification and estimation of ξ and ζ is the crucial part for the mixture (2.2).
In order to achieve full identification, consider the following tail dominance conditions concerning the component cdfs F 0 and F 1 and their Fourier transforms, say F0 and F1 , respectively.
For i = 2, 3, let M i denote the class of mixtures of the form (2.2), the component cdfs of which satisfy the tail conditions C1 and Ci.HKS state identification under C1 and C2, for convenience, we reformulate their result.
Theorem 3. If F ∈ M i for i = 2 or i = 3, then, under A1, F 0 , F 1 and π are nonparametrically identifiable within this class M i .Moreover, by C1, and under C2: Let us comment on and illustrate the conditions C1-C3 as well as the statement of the theorem.
Assumptions C1 and C2 from HKS mean that F 0 dominates the left tail of the distribution F , while F 1 dominates the right tail.This assumption is natural for location mixtures, where it is satisfied for (exponentially) light tails of the underlying distribution.A class of examples is the (skew) normal distribution with equal skewness and scale parameters.For scale mixtures, Assumptions C1 and C2 are not appropriate.For example, for normal distributions, the component with higher variance dominates both tails.Specifically, consider a normal mixture with σ 0 > σ 1 and µ 0 , µ 1 ∈ R arbitrary.Then Assumptions C1 is satisfied and further thus Assumptions C1 and C3 provide identification.More generally, for a scale mixture of a supersmooth density (for which the characteristic function decays at an exponential rate), C3 is satisfied.Thus, C3 allows to smoothly separate scale-mixtures.
Example 2 (continued).We show in the context of Example 2 that Theorem 3 only states that the specific representation of the conditional mixtures for which the components satisfy the conditions C1 and C2 or C3 is identified, there might be further representations of the from (2.2).Nevertheless, as argued above, these representations are quite natural: C1 and C2 are appropriate if F 0 and F 1 dominate distinct tails of the distribution, while C1 and C3 are natural for a scale-type mixture in a smooth density with light tails.
In terms of densities, the model is Let f 1 and f 2 denote the one-dimensional marginals of f .Then theorem 4.1 in HZ provides the factorization where the functions g 1 and g 2 are uniquely determined up to constant multiples.Now, theorem 4.2 in HZ states the partial identifiability of (2.7) as follows.
does not depend on j, and such that f θ 0,j := f j + α j g j , f θ 1,j := f j + β j g j (2.8) are non-negative, then these latter functions are probability densities which also fulfill (2.7), with mixture weight 1 − π Denote by Θ the set of corresponding shifting vectors θ.We show that our tail assumptions identify a unique value in Θ.Indeed, suppose that θ * ∈ Θ is such that , (2.9) corresponding to condition C1 of section 2 in terms of densities.Since which by (2.7) yields the identification of f 1 .The arguments under C2 and C3 are similar.
The nonidentifiability without additional assumptions on the components in model (2.2) holds more generally, see Example 1 in Hohmann and Holzmann (2013).Further, without the additional regressor Z even in case of a known weight and tail conditions on the components, the model is not identified, see Example 2 in Hohmann and Holzmann (2013).

Estimation
In this section, given i.i.d.observations (Y 1 , Z 1 ), . . ., (Y n , Z n ) such that the conditional distribution F (y|z) satisfies (2.2), following HKS we propose nonparametric estimators of the components F 0 , F 1 and of the weight function π.The essential step is to estimate the quantities ζ and ξ, then, based on (2.3) and (2.4), plug-in estimates are easily devised.Given the tail conditions (C1) and (C2)/(C3), we estimate ζ and ξ as the limits arising in (2.5) and (2.6).To this end, Section 4 contains the asymptotic distribution theory for limits of quotients of characteristic functions, our major contribution, as well as of distribution functions, which is essentially covered in HKS, of independent samples in their tails.Here, we apply this theory to obtain asymptotics for the estimators in model (2.2).

Estimation of ζ and ξ under C1 and C3
Consider the empirical conditional distribution and characteristic functions Motivated by (2.5) and (2.6), following HKS we consider estimators for ζ and ξ of the form where the levels L n and R n need to be chosen appropriately.To this end, assume A2.The Borel sets B 0 and B 1 in A1 i. satisfy B 0 ∩ B 1 = ∅ and p j := P(Z ∈ B j ) > 0, j = 0, 1.
Under A2, we define disjoint subsamples Y * 1 , . . ., Y * mn and In order to choose R n , we assume that the characteristic function satisfies F (t|B 0 ) → 0 as t → ∞.Let s n → ∞ be such that s n /n → 0.Then, for large n, by continuity of F (y|B 0 ) there is a (not necessarily unique) solution t n of the equation F (t n |B 0 ) = s n /m n , and we choose R n as a solution of the corresponding empirical version of this equation: More precisely, we require A3.There exists γ > 0 and a non-random sequence t n → ∞ such that (4.7)-(4.10)(see Section 4) hold true for For further discussion see Section 4. Finally, assume that the rates r n and s n are chosen such that there exist constants β ζ and β ξ in R (possibly zero) for which as n → ∞.
with τ = p 0 /p 1 .Moreover, if s n = r n , then the estimators are asymptotically independent.
The first part of the proposition is as in HKS, the second (which is based on characteristic functions) as well as the asymptotic independence are our main contributions to estimation.Under the Assumptions C1 and C2, we obtain analogous results to those in HKS, see Hohmann and Holzmann (2013) for the details.
For further discussion of the tail assumption (3.2) see HKS, Lemmas 7 and 8.Note that it also applies to the characteristic functions if these (and not the distributions functions) satisfy the shape constraints made in HKS.

Estimating the component distributions and the weight function
We now turn to the estimation of the components distributions F 0 and F 1 and the mixture weight function π.We obtain similar, though slightly more refined results as HKS.
By (2.3), natural estimates of F 0 and F 1 are given by where ζ n is obtained using (C1) and ξ n either from (C2) or from (C3).As a consequence of Proposition 4, where the variances σ 2 ζ and σ 2 ξ are given as in Proposition 4, and possibly r n = s n , in which case the estimators are asymptotically independent.Then we have Theorem 5. Suppose that (3.3) holds, where s n and r n satisfy (3.1).Then where G i , i = ξ, ζ, are tight Gaussian processes with mean and covariance functions We note that relations analogous to (2.3) also hold true for underlying densities, and hence that similar estimators for the densities could be devised.
Finally, we consider estimation of the mixture weight function π(B) for sets B ∈ B p with P(Z ∈ B) > 0, B = R p being of particular interest.Fix a y 0 satisfying Assumption A1.From (2.4), a suitable estimator is given by Theorem 6.Let B ∈ B p with P(Z ∈ B) > 0, and assume (3.3), where in case of equal rates we additionally assume asymptotic independence. where Remark.When estimating the mixture weights π(B 0 ) and π(B 1 ) of the sets B 0 and B 1 upon which the estimation procedure is based, the asymptotic covariance matrix has a simpler form: Since Λ(B 0 ) = 0 and Λ(B 1 ) = 1, we obtain that .

Estimating quotients in the tails
Let X 1 , X 2 , . . .and Y 1 , Y 2 , . . .be mutually independent sequences of i.i.d.observations with distribution functions F and G, respectively.Assume that or / and hold for some θ > 0 and η ∈ C\{0}, where as above F and G denote the characteristic functions of F and G.We shall construct asymptotically normal estimators of θ and η.In the following, suppose that l n and m n are sequences in N such that l n , m n ≍ n as n → ∞.

Characteristic functions
To estimate η in (4.2), let where with h n a sequence tending to infinity.Decompose In order to handle the "variance term", write where are the characteristic processes and s n → ∞.Assume that s n satisfies We shall use strong approximations of the characteristic processes by for F n (y), and similarly for G n .In order that these processes are samplecontinuous and that strong approximations work, some conditions on F and G are required, see Csörgő (1981).We shall adopt the following sufficient condition: Assume that there exists γ > 0 such that Finally, we assume that there also exists a non-random sequences t n → ∞ such that with γ determined by (4.7).
Remark.Given a non-random sequence t n → ∞ of order (4.8), assume that the following separability criterion holds: There is a sequence a n , either constant or tending to infinity, such that for all ε > 0 there exists for n sufficiently large.In case of a supersmooth density, one can show that (4.11) holds with a n = t n , and that this rate implies (4.10 Lemma 7. Assume that (4.7) and (4.11) hold.If t n is a non-random sequence of order (4.8), then there exists a sequence and let t n be chosen such that | Φ(t n )| = s n /m n .Given ε > 0, the infimum in (4.11) is attained at t * n = t n + ε/a n .A Taylor expansion then yields and thus (4.11) holds.As a result, by Lemma 7 there exists a random sequence . Now, again by a Taylor expansion, similar arguments show that for some tn between h n and t n , Since the imaginary part of Φ(h n ) − Φ(t n ) can be handled likewise, we also see that (4.10) is fulfilled.

Distribution functions
To estimate θ in (4.1), let where the level h n is specified below.Write Assume that r n → ∞ satisfies (3.1), and that (4.12) (4.12) is satisfied if we choose in particular h n = Y mn(⌊rn⌋) , where ⌊r n ⌋ is the largest integer smaller than r n , and where Y mn(⌊rn⌋) denotes the ⌊r n ⌋-th largest order statistic of the sample Y 1 , . . ., Y mn , since Theorem 10.Suppose that the assumptions of Theorem 9 for s n = r n as well as (3.1) and (4.12) hold.If there exists τ > 0 such that m n /l n → τ , then The asymptotic distribution of r n (θ n − θn ) follows from arguments along the lines of HKS, however, the asymptotic independence requires some additional work.Further applications, discussed in Hohmann and Holzmann (2013), are testing for tail dominance, as well as estimating the exponent of regular variation.

Simulation study
We investigate the finite-sample performance of the estimators in a simulation study in the HZ-model.Consider a random vector (Y, Z) ′ distributed according to where Φ and Ψ denote the normal and skew-normal distribution functions, resp.
The density of the skew-normal distribution is given by where φ denotes the standard normal density, and its characteristic function by where and thus the condition ) is sufficient and necessary for both C1 and C3 to hold true.
For the estimation we further chose B 0 = (0, ∞) and B 1 = (−∞, 0], inducing true values ζ = 2.3876 and ξ = 0.4502, and set y 0 = 1.The estimation results for ξ, ζ, and p, using different sample sizes n and different rates r n = s n = n δ , are presented in Table 1 The choice of r n turns out to highly affect the estimates' variance and bias properties.A small r n leads to small bias, it however increases the variance, as should be expected from the theory.Therefore, in a second step we use a cross-validation scheme to choose r n .This can for example be done by a repeated random sub-sampling validation, i. e., one randomly slits up the sample of observations into two sub-samples of equal size, uses these sub-samples to estimate both the mixture for the given r n , where p = π n (R), and the ordinary empirical distribution function of Y .One estimates the L 1 -distance between these estimates, the crossvalidated r n is then the minimizer (on some fixed grid).The estimates for ξ, ζ, and p using cross-validation can be found in Table 1 (d).Also, Figure 1 shows estimates of the distribution functions F 0 and F 1 using cross-validation.Table 1 Estimates of ζ, ξ, and π in the model of Hall and Zhou (2003) Since, for all ε > 0, (4.11) implies

Proof of Theorem 9
The proof of Theorem 9 proceeds in several steps.
Lemma 11.Assume that (4.7) holds.On a sufficiently rich probability space there exist versions of the X k and Y k , and independent sequences B 1,n and B 2,n of standard Brownian bridges on [0, 1] such that, defining Proof.According to Theorem 4 and Corollary 2 in Csörgő (1981), for all T, δ > 0 and n ∈ N there exists a constant C which depends only on δ and F such that where the sequence q n satisfies q n ∼ n −γ/(2γ+4) (log n) (γ+1)/(γ+2) .Hence, for all sequences T n = o(q −1 n ) and all ε > 0 we find that eventually, where the right side converges to zero as n → ∞.
Remark.The processes C i,n are zero mean complex Gaussian.In particular, with C being defined as in (4.6) and W a standard Brownian motion on [0, 1], With this, basic properties of the Itô integral give and hence Lemma 12. Let B n be a sequence of standard Brownian bridges, not necessarily independent of h n , and define C n (y) = exp(iyx) B n (F (dx)).If (4.7) and (4.9) hold, then Proof.For any ε, δ > 0, defining The right probability tends to zero due to (4.9).The left probability can be made arbitrarily small by the choice of δ.In fact, by the maximal inequality as given in Lemma 2.1 in Talagrand (1996), there exists a finite constant K such that, for all x > 0, P sup where N (T, η) denotes the smallest number of open balls of radius η with respect to the distance d(s, t) = [E|C n (s) − C n (t)| 2 ] 1/2 (which does not depend on n) that are necessary in order to cover an index set T ⊂ R. Hence, it remains to show that the entropy integral ∞ 0 log N (I n,δ , η) dη is finite and can be made arbitrarily small by the choice of δ.
As a result, there exists an absolute constant C such that d(s, t) ≤ C|s−t| γ/2 for sufficiently small δ and s, t ∈ I n,δ , so that each η 2/γ /C-cover of I n,δ with respect to the absolute value distance is an valid η-cover with respect to d, yielding This and noting that Proof of Theorem 9.For all ε > 0 and T n → ∞, with C 1,n as given in (A.1), Below we show that The conclusion follows by using (A.3), the fact that the CN(0, 1, 0)-distribution is invariant under multiplication with the complex non-random numbers z n with |z n | = 1, and the independence of C 1,n and C 2,n .
To show (A.4), from (A.3) it follows that G n (h Applying (4.5) we therefore have The proof of Theorem 10 is given in Hohmann and Holzmann (2013).

A.2. Proofs of Section 2
Proof of Theorem 3. From C1, for i = 0, 1 with probability one.Finally, since the weak limit does actually not depend on the specific realization of the sequence 1 {Z k ∈Bj } , the result follows.
Proof of Theorem 5. We only consider the process

Fig 1 .
Fig 1.Estimates F0 and F1 (dashed) of the component distribution functions F 0 and F 1 (solid) for different sample sizes n.The right column shows estimates of the corresponding densities.
t n ) by triangle inequality, we further have | G(t n )| = s n /m n (1 + o P (1)), and thus