Discrete mixture representations of parametric distribution families: geometry and statistics

We investigate existence and properties of discrete mixture representations $P_\theta =\sum_{i\in E} w_\theta(i) \, Q_i$ for a given family $P_\theta$, $\theta\in\Theta$, of probability measures. The noncentral chi-squared distributions provide a classical example. We obtain existence results and results about geometric and statistical aspects of the problem, the latter including loss of Fisher information, Rao-Blackwellization, asymptotic efficiency and nonparametric maximum likelihood estimation of the mixing probabilities.


Introduction
We say that a family {P θ : θ ∈ Θ} of probability measures on some measurable space (S, S) has a mixture representation in terms of a finite or countably infinite family {Q i : i ∈ E} of probability measures on (S, S), the mixing distributions, and a family of {w θ : θ ∈ Θ} of probability mass functions on E, the mixing coefficients if, for all θ ∈ Θ, P θ (A) = i∈E w θ (i) Q i (A) for all A ∈ S. (1) We will generally have a continuous (uncountable) base family {P θ : θ ∈ Θ} and a parameter set Θ that is a subset of R d for some d ≥ 1. Note that we assume that E is discrete; we will therefore refer to (1) as a discrete mixture representation.
A particularly interesting example, with connections to the power of statistical tests and also to the Dynkin isomorphism in the theory of stochastic processes, arises if we start with a standard normal random variable X and let P θ be the distribution of (X + θ) 2 . For such noncentral chi-squared distribution with one degree of freedom and noncentrality parameter θ 2 , the representation (1) holds with E = N 0 , Q i the central chi-squared distribution with 2i+1 degrees of freedom and w θ the probability mass function of the Poisson distribution with mean θ 2 /2; see also equation (12) below.
Replacing the standard normal distribution by some other distribution µ on the real line we obtain the noncentral distributions associated with µ. In [2] we obtained representations similar to the classical case for such noncentral families if µ is the logistic distribution or the hyperbolic secant distribution, but our methods there did not lead to similar results for other standard symmetric distributions, such as the double exponential and the Cauchy distribution. Initially, our aim was to develop an alternative approach that can be used in these cases and for other parametric families. It turned out that the problems of existence and properties of discrete mixture representations for a given family of probability measures have some interesting general aspects, of a geometric and statistical nature; these, together with their interaction, are now the major themes of the present paper.
We briefly recall the general situation: With an arbitrary set of mixing distributions we need to replace the sum in (1) by an integral and thus require a measurable structure, i.e. a σ-field E, on E together with E-measurability of all functions y → Q(y, A), A ∈ S, so that Q is a Markov kernel (transition probability) from (E, E) to (S, S). For a probability measure τ on (E, E) we then obtain P τ , the mixed distribution with mixing kernel Q and mixing measure τ , by P τ (A) := Q(y, A) τ (dy) for all A ∈ S. (2) With S = {y k : k ∈ N 0 } countable and S the set of all subsets of S a discrete mixture representation always exists for any family {P θ : θ ∈ Θ}: With E = N 0 we take Q k to be the probability measure δ y k concentrated on {y k } and define the mixing coefficients by w θ (k) = P θ ({y k }). In the general situation we may still use such a construction with Q(y, ·) = δ y the one-point measure in y and P itself as the mixing measure (here E = S, E = S, and we assume that {y} ∈ S for all y ∈ S). In particular there is a mixture representation for every family {P θ : θ ∈ Θ} of probability measures on a given measurable space (S, S).
A discrete mixture representation, however, might not exist: For example, (1) implies that the family {P θ : θ ∈ Θ} is dominated by some σ-finite measure ν (this is not true in general, for example if E is uncountable and E = S, and if the family consists of all one-point measures). In fact, for a dominated family (1) may be rewritten in terms of functions as where g i , f θ are ν-densities of Q i , P θ . This way, we may regard {P θ : θ ∈ Θ} as a curve in the space L 1 = L 1 (S, S, ν) of ν-integrable functions on S. Mixed distributions are a canonical theme in probability and statistics, and many authors have considered related problems. A standard reference for mixtures, especially from a statistical point of view, is Lindsay's research monograph [18]. The set of mixtures of binomial distributions has been considered in detail in [25] and [26], from a geometric and statistical angle respectively. In [17] penalized maximum-likelihood estimators are proposed for mixed distributions.
A classical mixture result for a nonparametric distribution family appears in connection with Grenander's influential paper [11] about estimation of distributions on R + that have decreasing densities and, more generally, in connection with the structure of unimodal distributions [7, p.158]. The special case of noncentral chi-squared is considered in many papers, see e.g. [10,22,13]. Finally, a strong case for the use of mixture representations is made in [12], where these are related to the removal of constraints in the optimization problems that typically turn up once the estimates have to be calculated.
In Section 2 we provide existence results that can be used to obtain discrete mixture representations in many cases, including the noncentral families associated with the double exponential and the Cauchy distribution mentioned above; see Theorems 1 and 4. The proof of the first theorem is based on a representation of the σ-field S by a filtration that consists of σ-fields generated by finite or countably infinite partitions of S, such as the dyadic partitions if S is the σ-field of Borel subsets of the unit interval S. The result, however, is less explicit than in the classical case of noncentral chi-squared distributions or the other cases considered in [2].
In Section 3 we develop the curve view in (3) and relate the existence of discrete mixture representations to continuity properties of the curve for spaces other than L 1 . Section 4 deals with geometric aspects. The set of all mixtures of a given family of mixing distributions is obviously convex, but even in a stronger sense which makes it well-suited to Dynkin's approach [5] that is based on the notion of barycentric map. In particular, and in contrast to the strategy used by many papers in this area where Choquet's representation theorem is an important tool, we do not require topological notions for infinite-dimensional linear spaces. Naturally, the (geometric) question arises whether a representation such as (1) is minimal in the sense that the right hand side is the barycentric convex hull of the parametric family on the left; see Theorem 10 for results in the chi-squared case. The set of extreme points of the mixture family in the representation of uniforms from Section 2 turns out to be empty; see part (a) of Theorem 11.
In Section 5, which contains several subsections, we consider statistical aspects. In particular, from this point of view mixtures such as (1) may be thought of as representing a two-stage experiment: To obtain a random variable X with distribution P θ , we first choose an E-valued random variable T with probability mass function w θ and then, given T = i, choose X with distribution Q i . In particular, finding a discrete mixture representation is essentially the same as finding a discrete sufficient statistic, on a possibly enlarged base space.
In Section 5.1 we relate our general existence result from Section 2 to the construction of an important class of prior distribution families in nonparametric Bayesian inference.
For Sections 5.2 and 5.3 the starting point is the observation that a representation such as (1) relates the parametric families {W θ : θ ∈ Θ} of probability distributions on (E, E), where W θ is given by W θ ({i}) = w θ (i) for all i ∈ E, and the family {P θ : θ ∈ Θ}. On general grounds the passage from the first to the second family entails an information loss. This can be formalized by the respective Fisher information. We obtain an integral expression for the classical case with Poisson distributions and noncentral chi-squared distributions with one degree of freedom, see Proposition 12. It turns out that at least half of the information is lost, and that this bound is asymptotically tight as θ → ∞. Moreover, it follows that the method of moments estimator for the noncentrality parameter in the chi-squared case (which, together with some of its variants, has been considered in several papers) is not efficient. Further, in the general setup, the existence of a sufficient statistic T , on a possibly enlarged base space, also leads to a comparison of experiments by conditioning on T . This is carried out in Proposition 13 for the representation obtained in Section 2 for a family of uniform distributions and the method of moments estimator.
In Section 5.4 we note that a discrete mixture representation leads to an embedding of the parametric family into a nonparametric one, meaning that the original parameter θ is replaced by the probability simplex on (E, E). In our final result, Theorem 14, we show, again in the classical situation, that the method of moments estimator for the mean functional is then asymptotically efficient at a large class of distributions, including the noncentral chi-squared with one degree of freedom. General aspects of and comments on nonparametric maximum likelihood estimation for such classes, including the use of the EM algorithm, are collected in Section 5.5. Section 6 concludes our work and also mentions some directions for future research. Finally, an appendix contains the proofs of our results.

Existence of discrete mixture representations
We construct a discrete mixture representation for a specific distribution family and then use this result to obtain such representations for other families. We begin the first step by outlining a general approach; see also Section 5.1 for a similar treatment of a problem in nonparametric Bayesian inference.
Suppose that the basic space (S, S) is the increasing limit of a sequence of finite spaces in the sense that S is generated by the union of F n , n ∈ N, where (F n ) n∈N is a filtration consisting of finite σ-fields. Then each F n is generated by a finite measurable partition F n,1 , . . . , F n,kn of S, and these are nested. For a prospective dominating probability measure ν we then put E(ν) := {(n, k) : n ∈ N, k = 1, . . . , k n , ν(F n,k ) > 0}, and for i = (n, k) ∈ E(ν) we let Q i be the distribution with ν-density g n,k := ν(F n,k ) −1 1 F n,k . (Here and below 1 A denotes the indicator function of the set A.) For an arbitrary probability measure µ on (S, S) with ν-density f we then obtain an increasing sequence of subprobabilities via their densities f n = kn k=1 a n,k g n,k , with a n,k := sup{t ≥ 0 : ν(tg n,k ≤ f ) = 1} for all (n, k) ∈ E(ν). The desired discrete mixture representation then appears through the corre- It should be clear that this approach can also be used for two-sided filtrations (F n ) n∈Z and countably infinite partitions. We now carry this out for a specific parametric family, using the decomposition of the real line into dyadic intervals. Let Q bin = {k2 m : k ∈ Z odd, m ∈ Z} be the set of binary rational numbers. By convention, the notation (a, b) may refer to a pair of real numbers or to an open interval, but the meaning should always be clear from the context. Let unif(a, b), −∞ < a < b < ∞, be the uniform distribution on the interval (a, b) and let We define a countable set E ⊂ R 2 by Our first result now shows that the parametric family of uniform distributions on bounded intervals of real numbers has a discrete mixture representation. We give a constructive proof (see the appendix), which will prove instructive later on when we use the representation provided by Theorem 1. The proof is based on a decomposition of finite real intervals into those with binary rational endpoints and length an integer power of 2. The condition in the theorem ensures that the decomposition is unique, but see also Theorem 11 (a).
where the mixing coefficients are given by Remark 2. A version of the theorem for the subfamily {unif(0, θ) : 0 < θ < 1} is of separate interest; it is also easier to state: Each θ has a unique binary expansion θ = K(θ) k=1 2 −j k (θ) , with K(θ) < ∞ if θ ∈ Q bin , and then, ignoring the dependence on θ in the notation, with a 0 := 0 and a k := k l=1 2 −j l for k > 0. This will be taken up in Section 5.3. Let (Ω, A) and (Ω , A ) be measurable spaces and suppose that T : Ω → Ω is (A, A )-measurable. We recall that the push-forward P T of a probability measure P on (Ω, A) under T is the probability measure on (Ω , A ) given by P T (A) = P (T −1 (A)), A ∈ A . This is also known as the image of P under T .
imsart-ejs ver. 2020/08/06 file: BaringhausGruebelEJSfinal.tex date: June 23, 2022 Remark 3. The following structural property of the above discrete mixture representation is worth noting; it also plays a role in the last part of the proof: A set {P θ : θ ∈ Θ} of distributions on the Borel subsets of the real line is a location-scale family if the push-forward of any element under an affine-linear transformation x → cx + d, c = 0, is again an element of the family. Clearly, this holds for the family in Theorem 1. The (countable) subset of mixing distributions that we obtain if the interval bounds are restricted to binary rationals still enjoys this invariance property, provided that we restrict the shift d to Q bin and |c| to an integer power of 2.
It should be clear that a family of distributions has a discrete mixture representation if it can written as the union of a finite or countably infinite family of families with this property, or as a subset. Below we repeatedly use two further properties: For the first of these, suppose that (1) holds and that τ is a probability measure on (Θ, B(Θ)). Then the τ -mixture R τ with base family {P θ : θ ∈ E} can be written as withw τ (i) := w θ (i) τ (dθ) (here we implicitly assume that the functions θ → w θ (i), i ∈ E, are measurable). Hence a family of mixtures of the original family again has a discrete mixture representation, even with the same family of mixing distributions. For the second property let (S , S ) be another measurable space, let T : Ω → Ω be measurable, and assume that {P θ : θ ∈ Θ} has a mixture representation P θ = Q(y, ·) µ θ (dy) as in (2). Then the following representation of the push-forwards holds, where Q T (y, ·) denotes the push-forward of Q(y, ·) under T . Clearly, the first of these properties can be extended to non-discrete base families, and the second can easily be specialized to the case of discrete E. As mentioned in Section 1, noncentral distributions are of particular interest. For these, we start with a distribution µ on the real line and write P θ for the distribution of (X + θ) 2 or |X + θ|, where the random variable X is supposed to have distribution µ. From (9) it is clear that for the existence of a discrete mixture representation it is irrelevant which of these possibilities we choose. Also, if µ is symmetric then we may assume that θ ≥ 0. Below, unless otherwise specified, densities refer to densities with respect to the Lebesgue measure.
Theorem 4. Suppose that µ is symmetric and has a density f that is weakly decreasing on R + . Then the associated noncentral family has a discrete mixture representation.
The proof uses a representation of µ as a mixture of uniform distributions on the intervals (−y, y), y > 0.
Both the Cauchy distribution and the double exponential distribution satisfy the assumptions in Theorem 4 which means that, answering a question raised in Section 1, both noncentral families have a discrete mixture representation. In the following example we give some details for the double exponential case.
Example 5. The (standard) double exponential distribution DExp(1) is given by its density x → e −|x| /2, x ∈ R. It is easily checked that the corresponding representing measure µ in the proof of Theorem 4 is equal to the gamma distribution Γ(2, 1) with parameters 2 and 1, which has density function x → xe −x , x ≥ 0, so that Now let P θ be the distribution of |X + θ|, where X ∼ DExp(1) and θ ≥ 0. Then (10) implies with T θ (x) = |x + θ|. It is easy to check that the push-forwards in (11) can all be written as a uniform distribution or as a mixture of two uniform distributions. In both cases Theorem 1 is applicable.
In Theorem 11 below we consider the mixture family associated with the mixing distributions that appear in (5) in more detail.

Representations with continuity properties
We recall that noncentral chi-squared distributions χ 2 n (θ 2 ) with n degrees of freedom may be written as where Po(λ) abbreviates the Poisson distribution with parameter λ and χ 2 m is the (central) chi-squared distribution with m degrees of freedom, i.e. the distribution of X 2 1 + · · · + X 2 m if X 1 , . . . , X m are independent standard normal random variables. This provides a discrete mixture representation for the noncentral chi-squared distributions χ 2 n (θ 2 ) with arbitrary (n, θ 2 ) ∈ N × (0, ∞), and taking n = 1 leads to such a representation for the subfamily addressed in the introduction. In both cases we may take E to be the set of integers i > 0 and {χ 2 i : i ∈ E} as the family of mixture distributions. In contrast to these and the similarly explicit representations obtained in [2] our results in Section 2 with uniforms as mixing distributions seem to be less 'usable'. In particular, they differ with respect to smoothness properties. For example, we may view θ → (i → w θ (i)) as a function on Θ with values in the Banach space ( 1 , · 1 ), It is easy to see that the functions given by the mixing coefficients in (12) are continuous. Notice though that in Theorem 1 and the results built on it we do not require the mixing coefficients to be continuous. What happens if we impose continuity assumptions on the discrete mixture representation? We suggest to formalize these by regarding (1) as taking place in a certain Banach space (B, · ); indeed, we have already done so when passing from (1) to (3). In the classical case (12) with n > 1 we may for example use The continuous densities of χ 2 n , n > 1, are all bounded and it is easy to see that the respective norms tend to 0 as n → ∞, which implies boundedness of the whole family in (B, · ). For the subfamily with n = 1 we may similarly use C b ([ , ∞]) with an arbitrary > 0 (the continuous density of χ 2 1 is not bounded). The following simple result can be used to show that for a given family {P θ : θ ∈ Θ} a representation of the type (1) is only possible if the corresponding smoothness assumptions are not too strong. In it, we regard θ → w θ as a function on Θ with values in the Banach space ( 1 , · 1 ). Proposition 6. Suppose that the representation (1) holds in some Banach space (B, · ), and that Then, the function θ → P θ on Θ with values in B is continuous.
Variations of this result are easily obtained. If (C) is amplified to Lipschitz continuity, for example, then θ → P θ is Lipschitz continuous too. Further, if E = N then exponential coefficients can be introduced and (B) can be relaxed or amplified to the boundedness of ρ k Q k , k ∈ N, with some ρ > 0 if a corresponding bound is assumed to hold for the mixing coefficients.
(b) Continuity of course also depends on the topology chosen on Θ. In fact, from a probabilistic point of view, especially in connection with the standard model for an infinitely repeated toss of a fair coin, one might argue for Θ = {0, 1} ∞ instead of Θ = (0, 1) in the context of the family {unif(0, θ) : 0 < θ < 1}. With the discrete topology on {0, 1} and the product topology on the new Θ the function θ → (i → w θ (i)) then turns out to be continuous; see also Proposition 13 (b).
We revisit two noncentral families under this continuity perspective. Below we write l for the Lebesgue measure on the Borel subsets B of R and l + for its restriction to the Borel subsets B + of R + or R + \ {0}. Continuity of a mixture representation refers to a notion of convergence for probability measures. In (1) convergence of the series is the convergence of real numbers. As the mixing coefficients are nonnegative and summable, this automatically amplifies to convergence in total variation norm of the partial sums if we regard these as measures. For a dominated sequence this in turn is equivalent to L 1 -convergence of the respective densities. In the special case (S, S) = (R, B) and with dominating measure l the distance of probability measures then refers to the distance of densities, that is, of functions f : R → R + . For these, other notions, stronger than L 1 -convergence, can used, for example the distance based on the essential supremum, or distances that use smoothness properties of the functions.
Example 8. We consider the uniform distributions unif(θ, θ+1), θ ∈ R. Clearly, all P θ can be interpreted, via their densities, as elements of Further, θ → P θ is · ess sup -bounded. For a bounded set of mixing distributions and with continuity of the mixing coefficients, a representation of the form (1) would imply that θ → P θ is continuous, which is obviously not the case: For example, P θ+1/n − P θ ess sup ≥ 1 for all θ ∈ R, n ∈ N.
Example 9. Let µ = DExp(1) as in Example 5. We consider the corresponding noncentral distributions P θ , θ ≥ 0, where P θ is the distribution of |X + θ| and X has distribution µ. Then a continuous density of P θ is given by This implies that f θ (y) = f θ (0) + y 0 g θ (z) dz for all y ≥ 0, with In particular, f θ is differentiable in y = θ, and the derivative has a jump of size 1 at y = θ. We can interpret all P θ , via the g θ , as elements of the Banach space B = L ∞ (R + , B + , l + ). Repeating the step from (1) to (3) we may regard a representation as taking place in B. In this space, θ → P θ is not continuous, so a discrete mixture representation for the noncentral family associated with DExp(1) satisfying the assumptions (C) and (B) does not exist.

Geometric aspects
We briefly sketch Dynkin's approach [5] to convex measurable structures in the context of the present situation. Let M 1 = M 1 (S, S) be the set of all probability measures on (S, S) and let B(M 1 ) be the σ-field on M 1 generated by the projections P → P (A), A ∈ S. Then each probability measure Ξ on (M 1 , B(M 1 )) defines a probability measure P = Ψ(Ξ) on (S, S), the barycenter of Ξ, via Note that we integrate with respect to a measure Ξ on a set of probability measures, which means that the integration variable R is a probability measure; The classical notion appears if we restrict this to measures Ξ that are concentrated on a finite number of points in M . The set of all probability measures on (N, P(N)) with finite support provides an example of a family that is classically convex but not barycenter convex; here and below P(A) denotes the power set associated with a set A, i.e. the set of all subsets of A. Now let (E, E) be another measurable space and let Q be a transition probability from (E, E) to (S, S). We write Mix{Q(y, ·) : y ∈ E} for the set of probability measures on (S, S) that arise as Q-mixtures, see (2). It is easy to see that Φ : E → M 1 , y → Q(y, ·), is (E, B(M 1 ))-measurable, and that the mixed distribution P τ is the barycenter of the push-forward Ξ = τ Φ of τ under Φ. Further, from the behavior of mixtures under push-forwards, see (9), it follows that such mixture families are barycenter convex; indeed, Mix{Q(y, ·) : y ∈ E} may be seen as the barycentric convex hull of the family {Q(y, ·) : y ∈ E}.
The basic equation (1) then says that {P θ : θ ∈ Θ} is a subset of Mix{Q i : i ∈ E}, where we have written Q i instead of Q(i, ·). By the mixture-of-mixtures formula (8) this implies and it seems natural to call a discrete mixture representation (1) minimal, if In this context, a description of the respective extreme points is of interest. Suppose that µ = i∈E p i Q i ∈ Mix{Q i : i ∈ E} is a 'true' mixture in the sense that 0 < p j < 1 for some j ∈ E. Then µ can written as Notice that this argument uses barycenter convexity; indeed, for sets that are convex in this stronger sense the classical notion of extreme points (not a nontrivial finite affine combination of other points of the set) and the barycentric version (not representable in the sense of (14) by some Ξ that is not concentrated at one point) are the same. For noncentral chi-squared distributions we have the following result.
The proof of the first part implies that the family Mix{χ 2 1 (θ) : θ ≥ 0} is a simplex, where again the notion refers to general barycenters.
We next consider the situation in Theorem Again, absolutely continuity refers to the Lebesgue measure l.
(b) Let µ be a probability measure on (R, B). If µ is absolutely continuous with a density that is Riemann integrable on all compact intervals, then µ is an element of Mix{Q (p,q) : (p, q) ∈ E}.
(c) There exists a probability measure on (R, B) that is absolutely continuous and that is not an element of Mix{Q (p,q) : (p, q) ∈ E}.
The above approach can be used to obtain similar results for other families of distributions. For example, in Mix{unif(0, θ) : θ > 0} the mixing distributions are extreme, and the family contains the family F of all distributions on (0, ∞) with a weakly decreasing density. Obviously, unif(0, θ) ∈ F. Taken together this shows that F, which has a discrete mixture representation by Theorem 4, does not have a minimal discrete mixture representation.
With respect to the general approach in this section we point out that, in contrast to the classical functional-analytic results on convexity and associated representations, see e.g. [20], we have not used any topological concepts (other than those for the real line that are inherent in the Lebesgue integral). Nevertheless, it is interesting to compare mixture families to the closed convex hull of the mixing distributions. This set of course depends on the topology in use. For example, with Mix{unif(0, θ) : θ > 0, θ ∈ Q} and total variation norm (or, equivalently, the L 1 -norm for the respective densities), the closed convex hull is strictly larger than the mixture family itself. Part (a) of Theorem 11 may also be of interest in this connection; see also the corresponding remarks in the introduction.
Finally, we briefly consider what happens if we drop the assumption that the mixing coefficients are nonnegative. For example, what is the closure of the linear span of the shifted Cauchy distributions in the space L 1 ? A famous result from functional analysis, see e.g. Theorem 9.5 in [21], states that this set is the whole of L 1 if the characteristic function (Fourier transform of the density) has no zeroes. In the Cauchy case, this is the function x → e −|x| , so that the condition is satisfied. Clearly, the corresponding mixture family is much smaller.

Statistical aspects
The structural decomposition given by a mixture representation is closely related to the statistical concept of sufficiency. As in (2) let (E, E) and (S, S) be measurable spaces and let Q be a Markov kernel from (E, E) to (S, S). For any probability measure τ on (E, E) we may define the probability measure τ ⊗ Q on the product space (E × S, E ⊗ S) by Let T and X be the projections (y, x) → y and (y, x) → x and suppose that D is a non-empty subset of the set M 1 (E, E) of probability measures on (E, E). By construction, T is then sufficient for the set {τ ⊗ Q : τ ∈ D}, and {P τ : τ ∈ D} with P τ = (τ ⊗ Q) X is the corresponding set of distributions for the second component X, with P τ as in (2). This shows that, given a mixture representation, it is possible to enlarge the sample space such that a sufficient statistic appears. Of course, this is also quite evident from the two-stage interpretation of mixture experiments. Also, we noted in the remarks following (2) that a mixture representation always exists if we take Q(y, ·) = δ y . Here this corresponds to the statement that X itself is a sufficient statistic. Conversely, suppose that we have a set of probability measures {P τ : τ ∈ D} on (S, S), indexed by τ ∈ D ⊂ M 1 (E, E)}, with the property that on an enlarged space (E × S, E ⊗ S) there is a family {R τ : τ ∈ D} of probability measures such that for each τ ∈ D the push-forward R X τ of the projection X on the second coordinate is P τ , the push-forward R T τ of the projection T on the first coordinate is τ , and T is sufficient for {R τ : τ ∈ D}. Then if (S, S) is a Borel space, there exists a Markov kernel Q from (E, E) to (S, S) such that (2) holds for all τ ∈ D; for more details see [15].
From this point of view we may rephrase our quest for a discrete mixture representation as the search for a discrete sufficient statistic on a possibly enlarged base space. Such an enlargement may indeed be necessary: If T : S → E is sufficient for a family {P θ : θ ∈ Θ} of distributions on (S, S) dominated by some σ-finite measure ν, and if E is countable, then the Neyman criterion implies that for θ 1 , θ 2 ∈ Θ with θ 1 = θ 2 the ratio x → p θ1 (x)/p θ2 (x) of the associated ν-densities has only countably many values. In the classical case, with P θ = χ 2 1 (2θ) for θ > 0, this ratio is a continuous function, and we would obtain a contradiction to the intermediate value theorem.
A discrete mixture representation such as (1) connects the statistical experiments E = (E, E, {W θ : θ ∈ Θ}), with W θ again given by W θ ({i}) = w θ (i) for imsart-ejs ver. 2020/08/06 file: BaringhausGruebelEJSfinal.tex date: June 23, 2022 all i ∈ E, and F = (S, S, {P θ : θ ∈ Θ}) and also leads to an embedding of the parametric experiment F into the nonparametric experiment Further, with the notation introduced above, we obtain the experiment After a short subsection on the connection to nonparametric Bayesian inference we investigate various statistical aspects that relate the experiments E, F, M and R. In the context of comparing experiments the connection between geometry and statistics can be seen as an underlying thread, a connection that currently receives much attention under the heading 'information geometry'. The classical case is the use of the Hilbert space Pythagorean theorem in the proof of the Cramér-Rao lower bound, which plays a role in Subsection 5.3. Also, concepts from differential geometry are increasingly used in statistics, with [1] and [6] being two early influential contributions. A specific aspect of this influence is the interpretation of Fisher information as curvature. In the situation considered here this leads to an interpretation of the results in Subsection 5.2 as a reduction of curvature ('flattening') in the transition from E to F. Finally, convexity considerations in the context of sufficiency are basic themes in [14] and [5].

Priors on sets of probability measures
We refer the reader to the recent textbook [9] for a general introduction to nonparametric Bayesian inference.
As in Section 4 let M be a subset of the set M 1 (S, S) of all probability measures on (S, S) and let M be the σ-field on M generated by the maps P → P (A), A ∈ S. The data x ∈ S are regarded as a realization of a random variable X, and M is the set of potential distributions for X. Our prior knowledge is formalized by a probability Ξ on (M, M), the prior distribution. An important problem of nonparametric Bayesian inference is the construction of a set of such measures Ξ that is flexible in the sense that the transition to posterior distributions is manageable and does not leave the set, and that the Bernstein-von Mises theorem applies. In his seminal paper, Ferguson [8] puts special emphasis on Dirichlet processes for the case that M is the set of all probability measures on (S, S). These priors satisfy the above requirements, but they are concentrated on the set of discrete distributions.
Hoff [12] points out the relevance of mixture representations for the construction of prior probabilities on convex sets of probability measures. There, a necessary first step is the construction of a suitable mixture representation; see [12,Section 4] for an interesting variety of worked examples. Here we start with a mixture representation. For example, in the statistical experiment M we have and the construction of probability measures on M is essentially equivalent to the construction of probability measures on M 1 (E, P(E)). In the classical situation and in many other cases of interest, E = N 0 , and the latter problem can be approached via a stick-breaking procedure. The detour via a discrete mixture representation means that the posterior distribution obtained from the prior and the data would still be absolutely continuous with respect to any measure dominating M (in fact, even smoothness properties of the densities would be retained).
An important alternative to Dirichlet processes are tree-based priors. To see the connection to discrete mixture representations we assume, as at the beginning of Section 2, that S is generated by the union of σ-fields F n , n ∈ N, where (F n ) n∈N is a filtration and each F n , n ∈ N, is generated by a finite partition {F n,1 , . . . , F n,kn }. This structure leads to a tree with node set (n, k), n ∈ N, k = 1, . . . , k n , and edges between (n, k) and (n + 1, j) whenever F n,k ⊃ F n+1,j . A suitable assignment of probabilities p (n,k),(n+1,j) to these edges then defines a probability P on (S, S), with P E n+1,j E n,k = p (n,k),(n+1,j) for all edges {(n, k), (n + 1, j)}. Choosing random assignments leads to tree-based priors Ξ, such as tail-free processes and Pólya trees.
Conversely, choosing the weights via the conditional probabilities provides a tree-based representation for a given distribution P . The approximation P n that results if the conditional probabilities of the nodes are multiplied up to depth n is simply the restriction of P to the σ-field F n . For discrete mixture representation, however, we need a sequence of approximations that is increasing. Such a sequence of subprobability measures can be obtained via a dominating measure, as explained at the beginning of Section 2.

Fisher information
We first consider the experiments E and F defined above, where the parameter set Θ is assumed to be an open subset of R d . In what follows we confine ourselves to giving the definition of the Fisher information matrix for the experiment F. Let f (·, θ) be the density of the distribution P θ with respect to some σ-finite measure on (S, S). We denote by θ 1 , . . . , θ d the components of the column vector θ ∈ Θ. Under suitable standard regularity conditions, see, e.g. [16], the integrals exist and it is i F (θ) = (i F;jk (θ)) 1≤j,k≤d a symmetric positive definite d×d matrix which is called the Fisher information matrix associated with the experiment F. Assuming that the corresponding standard conditions hold for the experiment E, there is a Fisher information matrix i E (θ) associated with E as well. The experiments E and F are related in the sense that there is Markov kernel Q from (E, E) to (S, S) such that (1) holds. Following Le Cam [15], the experiment F is then said to be reproducible from the experiment E (or, equivalently, E is said to be better than F). Notice that the standard terminology that E is sufficient for F is not adopted by Le Cam. Under certain regularity conditions it then follows that see [10] or [23]. The latter author deals with the one-dimensional case d = 1. Then (18) reduces to It is of interest to quantify i E (θ) − i F (θ), which can be viewed as the loss of information when switching from E to F. We tackle this problem in the special case arising with the mixture representation The above conditions for (17) are then satisfied for the experiments E and F that are specified in the following result.
In particular, with r(θ) := θi F (θ) for all θ > 0, and r is strictly increasing. Finally, It follows from (22) that the transition from E to F causes a loss of information of more than fifty percent, and (21) implies that this bound is asymptotically tight as θ → ∞.
For a general discussion of estimation procedures for estimating the unknown parameter θ based on observations of X we refer to [13].

MSE reduction
A quantitative comparison of F and R, the original and the enlarged experiment, can be obtained through the reduction of the mean squared error (MSE) of estimators by conditioning on the sufficient statistic, a procedure sometimes called 'Rao-Blackwellization'. In the classical case (19), with Θ = (0, ∞) and P θ = χ 2 1 (2θ), the enlargement leads to the product space (N 0 × R >0 , P(N 0 ) ⊗ B >0 ), with the distributions R θ specified by R θ ({k}×A) = Po(θ)({k})·χ 2 2k+1 (A) for all k ∈ N 0 , A ∈ B >0 . Again, T and X are the canonical projections on N 0 and R >0 respectively, and T is sufficient for {R θ : θ ∈ Θ}. In this situation, θ := (X − 1)/2 is an unbiased estimator for the unknown parameter θ, and conditioning on the sufficient statistic leads to the estimatorθ = E θ [θ|T ] = T . The MSE reduction may be given explicitly in this situation: We note that, in view of the Cramér-Rao lower bound for the variance of an unbiased estimator, the formula for MSE θ (θ) can also be used to obtain a lower bound for the Fisher information i F considered in the previous subsection. In fact, inspecting the geometric argument in the proof of the bound it is not difficult to show that the inequality is strict, so that the result of the previous subsection may be augmented as follows, see also Figure 1 where the integral in (20) has been computed numerically. Note that (23) leads to an alternative proof of the second limit relation stated in (21). Further, standard arguments show that var(θ) > i F (θ) −1 implies that the estimatorθ is not asymptotically efficient and thus inferior to the maximum likelihood estimator; see also Subsection 5.4. As another, somewhat less straightforward example we consider the family of uniform distributions P θ = unif(0, θ), θ ∈ Θ := (0, 1), on (S, S) = ((0, 1), B (0,1) ) together with the mixture representation in Remark 2. In order to carry out Rao-Blackwellization for the moment estimatorθ = 2X, X ∼ P θ , we use the binary representation: For a given θ ∈ Θ, let K(θ), j k (θ) and a k (θ) be as in (7). Let E := Θ∩Q bin , and for t ∈ E let d(t) := 2 −K(t) where K(t) is obtained from the binary representation of t. Given θ ∈ Θ we assign the weight w θ (a m (θ)) = d(a m (θ))/θ to each a m (θ), m = 1, . . . , K(θ), and put w θ (t) = 0 otherwise. Further, with each t ∈ E we associate the mixing distribution Q(t, ·) = unif(t − d(t), t). Then the discrete mixture representation (7) may be written as For example, if θ = 3/4 then K(θ) = 2, a 1 (θ) = 1/2, a 2 (θ) = 3/4, d(a 1 (θ)) = 1/2, d(a 2 (θ)) = 1/4, w θ (1/2) = 2/3, w θ (3/4) = 1/3, and we arrive at On the product space (E × (0, 1), P(E) ⊗ B (0,1) ) we obtain the probability measures R θ specified by With T and X the canonical projections on E and (0, 1) respectively, T is sufficient for {R θ : θ ∈ Θ}. For the moment estimatorθ = 2X conditioning on the sufficient statistic now leads to the estimatorθ := E θ [θ|T ] = 2T − d(T ). As in Proposition 12 we first give a general formula and then derive some properties of the function of interest.
Proposition 13. (a) With the notation as introduced above, the mean squared error of the conditioned moment estimatorθ = 2T − d(T ) is given by (b) The function θ → φ(θ) := var θ (θ) is continuous on Θ \ E, and on E it is right continuous and has left hand limits. Moreover, if θ is a binary rational of the form θ = q2 −L , with L, q ∈ N and q odd.
The property 'right continuous, with left hand limits' is often abbreviated to 'càdlàg'.
For the unconditioned estimatorθ we have var θ (θ) = θ 2 /3. Both estimators are unbiased. Figure 2 shows the variances (mean squared errors) ofθ andθ for θ an integer multiple of 2 −12 , with the interpolation justified by Proposition 13 (b). The figure suggests a self-similarity property: Indeed, as X ∼ P θ/2 and 2X ∼ P θ are equivalent, it follows that both variance functions have the scaling property φ(θ/2) = φ(θ)/4. Further, if θ = 2 −m for some m ∈ N then the variance is zero, as the distribution of T is degenerate for such values of the parameter.

An asymptotically efficient estimator for the mean functional
We consider the classical case (19). In the corresponding mixture model we then have and the statistical experiment M is based on the family P = P p : P p = ∞ k=0 p k χ 2 2k+1 , p ∈ Θ of mixture distributions parametrized by p ∈ Θ. The family χ 2 1 (2θ) : θ > 0 underlying F is a subfamily of P.
Our aim is to show that the moment estimator for the mean functional, which we know from the previous subsections to not be efficient in the context of F, is asymptotically efficient at each P p ∈ P that has non-vanishing mixing coefficients with finite variance. We refer the reader to [24, Sections 1.2, 2.1 and 3.1] for an exposition of the general theory needed here. In comparison with the situation in Proposition 12 there are three main differences: First, instead of a parametric family with a parameter θ specifying the distribution, we now have a one-dimensional parameter function κ : P → R, where κ(P ) does not fully characterize P . Secondly, instead of the variance of an estimatorθ for θ we now consider the variance of the limiting normal distribution in an associated central limit theorem for a sequence of estimators for κ(P ). The third point requires some machinery. The basic idea is to consider dominated one-dimensional submodels {P t : 0 ≤ t < } ⊂ P with P 0 = P and then use the derivative at t = 0, in analogy to (17). This leads to a tangent set, where the geometry is that of a Hilbert space of square integrable functions. Again speaking somewhat informally, a lower bound can then be obtained from the supremum of the associated second moments. Finally, an estimator sequence is said to be asymptotically efficient at P for the parameter function κ if asymptotic variance and lower bound are the same for this P .
Let P 0 be the subset of distributions P p ∈ P with p = (p k ) k∈N0 satisfying the moment condition ∞ k=0 k 2 p k < ∞ and also the condition that p k > 0 for each k ∈ N 0 . Consider the mean functional κ : P 0 → R defined by Let X 1 , . . . , X n , . . . be a sequence of independent and identically distributed random variables with distribution P p ∈ P 0 which is assumed to be unknown. Then, generalizing the moment estimator that already appeared in Subsection 5.3, leads to a sequence (T n ) n∈N of unbiased estimators for κ(P p ), and the central limit theorem shows that, as n → ∞, where v(P p ) = ∞ k=0 (k − κ(P p )) 2 p k .
Theorem 14. For estimating κ(P p ), the estimator sequence (T n ) ∞ n=1 is asymptotically efficient at each P p ∈ P 0 .
If P p = χ 2 1 (2θ) then v(P p ) = κ(P p ) = θ so that the variance of the limit distribution of √ n (T n − κ(P p )) is equal to 2θ + 1 2 , in accordance with the value found in Subsection 5.3.

NPMLE and EM
As we pointed out in the introduction a distribution P θ with the mixture representation P θ = i∈E w θ (i) Q i may be seen as the distribution of the outcome X in a two-stage experiment: First, choose Y in E with distribution (mass function) w θ , then, if Y = i, choose X according to Q i . In a sample x 1 , . . . , x n of size n from P θ we may then think of the corresponding y 1 , . . . , y n as hidden (or latent) variables, or as missing covariates, see [18,Sect. 1.3.5 and 1.3.6]. A non-parametric generalization leads to the problem of estimating the mixing probabilities p i , i ∈ E, in a representation P = i∈E p i Q i of the unknown distribution P , with the Q i known, from the data x 1 , . . . , x n . We may of course regard the sequence p = (p i ) i∈E as a parameter, where the parameter space is now the probability simplex on E, as in the previous subsection.
The EM algorithm, see [4] and [18,Sect. 3.4] can be used to obtain a sequence p(l), l = 1, 2, . . . of approximations to the corresponding non-parametric maximum likelihood estimator (NPMLE). We give the details for the classical case, where E = N 0 and Q k is the central chi-squared distribution with 2k +1 degrees of freedom, with continuous density g k . The log-likelihood function L X is then given by With the corresponding y-values known, the log-likelihood function would be log p yi + r(x 1 , . . . , x n ; y 1 , . . . , y n ), (28) where the function r does not depend on p. In the EM algorithm, we obtain the next approximation p(l + 1) for the NPMLE from the current value p(l) as the argmax (M-step) of the conditional expectation (E-step) of L X,Y (p) given x 1 , . . . , x n . The E-step boils down to the calculation of E p(l) log p Yi |X i = x i . Under P p(l) the joint distribution of X i and Y i has density (x, k) → p k (l)g k (x) with respect to the product of Lebesgue measure on R + and counting measure on N 0 , so that the conditional probability of Y i = k given X i = x becomes The M-step then requires maximization of the function As q n is a probability mass function, Gibbs' inequality can be used to show that the desired argmax is given by p(l + 1) = q n . Using this in (28) we arrive at For chi-squared mixing distributions the factors e −x/2 cancel in (29) and, more importantly, the infinite sums can be truncated to the range from 0 to To see this we note that Hence, if some k > K appears in (27), then the corresponding mass p k can be shifted to the left without decreasing the value of L X . The left part of Figure 3 shows the results of four simulation experiments, each with n = 10000 observations, where the base distribution is a noncentral chi-squared with one degree of freedom and noncentrality parameter θ = 2. The vertical red lines represent the probability mass p k of the true mixing distribution, which is Poisson with parameter 2, at the positions k = 0, 1, . . . , 7. The four estimates of these masses are computed numerically by using the EM algorithm and are shown in blue slightly to the right.
For noncentral distribution families the nonparametric version incorporates a variant of the deconvolution problem [18,Sect. 1.3.19]: In the chi-squared case, we regard the data as realizations of random variables X = (Y + Z) 2 with Y, Z independent, Y standard normal, and an unknown distribution µ for Z. The procedure outlined above can be applied and leads to an estimator for the mixed Poisson distribution associated with µ. This in turn can then be used to obtain an estimator for µ, by combining NPMLE and EM again, for example.
The right hand part of Figure 3 shows the results of four simulation experiments where Z = 2 √ W with W exponentially distributed with parameter 1. The corresponding mixed Poisson distribution is the geometric distribution on N 0 with parameter 1/2. Again, the vertical red lines represents the true mixing distribution, the four estimates are slightly to the right and in blue, and each simulation is based on n = 10000 observations. We should point out that a rather large number n of observations is required in order to obtain estimators with small mean squared error.
Given a sequence X 1 , . . . , X n , . . . of independent and identically distributed random variables with distribution P p = ∞ k=0 p k χ 2 1+2k , where the parameter p = (p k ) k∈N0 in the probability simplex on N 0 is unknown, the non-parametric maximum likelihood estimator p NPMLE,n of p based on X 1 , . . . , X n exists. Using imsart-ejs ver. 2020/08/06 file: BaringhausGruebelEJSfinal.tex date: June 23, 2022 the general consistency statement given in [19,Theorem 5.3] we deduce that the sequence (p NPMLE,n ) n∈N is strongly consistent for p.

Summary and outlook
We have investigated geometrical and statistical aspects, and their interaction, in the context of discrete mixture representations. It turned out that there are connections to a variety of theoretical and applied topics, including -tree-based constructions of probability measures, with connections to nonparamnetric Bayesian inference (Theorem 1, Section 5.1), -the view of mixture representations as curves in an infinite-dimensional space (Section 3), -Dynkin's barycentric approach to convexity (Section 4), -the effect of mixing on Fisher information (Section 5.2) and mean squared error (Section 5.3), and -algorithmic aspects, notably the use of the EM algorithm (Section 5.5).
We have almost exclusively considered the one-dimensional case, i.e. families of probability distributions on the real line, leaving multivariate situations to a future paper. Another aspect that seems to be worth a separate investigation is the structure of the family {Q i : i ∈ E} of mixing distributions in a representation such as (1). In the classical case (12), with noncentral chi-squared distributions, these are the convolution powers of one specific probability measure, and we noted in Remark 3 that the mixing distributions in Theorem 1 constitute a location-scale family of a specific type. Taken together, the multivariate case and the structure of the mixing family are also of interest for applications in the general area of stochastic processes, with relations to the Dynkin isomorphism (see [2,Section 4.2]). In an applied context it would certainly be interesting to remove the assumption in Section 5.5 that the mixing family is known. This would lead to a statistical inverse problem, and structural assumptions could be used, if applicable, to reduce its complexity.
Acknowledgment. We thank the associate editor and the two referees for their detailed and constructive comments, which have led to a considerable improvement of the paper.

Appendix: Proofs
Proof of Theorem 1. Consider the infinite rooted binary tree where each node has one left and one right descendant. Any x ∈ (0, 1) that is not a binary rational defines a unique infinite path through this tree, starting at the root node, and moving to the left or right descendant if the next digit in its binary expansion is 0 or 1 respectively. With x = 1/3, for example, we would move to the left and to the right in odd and even steps alternatingly. Our first aim is to obtain a decomposition of (a, b) ⊂ (0, 1) into intervals with binary rational endpoints. For this we label the nodes of tree by appropriately chosen intervals, using (4), and then collect these along the paths given by a and b.
In order to carry this out formally, let a, b ∈ R \ Q bin , 0 < a < b < 1, be given, with binary expansions a = , 1} for all k ∈ N. Let j 1 < j 2 < · · · be the positions of the digit 1 in the expansion of b. The corresponding partial sums b(l) := j l k=1 b k 2 −k , l ∈ N, then provide a strictly increasing sequence of binary rational numbers with limit b. By construction, 2 j l+1 b(l + 1) − b(l) = 1, 2 j l+1 b(l + 1) − 1 is even, Similarly, and now writing j 1 < j 2 < · · · for the positions of the digit 0 in the expansion of a, and with a(l) := 2 −j l + j l k=1 a k 2 −k , l ∈ N, we obtain a strictly decreasing sequence (a(l)) l∈N ⊂ Q bin with limit a and the property that 2 j l+1 a(l + 1) is odd, and 2 j l+1 a(l + 1) − 1 2 −j l+1 = a(l + 1) − 2 −j l+1 < a for all l ∈ N.
It remains to remove the restrictions on a and b. For elements of Q bin the binary representation is not unique, but if we agree to choose the representation with a finite number of digits 1 in the case of b and a finite number of digits 0 in the case of a then the only change in the above argument is that the respective sequence of j-values would be finite.
In order to remove the assumption 0 < a, b < 1 we first describe the way in which the representation interacts with affine transformations T : R → R of the form T (x) = 2 k x + r, with k ∈ Z and r ∈ Q bin . Such functions are bijections, and their inverse is of the same type. Further, they leave E invariant, if applied to the components of the pair, and it is easy to check that which means that the representation then also holds for T (a), T (b) . Any pair (a, b) ∈ Θ can be transformed by such T 's into the unit interval, so the desired representation holds for all (a, b) ∈ Θ.
Proof of Theorem 4. It is well-known and follows easily from Shepp's theorem [7, p.158] that a probability measure P on the positive half-line with weakly decreasing density f (which we may take to be left continuous) may be written as a mixture of uniform distributions on the intervals [0, y], y > 0. For completeness we give a simple and direct argument: First, a σ-finite measure ρ on (R + , B + ) can be defined via ρ([y, ∞)) = f (y) for all y > 0. Then let ν be the measure with density y → y with respect to ρ. Is it easy to check that ν has total mass 1. Also, for all x > 0, and the integrand in the last expression is a density for unif(0, y).
With µ and f as in the statement of the theorem we may write µ = 1 2 (µ 0 +µ T 0 ) with T (x) = −x, and with µ 0 a distribution that is concentrated on the positive half-line and has a weakly decreasing density there. Using the first part, we get where ν is the mixing measure for µ 0 . We now apply the general principles represented by (8) (repeated mixtures) and (9) (behaviour under push-forwards): First, if X has distribution µ and Q θ is the distribution of X + θ, representation (31) implies that Thus, the family {Q θ : θ ∈ R} has a discrete mixture representation in terms of the countable set of uniform distributions on intervals with binary rational endpoints introduced in connection with Theorem 1. The transfer argument for push-forwards, with T (x) = |x| or T (x) = x 2 , now completes the proof.
Proof of Proposition 6. Assume without loss of generality that E = N. Let C := sup k∈N Q k . Then, for all η, θ ∈ Θ, imsart-ejs ver. 2020/08/06 file: BaringhausGruebelEJSfinal.tex date: June 23, 2022 Proof of Theorem 10. (a) Suppose that µ ∈ Mix{χ 2 1+2k : k ∈ N 0 } has two mixture representations, with p k , q k ≥ 0 and p k = q k = 1. Then the respective densities must be equal almost everywhere, so that the continuous versions agree on (0, ∞). This gives Multiplying both sides by x 1 2 e x/2 and using the uniqueness of the coefficients in a power series representation we get p k = q k for all k ∈ N 0 . Hence the mixture representation of elements of Mix{χ 2 1+2k : k ∈ N 0 } is unique. Now suppose that µ is a mixture of noncentral chi-squared distributions with one degree of freedom and mixing measure ν on the parameter space Θ = [0, ∞). The classical representation (12) with n = 1 then gives so that the mixing coefficients are the probabilities of a mixed Poisson distribution. In order to prove the strict subset relation it is therefore enough to name a distribution on N 0 that is not a mixed Poisson distribution-such as δ 1 . In particular, To finish the proof of part (a) we use the uniqueness of the representation to show that each χ 2 1 (η), η > 0, is an extreme element of the convex set Mix{χ 2 1 (θ) : θ ≥ 0}. Indeed, suppose that χ 2 1 (η) = αµ 1 + (1 − α)µ 2 , and that ν 1 , ν 2 are the parameter distributions for µ 1 , µ 2 . But then χ 2 1 (η) would itself be a mixture of the distributions χ 2 1 (θ), θ ≥ 0, with mixing measure ν := αν 1 + (1 − α)ν 2 . The uniqueness now implies that ν is concentrated at η, so that µ 1 and µ 2 are equal.
This shows that the set of extreme elements of Mix{χ 2 1 (θ) : θ ≥ 0} is uncountable, in contrast to the set of extreme elements of Mix{χ 2 1+2k : k ∈ N 0 }. (b) We know from (12) that each χ 2 k (θ), θ ≥ 0 and k ∈ N, has a representation in terms of central chi-squared distributions, so that, by the mixture-of mixtures property (8), the left set is contained in the set on the right side of the assertion. However, χ 2 k = χ 2 k (0), which means that the mixing distributions are elements of the left set, hence so are their mixtures.
(p, q) ∈ E} by Theorem 1. Further, the differences g M,m+1 − g M,m are linear combinations with nonnegative coefficients of the indicators of I M,m+1,k , k = 1, . . . , 2 m+1 . Taken together this shows that g M,∞ /µ(I M ) is the density of an element of Mix{Q (p,q) : (p, q) ∈ E} whenever µ(I M ) > 0. As µ itself is a mixture of these, an appeal to the repeated mixture property (8) completes the proof of (b).
(c) We start with a familiar construction from measure theory, see e.g. [3, Problem 2.5 b), p.65]. Let q k , k ∈ N, be an enumeration of the rational numbers in the interval (0, 1). Choose for each k ∈ N some n k ∈ N, n k > k + 1, such that Then the union A := ∞ k=1 I k is a non-empty open subset of [0, 1], the Lebesgue measure α := l(A) of which is The set A is dense in [0, 1]. Let µ be the uniform distribution on the complement The support of µ does not contain any interval of positive length, in contrast to the support of the elements of Mix{Q i : i ∈ E}.
Proof of Proposition 13. (a) Let θ ∈ Θ be given. We first assume that K(θ) = ∞ and suppress the dependence on θ where convenient. In terms of the sequences (j k ) k∈N and (a m ) m∈N0 we have R θ (T = a m ) = a m − a m−1 θ = 1 θ2 jm for all m ∈ N and R θ (T = t) = 0 for all other elements t of E. Conditionally on T = a m , X is uniformly distributed on the interval (a m−1 , a m ]. With these definitions we obtain which, of course, is also immediate from the tower property of conditional expectations. Similarly, a m a m−1 (a m − a m−1 ).
From the above expression for E θθ we get lim m→∞ a 3 m = θ 3 , which completes the proof of (24) in the case that K(θ) = ∞. For binary rational parameter values the same arguments apply, only that the sums are now finite and no limits are needed.
(b) We first consider the function Ψ(θ) = w θ that maps Θ to the set of probability measures on N together with the total variation distance or, equivalently via probability mass functions, to the space ( 1 (N), · 1 ). Clearly, Ψ = Φ 1 • Φ 2 where Φ 2 maps θ = ∞ k=1 b k 2 −k to the sequence b(θ) = (b k ) k∈N ∈ {0, 1} N of digits in the binary expansion that has infinitely many 0's, and Φ 1 (b)({k}) = b k 2 −k /θ, k ∈ N. On the intermediate space of 0-1 sequences we consider the product topology, where a sequence of sequences converges if the respective components converge, which here means that the components do not change from some sequence index onwards (which may depend on the index of the component; informally, all components eventually 'freeze').
The outer function Φ 1 is continuous in view of Scheffé's lemma. For any θ ∈ Θ \ E there are infinitely many 0's (and 1's) in b(θ). If b k (θ) = 0 and b k+1 (θ) = 1 then θ is strictly inside a binary rational interval of length 2 −k . Hence, if θ n → θ then θ n is also contained in this interval from some n 0 onwards, which in turn implies that the first k digits remain constant. This shows that the inner function Φ 2 is continuous on Θ \ E. In a similar fashion we obtain that Φ 2 is right continuous on E, and that in E its left hand limits exist.
The conditioned estimatorθ = 2T − d(T ) is bounded. Hence Lebesgue's dominated convergence theorem can be applied, so that we finally obtain that θ → var θ (θ) is càdlàg and fully continuous in those parameter values that are not binary rationals.
It remains to prove the formula for the jumps of φ in binary rationals. Using the notation from the first part of the proof it is clear that θφ(θ) is the finite sum From (24) we know that a k a k−1 (a k − a k−1 ) + a K a K−1 (a K − a K−1 ).
Using a K = θ and a K −a K−1 = 2 −L we now obtain (25) after a short calculation.
Proof of Theorem 14. Fix P p ∈ P 0 . Denote by Θ d the subset of elements q = (q k ) k∈N0 in Θ with finitely many non-zero mixing coefficients q k . For q = (q k ) k∈N0 in Θ d , 0 ≤ a < ∞ and 0 ≤ t < min(1, 1/a) let P p,q;a,t = P p + at(P q − P p ).
Then v p = ∞ k=0 p k f 2k+1 is a density of P p , and v p,q;a,t = v p + at ∞ k=0 (q k − p k )f 2k+1 is a density of P p,q;a,t , both with respect to the Lebesgue measure l + on the Borel subsets of the positive half-line. The partial derivativė v p,q;a,t = ∂ ∂t v p,q;a,t = a ∞ k=0 (q k − p k )f 2k+1 does not depend on t, hence t → v 1/2 p,λ;a,t is continuously differentiable on the interval [0, min(1, 1/a)). Note that c p,q := max 1, q k p k , k ∈ N 0 < ∞, because only finitely many of the q k are non-zero. Due tȯ v 2 p,q;a,t v p,q;a,t = a 2 ( of the model P 0 at P p is a convex cone. Regarding this cone as a subset of L 2 (P p ), we claim thatκ Pp defined bỹ is an element of the closed linear span linṖ Pp ofṖ Pp . To see this, we first note that with (2k − 1)!! = 1 if k = 0, (2k − 1) · (2k − 3) · · · 3 · 1 if k ∈ N, it holds that f 2k+1 (x) = 1 (2k − 1)!!
Thus, for each r ∈ N 0 , and δ r = (δ rk ) k∈N0 with δ rr = 1 and δ rk = 0 otherwise, where the series converges pointwise and in L 2 (P p ). Therefore, as asserted,κ Pp ∈ linṖ Pp . Due to κ(P p,q;a,t ) − κ(P p ) = t κ Pp g p,q;a dP p for each q ∈ Θ d , each a ≥ 0 and each 0 ≤ t < min(1, 1/a), the functional κ is differentiable at P p relative to the tangent coneṖ Pp . Henceκ Pp is the efficient influence function for estimating the functional κ at P p . By (26) and Lemma 2.9 in [24], the sequence of estimators (T n ) ∞ n=1 for estimating κ is thus asymptotically efficient at P p .