Revisiting consistency of a recursive estimator of mixing distributions

Estimation of the mixing distribution under a general mixture model is a very difficult problem, especially when the mixing distribution is assumed to have a density. Predictive recursion (PR) is a fast, recursive algorithm for nonparametric estimation of a mixing distribution/density in general mixture models. However, the existing PR consistency results make rather strong assumptions, some of which fail for a class of mixture models relevant for monotone density estimation, namely, scale mixtures of uniform kernels. In this paper, we develop new consistency results for PR under weaker conditions. Armed with this new theory, we prove that PR is consistent for the scale mixture of uniforms problem, and we show that the corresponding PR mixture density estimator has very good practical performance compared to several existing methods for monotone density estimation.


Introduction
Mixture models are widely used in statistics and machine learning, often for density estimation and clustering. Here we will be considering a general version of the mixture model, where the mixture density is given by where k is a known kernel, i.e., where x → k(x | u) is a density for each u ∈ U, and P is the unknown mixing distribution on (the Borel σ-algebra of) U. An advantage to this general form is its flexibility: depending on the kernel, the mixture density m P can take virtually any shape (e.g., DasGupta 2008, p. 572), making such mixtures a powerful modeling tool for robust, nonparametric density estimation. Here we will assume that we have independent and identically distributed observations from a density m-which may characterized as mixtures of the form (1) where U = [0, ∞) and k(x | u) = u −1 1 [0,u] (x) is the uniform kernel, Unif(x | 0, u), with 1 A (x) being the indicator function of a set A. But for this particular kernel, it is not possible to check the sufficient conditions required in, e.g., Theorem 4.5 of Martin and Tokdar (2009). Similar issues would arise in other mixture model applications. Motivated by this deficiency in the state of the art, the focus of the present paper is to establish new asymptotic consistency properties for the PR estimator under weaker and more easily verified conditions. Following a brief review of the existing theory for PR in Section 2, we establish convergence properties of the PR estimator-both the mixture density and the mixing distribution-under weaker conditions in Section 3. We then apply these new results in Section 4 to our motivating example, namely, monotone density estimation via mixtures of uniform kernels. There we first give a characterization of the best mixing distribution and mixture density within a special class of uniform mixtures. This characterization suggests a particular formulation of the PR algorithm and we use the general results presented in Section 3 to prove that PR consistently estimates this best mixture. Our choice to focus on a special class of uniform mixtures generally introduces some model misspecification bias, but we show that this bias is a vanishing function of two userspecified parameters. Therefore, the bias has no practical impact on PR's performance, as our numerical examples confirm. Finally, some concluding remarks are given in Section 5. Technical details and proofs are presented in the Appendix.

Background on PR
As mentioned briefly above, PR is a stochastic algorithm designed for fast, nonparametric estimation of mixing distributions. The algorithm's inputs include the kernel k, an initial guess P 0 of the mixing distribution, supported on U, a rule for defining a sequence of weights i → w i ∈ (0, 1), and a sequence of data points X 1 , X 2 , . . .. Then the recursive updates first presented in Newton et al. (1998) define the PR algorithm: , u ∈ U, i ≥ 1.
After n data points have been observed, the mixing distribution estimator is P n , and the corresponding mixture density estimator is m n = m Pn defined according to (1). To understand the motivation behind PR, observe that the i th PR update is just a weighted average of P i−1 (du) and the posterior for U with prior P i−1 (du) and kernel likelihood k(X i | u). The weights, w i , need to be decreasing in i but not too quickly; this will be made more precise below. Some recent and novel applications of PR can be found in Scott et al. (2015), Tansey et al. (2018), and Woody et al. (2021).
If the user has a specific dominating measure µ on U in mind, then he/she can incorporate that information into the algorithm. That this, the updates in (2) can be expressed in terms of the density or Radon-Nikodym derivative p i = dP i /dµ as where p 0 = dP 0 /dµ is the initial guess. Therefore, PR can be used to estimate a mixing density, compared to the nonparametric MLE, P , which is almost surely discrete. Moreover, when the densities are evaluated on a fixed grid in U, and the normalizing constant in the denominator is evaluated using quadrature, computation of the PR estimate, P n , is fast and simple-done in O(n) operations-compared to the nonparametric MLE or a Bayesian estimate based on Markov chain Monte Carlo (MCMC).
The above algorithm is described for the case when data points are arriving one at a time, but, of course, the same procedure can be carried out when the data X 1 , . . . , X n comes in a batch. When data are both batched and iid, as we consider here, one might be troubled by the fact that P n depends on the order in which the data are processed. In particular, while there are some potential advantages to PR's order-dependence (see Dixit and Martin (2019)), it implies that P n is not a function of a minimal sufficient statistic. To overcome this, Newton (2002) suggested that one could evaluate the estimator P n separately on several random permutations of the data sequence and then take averages over permutations. This can be seen as a Monte Carlo estimate of the Rao-Blackwellized estimator, the average over all permutations. It has been shown empirically (e.g., Martin and Tokdar 2012) that it only takes a few random permutations to remove the order-dependence, so, with the inherent computational efficiency of PR, the permutation-averaged version is still much faster than, say, MCMC.
Not being Bayesian or maximum likelihood estimators, it is not immediately obvious that the PR estimates, P n and m n , would have any desirable statistical properties. It has, however, been shown that, under certain conditions, both P n and m n are consistent estimators. Before stating these sufficient conditions for consistency, we need to describe what the PR estimates are estimating in general.
Suppose the true density of the iid data X 1 , . . . , X n is m . Of course, there is generally no way to know if m can be expressed as a mixture model of the form (1) for a particular kernel, k. When the mixture model is incorrectly specified, there is no "P " for the PR estimator P n to converge to, and we cannot expect m n to be a consistent estimator of m . Instead, there may be a mixture density, m † (x) = k(x | u) P † (du), that is "closest" to m , and that P n and m n would converge to P † and m † , respectively. Proximity here is measured in terms of the Kullback-Leibler divergence, More precisely, let P denote (a possibly proper subset of) the collection of probability distributions P on U, and define the corresponding set of mixtures of the form (1) for a given kernel k, where P is the closure of P with respect to the weak topology, i.e., P plus all possible limits of weakly convergent sequences in P. To avoid vaccuous cases, we will assume that K(m , m) is finite for at least one m ∈ M . This is not a trivial assumption, however; see Section 4. In this case, the "best approximation" of m in M is the Kullback-Leibler minimizer, m † , that satisfies A relevant question is whether such a minimizer exists and if it is unique. Assuming that K(m , m) is finite for at least one m ∈ M and given that it is a convex function, we can expect that a minimizer m † exists and is unique. Existence of a P † corresponding to m † is guaranteed by assuming certain conditions on k and U; see Conditions A1 and A2 in Martin and Tokdar (2009) and, more generally, Liese and Vajda (1987, Ch. 8). However, uniqueness of P † requires identifiability of the mixture model (1) in P . In Tokdar et al. (2009), consistency of the PR estimators was established in the case where the mixture model was correctly specified, i.e., when m ∈ M , so that there exists a true P ∈ P. That is, under certain conditions, they showed K(m , m n ) → 0 almost surely and that P n → P weakly almost surely. Martin and Tokdar (2009) extended these consistency results to the case where the mixture model is not necessarily correctly specified, i.e., where possibly m ∈ M . This extension is a practically important one, as it provides a theoretical basis for the PR-based marginal likelihood estimation framework developed in Martin and Tokdar (2011) and later applied in, e.g., Martin and Han (2016), Dixit and Martin (2020). Under conditions slightly stronger than those given in Tokdar et al. (2009) for the correctly specified case, they showed that K(m , m n ) → K(m , m † ) and P n → P † weakly, both almost surely. This implies, for example, that the PR estimates do the best they could, asymptotically, relative to the specified model. It turns out, however, that the sufficient conditions stated in Martin and Tokdar (2009), very similar to those in Tokdar et al. (2009), are rather restrictive. The most problematic of those assumptions is the following: For nice kernels like k(x | u) = N(x | u, σ 2 ) for a fixed σ 2 > 0, if U is compact and m has Gaussian-like tails, then (4) can be satisfied. However, if m is heavier-tailed, then (4) could easily fail. More concerning is if we are considering a not-so-nice kernel, such as uniform: k(x | u) = Unif(x | 0, u), for x > 0 and u > 0; this is the natural kernel in the case where m is monotone non-increasing on [0, ∞). In this case, the u-dependent support implies that the ratio in the above display is infinite on an open interval and, hence, (4) obviously fails. The difficulty in verifying condition (4) in several practical applications is what motivated our investigation into potentially weaker sufficient conditions and, in turn, the present paper.

Conditions
The goal is to develop a new set of sufficient conditions for PR consistency that are weak enough that they can be checked in the applications we mentioned above, in particular, the case of uniform kernels for monotone density estimation. First we make clear the setup/conditions, and then we present the main results.
Condition 3. The kernel, the initial guess P 0 , with corresponding m 0 = m P 0 , and the true m satisfy the following integrability property: In the previous literature on this topic, and also in the literature on stochastic approximation more generally, the weights/step sizes are assumed to satisfy Of course, the specific weights in Condition 1-which are of the same form as the weights used in Hahn et al. (2018)-satisfy these conditions, but others do to. The reason we adopt this specific choice is that it allows us to replace (4) with the weaker bound (5) discussed more below. And since the choice of weights is entirely in the hands of the user, while the choice of kernel may be determined by the context of the problem and m is a choice made by "Nature" and hidden from the user, it is best to sacrifice on generality in directions the user can control. Condition 2 assumes that the mixing distribution support is compact, but this is not much of a restriction in practice, since it can be taken as large as the user pleases. Compactness of U is not strictly needed for the results presented below, but (a) some more complicated notion of compactness is needed, as we briefly discuss in the paragraph leading up to Corollary 2, and (b) Condition 3 might be difficult to check without U being compact. For these reasons, we opt for the simpler albeit slightly more restrictive compactness condition listed above.
Finally, the most complicated assumption is in Condition 4, about integrability. To understand this better, it may help to re-express the integrand as First, if the PR prior guess P 0 is not too tightly concentrated, then the mixture m 0 would be heavier-tailed than any individual kernel k(· | u). In that case, the first ratio in the above display would be bounded, or at least would not increasing too rapidly. Second, we cannot expect PR, or any mixture model-based method for that matter, to be able to do a good job of estimating m if a mixture with a relatively diffuse mixing distribution cannot adequately cover the support of m . So the heart of Condition 3 is an assumption that the posited mixture model can adequately cover the support of m , in the sense that the second ratio in the above display is not blowing up too rapidly. Finally, if the two ratios are well controlled, then the integral with respect to k(· | u) should be bounded uniformly in u. We shall see below, in Section 4, that (5) can be checked for uniform kernels while the condition (4) in Martin and Tokdar (2009) cannot.

Main results
Our goal in this section is to show that the PR estimator, m n = m Pn , of m is consistent in the sense that K(m , m n ) converges almost surely to inf m∈M K(m , m), the minimum Kullback-Leibler divergence over the posited mixture model class M . In the special case where m ∈ M , this implies consistency in the usual sense: K(m , m n ) → 0 almost surely. In either case, it says that the PR estimator, m n , is close to the best possible mixture approximation of m , at least asymptotically. We will also show how consistency of the mixing distribution estimator can be established from consistency of the mixture, but this will require further explanation; see below.
Theorem 1. Under Conditions 1-3, the PR estimator, m n , of the density m satisfies K(m , m n ) → inf m∈M K(m , m) almost surely. In particular, if m ∈ M , then K(m , m n ) → 0 almost surely.
Here we give a very rough sketch of the proof strategy. Start by writing K n = K(m , m n ) − inf m∈M K(m , m), and let A i denote the σ-algebra generated by the observations X 1 , . . . , X i , for i = 1, 2, . . .. We show in the proof that and Z n is a "remainder" term defined in the appendix. It follows from Jensen's inequality that T (P ) ≥ 0, with equality if and only if P = P † , the Kullback-Leibler minimizer. If we could ignore the remainder term, then K n would be a non-negative supermartingale and, therefore, would converge almost surely to some K ∞ . Of course, the remainder term cannot be ignored, so we will use the "almost supermartingale" results in Robbins and Siegmund (1971) to accommodate this. Moreover, to show that K ∞ is 0 almost surely, we will use some new and useful properties of the function T in (6) which were overlooked in the analysis presented in Martin and Tokdar (2009). When the mixture model is correctly specified, so that m † = m , it follows from Theorem 1 and the familiar properties of Kullback-Leibler divergence that m n → m almost surely in Hellinger or total variation distance, i.e., that (m 1/2 n − m 1/2 ) 2 dx and |m n − m | dx both go to 0 almost surely. In the general case where the mixture model is misspecified, Theorem 1 still strongly suggests that m n → m † , but some effort is required to connect the Kullback-Leibler difference to a distance between m n and m † . Towards this, define the Hellinger contrast ρ(m 1 , m 2 ) = ρ m (m 1 , m 2 ), which is given by This is just a weighted version of the ordinary Hellinger distance-with weight function m /m † -so it is a proper metric. Clearly, if the mixture model is correctly specified, so that m † = m , then ρ is exactly the Hellinger distance. See Patilea (2001) and Kleijn and van der Vaart (2006) for further details on the Hellinger contrast. The following result establishes that ρ(m † , m n ) → 0 almost surely, which implies that the limit m ∞ of m n satisfies m ∞ = m almost everywhere with respect to the measure with Lebesgue density m . Under some additional conditions, namely, that m † is suitably close to m , the PR estimator m n is shown to converge to m † in total variation distance, which implies the limit is equal to m † almost everywhere with respect to Lebesgue measure. Corollary 1. Under the conditions of Theorem 1, ρ(m † , m n ) → 0 almost surely. Moreover, if m † /m ∈ L ∞ (m ), then m n → m † almost surely in total variation.
Proof. See the proof of Corollary 4.10 in Martin and Tokdar (2009).
Finally, what can be said about the convergence of the mixing distribution estimator, P n ? Again, Theorem 1 strongly suggests that P n is converging to P † in some sense, but we cannot make that leap immediately. In particular, without additional assumptions, there is no guarantee that P † is unique or even that P n converges at all. For this, we will need identifiability of the mixture model (1) and tightness of (P n ). Under Condition 2, as we assume here, tightness of P n follows from Prokhorov's theorem. If compactness of U is not a feasible assumption, then one can instead verify the more general sufficient condition, namely, Condition A6 in Martin and Tokdar (2009), for tightness of P n .
We will also reqjuire the following fairly abstract condition on the kernel density k, written in terms of a general sequence of mixing distributions (Q t ) on U: In words, (7) states that the kernel is such that weak convergence of mixing distributions implies almost everywhere pointwise convergence of mixture densities. This holds immediately if u → k(x | u) is bounded and continuous for almost all x, as was assumed in Martin and Tokdar (2009) and elsewhere. However, in some examples, like in Section 4 below, strict continuity of the kernel fails, but condition (7) can be verified.
Corollary 2. In addition to the conditions of Theorem 1, assume that • the model is identifiable, i.e., m P = m P almost everywhere implies P = P , • the kernel is such that (7) holds, • and m † /m ∈ L ∞ (m ).
Then the Kullback-Leibler minimizer P † is unique and P n → P † weakly almost surely.
Proof. Since P n is tight, there exists a subsequence P n(t) such that P n(t) → P ∞ weakly, for some P ∞ . By (7), we have pointwise convergence of the mixture densities, i.e., m n(t) (x) → m ∞ (x) for almost all x, and then m n(t) → m ∞ in total variation distance thanks to Scheffé's theorem. But Corollary 1 already gives us m n → m † almost surely in total variation distance on the full/original sequence. Therefore, it must be that m ∞ = m † almost surely and, by identifiability, that P ∞ = P † . Since any such convergent subsequence of P n would have the same almost weak limit, P † , it must be that P n itself converges weakly almost surely to P † , as claimed.
The boundedness assumption on m † /m , as in Corollary 1, is needed simply to convert convergence of m n to m † in the Hellinger contrast to convergence in total variation. Identifiability of the mixture model m P in P is non-trivial. Additively-closed one-parameter families of distributions were proved to be identifiable in Teicher (1961). Identifiability of finite mixtures of gamma and of Gaussian distributions was proved in Teicher (1963). Scale mixtures of uniform distributions, like we discuss in Section 4 below, were shown to be identifiable in Williamson (1956). More generally, identifiability of mixture models needs to be checked on a case-by-case basis.

Application: Monotone density estimation 4.1 Background
Any monotone non-increasing density can be written as a scale mixture of uniforms (Williamson 1956), i.e., for any monotone density m defined on X = [0, ∞), there exists a mixing distribution P , supported on U = [0, ∞), such that, where is the uniform kernel density. Therefore, the problem of estimating a monotone density can, at least in principle, be solved through the use of mixture density estimation methods, such as the PR algorithm. Let X 1 , . . . , X n be iid from a monotone non-increasing density m . One approach to estimating m is to calculate the nonparametric MLE, also known as the Grenander estimator (Grenander 1956), which is the left derivative of the least concave majorant of the empirical distribution function. It is known that Grenander's is a consistent estimator of m , with consistency results obtained in Rao (1969) and Groeneboom (1985). However, as shown in, e.g., Woodroofe and Sun (1993), the Grenander estimator tends to overestimate near the origin and, in particular, is inconsistent at the origin. The same authors proposed a penalized likelihood estimator that penalizes the Grenander estimator at the origin and is also consistent overall.
Another approach is Bayesian, whereby a prior distribution on m is imposed by using the mixture characterization in (8) along with a suitable prior on the mixing distribution P . A natural choice is a Dirichlet process prior on P , leading to a Dirichlet process mixture of uniforms model for the density m; see Bornkamp and Ickstadt (2009). Although this approach seems straightforward, obtaining asymptotic consistency results for the posterior distribution is made difficult by the uniform kernel's varying support. In particular, if the support for the mixing distribution is not suitably chosen, then the Kullback-Leibler divergence of a posited mixture model from the true density would be infinite, which creates problems for verifying the so-called "Kullback-Leibler property" (Schwartz 1965;Wu and Ghosal 2008) in the classical Bayesian consistency theory. Some strategies have been suggested in, e.g., Salomond (2014), who showed that the Bayesian posterior distribution under the Dirichlet process mixture prior has a near optimal concentration rate in total variation. More recently, Martin (2019) proposed the use of an empirical, or data-driven prior for which the prior support conditions required for asymptotic consistency are automatically satisfied, and showed that the corresponding empirical Bayes posterior distribution concentrates around the true monotone density at nearly optimal minimax rate. But the fully Bayesian solutions are computationally non-trivial and somewhat time consuming; moreover, the estimates tend to be relatively rough. The PR algorithm, which is computationally fast and tends to produce smooth estimates, is a natural alternative to the aforementioned likelihood-based methods.

PR for uniform mixtures
Suppose that the true density m is any monotone density supported on [0, ∞). We know that m can be written as a mixture in (8), so there exists a mixing distribution P , which is also supported on [0, ∞). This point is relevant because of the following unique feature of uniform mixtures: if m P is a mixture model as in (8) with P supported on [0, L), then m P (x) = 0 for all x > L and, hence, if L < ∞, then K(m , m P ) ≡ ∞. Therefore, the upper bound of U being ∞ creates some serious challenges. For practical implementation of the PR algorithm, and for the theory as discussed above, a compact mixing distribution support is needed. This calls for a different approach.
For a fixed L ∈ (0, ∞), define a new target, m L , which is simply m restricted and renormalized to [0, L). That is, if M denotes the distribution function corresponding to the density m , then Alternatively, m L can be viewed as the conditional density of X, given X ≤ L; see below. The point of this adjustment is that m L has a known and bounded support, so a mixture model with mixing distribution supported on (a large subset of) [0, L) can be fit with the PR algorithm to efficiently and accurately estimate this new target m L . Note that m L can be made arbitrarily close to m by choosing L sufficiently large (see below), so this modification has no practical consequences. For technical and practical reasons, we cannot use the PR algorithm when the support of the mixing distribution contains u = 0, so we introduce a new lower bound ∈ (0, L), which can be arbitrarily small. Then the proposed mixture model to be fit by PR is While both m P above and the adjusted target m L are supported on [0, L], the model in (9) is still slightly misspecified through the introduction of the lower bound > 0 of the mixing distribution support. In particular, note that m P (x) is constant for x ∈ [0, ]. But the fact that can be taken arbitrarily small means that there are no practical consequences to this misspecification. It does complicate the convergence analysis, but, fortunately, the theory presented in Section 3 above is general enough to handle this.
Given that the mixture model (9) is slightly misspecified, it is important to know what we can expect the PR algorithm to do. Theorem 1 states that, roughly, the PR estimator m n will converge to the Kullback-Leibler minimizer m † . Since the supports of m L and the model densities m P in (9) are the same, we avoid the "K(m L , m P ) ≡ ∞" problem so minimizing the Kullback-Leibler divergence is well-defined. To understand the bias coming from model misspecification, it will be important to understand what m † looks like. Incidentally, Williamson (1956) established that uniform mixtures are identifiable, so there is a unique mixing distribution, P † , supported on U, at which the Kullback-Leibler divergence is attained. The following lemma gives the details.  (9), then the unique minimizer, P † = P † ,L , of the Kullback-Leibler divergence P → K(m L , m P ) is given by where δ {t} is the Dirac point-mass at t, P | U is P restricted to U = [ , L], and the coefficients are given by with M the distribution function corresponding to m . Then the best approximation of m L under model (9) is m † = m P † , given by Proof. See Appendix A.2.
The characterization result in Lemma 1 is intuitive. There is a true P that characterizes the true monotone mixture density m , both generally supported on [0, ∞). Our proposed model, however, effectively restricts the mixing distribution's support to [ , L], so it makes sense that the best approximation would agree with P on [ , L] and then suitably allocate the remaining mass to the endpoints and L.
From Section 2, recall that the implementation of the PR algorithm begins with an initial guess P 0 , and that this effectively determines the dominating measure with respect to which P n has a density. PR's ability to choose the underlying dominating measure comes in handy in cases like this where we know that the target mixing distribution, P † , has an "unusual" dominating measure. From Lemma 1, we know that the best mixing distribution for fitting mixture model (9) to m L puts point masses at the endpoints, and L, of U, and has a density with respect to Lebesgue measure on the interior of U. So, naturally, we can initialize the PR algorithm with a starting guess P 0 that has a density with respect to the dominating measure δ { } + λ U + δ {L} , where λ U denotes Lebesgue measure on U. Specifically, our proposal is to initialize the PR algorithm at where p 0, and p 0,L are positive with sum strictly less than 1, and P 0,U has a density with respect to Lebesgue measure, e.g., P 0,U could just be a uniform distribution on U. Then the estimate, P n , after the n th iteration will have the same form P n = p n, δ { } + (1 − p n, − p n,L ) P n,U + p n,L δ {L} , and the corresponding mixture density estimate, m n , is obtained as usual by integrating the kernel with respect to the mixing distribution P n .

Theoretical results
Now that we know what the PR algorithm ought to converge to, we are ready to state our main result of this section. First, a word about the notation/terminology that follows. In our previous results, when we wrote "almost surely," this was referring to the law that corresponds to iid sampling from m . In the results below, m L is the target, so we will write "m L -almost surely" to be clear that it is with respect to the law corresponding to iid sampling from m L . Recall that m L is the conditional density of X, given X ≤ L, so this modified law can be interpreted as iid sampling from m , but throwing away any data points that exceed L. Again, since L can be taken arbitrarily large, there are no practical consequences of this restriction. In fact, a bound on the bias induced by both the L-and -restrictions is given in Proposition 1 below.
Theorem 2. If m is renormalized to m L supported on [0, L], and if the proposed mixture model m P is as in (9), then the PR estimator m n satisfies K(m L , m n ) → K(m L , m † ), m L -almost surely where m † is as given in Lemma 1. Moreover, m n converges m L -almost surely to m † in total variation distance and the mixing distribution estimates P n converges weakly m Lalmost surely to P † in (10).

Proof. See Appendix A.3.
Our choice to restrict the mixing distribution's support to U = [ , L] introduces some bias. That is, the limit m † of the sequence of PR estimators is the Kullback-Leibler minimizer over all mixtures supported on U = [ , L], but it is different from m L , which is different from m . Intuitively, if ≈ 0 and L ≈ ∞, then the bias ought to be negligible. The next result confirms this intuition by bounding the bias as a function of ( , L).
Proposition 1. The L 1 distance between the true monotone density m and the best approximation m † in (11) under the restricted model (9) is bounded as Proof. See Appendix A.4.
To make the bound in (12) more concrete, we consider a specific case. A common choice in the literature (e.g., Martin 2019; Salomond 2014) is to assume m has tails that vanish exponentially fast, so that m (x) ≤ exp(−bx r ), for all large x and some positive constants b and r; the case r = ∞ corresponds to m having a bounded support. From this, and standard asymptotic bounds on the incomplete gamma function, it follows that 1 − M (L) L −r exp(−bL r ), for large L. Furthermore, if, e.g., P has a bounded density at 0, then we have P ([0, ]) . Combining these two, we arrive at the following, more explicit bound on the L 1 bias as a function of ( , L): Clearly, by taking small and L even just moderately large, the overall bias as a result of restricting to U = [ , L] can be made negligibly small.
As a final technical detail in this section, we consider the problem of estimating m (0), the density at its mode, the origin. This is an interesting and challenging problem, with a variety of applications; see, e.g., Vardi (1989). In particular, Woodroofe and Sun (1993) highlight examples such as time between breakdowns of a system and distribution of galaxies that require the estimation of this modal m (0). The PR algorithm gives an obvious estimator of m (0), in particular, m n (0). The following result gives a theoretical basis for using this estimate and simulations in Section 4.4 show that the proposed estimate at 0 performs well when compared to existing methods.

Numerical illustrations
In this section we compare different methods for monotone density estimation to our PRbased method. The four methods we consider are the Grenander estimate, a Bayesian approach using a Dirichlet process, Bayesian approach using an empirical prior, and the method based on optimization of the penalized likelihood. The Grenander estimate is based on the nonparametric MLE and can be calculated easily using the R package fdrtool (Klaus and Strimmer 2015). Settings for the Dirichlet process mixture and the empirical Bayes were based on those suggested in Martin (2019) and computed using the R codes he provided on his website. 1 The penalized likelihood maximization was based on Woodroofe and Sun (1993) and we used one of the values recommended by those authors for their penalization parameter, i.e., α = n −1 log n. For PR, we take the mixing distribution support to be U = [ , L], with = 10 −5 and L = max(X). The initial guess P 0 is taken to be uniform on U. To reduce the dependence of the PR estimator on the data order, we average the PR estimates over 25 random permutations of the data. For the comparisons below, we consider both real and simulated data sets.
First, we consider data coming from a study of suicide risks reported in Silverman (1986), which consists of lengths of psychiatric treatment for n = 86 patients used as control. As per the detailed study of suicide risks in Copas and Fryer (1980), there is a higher risk for suicide in the early stages of treatment, so modeling these data with a monotone density is appropriate. Figure 1 shows a comparison of the four monotone density estimation methods discussed above with PR over a histogram of the data. PR gives a smooth estimate of the monotone density in a very short amount of time, much faster than the Bayes and empirical Bayes estimates that require Markov chain Monte Carlo. The take-away message is that, PR's misspecification bias-due to the choice of and L-can be easily controlled and that it gives a quality estimate compared to the other four methods. In fact, the PR estimate in this case is smoother than the other four methods, a desirable feature in applied data analysis. The simulations below will give a clearer picture of how PR performs compared to the other four methods.
Second, we consider two true monotone densities m , namely, the standard exponential and the half standard normal. We carry out the simulation study over sample sizes of n = 50, 100, 200. For each n, we generate 200 data sets of size n and produce the five different estimates on each data set. As our metric of comparison, we use the total variation (or L 1 ) distance between the true density and the estimate. Additionally since inconsistency of the Grenander estimate at the origin is a well-known complication we also look at the ratiom(0)/m (0) for each method. Boxplots summarizing both the L 1 distance and the at-the-origin ratio for the two simulations are shown in Figures 2 and  3. Consider the boxplots summarizing the L 1 distance. As the sample size increases, the boxplots for all five methods shrink towards 0, as expected. Notably, performance of PR is better than the Grenander estimator over all sample sizes. It is also faster and with slightly better performance when compared to the two Bayesian estimates, and is comparable to the penalized likelihood estimate. For estimating the density at 0, we compare PR with only the state-of-art estimates, namely the one based on penalizing the nonparametric MLE near 0 and the DP mixture. Even though PR is not tailored specifically for estimation at 0, as the penalized likelihood estimator is, its performance is competitive with the other methods.

Conclusion
Estimation of mixing distributions in mixture models is a challenging problem, one for which there are very few satisfactory methods available. To our knowledge, the PR algorithm is the one general method available that is both fast and capable of nonparametrically estimating a mixing distribution having a density with respect to any user-specified dominating measure. Despite the simple and fast implementation of the PR algorithm, and the strong empirical performance observed in numerous applications, its theoretical analysis and justification is non-trivial because of the recursive structure. Previous work has established consistency of the PR estimates under relatively strong conditions. Most concerning is that there are known examples, such as monotone density estimation using uniform mixtures, for which the sufficient conditions in previous work do not hold. The main focus of the present paper was to weaken those overly-strong conditions in order to broaden the range of problems in which PR can be applied. In particular, the new sufficient conditions can be checked for mixtures of uniform kernels, which puts PR in a position to solve the non-trivial problem of monotone density estimation on [0, ∞).
There are a number of possible extensions and/or open problems that could be considered. First, from a practical or methodological point of view, there is a natural extension That is, what can be done if the location of the mode itself is unknown? This is a non-trivial problem and has been investigated by a number of researchers, including Liu and Ghosh (2020). In the PR framework, the natural approach would be to treat the mode as an unknown, nonmixing parameter contained in the kernel, and apply the PR marginal likelihood strategy in Martin and Tokdar (2011) to estimate both the mode and the mode-specific mixing distribution. How this proposal compares to existing methods remains to be investigated. Second, from a theoretical point of view, it is undesirable to work with a fixed and compact mixing distribution support U. A natural extension would be to introduce a type of sieve, to allow the support to depend on the sample size, i.e., U = U n . The use of a n-dependent support U n , however, is difficult and awkward in the context of PR. First, unlike usual likelihood-based methods that assume all the data to be available at once, PR is technically meant to be used for recursive estimation with online data. In that case, having a sample size dependent support is unnatural since the sample size is not set in advance. But even if we ignore PR's recursive structure and treat it as being applied to batch data, the analysis is based on martingales that do implicitly treat the data points one by one in a sequence, so having any n-specific components in the algorithm itself is awkward. Beyond awkwardness, there is a specific technical obstacle. Much of the analysis depends on properties of the functional T defined in (6). This functional depends on U and so, if U is made to depend on n, then we end up with a sequence, T n , of functionals that are applied to the PR sequence of estimates, P n , so new techniques would be needed in order to analyze a sequence of random variables like T n (P n−1 ). where the remainder term R satisfies 0 ≤ R(x) ≤ max{1, (1 + x) −2 }. This remainder bound will be important later. Let K n = K(m , m n ). Then from that recursive form of the mixture density updates above, and this Taylor approximation, it can be shown that Since we only need to get an upper bound up to a multiplicative constant, we will ignore that constant lumped inside of " " in what follows; we will also ignore the leading "1+" since the bound will ultimately get multiplies by w 2 n , which itself is summable by assumption. From this bound, plug in the definition of h n,Xn to get where the second inequality is by Cauchy-Schwartz. Next, we focus on one of the terms in the denominator, say, m n−1 (x). From that recursive form for the mixture density updates, we immediately see that (1 − w i ), any x.
Plug in this lower bound for both terms in the denominator of the bound for Z n to get Now take conditional expectation with respect to A n−1 and interchange the order of integration (which is allowed since the integrand is non-negative) to get By Condition 3, we have that the expression inside curly braces above is bounded, uniformly in u, by a constant. Therefore, Next we used the assumed form of the weight sequence, in Condition 1, to bound the above product. In general, we have Using the standard bound, − log(1 − w) ≥ w(1 − w) −1 , and the fact that the w i 's are decreasing, we have According to Condition 1, w i = a(i + 1) −1 , the summation in the above expression is of the order log n, which implies Putting everything together, we get Since a < 2 9 , the exponent is less than −1, hence the upper bound is summable almost surely, thus verifying the hypothesis of the Robbins-Siegmund theorem. Consequently, we can conclude that K n → K ∞ and n w n T (P n−1 ) < ∞, almost surely.
It remains to show that the limit, K ∞ is 0 almost surely.
The key to proving this last claim is an understanding of the properties of the T function. For a generic mixing distribution P , supported on U, rewrite T as For any bounded and continuous function h : U → R, it follows from the standard bound | · · · du| ≤ | · · · | du and Cauchy-Schwartz that This implies the lower bound where the supremum is over all bounded and continuous functions h with h 2 dP = 1. For an alternative look at the integral in the curly braces above, define the operator φ that maps a probability measure P on U to a new probability measure, φ(P ), on U according to the formula Then that expression in curly braces is simply h dφ(P ) − h dP.
A consequence of the Robbins-Siegmund theorem is that n w n T (P n−1 ) < ∞ almost surely. Since w n itself is vanishing too slowly to be summable, it must be that there exists a subsequence P n(t) such that T (P n(t) ) → 0 almost surely. Therefore, h dφ(P n(t) ) − h dP n(t) 2 → 0, almost surely.
Since the original sequence P n is tight, there is a sub-subsequence P n(ts) with a weak limit, and the above result implies that the limit is a fixed point of φ. However, the only fixed points of this mapping are Kullback-Leibler minimizers, say, P † ; see, for example, Lemma 3.4 in Shyamalkumar (1996). This implies K n(ts) is vanishing almost surely. However, by the Robbins-Siegmund theorem, we have that the original sequence K n converges almost surely to some K ∞ . But if the original sequence has a limit and the limit is 0 on a subsequence, then it must be that K ∞ = 0 almost surely. Putting everything together, we have shown that K n = K(m , m n ) − K(m , m † ) → 0 almost surely, which implies K(m , m n ) → K(m , m † ), and completes the proof.

A.2 Proof of Lemma 1
The proof proceeds in two steps. First we express the modified target m L as a uniform mixture and identify the corresponding mixing distribution, denoted by P L . Then we solve the optimization problem that consists of identifying the mixing distribution, P † = P † ,L , supported on U = [ , L], that minimizes P → K(m L , m P ). First, recall the definition of m L , where M is the distribution function corresponding to the density m . By direct calculation, for the denominator we have The numerator can also be rewritten as After a bit of algebra to simplify the ratio of the sums in the previous two displays, we are able to write m L as a mixture where with π and P L defined as That is, m L is a uniform mixture, where the mixing distribution P L is not just P restricted and renormalized to [0, L], but a mixture of that and a point mass at L. For step 2, we want to find the minimizer of P → κ(P ) := K(m L , m P ), over all mixing distributions supported on U = [ , L], where m L has the mixture form presented above. Using the above notation, the lemma's claim is that the minimizer is where P L | U is P L restricted (but not renormalized) from [0, L] to U = [ , L], and ω = P L ([0, ]). If we can show that the Gateaux derivative of κ, evaluated at P † , in the direction of any other distribution H on U, is vanishing, then we will have proved the claim. The Gateaux derivative at a generic P , in the direction of H, is Let m † = m P † , which has the form Then the goal is to show that or, equivalently, to show that On the interval x ∈ ( , L], it is clear that m † (x) = m L (x), so Next, since both P † and H are supported on U = [ , L], the two mixture densities m † and m H are constant on the interval x ∈ [0, ]. This implies We claim that the integral on the right-hand side is 0. To see this, first integrate m † : Similarly, integrate m L : Clearly the two integrals above are the same, which implies that and, consequently, that Plugging the relations (20) and (21) into the left-hand side of (19) proves the claim, i.e., that the Gateaux derivative of κ at P † vanishes in all directions H, which implies that P † is the minimizer of the Kullback-Leibler divergence.

A.3 Proof of Theorem 2
To prove K(m L , m n ) → K(m L , m † ), we apply Theorem 1. Condition 1 is in the user's control and, hence, is easy to satisfy. Condition 2 requires the support of the mixing distribution to be compact, which is clearly satisfied by U = [ , L]. Condition 3 is the only non-trivial condition, and it requires where m 0 is the mixture density corresponding to the initial guess, P 0 , which contains point masses. The key point is, thanks to the point mass at L, m 0 (x) ≥ p 0,L Unif(x | 0, L) = p 0,L L −1 , x ∈ [0, L].
Since the denominator above is uniformly bounded away from 0, and, similarly, the numerator is uniformly bounded by −1 , Condition 3 clearly holds.
Next, the claim about convergence of m n to m † in total variation follows immediately from Corollary 1 and the fact that m L is bounded away from 0. Finally, for the claim about weak convergence of P n to P † , we apply Corollary 2. We have already stated that m † /m L ∈ L ∞ since m L is bounded away from 0. So all that remains is to check that the uniform kernel satisfies the abstract condition (7), which we do next. Imagine a generic sequence of mixing distributions Q t supported on U = [ , L] and assume they converge weakly to Q ∞ . The condition (7) concerns the behavior of the mixture density m Qt (x). Note that the uniform kernel is not a continuous function in u for a given x, but it is upper-semicontinuous. Recall that the mixture densities are constant for x ∈ [0, ]. This means that the value of the mixture density on a set of positive measure is determined by its value at x = , so some care will be needed below; in particular, we'll have to deal with the cases x ∈ [0, ] and x ∈ ( , L] separately.
Start with the case x ∈ ( , L]. The kernel u → Unif(x | 0, u) is bounded and continuous except for the jump discontinuity at u = x. It is possible that the limit Q ∞ of the sequence Q t ) of mixing distributions puts positive mass at u = x, i.e., that x is a discontinuity point of Q ∞ . In such cases, m Qt (x) may not converge or, even if it does converge, the limit may not equal m Q∞ (x). However, Q ∞ 's set of discontinuity points has Lebesgue measure 0. For any x ∈ ( , L] that is not a discontinuity point of Q ∞ , the kernel is effectively bounded and continuous, so Q t → Q ∞ weakly implies m Qt (x) → m Q∞ (x). This verifies (7) for the range x ∈ ( , L]. For the case x ∈ [0, ], again, we know that the mixture density is constant in x. Therefore, if there is an issue with convergence of the mixture density at x = , then that implies an issue on a set of positive Lebesgue measure, hence (7) fails. However, while the kernel is only upper-semicontinuous in general, u → Unif( | 0, u) is bounded and continuous on the support of the Q t sequence, so we get m Qt ( ) → m Q∞ ( ) automatically from the definition of weak convergence. This implies the same for all x ∈ [0, ], so (7) holds there too.

A.4 Proof of Proposition 1
By the triangle inequality, we have Now we consider each term in the upper bound (22)  For the first term in (22), we borrow the calculations in the proof of Lemma 1 above. In particular, on the interval x ∈ [ , L], the two densities are the same, but on the interval x ∈ [0, ), the absolute difference between densities is bounded by Unif(x | 0, u) P L (du), x ∈ [0, ).

Now integrate to get
Combining the two bounds proves the claim.

A.5 Proof of Proposition 2
As shown in the proof of Theorem 2, m n ( ) → m † ( ) almost surely with respect to m L . Since m n (0) = m n ( ) and m † (0) = m † ( ) by Equation (9), the proof of the first claim is complete. To bound the bias, i.e., the difference between the quantity being estimated, m † (0), and and the true density at the origin, m (0), we proceed as follows.
Using the definitions of a , a U , and a L , the bound P ([0, ]) , and the fact that U u −1 P (du) = O(1) as a function of ( , L), it is easy to check that each of the three terms on the right-hand side above can be bounded by 1 − M (L). That is,