A review of uncertainty quantification for density estimation

It is often useful to conduct inference for probability densities by constructing “plausible” sets in which the unknown density of given data may lie. Examples of such sets include pointwise intervals, simultaneous bands, or balls in a function space, and they may be frequentist or Bayesian in interpretation. For almost any density estimator, there are multiple approaches to inference available in the literature. Here we review such literature, providing a thorough overview of existing methods for density uncertainty quantification. The literature considered here comprises a spectrum from theoretical to practical ideas, and for some methods there is little commonality between these two extremes. After detailing some of the key concepts of nonparametric inference – the different types of “plausible” sets, and their interpretation and behaviour – we list the most prominent density estimators and the corresponding uncertainty quantification methods for each.

from more "classical" approaches [159] to the most advanced modern techniques [23]. Estimation, however, is only one piece of the puzzle: as in any statistical problem, it is desirable to also conduct inference, providing some quantification of uncertainty in addition to single estimates. Broadly speaking, uncertainty is quantified using sets of "plausible" values -for example, confidence intervals for frequentist methods and credible intervals for Bayesian ones. Although not as abundant as other areas in nonparametric statistics, there is a sizeable body of literature on uncertainty quantification (UQ) for density estimation, ranging from rigorously theoretical to extremely practical.
The following sections provide more detail on various types of "uncertainty sets", then outline several density estimation methods and review available literature dealing with UQ for each one. Although some combinations of estimation and inference ideas are not represented in the literature (in particular, a substantial gap exists between theoretical and practical UQ developments in many cases), in principle, one could always obtain some kind of uncertainty bounds on a density estimate, either by bootstrapping a frequentist method or taking quantiles of MCMC output for a Bayesian one. Whether or not such bounds have suitable coverage properties or otherwise perform adequately is another question for which the answers are not always known. Despite some of these limitations, this paper presents a comprehensive review of the work done thus far in unknown density UQ, and suggests promising areas to extend the research or "fill in the gaps".

Overview and notation
Let X = (X 1 , . . . , X n ) be a sample from some unknown "true density" f 0 . The majority of discussion here will assume i.i.d. samples, but other data structures will also be considered as warranted. One structure that is common enough to justify mentioning here is the case of "noisy" observations Y i = X i + Z i , where the errors Z i have known distribution and the true X-values are unknown. Estimating the density of X in this case is called deconvolution density estimation. In the present context,f will denote a specific "point" estimate (in the sense that it is a single element of a function space) of f 0 , such as a MLE or posterior mean; while f will typically be used to discuss classes or function spaces of estimators in more generality.
As mentioned in the introduction, UQ arises by considering "uncertainty sets". Such sets are random through their dependence on X, but for brevity the notation here does not reflect this. As f 0 is a function, there are several ways to define uncertainty sets, each with different implications and advantages. Perhaps the most obvious examples are pointwise intervals C x = [L(x), U(x)], defined separately for each point x in the domain of f 0 . A common special case is when the intervals are symmetric about an estimatorf : The goal with pointwise intervals is to achieve (possibly only approximately or asymptotically) P (f (x) ∈ C x ) ≥ 1 − α for all x ∈ Dom (f 0 ), where 1 − α is the usual predetermined level. The meaning of the generic placeholders P and f depends on whether the inference is frequentist or Bayesian.
Pointwise intervals tend to be easy to implement, and also have nice theoretical properties for some (but not all) density estimation techniques. However, they are fundamentally limited in their ability to make "global" uncertainty statements. For a given level 1 − α, even if P (f (x) ∈ C x ) ≥ 1 − α ∀x, the stronger and perhaps more meaningful statement P (f (x) ∈ C x ∀x) ≥ 1 − α cannot necessarily be deduced. However, in some cases the "simultaneous" statement does hold, in which case the set C = {(x, y) : x ∈ Dom (f 0 ) , y ∈ C x } is called a confidence or credible band. Like pointwise intervals, bands are often centered at a specific estimator, although this need not be the case. For instance, Hall and Titterington [89] proposed to construct frequentist bands for univariate densities based on simultaneous multinomial confidence intervals for the probability masses within consecutive subintervals of the domain. Classical approximation results allowed them to construct such intervals without a specific density estimator, and they constructed the density bands by interpolation, with modifications depending on f 0 being once or twice differentiable. Such bands are not smooth, but were shown to have suitable coverage properties and optimal asymptotic widths without making further assumptions or restrictions. Hengartner and Stark [94] devised conservative confidence bands for shape-restricted densities (either monotonic or having ≤ k modes for known k, possibly relative to some weight function), also obtained without an estimator. To derive their bands, they started with a confidence region for the cdf F 0 comprised of distributions with densities having the same shape restriction as f 0 , and subsequently showed how to reduce the determination of the band for f 0 to a finite-dimensional linear program while conservatively preserving coverage probability.
If C x has the same width for all x in a band, then it is uniform and is therefore a L ∞ -ball in a suitable function space F. Thus, a uniform band is a special case of a more general idea: using a ball C in some pseudo-metric space of functions (F, d) as an uncertainty set. Analogously to pointwise intervals and bands, here the goal is to have P (f ∈ C) ≥ 1 − α. For choices of d such as the Hellinger or L 2 distances, such sets arise in nonparametric literature due to their satisfying theoretical properties. However, their practicality is somewhat limited: an L 2 -ball of functions, for instance, does not provide error bounds that can be easily visualized or understood, short of simply plotting a large number of functions from the ball alongsidef . For example, Szabó, van der Vaart and van Zanten [198] visualized L 2 -balls from an empirical Bayesian model for nonparametric regression by sampling functions from the posterior and plotting the 100 (1 − α) % of draws closest in the L 2 sense to the posterior mean. In a discussion of this paper, Low and Ma [133] suggested using this procedure to generate bands for the regression function whose boundaries are simply the pointwise maxima and minima of these closest posterior draws. Their simulations showed that the bands thus obtained performed quite well with respect to the framework of Cai, Low and Ma [20]. Beyond the aforementioned examples and those in Section 4.4.3, discussion of these "uncertainty balls" is limited, although Chapter 6 of Csörgo and Révész [35] contains theorems on the asymp-totic distributions of the L 2 -errors of several "classical" frequentist estimators (KDE's, histograms, and certain orthonormal basis expansions). These results could be relevant towards the construction of confidence balls, but this seems not to have been done in practice, likely due to their limited visual utility. On the other hand, if d is the L ∞ distance, one recovers the meaningful and easilyvisualized UQ given by bands, at the expense of nice theory in some cases. As before, for any pseudo-metric a common special case arises by taking the associated sets to be centered at some estimator: In frequentist inference, uncertainty quantification relies on confidence sets of any of the forms described above, typically obtained in practice using asymptotic arguments and/or bootstrapping. Confidence sets are designed in view of the "ground truth" X ∼ f 0 : letting P 0 denote the probability law associated with f 0 , the goal is to achieve coverage probability P 0 (C x f 0 (x)) ≥ 1 − α ∀x in the pointwise case, or P 0 (C f 0 ) ≥ 1 − α for bands or function balls. The Bayesian approach employs credible sets instead: using Π (· | X) as generic notation for the posterior over a space of densities f , the sets of interest are either pointwise intervals such that Π (f (x) ∈ C x | X) ≥ 1 − α ∀x, or bands/balls such that Π (f ∈ C | X) ≥ 1 − α. To facilitate validation and comparison, it is possible to view Bayesian methods through a frequentist lens by acknowledging the existence of the "ground truth" f 0 , in which case the posterior Π (· | X) is considered as a random measure due to its dependence on X ∼ P 0 . This leads to a similar interpretation of credible sets as functions of the data. It is then natural to ask if they achieve coverage in the aforementioned frequentist sense. Put another way, can credible sets also serve as valid confidence sets? The difficulty of answering this question for nonparametric Bayesian methods is well-known and an active area of research; discussion of coverage therefore tends to be easier in the frequentist paradigm.
Naturally, the best possible inference produces small sets with high coverage probability. To this end, the concepts of honesty and adaptivity are relevant. Consider a confidence set C n , where the subscript n is added to emphasize limiting behaviour with respect to sample size. The remainder of this section ignores the distinction between pointwise intervals, bands, and balls.
In the context of density estimation, C n is honest at level 1 − α if lim inf where F is once again a suitable function space of interest [95]. In words, an honest confidence set asymptotically achieves the desired coverage level uniformly over all possible "ground truths". Honesty is crucial for practical finite-sample inference: without it, it is possible in some cases for the infimum of coverage probability over F to be zero for any n [123]. The precise definitions and presentations underpinning the notion of adaptivity vary throughout the nonparametric literature [19, 69, 95, for instance].
The present discussion will focus as narrowly as possible on material relevant to density UQ. Suppose F = ∪ s∈S F s for some ordered index set S, where for s > t it holds that F s ⊆ F t and the elements of F s are smoother than those of F t \F s . Typically each subset F s is, say, a ball in a suitable Besov space of regularity s, with an associated minimax-optimal contraction rate r n (s) decreasing in both n and s [70]. Following Hoffmann and Nickl [95], call C n adaptive if there exists L > 0 such that, for all s ∈ S and for all n large enough, where the expectation is with respect to P 0 , and |C n | is the diameter of C n with respect to the metric by which it is defined (typically L 2 or L ∞ in this context). Naturally, less uncertainty is expected in the estimation of smoother functions. Adaptive confidence sets take advantage of this fact: they are optimal in the sense that, for every level of smoothness under consideration, their "maximum" expected size contracts at the optimal rate with respect to n. This is especially useful since the actual smoothness of the true density is likely to be unknown, and it does not have to be specified for adaptive C n . Unfortunately, adaptivity is an elusive goal which cannot be achieved without caveats, especially if honesty is also desired. As it pertains to density estimation, one of the earliest results to this effect came from Low [132], who considered pointwise inference for f 0 with uniformly bounded k th derivatives. They showed that, over this space, an honest confidence interval could achieve the worst-case contraction rate for any f 0 , regardless of its true smoothness. Confidence sets in L ∞ are particularly tricky: full adaptivity over finitely many smoothness levels can only be achieved by swapping the lim inf and inf in (2) (i.e. considering "dishonest" bands) [70,95], but Bull [15] showed that even with this modification it is still impossible to adapt over a continuous range of smoothness levels in the white noise model. Dümbgen [41] defined density confidence bands using a test statistic depending on the cdf values at order statistics and showed some adaptivity results based on local smoothness, but they are only valid over sets of shape-restricted (e.g. unimodal or monotonic) densities. Such difficulties are pervasive for all types of confidence sets: to achieve honesty and adaptivity together, it is necessary to assume additional restrictions on the smoothness classes under consideration or the functions therein. The theory shows that L 2 confidence sets are less restrictive in this regard than confidence bands, but neither are without their difficulties. Section 8.3 of Giné and Nickl's textbook [70] is an excellent and comprehensive discussion of these ideas, and the references in their notes provide further details. The authors explored adaptation theory for the white noise model, but noted that it can be made to apply to density estimation. Adaptivity and honesty are central to the theory of nonparametric inference, but to many practitioners they may ultimately be less important than the aforementioned visual aspect of UQ. Figure 1 shows how to graphically represent the uncertainty associated with a density estimate by plotting multiple estimators and corresponding UQ methods, all based on the same simulated dataset. The figure includes both frequentist and Bayesian inference methods, and demon- strates the differences between pointwise intervals and simultaneous bands (in particular, the latter are wider than the former, as one would expect to be necessary for this stronger type of inference). The methods shown in Figure 1 are among the many described in the following sections, each of which explores UQ in terms of the concepts described above. The figure itself is discussed in more detail in Section 9 as well as the supplementary material [141].

Kernel density estimators
KDE's are one of the most used and well-studied density estimation methods, at least in the frequentist literature. They are ubiquitous enough that their properties are arguably "common knowledge", receiving extensive documentation in textbooks, undergraduate course material, and review papers unto themselves [e.g. 28, whose review informs much of the discussion in this section]. Recall that a kernel density estimate for a density on R d is of the form where K is some (typically symmetric) kernel function and h is a bandwidth which controls smoothing, or bias/variance tradeoff. Asymptotic theory for estimation is typically based on h decaying to zero in some "big-O" relationship with n that optimizes MSE, or integrated MSE. Practical methods for obtaining h include cross-validation, plug-in methods, rules of thumb, and bootstrapping [102]. Note that, as the estimator is little more than a sample mean, it is equivalent to a conditional expectation with respect to the random measure F n , the empirical distribution function of X.

Pointwise inference
For pointwise inference, it is well-known that KDE's are asymptotically normal: withf as defined in (4), for all x ∈ R d it holds that Furthermore, the distributions for a finite collection of points are asymptotically independent [18]. Using this fact, it follows that the endpoints for pointwise confidence intervals should be roughly of the formf (x) In practice, intervals can be computed by estimating σ x : either by using one of its asymptoticallyequivalent formulae ("plugging in"f in place of f 0 [28,87] or replacing expectations with sample averages [96,57]) or bootstrapping [28]. Many papers also replace the standard normal quantiles with those of bootstrap t-statistics [e.g. 96,84]. Such studentized or "percentile-t" confidence intervals seem to be the most commonly-discussed in the literature, but any method of bootstrap confidence interval construction should be valid -for instance, Chen [28] discussed a bootstrap interval based entirely on the percentiles of absolute deviations. Hall and Kang [87] showed that re-calculating the bandwidth for each bootstrap sample does not provide worthwhile improvements to the accuracy of inference 1 , so computational difficulty is avoided by using the same bandwidth across all replications. In the univariate case with a compactly-supported kernel, Chen [27] considered the construction of confidence intervals based on empirical likelihood, a nonparametric analogue to the standard methods of profile log-likelihood ratios. The theory is similar to the parametric case: viewingf (x) as a sample mean of random variables where is the profile empirical log-likelihood ratio. This allows for pointwise intervals of the form {y : (y) ≤ c 1−α }, where c 1−α is the 1 − α quantile of the χ 2 1 distribution. Chen showed that such intervals have asymptotic performance comparable to the percentile-t bootstrap and can outperform it in simulations, especially with Bartlett correction. Note that everything discussed thus far is based on (5), which is centered about E f instead of f 0 . This poses a problem for inference when choosing an "optimal" bandwidth minimizing (integrated) MSE or some proxy. The aforementioned intervals provide asymptotically-correct coverage for the expectation off (x); in order for this to hold for f 0 (x) instead, the quantities in the numerator of (5) must be interchanged. This is only possible if the ratio of bias and asymptotic standard deviation goes to zero. However, the optimal asymptotic error rate is achieved when h is set proportional to some power of n such that the squared bias and variance decay at equal rates [28,159]. Thus, with an "optimal" bandwidth, the ratio above tends to a nonzero constant so that confidence intervals do not have the correct coverage properties 2 . There are two main ways to handle this. The first is undersmoothing, where a lower-than-optimal bandwidth is selected. In the univariate case, Horowitz [96] suggested taking h proportional to a higher power of n than usual, thereby allowing the squared bias to decay faster than the variance; while Hall [84] multiplied a rule-of-thumb bandwidth by a constant c ∈ (0, 1). Chen [27] used a version of the former when a confidence interval at only one point is desired: first obtaining kernel estimates of f 0 and its second derivative at the point with approximations of the (local) MSE-optimal bandwidths, then using these to estimate the bias. Chen suggested simply using the optimal bandwidth in confidence interval construction when the estimated relative bias is small, or an estimate of a coverage-optimal undersmoothing bandwidth when it is large. Because a smaller h means higher variance, confidence intervals based on undersmoothing may be wider than one would prefer [28]. The second method is therefore to estimate the bias term withb and replacef with the bias-corrected estimatorf −b. Assuming a kernel of order 3 r is used, the bias depends on the r th -order derivatives of f 0 , assuming these are bounded and continuous [26]. These derivatives can also be estimated with kernel methods, but require higher bandwidths for optimality than the density estimator itself; for this reason, traditional bias correction uses an oversmoothed KDE to obtainb [84,28]. Hall [84] showed through both asymptotics and simulations that undersmoothing with a higher-order kernel results in percentile-t bootstrap confidence intervals with smaller coverage errors than those based on such "oversmoothing" bias corrections. However, Calonico, Cattaneo and Farrell [21] developed a "robust" bias correction, in which the variance estimate used in confidence interval construction is modified to account for the correction, and showed that it can perform as well as undersmoothing-based intervals, with more robustness to bandwidth selection. Notably, their results hold when using the MSE-optimal bandwidth and second-order kernels for bothf and the bias correction, which they noted to be convenient automatic choices. While lower error rates and narrower intervals are desirable, it should be noted that the bias-corrected centersf −b are not necessarily nonnegative. Also note that the aforementioned results are based on a kernel with compact support.
Hall and Horowitz [86] devised another novel bootstrap approach. Starting with the original KDEf , they repeatedly drew "bootstrap" samples fromf (some papers call this the smoothed bootstrap, e.g. [148]) and used these to create Gaussian plug-in intervals at each point x in the domain, with some nominal confidence level 1 − β(x). They set each β(x) to ensure that the actual coverage (as estimated with bootstrap replicates) achieved the desired level 1−α. Lettinĝ β δ be the ξ-quantile, for some low ξ, of the β(x)-values over a fine grid of x's with edge width δ, they tookβ as the limit ofβ δ as δ → 0 and finally used standard normal quantiles z 1−β/2 to construct pointwise plug-in intervals centered atf . Their theory and simulation studies focused on nonparametric regression, and in this case they showed asymptotic pointwise coverage of at least 1 − α at roughly (1 − ξ)100% of points in the domain. However, they suggested that all results would translate to KDE's as well.
Similar ideas to those discussed above extend to situations besides a single i.i.d. sample X. Louani [130] derived theoretical results for the case of randomly right-censored data: when there exists another sample Y of size n, and only Z i = min {X i , Y i } and 1 (X i ≤ Y i ) are observed. They considered a modified KDE defined by integrating with respect to a Kaplan-Meier estimate of the cdf, rather than the usual edf F n . Relaxing some of the conditions required for previous similar results [142,127] (in particular, assuming only one bounded continuous derivative of f 0 ), Louani showed pointwise asymptotic normality for this estimator when using a kernel of compact support. The asymptotic standard deviation is similar to the left-side factor in (5), but with an extra factor of 1 − G(x), where G is the cdf of Y . Another theoretical extension came from Giné and Mason [68], who considered kernel-based U-statistic estimators for the densities of functions g (X 1 , . . . , X m ) with m > 1. Analogously to other results described here, they derived central limit theorems for such estimators, noting the bias can be eliminated if the bandwidth decays appropriately in the special case where g is additive in its arguments (see also [185] for related results). Schick and Wefelmeyer [186] studied the case when the data is a linear process: X i = ∞ s=0 a s i−s for zero-mean s and absolutely convergent {a s }. The asymptotic mean in their limiting normal distribution is the convolution of the true density with the kernel. For a more practically-oriented extension, Wang and Wertelecki [214] considered data observed with rounding errors. They proposed a multi-step process to estimate the density of X: first deriving a rough, convolution-based estimate for the cdf of the non-rounded data; then using this to generate a sample from the estimated distribution of the rounding errors; and finally subtracting the simulated errors from the rounded data and constructing a KDE from the resulting quantities. Because the procedure involves simulated sampling, it naturally lends itself to bootstrap-style uncertainty quantification, which the authors showed in the form of pointwise confidence intervals for real data.
Further results exist for deconvolution density estimation. In this case, it is common to use a specialized kernel deconvolution density estimator, replacing the "standard" kernel in (4) with a deconvolution kernel : the Fourier transform of the ratio between the characteristic functions of some kernel function and the known error distribution. Fan [49] provided asymptotic normality results for such estimators in two cases: ordinary smooth deconvolution (where the tails of the error characteristic function decay at a polynomial rate) and supersmooth deconvolution (where they decay exponentially). In addition to the usual corollary of bias removal with undersmoothing, Fan also showed that the asymptotic variance (which depends on the true unknown density of the noisy data in the ordinary smooth case, and does not have a general expression in the supersmooth case) in the pivotal quantity could be replaced by a sample-dependent term: either the sum of squared deconvolution kernel values, or their sample variance (only the former was considered for the supersmooth case). Zhang [217] showed similar results for a similar estimator. Fan and Liu [51] later relaxed the conditions assumed in [49] for the ordinary smooth case, allowing the asymptotic normality results to apply to a wider variety of commonly-used error distributions. In a later paper, van Es and Uh [205] showed for a subset of supersmooth error densities that, under certain conditions on the kernel, the asymptotic variance of the estimator does not depend on the data or the true density. They noted that this allows in this case for the construction of pointwise confidence intervals without data-dependent standardization, although they do not address the issue of bias. Further asymptotic normality results with known variances are given in van Es and Uh [204] and Uh [201] for somewhat more general kernels and subsets of supersmooth error densities. Masry [139] generalized the classical results of Fan and Zhang to inference on the joint density of stationary process data based on observations with i.i.d. additive noise. They showed asymptotic normality for various types of mixing with both ordinary smooth and supersmooth error distributions, but only considered undersmoothing-based bias removal and sample-based standardization for the former. For both i.i.d. and strongly-mixing data, Zu [219] proved asymptotic normality of the estimator when the noise is logarithmic chi-squared, a case not covered by the assumptions in the previous literature. The asymptotic variance in this case depends once again on the true density of the noisy data; Zu suggested that it could be consistently estimated by a classical KDE to facilitate construction of (biased) confidence intervals.
Returning once more to the case of an i.i.d. sample, a final extension is the adaptive kernel density estimator implemented in Stata by Van Kerm [206]. This method starts with a "pilot" density estimate of fixed bandwidth; its values at the sample points are used to assign individual bandwidths to each of the kernels in (4), which can also be given individual weights. These variable bandwidths reduce variance in regions where data is sparse, and bias in regions where it is dense. As with the normal KDE, it is an easy matter to get a plug-in estimate of standard error; this is how Van Kerm's software implements simple pointwise inference.

Simultaneous inference
Moving beyond the pointwise case, consider simultaneous UQ on the entire support or some subset thereof. The aforementioned undersmoothing and/or bias-correction principles still apply, and the rest of this section will largely take the application of such principles for granted. Bickel and Rosenblatt [9] provided perhaps the first results to this effect for univariate KDE's, showing that, under suitable technical conditions, for suitable sequences A n and d n (the latter being a function of the former), with the supremum taken over a compact interval (say, [0, 1] w.l.o.g.) on which f 0 is bounded away from 0. They further showed that with moderate undersmoothing, it is possible to replace Ef and √ f 0 in (6) by f 0 and f , respectively, Note that using a differently-scaled A n , the factor of 2 in the exponent of the limiting distribution can be eliminated, thereby turning it into the c.d.f. of a standard Gumbel random variable [e.g. 69, who derived such a result for an undersmoothed data-driven bandwidth choice and plug-in estimator for √ f 0 ; see Section 4.4 for further details]. In either case, the limiting probability law is of the extreme value or "double exponential" form. Rosenblatt [177] expanded upon Bickel and Rosenblatt's results, slightly relaxing the conditions under which (6) holds in the univariate case and generalizing to the multivariate case. However, their multivariate results required rather strong restrictions on the bandwidth, differentiability of f 0 , and moments of K. Rio [171] gave another rather technical result on the limiting distributions of suprema over closed subsets of (0, 1) d for d-dimensional densities. Additional generalizations of the Bickel-Rosenblatt results in the univariate case were provided by Giné, Koltchinskii and Sakhanenko [66], who gave conditions for results similar to (6) to hold with a different weight function Ψ replacing the factor √ f 0 −1 or the supremum taken over a data-dependent set. The same authors provided further theory to this end in a companion paper, in which they considered suprema over the whole real line [67]. Sakhanenko [183] further modified and extended these results to multivariate densities. Using moderate deviations principles, Mokkadem and Pelletier [143] showed that it is actually possible to construct Bickel/Rosenblatt-style confidence bands with asymptotic coverage level equal to 1 by using separate bandwidths for the KDE's in the mean and variance estimates (i.e. in the quantities used to define, respectively, the centre and margins of the bands). Further technical refinements allowed them to achieve this with narrower bands, at the expense of a slower convergence rate. While remarkable, these results have not been applied in practice in literature to date. A drawback of using these asymptotics in practice is that convergence to the extreme value distribution is known to be very slow [e.g. 83]. Thus, it may be advisable to use bootstrapping for confidence bands. In what follows, let f * denote a KDE based on a bootstrap resample of X. Hall [85] considered bands (over compact intervals) of the type This band is based on the "studentized" quantity f − Ef / Ef , but differs from others by not using any kind of estimator for the denominator. Hall found in simulations that this interval had better coverage for Ef than a bias-corrected translation had for f 0 , presumably due to inaccuracy in the bias correction. Hall and Owen [88] recommended the bootstrap to construct simultaneous confidence bands on [0, 1] with profile empirical likelihood methods. Recalling the notation for empirical likelihood in Section 3.1, they found an extreme value limiting result for sup x E f (x) similar to (6), but recommended using percentile bootstrap methods to find a suitable boundĉ for a band of the form {f : (f (x)) ≤ĉ ∀x ∈ [0, 1]}. The technicalities and variations they considered are too cumbersome to discuss further here; see [88] for the full details. They found their intervals to be disappointingly wide when applied to real data, but suspected that this was due to the inherent variability of the density estimation itself. Neumann [148] gave quite general theoretical results for uniform-width percentile bootstrap bands of the formf ± t * α , where t * α is the bootstrap quantile of sup x |f * − Ef * |. Their results are valid for multivariate densities, suprema over all of R d , and weakly-dependent data. Neumann used compactly-supported kernels and the smoothed bootstrap: generating X * from a possibly-different KDE based on the original sample, rather than from the empirical distribution. In a recent paper, Cheng and Chen [29] used the debiased estimator of Calonico, Cattaneo and Farrell [21] to derive asymptotically correct bands, via the bootstrap, of either uniform width (using quantiles of sup x f * −f ) or variable width (using quantiles of sup x f * −f /σ * multiplied byσ(·)), where the bootstrap density and associated variance estimates were all computed based on the bias-correction approach. Their results extend to the multivariate case and assume a compactly-supported f 0 . Their simulation study showed that their bands achieved better coverage and narrower width than those based on the standard KDE, although some undercoverage still occurred for small samples without undersmoothing.
Yeh [215] used the bootstrap to create a rather novel type of confidence band. Generating a large number of KDE's from bootstrap samples of X, they retained the 100 (1 − α) % of them with the largest curve depth (a way of ranking functions "from the centre out" based on some distance from a central curve, in this case the KDEf ). Simulation studies showed that such bands had reasonable performance compared to the asymptotic methods discussed in this section.
As is the case for pointwise inference, for simultaneous bands there are analogous results for deconvolution KDE's, described by Bissantz et al. [10]. A necessary assumption for these results is that the characteristic function of the error density decays as t −β for large |t| and some known constant β > 0. Their asymptotic results for bands over a compact interval are nearly equivalent to those derived from (6), although in the asymptotic standard deviation they divided by an extra factor of h β and replacedf withĝ, where the latter is an estimate of the density g of the observed Y -values (a standard KDE suffices). However, they noted slightly better coverage probability (especially in terms of robustness to model misspecification) can be achieved with percentile bootstrap confidence bands of variable width based on the quantiles of f * (x) −f (x) /ĝ(x), where f * is a deconvolution estimator from a bootstrap sample of the observed noisy data.

Miscellaneous
Aside from some technical considerations in the previous section, not much consideration is given to the support of the true density. Indeed, the issue of KDE boundary bias for f 0 of restricted support is well-known and several mitigating strategies exist [e.g. 101, and discussion therein], but this is rarely discussed in the context of uncertainty quantification. One exception is given by Bouezmarni and Rombouts [11], who considered the gamma kernel estimator for time series data on [0, ∞). The gamma kernel has a shape parameter varying with x and leads to an estimator (asymptotically) free of boundary bias. The authors showed pointwise asymptotic normality analogously to the results discussed above, based on the behaviour of the gamma scale parameter which acts as a bandwidth. In practice, it can be selected by cross-validation; the authors did so and constructed confidence intervals for real data based on their asymptotic results. This section concludes by discussing a paper on large-sample Bayesian methods by Lo [126]. The key observation for this discussion is to recall that one can view the KDE (4) as a (conditional) expectation with respect to F n . Lo's ideas are based on replacing F n in this expectation with a different random distribution F conditional on X. One such example is the empirical distribution of a bootstrap sample; this is equivalent to a probability measure with atoms at the sample values and weights randomly selected from {1/n, 2/n, . . . , 1}. Lo also considered the Bayesian bootstrap, where the weights on the atoms are drawn from a uniform Dirichlet distribution [180]. This is equivalent to a draw from the posterior when X ∼ F , and F is given an improper Dirichlet Process prior with zero base measure. Finally, Lo generalized this to allow for a non-zero base measure in the DP prior. They showed an asymptotic result analogous to (6) for all three aforementioned KDE variants, where the limit holds for f 0 -almost all X. This allowed them to use extreme value asymptotics to derive appropriate Bayesian bands for f centred at the usual KDEf . In practice one may prefer not to do this, given the substantial developments in Bayesian computation since the time of Lo's paper.

Adaptive basis expansion methods
This section considers estimates for f 0 of the form where the B j,K 's are a suitable set of fixed nonnegative "basis functions" for a given K. . The coefficient vector b ∈ R K is constrained such that f is a valid density. The remainder of this section will useb to denote the coefficients associated with a specific estimatorf . The dimensionality K is of particular interest, serving as a smoothing parameter that controls the bias-variance tradeoff of the estimator. The basis functions corresponding to higher K-values are typically "narrower", allowing for more intricate shape detail to be captured in estimates. For instance, taking a high K-value for the histogram corresponds to using a larger number of narrower bins. Conversely, a value that is too high will result in a high-variance estimator that is unacceptably noisy. In general, higher K-values are required for larger samples to capture the true density.
One can choose K in a data-driven way. Many theoretical results for this approach rely on K increasing with n, usually appealing to some "big-O" conditions on its growth [e.g. 3,69,200]. In practice, a value could be chosen by cross-validation [120], changepoint methods [81], or appealing to known asymptotic theory [3, who derived nice properties for a method with K = o (n/ log n) and then simply used K = n/ log n in a simulation study]. In the theoretical Bayesian context, Rousseau and Szabó [179] considered maximum marginal likelihood (MML) estimates for K, marginalizing over a prior for b. Further discussion of such ideas is beyond the scope of this paper; see van de Wiel, Te Beest and Münch [202] for details on practical implementation of MML.
In the Bayesian literature, methods involving a data-driven choice of a single K-value are often called empirical [179]. All frequentist methods discussed in this section are of this type. On the Bayesian side, such methods contrast with hierarchical ones, which use a suitable discrete prior on K and allow it to "vary" [179]. In general terms, the K-values obtained with any approach will reflect what is necessary to capture the true shape of f 0 . It is in this respect that these estimators are said to be "adaptive".

Histograms
Perhaps the simplest of density estimators, a histogram (sometimes referred to as an empirical density [169]) is piecewise constant over some division of the support into disjoint subsets, or "bins". In the most general form with countably many bins {A j }, it can be written as with the constants c j chosen to ensure that the estimator is a valid density. The regularity of a histogram is controlled by adjusting the sizes of the bins. For instance, a very common form for the univariate case iŝ where n j is the number of sample values in the interval j−1 K , j K and K provides the needed regularity control. Assuming a bounded support, say [0, 1], the sum in (9) is over j = 1, . . . , K and is equivalent to (7) using a basis of indicator functions: Temporarily ignoring the notion of empirical or hierarchical approaches to K, suppose for now that it is fixed at some arbitrary value irrespective of everything else. Then the histogram simply becomes a problem of multinomial inference: the coefficient b j in (7) is an estimate of the probability that X ∼ f 0 falls in the j th bin, say p j . In this respect, the "traditional" histogram, whereb j = n j /n as in (9), is a MLE. Here the object of inferential interest is not necessarily f 0 , but rather the so-called theoretical histogram f [194], a piecewise-constant density equal to Kp j in the j th bin. With this view, (piecewise-constant) pointwise intervals arise by considering the single binomial proportion p j , and simultaneous bands by considering the vector of multinomial probabilities p = (p 1 , . . . , p K ). Frequentist and Bayesian methods for both are well-studied; see Vermeesch [207] for some practical applications to histograms.

Simultaneous frequentist inference
To discuss inference for f 0 itself, it is necessary to return to the adaptive paradigm. Much of the frequentist literature for histogram density UQ is theoretical and predates developments such as the bootstrap, relying on extreme value asymptotics similar to those for KDE's. One of the first such papers is by Smirnov [194]. They derived a limiting result much like (6) for the normalized quantity where the supremum is over a compact interval on which f 0 is bounded away from 0 and has total mass less than 1. Smirnov claimed that it was not possible to replace f in the denominator with f 0 due to the systematic difference between them dominating the error. However, they stated that it is possible to do so by replacing the histogram with a frequency polygon (a linear interpolation between the histogram values at the sample points) and imposing some extra conditions on the relationship between K and n. Although Smirnov did not provide proofs for these results, they will be shown later to be a special case of proven results for wavelets [69]. For f 0 supported on a compact interval, Révész [169] was able to prove a somewhat modified extreme value limit for the distribution of a quantity similar to (10), exceptf can be either the traditional histogram or a frequency polygon (with a slightly different interpolation scheme than that considered by Smirnov), f 0 replaces f in the denominator, and the supremum is taken over an interval converging to the whole support of f 0 . Révész also derived a similar result with the absolute value removed from the supremum. Further results to this effect were given by Freedman and Diaconis [58]. For everywhere-positive densities with a unique maximum, they considered a quantity similar to (10) without the absolute value (i.e. considering only the maximum positive deviation, although they claimed their proofs can be adapted to the maximum absolute deviation), the supremum taken over the whole real line, and the factor of f (x) in the denominator replaced by the maximal value of f 0 (a fixed constant). Their limiting results are quite similar to those of Révész. The three papers just discussed allow for (using a moderate amount of algebra) the construction of asymptotically correct simultaneous confidence bands for univariate densities satisfying suitable technical conditions, provided K increases at a suitably fast rate with respect to n (this roughly corresponds to the notion of undersmoothing discussed in Section 3). However, these papers did not concern themselves with the practicality of these ideas applied to actual data. It seems reasonable to suspect that slow convergence could be an issue which could be rectified with bootstrap methods, as was the case with KDE's.

Pointwise frequentist inference
Consider now the issue of frequentist pointwise intervals. For the "traditional" histogram (of the form (9)), Laloë and Servien [115] showed conditions on K and f 0 (x) for the quantity to have a limiting distribution, which they proved to be a standard Gaussian when it does exist. Their proof applied the Lindberg-Feller Central Limit Theorem to the histogram values (recall that these are scaled binomial random variables for each K). Their conditions were more general (but also more technical) than those in many other papers: in particular, they did not even require f 0 to be continuous. Some literature provides results for non-traditional histogram variants with non-uniform bin spacing. For univariate densities, Kim and Van Ryzin [106] showed pointwise asymptotic normality for a histogram with randomlyspaced bins. They required bin spacings to meet certain conditions for their results to hold; one valid option is to fix the number of observations in each bin and determine their widths by the spacings of the sample's order statistics.
The same authors showed analogous results for an extension to the bivariate case [107]. Another variant for the univariate case is the maximum entropy histogram estimator (MEHE), which works by dividing the real line into K ≤ n subintervals (with the first and K th respectively extending to −∞ and +∞ and f having some suitable tail behaviour there) and choosing the spacing of their boundaries to maximize entropy subject to preservation of sample means and mass in the subintervals. Rodriguez and Van Ryzin [175] considered this estimator and a "symmetrized" variant and showed pointwise asymptotic normality of the quantity (11) for both. Their conditions on continuity and growth of the number of subintervals were slightly different for the symmetrized version, and the limiting law does not concentrate around zero as it does for the regular MEHE, presumably necessitating bias correction. Stadtmüller [195] considered asymptotics for yet another variant of the form (9), first considered by Gawronski and Stadtmüller [59], in which the indicator functions in the summands are replaced by the values of lattice distributions to yield a smoothed estimate.
They gave a few suitable examples: Note that the lattice distributions do not necessarily constitute probability distributions with respect to x. Thus, density estimators of this type may not integrate to 1, although some examples presented in [59,195] certainly will. For these estimators, Stadtmüller [195] showed pointwise asymptotic normality of the quantity where σ 2 depends on the lattice distributions. Additionally, they showed extreme value limiting results [somewhat similar in form to those in 9, as usual] for the supremum of this quantity (as well as the supremum of its absolute value) over compact intervals, under some regularity conditions on f 0 and the lattice distributions. They also noted that it is possible, as usual, to replace Ef by f 0 (thereby achieving correct asymptotic coverage for confidence intervals or bands) by undersmoothing -in this case, increasing K at a higher-than-optimal rate with respect to n.

A Bayesian approach
Recently, Rousseau and Szabó [179] discussed theory for Bayesian UQ of histogram estimators, assuming univariate f 0 supported on a compact interval. For this, return to the form (7), with the basis functions equal to indicators for equally-spaced bins. Rousseau and Szabó considered credible sets of the form (1), where d is the Hellinger distance andf is a suitable centering point such as the posterior mean. They showed that, under some regularity conditions on f 0 (it must be bounded away from zero and sufficiently smooth, and satisfy a "general polished tail assumption" defined by the authors and briefly described below) and the prior (a suitable K-dimensional Dirichlet for b | K, and others omitted here for brevity), posterior credible sets of this type have arbitrarily high asymptotic frequentist coverage if their diameter is increased by an appropriate factor. In mathematical terms, for any ∈ (0, 1), there exists L > 0 such that where In fact, they showed the stronger honesty result that this limit inferior holds uniformly over a certain class of functions, and that the uninflated credible sets are also almost adaptive over this class (save for a logarithmic factor in the diameter contraction rate). The densities comprising this class are those in an arbitrary union of Hölder balls of equal radius and regularities in (1/2, 1]. They must also satisfy the aforementioned general polished tail assumption, which essentially controls their high-resolution behaviour. Further ideas of this type will emerge in Section 4.4. Rousseau and Szabó's results hold for both the empirical and hierarchical approaches to K. For the latter case, a geometric or Poisson prior for K satisfies the relevant conditions. The authors noted that the "blow-up factor" of √ log n is unfortunate, but they believe it is necessary to prevent coverage from decaying to zero in certain cases. Although it is quite pleasant to have such theoretical guarantees, it may be a challenge to put them towards a practical end due to the blow-up factor. Given an MCMC method to generate posterior simulations from this model, a credible set can be roughly visualized by plotting the (1 − α)100% of f draws closest in Hellinger distance tof , but plotting draws from the blown-up set is another matter since we are not aware of any way to estimate L .
Given the popularity of histograms, it is somewhat surprising that practical implementations and demonstrations of UQ for them appear so rare in the literature. For practitioners thorough enough to quantify errors in density estimation, it is perhaps reasonable to conclude that histograms have been superseded by KDE's and other methods that produce smooth estimates. Certainly, smoothness is advantageous for interpretation, especially when one wishes to account for uncertainty.
The following sections will contain a few more results which are applicable to histograms, arising as special cases of other methods.

Bernstein polynomials
One of the earliest non-histogram methods of the type (7) was proposed by Vitale [208] for densities supported on [0, 1]. They took The basis functions are Beta densities with integer parameters (equivalently, scaled Bernstein polynomials), and the coefficients are equal to the proportion of sample values in each interval j−1 K , j K . In this respect, Vitale's estimator is essentially a smoothed histogram. In fact, aside from a different scaling factor it is almost the same as the lattice-smoothed histogram of Gawronski and Stadtmüller [59]; it therefore seems reasonable to suspect that one could derive confidence bands from similar asymptotic arguments as Stadtmüller [195]. An equivalent way to interpret this estimator is as a mixture of Beta densities.
Babu, Canty and Chaubey [3] provided pointwise asymptotic normality results for this estimator under mild conditions, from which one can presumably derive expressions for approximate pointwise confidence intervals (subject to the usual handling of bias and variance terms). At interior points x ∈ (0, 1), Vitale's estimator is quite similar to the KDE: optimal MSE behaviour occurs with K such that the asymptotic orders of variance and squared bias match [208]. With this choice, confidence intervals -based on either plug-in or bootstrap methods -will not have the correct asymptotic coverage, concentrating around E f (x) instead of f 0 (x) [3]. "Correct" intervals could be obtained by undersmoothing, choosing a higher-than-optimal K [asymptotic conditions in 3]. Alternatively, noting that the bias term is a known function of the first two derivatives of f 0 , it may be reasonable to estimate a bias correction with plug-in methods, again in analogy with KDE's. Tenbusch [200] proved analogous results for Vitale-style estimates of bivariate densities defined on triangular or rectangular regions, with some generalizations for the latter provided by Babu and Chaubey [4]. As they are quite similar to the univariate case, they are not repeated here. The aforementioned pointwise results are valid for interior points x, but these estimators are known to have different asymptotic behaviour at the boundaries [208,200] and so UQ may also work differently there.
There are methods besides Vitale's for estimating a density with Bernstein polynomials. For another frequentist method, take the coefficientsb to be MLE's. Guan [81] claimed pointwise asymptotic normality results for this approach, but it is not clear how to turn these results into appropriate pointwise intervals.
Theory for Bayesian estimates of this type typically depends on the idea of viewing the coefficients b as increments of some unknown c.d.f.
s estimator fits this framework for F equal to the edf of X). To that end, Petrone [160,161] considered a hierarchical Bayesian formulation with a discrete prior on K and a Dirichlet Process prior on F . For practical implementation, they devised an equivalent formulation making use of the aforementioned "mixture-of-betas" interpretation. They introduced a vector of latent variables Y = (Y 1 , . . . , Y n ) ∼ F , which provide "mixture labels" for the samples conditional on K: [161] for more details on the properties of this construction, as well as a Gibbs sampling algorithm for posterior inference. In principle, this formulation gives everything needed to obtain, at the very least, pointwise credible intervals -indeed, Petrone did so in these papers. Note that practical implementations of this model require the truncation of the prior for K for computations to be possible. This has theoretical implications, but is not an issue in practice provided the maximum value for K is reasonably high. Petrone and Veronese [162] generalized these ideas for data not necessarily in [0, 1]; see Section 7.3.1 for elaboration on this.
Following the analogous KDE ideas in Lo [126], Ghosal [64] considered an alternative "posterior" based on a (generalized) Bayesian bootstrap approach, where it is assumed that X ∼ F and F is a random distribution from a Dirichlet Process with base measure α(·)+ δ Xi . They conjectured pointwise asymptotic normality (concentrating around the Bernstein density with coefficients from F = F 0 , rather than f 0 itself), but could not adapt the results in Lo [126] to prove this.

B-splines
B-splines are another option for the basis functions in (7). They are piecewise polynomials, characterized by a set of points in the domain called knots at which the values of the piecewise functions and a certain number of their derivatives must match. The number of basis functions K depends on both the number of knots and the polynomial degree chosen. Cubic splines are the most common choice, but there are others: for instance, a Bernstein basis of size K is a special case of B-splines of degree K − 1 and no interior knots [45]. Using interior knots allows B-splines to be sharper-peaked in general than Bernstein polynomials, even at lower degrees. Literature about B-splines abounds; see Dias [40] for one of many introductions.
Although there will be plenty of discussion of splines in Sections 5.1 and 6, their use in estimators of the form (7) is limited in the literature. UQ for such estimators appears limited to practically-oriented Bayesian papers, although we suspect that it may be possible to translate some of the theory pertaining to histograms or Bernstein polynomials to this type of basis. Note that it is necessary to normalize each B-spline so it integrates to 1, thereby preserving the "mixture-of-basis-densities" view of (7). As another technical note, here attention is restricted to compactly-supported densities and estimators.
Shen and Ghosal [190] considered the hierarchical Bayesian setup (as defined at the beginning of Section 4), with K having a suitable discrete prior and b | K having a conditional K-dimensional Dirichlet prior. Like most practicallyoriented papers with a hierarchical framework, they noted that the prior on K must be truncated for computation. They gave a closed-form expression for the posterior mean of f and claimed similar expressions existed for higher posterior moments, allowing them to construct approximate credible intervals (presumably by a Gaussian-style "mean ± 2*standard deviation" approximation). Their expression for the posterior mean is a ratio of sums, each of which has a number of terms increasing exponentially in n for splines of degree ≥ 1. Thus, the authors suggested randomly sampling a reasonable number of summands, say 3000, to approximate it. This is not an issue for splines of degree 0, as many terms cancel out due to the basis functions having non-overlapping supports. In this case, the estimator is simply a histogram and simplicity arises at the expense of smoothness. Shen and Ghosal found in their simulation study that the credible intervals were more appealing with cubic splines than with constant ones, although both had some difficulty capturing some of the true density's shape.
Edwards  31] and B-splines for estimating the spectral density of a stationary time series. This use case differs from the probability density estimation considered here, but some of their ideas are nevertheless interesting for our purposes. In addition to pointwise credible intervals, they also considered simultaneous bands generated from median absolute deviations: wheref is the posterior median, the pointwise MAD's are taken over MCMC draws, and ξ α is the (1 − α)-quantile (obtained via MCMC draws) of In a simulation study, they found that such bands had vastly superior coverage using B-splines instead of the Bernstein basis. Pointwise intervals for B-splines tended to be wider, but both these and simultaneous bands captured intricate shape details more effectively than when the Bernstein basis was used. This is because the compact support of B-splines allows them to more effectively capture sharp peaks. The authors noted, however, that B-splines resulted in longer computation times than the Bernstein basis. Lopes and Dias [129] used a semiparametric Bayesian model for densities, combining a mixture of normalized B-splines (with Dirichlet-distributed coefficients) with a mixture of parametric densities. As usual, a straightforward Gibbs sampler allowed them to obtain pointwise credible intervals from MCMC output.

Orthonormal wavelets
Briefly, the idea behind estimation with orthonormal wavelets is to express a square-integrable function f in the form where ψ kj (x) = 2 k/2 ψ 2 k x − j for some suitable function ψ called the mother wavelet. The mother wavelet is such that {ψ kj } is an orthonormal basis of L 2 (R), so that c kj = fψ kj . For most of the literature discussed in this section, it can be assumed unless otherwise noted that the domain is indeed all of R. However, in some cases it is desirable to modify the wavelets so that they form a basis of, say, L 2 ([0, 1]), and there are multiple approaches to this [e.g. 33].
It is often more convenient to express f as where φ kj (x) = 2 k/2 φ 2 k x − j for a scaling function or father wavelet φ such that {φ k0j , ψ kj : j ∈ Z, k ≥ k 0 } is also an orthonormal basis for L 2 (R). The number k 0 corresponds to the "coarsest" level of detail under consideration. In most literature explored below, it is either left arbitrary or set to 0 when the domain is R. When modifying the wavelets for use on [0, 1], the dimensionality of the basis will depend on k 0 and, for some methods, the "regularity" of ψ [33]. In this setting, k 0 may therefore be chosen to provide an appropriate set of basis functions [e.g. 14, 25]. The simplest wavelet example is the Haar wavelet, where φ = 1 [0,1) and ψ = 1 [0,1/2) − 1 [1/2,1) . In general, φ and ψ must be selected to mutually satisfy certain functional equations. For further detail on wavelet theory and examples, refer to Kaiser's excellent book on the subject [103].
In practice, to estimate a density with wavelets, one must truncate the sum over k in the second term of (15) to some upper limit K. In this respect, K is a bandwidth or "resolution": higher values introduce thinner wavelets into the sum that capture finer details, thereby reducing bias and increasing variance. In this respect, wavelets differ from other basis expansion methods in which the shapes of the basis functions themselves change with K. As previously mentioned, the coefficients in the wavelet expansion are simply inner products between the density and the basis functions. Thus, to obtain a point estimatef , the natural choice is to estimate d j and c kj by their empirical versions: the sample means of φ k0j (X) and ψ kj (X), respectively. It is now clear that density estimators based on the Haar wavelet are simply histograms with evenly-spaced bins.

Frequentist L ∞ inference
Giné and Nickl [69] derived some theoretical results for confidence bands over a compact subinterval, taken w.l.o.g to be [0, 1], by treating certain types of wavelet estimators in a unified framework with KDE's. In their approach, X is split into two subsamples: one of which is used for a data-driven bandwidth selection procedure (as always, their arguments involved undersmoothing to ensure correct coverage), with the other used to obtain the estimatef with this bandwidth. Letting K denote the number obtained from the bandwidth selection procedure (the details of which can be read in [69]), their framework encompasses both 1. kernel density estimators with kernel κ(x, y) = κ(x − y) and bandwidth 2 −K ; and 2. wavelet estimators in the form of (15) with k 0 = 0, and the sum over k in the second term truncated to K terms. To unify these estimators with KDE's, the authors invoked a projection kernel defined in terms of the father wavelet: For a final piece of notation, let c = sup x κ 2 (x, y)dy. Giné for suitable (known) sequences A n and d n . Just as in Section 3.2, it is straightforward to use this limit to get asymptotically-correct confidence bands. The authors further showed that these bands are honest and nearly 4 adaptive in a range of Hölder balls over all but a nowhere-dense (w.r.t. the Hölder norm) subset of the function space. However, they noted that their work is theoretically oriented and therefore cautioned against using these bands without assessing their finite-sample performance. Furthermore, the only wavelets they showed to fit into their framework were the Battle-Lemarié wavelets of order 1, 2, 3, and 4. The scaling function for the Battle-Lemarié wavelet of order r is a B-spline of order r [36], so for r = 1 it reduces to the Haar wavelet. Because (16) generalizes the histogram results first discussed by Smirnov [194] and the KDE results shown by Bickel and Rosenblatt [9], results of this type are often called Smirnov-Bickel-Rosenblatt theorems. Bull [16] showed that a Smirnov-Bickel-Rosenblatt result holds in the white noise model using symlets and Daubechies wavelets. The Daubechies wavelet of order r is a Haar wavelet for r = 1 and has increasing regularity for higher orders, but unlike the Battle-Lemarié wavelet it has the advantage of being compactly supported [16,103]. Bull verified their results for orders 6 ≤ r ≤ 20, using bases on both R and [0, 1]. Although the white noise model is not the focus of this review, they noted that these results could translate to the density estimation context via some of the Gaussian process theory in [69]. Indeed, the notion of equivalence between the white noise model and density estimation is established [e.g. 154], but the details are beyond the scope of this paper.
It was noted above that a Smirnov-Bickel-Rosenblatt confidence band could achieve honesty and adaptivity under certain conditions and restrictions on the function space. More broadly, discussion of these concepts often uses wavelet theory as a starting point, due to the nice theoretical properties of an orthonormal basis. Hoffmann and Nickl [95] considered another approach to ensuring the existence of adaptive and honest confidence bands in finitely many nested Hölder balls: removing subsets of functions from the lower-regularity ones to ensure "separation" from the smoother classes. By connecting this idea to hypothesis tests for the smoothness of f 0 , they showed that, in the case of finitely many smoothness levels, such separation conditions are necessary and sufficient for the existence of honest and adaptive bands, and that these conditions are weaker than those imposed by [69]. The constructive part of their argument involved a uniform band centered at an estimator satisfying certain properties; their paper and the references therein suggested that a wavelet estimator would be a good choice for both L 2 (R) and L 2 ([0, 1]). Unfortunately, the radii of these bands depend on properties of the Hölder balls that are unlikely to be known in practice, rendering application implausible. Nevertheless, these results are useful to inform theoretical discussion of the behaviour of confidence sets. Bull [15] considered inference on a union of Hölder balls with diameters and regularities both varying over a continuum. The conditions they imposed on the function sets are similar to those in [69], but somewhat weaker. Specifically, they required the densities under consideration to be self-similar. Briefly, self-similarity is a property of a function's wavelet expansion ensuring that it exhibits similar regularity at both small and large scales. Note that the general polished tail condition of [179] (see Section 4.1.3) is a generalization of this. Bull showed that this restriction excludes only a "negligible" set of functions in both the topological and probabilistic 5 sense, and that it is necessary and sufficient to achieve honest and adaptive confidence bands over a continuous union of Hölder balls. Refer to Sections 8.3.3 -8.3.4 of [70] for a more in-depth discussion of the role self-similarity plays in nonparametric inference.
Bull described a rather complex procedure to construct such a uniform band centered at a truncated empirical wavelet estimator, using Daubechies wavelets or symlets of order 6 ≤ r ≤ 20, modified to form a basis of L 2 ([0, 1]). The procedure exploits self-similarity to estimate the true smoothness of f 0 . Unlike the construction of Giné and Nickl [69], it does not require sample-splitting.

A practical approach
None of the literature discussed thus far in this section concerns itself with applications to real data. To the extent that there have been constructive results, they have tended in most cases to be rather complicated. For an example of somewhat more practically-oriented material, Chernozhukov, Chetverikov and Kato [30] developed 1 − α confidence bands of the form over a compact subset of R d . The subscriptl is a particular value of l, which is used to denote bandwidth (l replaces the usual letter K here for more streamlined notation as in the original paper). Much like Giné and Nickl [69], these authors cast both KDE's and wavelet estimators into the larger framework of estimatorsf l based on some kernel κ l (note that, unlike [69], they folded the bandwidth into the definition of the kernel). In fact, their framework also encompasses estimators based on nonwavelet projection kernels using other orthonormal bases such as Legendre polynomials. They considered univariate kernels and wavelets, and extended to the multivariate case by using elementwise products. Returning to (17),σ l is an estimate of the standard deviation off l , obtained using sample mean analogues of the relevant expectations [e.g. 57]. Letting L n denote the space of possible bandwidths,ĉ n (α) is an estimate of the 1 − α quantile of .
The authors suggested obtainingĉ n (α) by using the Gaussian multiplier bootstrap: whereas the normal bootstrap takes repeated samples of size n from the empirical distribution of X, this version repeatedly samples n i.i.d. standard normal variables ξ 1 , . . . , ξ n . Subsequently, is calculated, andĉ n (α) is taken to be the 1 − α quantile of this quantity over bootstrap replications. The numbers c n andl are chosen based on a separate application of the Gaussian multiplier bootstrap: the former is a scaled quantile of a different Gaussian multiplier process, and the latter is based on a modified application of the popular Lepskiȋ's method [121]. Chernozhukov, Chetverikov and Kato showed that -under some conditions on the necessary intermediate quantities, L n , and κ l -the bands (17) constructed in this way are asymptotically honest and adaptive over a range of Hölder balls, subject to global upper and lower bounds on the densities and a modified version of the "selfsimilarity" notion mentioned previously. Furthermore, they showed that the worst-case coverage probability of their bands converges to the nominal level at a polynomial rate, asymptotically faster than the logarithmic rate associated with Smirnov-Bickel-Rosenblatt results. These theoretical results hold for KDE's with compactly-supported kernels and estimators using either compact or Battle-Lemarié wavelets, with regularity conditions based on the maximal degree of Hölder smoothness to which one wishes to adapt. In their supplementary material, the authors conducted a small simulation study. Although most of the intermediate quantities used to construct (17) must meet certain conditions (primarily in terms of their behaviour with respect to n), they simply experimented with predetermined numerical values for their simulations.

Frequentist L 2 inference
Robins and van der Vaart [172] investigated the construction of L 2 confidence sets for conventional frequentist wavelet estimators. For a given wavelet basis 6 , let θ(f ) denote the coefficients of the basis expansion for an arbitrary f , and let f be the usual empirical wavelet density estimator. Then their confidence sets are of the form where K = K(n) is a suitably-increasing bandwidth and the termsτ ,B, and R are estimates for variance, bias, and θ (f 0 ) − θ f 2 , respectively. They used sample splitting, assuming that the data used to calculatef is independent from that used for the other terms. These sets were shown to be honest at level 1 − α, and adaptive to the fullest extent allowed by the theory without further restrictions 7 . Robins and van der Vaart mainly concerned themselves with the theoretical properties of such sets in various contexts. In practice, they may be difficult to construct due to the (likely unknown) quantities required for the calculation of the various terms in Equation 19. Bull and Nickl [17] further expanded upon the results of Robins and van der Vaart in the L 2 ([0, 1]) case, showing that honest and adaptive L 2 confidence sets over a wider range of regularity classes and Sobolev ball radii are possible by discretizing the smoothness range and using the "separation" approach from [69]. They constructed such a set in their proofs, somewhat similar in form to (19). Although they acknowledged the possibility of replacing some of the unknown terms in their construction by certain data-driven approximations, they did not consider applications to real data. Lerasle [122] provided a different approach to L 2 confidence balls, the full intricacies of which are omitted here. They used a model selection approach to determine the best approximation space (they dealt more generally with projection estimators on linear subspaces of an L 2 space, but for our purposes it suffices to consider the special case of wavelet estimators where the selection is for the truncation level) and a resampling method to estimate an L 2 norm needed in the radius of the set, thereby avoiding the sample splitting needed by some of the other literature discussed here. They showed that their confidence balls have the same adaptation properties as in [172], and that they are additionally non-asymptotic: they have correct coverage probability for any sample size n, not just in the limit.

Some extensions and Bayesian ideas
Lounici and Nickl [131] defined a wavelet-based deconvolution density estimator analogous to the kernel one described in Section 3, based on a deconvolution kernel using Fourier transforms of the error density and wavelet basis functions. They used concentration inequalities and Rademacher processes to construct a confidence band, the radius of which is a complicated expression depending on the unknown density of the observed noisy data (although they noted that it can be replaced by the deconvolution estimator in practice). Under some conditions on the bandwidth, error density, and smoothness of f , it is possible to control the probability with which the bands cover f 0 over all of R.
For another somewhat unconventional theoretical example, Kerkyacharian, Nickl and Picard [105] developed an estimator for densities on homogenous compact manifolds such as spheres. Their estimator is based on a needlet expansion of the density, where needlets form a basis with multiresolution properties similar to wavelets. They proposed a confidence band of random uniform width, discussed its (limited) adaptivity, and showed that its coverage probability can be controlled by undersmoothing.
Bayesian UQ literature for density estimators of this type is generally scarce, but some Bayesian results for histograms by Castillo and Nickl [25] can be more easily explained with the machinery of Haar wavelets. They used wavelet expansions of roughly the form (15) with k 0 = 0, the sums over j restricted to j = 0, . . . , 2 k − 1 to ensure the Haar system forms a basis of L 2 ([0, 1]), and the sum over k truncated to some upper limit K. This is equivalent to the basis function estimator (7) with 2 K piecewise constant basis functions instead of K. In the latter form, Castillo and Nickl placed a 2 K -dimensional Dirichlet prior on the histogram coefficients b, with K = K n chosen as a deterministic function of n and the assumed Hölder regularity of f 0 (for regularities in the range (1/2, 1]). They proposed credible sets C based on a "multiscale" approach: where the inner product is the standard one on L 2 ([0, 1]),f is the usual empirical wavelet estimator of f 0 , w k is a sequence such that w k / √ k → ∞ as k → ∞, and R n is such that Π (f ∈ C | X) = 1−α. Note that R n can be computed explicitly due to the conjugacy of the Dirichlet prior, since the likelihood depends only on the counts of observations in each "bin". Castillo and Nickl showed that the posterior over densities satisfies a sort of nonparametric Bernstein-von Mises property, and that these sets therefore have asymptotically correct frequentist coverage: P 0 (C f 0 ) → 1 − α as n → ∞. With a further refinement to their definition, their L ∞ -diameters also contract at a nearly-optimal rate in the "big-O in P 0 " sense. Unlike many of the other methods in this section, honesty and adaptivity are not implied here as the authors did not show the asymptotics to be uniform over all f 0 in some class. Although the choice of bandwidth K depends on the regularity of the unknown f 0 , they suggested that one could estimate a suitable bandwidth under a self-similarity assumption as in [69]. The geometry of these sets does not lend itself to visualizable error bounds. Instead, one can simulate from the posterior with MCMC, discard the 5% of function draws with the highest values for the multiscale quantity on the left-hand side of (20), and plot the remaining 95% to get a visual representation of the sets. This is the approach taken in, for example, the simulation study of [167], who considered similar theoretical ideas for Bayesian UQ in the context of white noise and conjectured that they may be applicable to densities.

Adaptive basis expansion methods for log densities
An adaptive basis expansion does not have to be applied to the density itself as in the preceding section. Rather, it can serve as a model of the logarithm of the density, provided normalizing constant c is incorporated: Modelling the logarithm as a sum has a few nice consequences. In particular, it allows f to be viewed as a member of an exponential family with sufficient statistics i B j,K (X i ), which makes it very easy to obtain an MLEf by maximizing i log f (X i ) with respect to b using (21) [e.g . 110]. Additionally, it is no longer necessary to constrain the coefficients.

Logsplines
One of the best-studied methods of this type is logspline density estimation.
Assuming the density is supported on an interval and letting L and U denote its endpoints, let {B j,K : j = 1, . . . , K} be a B-spline basis with knot sequence L < t 1 < . . . < t m < U (recall Section 4.3). Although cubic splines are the most common choice, lower orders are possible; in particular, using splines of "order" 1 (equivalently, degree 0) corresponds to a histogram [197]. It is common to put some constraint on the tail behaviour of the estimate when using cubic splines, especially (but not exclusively) when (L, U ) = R, in which case the MLE logf is typically required to be linear on (L, t 1 ] and [t m , U) [91,110]. If the support is a compact interval, another option is to require logf to be zero at L and U to reduce variance near the endpoints [112]. Stone [197] discussed some asymptotic theory for the maximum likelihood logspline density estimator, assuming the support is a compact interval ([0, 1] w.l.o.g.) and the knots are equally spaced. They showed that, when K increases to ∞ with n,f for all x ∈ [0, 1], where SE f (x) is a standard error estimate involving values of the basis functions and derivatives of c with respect to b (actual expression omitted for brevity), and f is the deterministic logspline density obtained by maximizing the expected (with respect to f 0 ) log-likelihood. In a related technical report, Stone [196] noted that this result can be used to obtain asymptotic confidence intervals for f 0 , provided K increases with respect to n at a suitable rate depending on some underlying differentiability assumptions on f 0 . As in many other cases, K must increase faster than the error-optimizing rate for this, leading to undersmoothing.
A more comprehensive and practical treatment of pointwise inference for logsplines was given by Kooperberg and Stone [112]. They considered more involved knot placement schemes: one that involves stepwise selection, addition, and deletion, ultimately selecting the number of knots to optimize a generalized AIC [see 112, and references therein]; and a free knot placement scheme where knot locations and coefficients are jointly maximized, with the dimensionality again chosen by AIC. In either case, it is possible to estimate a standard error for logf and use it to get approximate Gaussian pointwise intervals for the log density. By exponentiating the endpoints of these, Kooperberg and Stone obtained approximate confidence intervals for f 0 . The only difference between the two knot selection procedures in this regard is the dimensionality of the gradients and Hessians required for the standard error estimate: the free knot procedure requires more components, since it is necessary to include derivatives with respect to knot locations. Additionally, for the stepwise procedure (in which the knots are considered fixed), the authors considered confidence intervals obtained via the bootstrap: either using percentile intervals, or plugging a bootstrap estimate of the standard error into the Gaussian interval approximation. The final UQ option they considered was a fully Bayesian approach, in which they put a hierarchical prior on K, knot placement (conditional on K), and coefficients b (conditional on knot placement and K). They simply took simulation quantiles from a reversible-jump MCMC procedure as pointwise credible intervals. In their simulation study, Kooperberg and Stone found that, for non-bootstrap methods, intervals based on the free knot procedure had higher coverage than those based on the stepwise procedure, but all non-bootstrap frequentist approaches consistently undercovered. Bootstrap methods based on the stepwise procedure were much better, although the percentile bootstrap tended to overcover (i.e. the intervals were perhaps too wide). Using a bootstrap standard error estimate with the stepwise procedure therefore appeared to be the best option, especially due to computational savings since fewer resamples were required than for the percentile bootstrap. They reserved analysis of the Bayesian approach for a real dataset, where they found that the credible intervals were much narrower than the "bootstrap standard error" confidence intervals, suggesting that the Bayesian approach may undercover. In a different publication [111], Kooperberg and Stone expanded somewhat on these results. There they found that the non-bootstrap frequentist intervals could achieve appropriate coverage on average when their widths were modified by some uniform scaling factor. Factors between 1.34 and 1.55 sufficed in their simulations depending on the specifics of the standard error calculations, but it was not clear how well these would generalize. They also found once again that the Bayesian intervals appeared too small when applied to practical data, even with a larger prior covariance on the coefficients. Hansen and Kooperberg [91, in rejoinder to discussions] noted their challenges with UQ in Bayesian logspline estimation: they found it difficult to select priors that led to good point estimates and sensible credible intervals. More broadly, some authors have expressed skepticism about the usefulness of UQ for logspline density estimation, stating their view that pointwise confidence intervals do not generally provide useful shape information [110,140].

General orthonormal bases
A few theoretical Bayesian papers discussed in Section 4 also provided analogous results for log density basis methods. Castillo and Nickl [25] modelled log densities with wavelets modified to form a basis of L 2 ([0, 1]) instead of L 2 (R). The coefficients were given independent and identical priors -either Gaussian, Laplace, or something heavier-tailed -with a scale parameter depending on the Hölder regularity of log f 0 (assumed to be > 1). Similarly to their histogram approach described in Section 4.4, the authors used a deterministic bandwidth choice and showed that multiscale credible sets of the form (20) have correct asymptotic frequentist coverage, with near-optimal diameter contraction possible with further refinements. The same comments about practicality made in Section 4.4 apply here. In a similar vein, Rousseau and Szabó [179] considered density estimators (supported on [0, 1]) of the form (21) with an orthonormal basis of L 2 ([0, 1]) such that B 1 ≡ 1, with the subscript K removed since they did not consider basis functions changing with K. Among other technical conditions omitted here for brevity, they assumed log f 0 has (up to a normalizing factor) an infinite series representation in terms of this basis; equivalently, that the true density is a member of an infinite-dimensional exponential family. With a suitable prior on b | K (a normal distribution with independent components is one example satisfying their conditions), the authors showed that (12) holds with Hellinger balls for both empirical and hierarchical approaches to K, just as it does for the histogram model. As in that case, honesty and near-adaptivity (up to a logarithmic factor) results hold over functions in a Sobolev ball of regularity > 1/2 satisfying their general polished tail condition. Unfortunately, their results remain difficult to put into practice due to the existence of the "blow-up factor" in the diameter of the sets.

Roughness penalty methods
Some of the frequentist estimators considered in Sections 4 -5 were MLE's. In the i.i.d. case, one choosesf to maximize n i=1 log f (X i ) over all f in some predetermined class of possible estimators -generally those that can be expressed in the form of (7) or (21) -so that obtaining the estimate is simply a matter of optimizing the coefficients. In some cases it is advantageous to impose a further restriction onf to reduce variance or otherwise enforce some desirable "baseline" shape properties. In this case, instead choosef to maximize over the estimator class, where the functional J is some roughness penalty. This term forcesf or logf (depending on the context) to more closely resemble a function in the null space of J to an extent controlled by the smoothing parameter λ. A common choice for J is the integrated square of some linear differential operator: for instance, if J : f → D 3 log f 2 , then as λ → ∞, logf is forced towards a quadratic shape, and thereforef towards a Gaussian [193]. For brevity, this case may be described as "penalizing [the size of] the third derivative" [166] on the log scale. As indicated above, roughness penalties most commonly appear in the context of basis expansion methods, particularly spline fitting. When using splines with equally-spaced knots that do not repeat at the endpoints [47], an integrated squared k th -order derivative penalty can be approximated by the sum of squared k th -order differences between the coefficients. This simpler penalty gives rise to so-called P-splines, devised by Eilers and Marx [46]. In any case, such penalties are equivalent to quadratic forms in the basis function coefficients -for instance, the associated matrix for the aforementioned third derivative penalty consists of inner products between the third derivatives of the basis functions.
A Bayesian approach to roughness penalties is quite natural: it comes from viewing (22) as a log-posterior, with the first and second terms respectively corresponding to likelihood and prior. In this respect, the Bayesian methods of the previous two sections technically fit into this framework, but the focus in this section is on literature with a stronger emphasis on specific shape and smoothness restrictions imposed by the prior or penalty. The benefits of expressing penalties as quadratic forms as described above is now apparent: such a penalty is equivalent to an improper Gaussian prior on the spline coefficients (e.g. the P-spline penalty corresponds to a random walk), with λ commonly given a Gamma hyperprior [e.g. 118]. Note that this type of prior is only suitable when modelling the log-density with basis functions -when using a basis expansion for the density itself, care must be taken to ensure that it is nonnegative and integrates to one. Some examples of this approach are given in Section 6.2.

Penalty methods for log-scale basis expansions
Although roughness penalty density estimators had already been developed by Good and Gaskins [72], Silverman [193] appears to have provided some of the earliest results for the approximate distributions of such estimators. Letting g = log f and taking J(g) (in a slight abuse of notation) to be the integrated square of some m th -order linear differential operator on g, they considered the estimatorĝ := logf which minimized (22) over all g such that 1. g has piecewise differentiable (m − 1) th derivatives, 2. J(g) < ∞, and 3. e g < ∞.
Silverman showed that, for bounded f 0 on a bounded univariate domain,ĝ is asymptotically normal under suitable conditions on the higher-order derivatives of log f 0 and the rate at which λ → 0 as a function of n and m. In principle this result could lead to some type of pointwise confidence intervals, but Silverman did not pursue this further. The mean and covariance functions for the limiting Gaussian process depend on eigenvalues of an inner product space of estimators, and it is not clear how to approximate these in practice. O'Sullivan [155] expanded further on Silverman's original ideas for univariate densities on compact intervals, and justified approximatingĝ by cubic B-splines with knots at order statistics of X. They proposed to calculate λ by approximations to either a cross-validation score or an AIC-type quantity, and penalized the second derivatives of the log-densities. For uncertainty quantification, O'Sullivan adapted an idea from the non-parametric regression setting [210]: treating (22) as a log-posterior for the coefficients b in order to obtain "approximate Bayesian pointwise intervals". In the density case, O'Sullivan took a second-order Taylor series approximation of the unpenalized likelihood component log f (X i ). This lead to an approximate Gaussian log-posterior, from which they derived pointwise intervals on the log scale of the form where B(t) is a vector of basis function evaluations,Ĥ is the Hessian (with respect to b) of the unpenalized likelihood atb, and Ω is the matrix of inner products associated with the roughness penalty. Presumably, confidence intervals for f 0 could be obtained by exponentiating the above expression. O'Sullivan did not comment on the performance of these intervals in their simulation study, but noted that they were found to have good coverage properties in the nonparametric regression setting by Wahba [210]. There are other formulations besides the Silverman approach for density estimation with roughness penalties. One such Bayesian approach came from Lambert and Eilers [117], who essentially used logistic regression to produce a smoothed estimate of a histogram. Suppose the density is supported on a bounded interval, which is partitioned into J bins. Let u j and m j respectively denote the center of, and number of observations in, the j th bin I j . Then Lam-bert and Eilers proposed the model where the B k 's are B-splines with equally-spaced knots, b K = − K−1 k=1 b k for identifiability, Λ is a matrix of finite difference coefficients encoding a P-spline penalty, and τ is a precision parameter with a gamma hyperprior. For x ∈ I j , one can take f (x) = π j / (I j ) as a density estimate, where denotes the length of the interval. This penalized spline structure, combined with a high number of reasonably narrow bins, ensures the appearance of smooth estimates. Lambert and Eilers proposed this framework as a flexible way to handle grouped data by dividing the support into a smaller number of "wide bins" and replacing (24) with a multinomial model for wide bin counts, the probabilities for which are sums of the corresponding fine-grid π-values. Using a modified Langevin-Hastings algorithm to generate posterior samples, Lambert and Eilers applied this model to simulated and real data, using a moderately-sized cubic spline basis (K = 20). Unsurprisingly, their pointwise credible intervals (obtained from MCMC draws) exhibited higher variance when using larger "wide bins". In an earlier technical report, the same authors considered extensions of this model to multivariate densities by simply using products of B-spline bases, possibly allowing different dimensionalities and roughness penalties in each dimension [116].

Penalty methods for direct basis expansions
Roughness penalties can also be applied when modelling the density itself, rather than the log density, with basis functions. Komárek, Lesaffre and Hilton [109] considered such a formulation to estimate the error density in accelerated failure time models. Rather than splines, they used Gaussian densities at fixed locations, which they noted to be the limiting case for B-splines as their degree tends to infinity. The number of basis functions in their model is determined by the desired distance between their means (which serve the same purpose as equally-spaced knots for splines), as is their standard deviation. To ensure their estimates were valid densities, the authors used a softmax transformation to obtain the coefficients b: For identifiability, it is necessary to fix, say, a K = 0; a few other constraints on a are also necessary to ensure identifiability of other parameters in the failure time model. The roughness penalty, based on second-or third-order finite differences, is imposed directly on a. Estimation and inference follow from similar ideas as in O'Sullivan [155]: Komárek, Lesaffre and Hilton took a penalized maximum-likelihood estimate choosing the smoothing parameter with an approximate cross-validation score, and used a second-order Taylor expansion to obtain approximate pointwise "posterior" intervals for the density. They noted that in a simulation study (which they did not show), this method of constructing pointwise intervals yielded better coverage results than asymptotic methods.
Komárek and Lesaffre [108] used a Bayesian version of this construction to model the errors and random effects in an accelerated failure time model with intervalcensored data. As one might expect, the "logistic-scale" coefficients a in (26) were given (aside from a single identifiability constraint) a Gaussian prior with a (third-order) finite difference covariance structure, the scale of which is controlled by a smoothing parameter with a diffuse Gamma prior. Specifying the model in this way leads to related closed forms for estimated survival functions and densities of onset and event times. These functions can be simulated in an MCMC run, leading to pointwise credible intervals and means corresponding to posterior predictive functions. The simulation study conducted by Komárek and Lesaffre [108] showed that such credible intervals did a good job of capturing the true densities of event and onset times, although their smoothness varied with different combinations of true random effect and error densities. Sharef et al. [189] provided an even more flexible Bayesian approach of this type to estimate the frailty density in a proportional hazards frailty model. They used a mixture of normalized B-splines and an optional parametric term, constrained to ensure the density has mean one. The authors considered the use of fixed splines, as well as a reversible-jump MCMC procedure allowing the number and location of knots (and therefore, of basis functions) to vary adaptively. For the latter, they put some truncated discrete prior on the number of knots, with their locations given a discrete uniform prior over a larger set of "candidate knots". Conditioned on dimensionality, they expressed the coefficients for the spline part of the model as in (26). They considered multiple choices for a smoothness-imposing prior on a | K, listed below.
1. Simply taking the components of a to be i.i.d. Gaussians. The authors used this prior with adaptive knot selection, since the latter procedure controls smoothness automatically. 2. Taking a to be Gaussian with a covariance structure corresponding to second-order finite differences. The authors noted that this is only guaranteed to enforce smoothness for equally-spaced (fixed) knots. 3. Directly penalizing the second derivative of the spline mixture. This amounts to using a log-prior that is a quadratic form in exp (a) (with an associated matrix of inner products between B-spline second derivatives), divided by ( k e a k ) 2 .
In all cases, the prior for a has a scale parameter with an inverse-Gamma prior to control smoothing. The authors applied their approach to both simulated and real data, quantifying uncertainty with pointwise credible intervals from MCMC quantiles. Their simulation study showed that the adaptive knot se-lection approach without parametric component effectively captured the true frailty densities, although it required a sufficient quantity of data to do so (in particular, too few data clusters lead to wide pointwise intervals that did not adequately capture true shape information). On a real dataset with a modest number of clusters, they compared the fixed-knot version of their model (with second derivative penalty) to the adaptive knot procedure with different prior choices for the parametric component weight and number of knots. They found that the adaptive version with parametric components encouraged more smoothness in the posterior mean density and its credible intervals, to an extent determined by the choices of priors. However, the fixed-knot version with second derivative penalty performed best in terms of a modified Deviance Information Criterion. This section concludes with a rather novel frequentist approach from Sardy and Tseng [184] which is better-suited to densities that may not be smooth in the sense of piecewise differentiability. They used estimators which are either piecewise linear between the order statistics of X, or piecewise constant between their midpoints (equivalently, splines of degree 1 or 0, respectively), and total variation as their roughness penalty. The penalty is easily computed since their estimators ensure piecewise monotonicity, so that total variation is simply the sum of absolute differences between function values at consecutive order statistics. The authors devised two approaches for selecting the smoothing parameter: a universal one (depending only on sample size, not sample values) engineered to control the behaviour off when the true density is uniform; and one based on a sparsity 1 information criterion, in which λ andf are jointly estimated. They used the latter approach on real datasets with some tied values due to rounding, and obtained 95% pointwise confidence intervals by bootstrapping. The pointwise intervals had reasonable width and shape, and the authors noted that they may allude to the existence of additional modes not captured in the "point estimates" of the densities.

Random measure mixture methods
This section explores uncertainty quantification for the canonical nonparametric Bayesian method of density estimation. In the general case, this method employs (conditional) mixtures of the form where κ is some kernel with parameters θ and φ, and the integral constitutes a mixture over the domain of θ with respect to a random probability distribution G. The bulk of the nonparametric Bayesian literature uses infinite-dimensional discrete mixing distributions: where the weights and locations of the atoms -respectively, w and Z -are random sequences. The centrepiece of this Bayesian mixture model is the infinitedimensional prior on G: a "distribution on distributions". As it pertains to density inference, the locations and weights are usually independent, with the former distributed according to some continuous "base measure" and the latter having a prior from one of two commonly-used broad classes.
1. Normalized random measures with independent increments, or NRMI's [168], in which unnormalized weights are generated from a Poisson point process [100] and subsequently normalized. The measure with unnormalized weights is a completely random measure (CRM). 2. Gibbs-type random measures of type 8 σ ∈ (0, 1), which are equivalent to σ-stable Poisson-Kingman processes [71,124]. Briefly, these arise from NRMI's with intensity measure corresponding to the σ-stable subordinator [65, p. 604] by conditioning the distribution of the weights on their sum T , then mixing over an arbitrary distribution for T [163].
Assuming independence between weights and locations, each approach is a special case of the larger set of Poisson-Kingman models [163,124], which are in turn a type of species sampling model. The normalized generalized gamma (NGG) processes comprise the intersection of these approaches [124], whereas the Pitman-Yor process [164] is an example of the second but not the first, as noted by Favaro and Teh [54]. It is well-known that both the NGG and Pitman-Yor processes admit the Dirichlet process as a limiting case when the parameter σ → 0 [as mentioned in 128, for instance]; many Bayesian density inference papers are specifically devoted to so-called Dirichlet process mixtures. For the interested reader, Chapter 14 of Ghosal and van der Vaart [65] is an excellent exploration of the relationships between such discrete nonparametric priors. For any of these priors on G, it is easily seen that its specification in the form (28) leads to another expression equivalent to (27): Discussion of the theoretical aspects of UQ, such as asymptotic coverage probability, appears scarce in the literature for such estimators. Instead, the focus is on practical generation of uncertainty sets (usually pointwise credible intervals) from posterior samples obtained via MCMC. As one might expect, difficulty arises here due to the nonparametric nature of the quantity of interest -in particular, since the posterior distribution of G (this section hereafter adopts the bracket notation of Gelfand and Smith [61], denoting this posterior by [G | X]) will be infinite-dimensional. The key to most ideas for MCMC sampling of this model is to reformulate it in a hierarchical way: If there are additional hyperparameters φ and ψ, they are typically given their own independent priors, but these are not a main focus here. The latter encodes all parameters of the prior for G: for instance, for a Dirichlet Process prior with Gaussian base measure it may include the concentration parameter, as well as the location and scale of said base. Note that by the almost-sure discreteness of G, there is positive probability that θ i = θ j for some i = j. In this respect, the model imposes a random partitioning or clustering of the data, where each cluster is comprised of all observations with the same θ-value. With this formulation in mind, the known MCMC strategies divide into two main groups -marginal and conditional, depending on the way in which the infinite-dimensional parameter G is handled. The sections below briefly explain, and discuss the UQ implications for, each of these groups.

Marginal sampling methods
Marginal methods rely on integrating G out of the model and being able to obtain approximate samples from [θ | X]. Algorithms for this purpose are readily available when using the Dirichlet Process prior; see Neal [147] for a seminal review of them. In this case, it is easy to obtain a Monte Carlo estimate of the posterior mean density (denoted here as f (· | X), in keeping with the rest of the Bayesian notation in this section), as discussed by Escobar and West [48]. Letting θ * denote the parameter for a hypothetical new observation and θ = (θ 1 , · · · , θ n ), note that f (· | θ) = f (· | θ * ) dΠ (θ * | θ). The integrand is simply the kernel κ, and the distribution [θ * | θ] is readily available. Assuming the base measure of the Dirichlet process is conjugate to the kernel (as in the Gaussian case, for instance), this integral has an analytic closed form. From there, the Monte Carlo estimate of f (· | X) = f (· | θ) dΠ (θ | X) is an average of the above quantity over posterior MCMC draws of θ. By the same token, it is easy in the conjugate case to quantify uncertainty with respect to [f (· | θ) | X]. This is essentially the approach suggested by Wang and Dunson [213] to find pointwise confidence intervals, although they further simplified inference by using a greedy algorithm to find an optimal partition of the data. They noted that the deterministic nature of their algorithm results in an underestimation of uncertainty.
Inference of this nature ether ignores or marginalizes out uncertainty in the weights of G. For marginal samplers, this seems to be fairly standard practice when obtaining posterior density estimates to construct credible sets. Shi et al. [192] used one of Neal's nonconjugate algorithms [147] and obtained posterior density draws by taking the mean of the κ (· | θ i )'s for each MCMC draw of θ [see also their R package 191, and its source code 9 ]. This is equivalent to taking a mixture of cluster-specific kernels, each weighted by the number of observations in its corresponding cluster. Using kernel-and data-specific scaling and a low-information prior, Shi et al. obtained pointwise credible intervals for simulated and real data. Their framework can accommodate censored data, with pointwise uncertainty increasing in the presence of censoring as expected. In their simulation studies, the credible intervals did a good job of capturing the true densities, covering them throughout the domains for all one-dimensional examples and at roughly 98% of domain points for their two-dimensional example. Favaro and Teh [54] and Favaro, Lomeli and Teh [53] devised marginal Gibbs samplers for NRMI's and a specific subclass of σ-stable Poisson-Kingman models, respectively. For both classes, their density draw computations appear 10 to be based on truncation, and marginalization of the distribution of G given θ and the auxiliary variables of the sampler. To elaborate, the density draws are a sum of cluster-specific kernels, each given weight proportional to its (conditional) expected mass; and ten "new" kernels with parameters taken from the prior base measure, each given weight proportional to the expected total mass divided by ten. In the latter paper, the authors showed a pointwise credible interval for the density of a dataset of galaxy velocities, noting that the results were satisfactory and consistent with previous work.
It could be argued that the aforementioned approaches to density inference are inherently "incomplete". Indeed, marginalizing or otherwise deterministically approximating the random weights of G fails to account for some of the uncertainty in (29). If the goal is full uncertainty quantification in this regard, the focus must be on [f (· | G) | X] if possible. As noted by Gelfand and Kottas [60], it holds that This reveals the key to fully meaningful inference with a marginal sampler: Gelfand and Kottas [60] noted that this is easy for the Dirichlet process prior by conjugacy, since [G | θ] is a Dirichlet process with updated parameters. Of course, in practice the infinite sum in (29) must be somehow truncated to obtain actual density draws. Gelfand and Kottas did this by choosing the number of terms to satisfy a predetermined expected error threshold, then replacing the final weight to ensure that the truncated sum integrates to one. Kottas [114] later used this approach in the context of survival analysis, as did Griffin [74] when comparing different approaches to hyperpriors in the Dirichlet process model. Such methodology is not typically used for more general random measure priors, despite relevant distributional results existing in the literature [54,53]. This is likely a computational matter: to directly sample the weights w of a random measure, it is typically necessary to employ a stick-breaking process, in which they are represented as ∼ Beta(1, M) [188]. However, such representations for the general classes of random measure considered here are more recent developments, and the densities of the V i 's are quite complicated [52,50].

Conditional samplers
In contrast to the approaches described above, conditional methods do produce posterior samples of the weights in (28), allowing for "full" inference on functionals such as (29). There are several ways to avoid the problem of having to sample infinitely many weights. Early conditional samplers simply replaced G by a finite approximation, choosing the deterministic truncation level a priori. Discussion of such methods is deferred to Section 7.4; this section focuses on alternatives that better incorporate the infinite-dimensional nature of the model. Perhaps the most common approach to this end is to introduce some auxiliary variables such that the full conditionals of G are finite-dimensional. This ensures that Gibbs samplers target the correct posterior without the need for approximation, aside from the inevitable truncation to calculate the density draws themselves.
The retrospective sampler of Papaspiliopoulos and Roberts [157] was one of the earliest methods of this type for Dirichlet process mixtures. It involves the introduction of allocation variables K = (K 1 , . . . , K n ) such that K i = j iff θ i = Z j , with Z j as in (28). At each step of the chain, first draw only max (K) := max i {K 1 , . . . , K n } of the atoms and weights in G. A certain condition involving auxiliary standard uniform variables and the full conditionals of K is then checked. If the condition is met, perform a Metropolis-Hastings update of K and resume sampling as normal; otherwise, simulate additional components of G one at a time (from their priors, as they represent clusters with no allocated observations) until the condition is met. Note that the number of components is therefore variable across iterations. The authors noted that posterior draws for any linear functional of G are equal in distribution to a deterministic function of prior draws and the first max (K) components from one retrospective sampling iteration. Thus, full posterior inference for f (· | G) is quite straightforward.
Another popular approach which avoids some of the computational burden of the retrospective algorithm is slice sampling, first used in this context by Walker [211]. Briefly, Walker's original idea involved introducing new latent variables U i , i = 1, . . . , n such that, with K i again denoting the allocation variable as above, the joint likelihood for observation i is Integrating out K i and U i reduces this to (29). Furthermore, these variables ensure that all full conditionals in the Gibbs sampler -including those for the necessary components of G -are finite-dimensional. Numerous adaptations of the algorithm exist: for instance, Kalli, Griffin and Walker [104] altered it for greater efficiency. Technical details aside, the main point is to simulate the finitely (but randomly) many components of G needed for the other sampler variables; this can exceed n, in which case some components will correspond to clusters with no data allocated. Favaro and Walker [55] adapted the algorithm of Kalli et al. to the larger class of σ-stable Poisson-Kingman models, using their stick-breaking representation to devise a method for sampling the weights. They applied their method with mixtures of Gaussians with common variance and means from the random measure. Density draws were calculated by first adding together the components obtained from the sampler, then allocating the remaining mass (which the authors noted was usually quite small) to a Gaussian kernel with the sampled posterior variance centered at the prior mean of the base distribution. One can then extract posterior sets from these draws in the usual way. In the same paper discussed in Section 7.1, Favaro and Teh [54] considered a slice sampler for NRMI mixtures; here they sampled the unnormalized masses of the random measure. They showed pointwise credible intervals for the densities of some real datasets that were reasonable in shape and variability. Their source code suggests that they used the same formula for density draws here as they did for their aforementioned marginal samplers, with ten additional "new" kernels as described in the previous section. Although finite-sum approximations are always necessary for density estimation, the approaches described above are noteworthy because the samplers themselves introduce no truncation error; all of their full conditionals are truly finite-dimensional. This is not the case for all papers which use conditional samplers for density inference. For instance, Barrios et al. [7] used a conditional algorithm for NRMI's that does not induce a finite-dimensional full conditional for G. Instead, they used a representation which allowed them to sample the masses in decreasing order. This allowed them to select the number of components sampled based on a relative error criterion, and to calculate density draws from only these (normalized by the sum of the sampled masses). They obtained pointwise credible intervals for a real dataset, demonstrating that the choice of both kernel and NRMI prior can moderately affect the smoothness of said intervals. Argiento, Bianchini and Guglielmi [1] folded random truncation into a modification of the NRMI prior itself by discarding all unnormalized weights smaller than some threshold . The resulting random measures have finite and random dimension, and converge in distribution to the corresponding NRMI's as → 0. The authors recommended fixing some small value for (it is possible to place a prior on it, but they warned that the computational cost may be unreasonable). They derived a conditional sampling algorithm, introduced a new class of NRMI's with a Bessel function in the intensity measure, and applied their method to real and simulated data. Their pointwise credible intervals showed a pleasing degree of smoothness and reasonable faithfulness to the true density of a simulated sample. Griffin [75] proposed an adaptive truncation method based on sequential Monte Carlo. The method involves iteratively resampling and increasing the dimension of the approximate model until a discrepancy measure falls below some threshold. Griffin applied this approach to a variety of nonparametric models, including Dirichlet process mixtures. Although they did not show credible intervals for densities, they did so in the context of time series modelling, indicating that density inference is indeed possible in this framework.

Feller-Dirichlet priors
This Bayesian model from Petrone and Veronese [162] generalizes the Dirichlet process mixture model, although it also serves as an extension of Petrone's ideas [160,161] from Section 4.2. Recall from that section that Petrone put a prior on K and introduced latent variables Y 1 , . . . , Y n from a random distribution F with DP prior such that Feller-Dirichlet prior generalizes this by replacing the latter Beta densities by some kernels g K (·; Y i ), leading to a density model of the form Petrone and Veronese provided several examples beyond the original Bernstein model that are suitable for data on [0, ∞) or R. For instance, take g K (·; θ) to be an inverse Gamma density with parameters (K, Kθ) with a Gamma base measure for the prior on F , or use a N θ, σ 2 /K density for the kernel with a Gaussian base measure. These examples illuminate the idea that the Feller-Dirichlet prior -a "mixture of DP mixtures" -bridges the gap between Dirichlet process mixture models and the Bernstein polynomial models explored previously. For inference, Petrone and Veronese truncated the DP to a large finite number of components and used a Gibbs sampler similar to that of Ishwaran and Zarepour [99] to obtain density estimates and pointwise credible intervals.

Extensions for non-i.i.d. data
Several extensions to the random measure density model also exist for data structures besides an i.i.d. sample X, most of which are based on the Dirichlet process instead of the more general measures. Müller and Rodriguez [145] and the references therein provide an excellent overview of such extensions; this section details some examples for which uncertainty quantification has been done in literature. In broad terms, these examples all involve inference for a family of densities {f (· | G t ) : t ∈ T }, where the random measures are indexed by some set T and share some form of dependence. In many cases, this will mean modelling the density for a "response variable" X with associated covariate t, effectively building yet another bridge between density estimation and nonparametric regression.
The dependent Dirichlet process (DDP) first introduced by MacEachern [134] is the basis for many useful models. DDP mixtures are similar in construction to (29), except that the weights {w tj } and locations {Z tj } may both vary with t ∈ T . For instance, De Iorio et al. [37] considered a model for survival analysis when there are covariates t i associated with each observation X i . The weights do not vary with t, but the Z tj 's correspond to location-scale pairs with the former component equal to a linear model in t: for example, if t = (u, v) for categorical u and continuous v, then Z tj = m j , A uj , B j v, σ 2 for j ∈ N. Inference proceeds by reformulating the model into the conventional DP mixture framework, replacing the top line of the hierarchy in (30) by where d i is a design vector so that θ i d i = m j + A uj + B j v, σ 2 when θ i = j and t i = (u, v). De Iorio et al. used this so-called linear DDP to analyze the densities of log survival times with various combinations of treatments and other factors. They showed some pointwise credible intervals for survivor and hazard functions, and although they did not do so for densities, it should be no more difficult. However, their inferential approach [as described in 38] is the same as that suggested by Escobar and West [48], where the weights are marginalized so that inference is based on The above formulation is somewhat similar to the density regression model considered by Dunson, Pillai and Park [43] for modelling the density of a continuous response variable. The model assumes a set of continuous covariates associated to each observation and a structure similar to (34), except that the random measure governing the θ-value for an observation now depends on the corresponding covariate vector t: it is a finite mixture of n i.i.d. Dirichlet processes, with the i th weight based on the distance between t and t i . Dunson et al. used a marginal sampler, so that posterior inference for the predictive density of a "new" observation (given some covariate vector) was once again based only on the finite-dimensional parameters. Draws for these densities have closed forms due to conjugacy: they are mixtures of cluster-specific kernels and one using posterior draws of the hyperparameters of the base distribution. For both real and simulated data, the authors showed pointwise credible intervals for such densities conditioned on various values for the covariates. In the latter case, the intervals did a good job of capturing the true densities. Dunson and Park [42] subsequently developed the kernel stick-breaking process (KSBP) to model an uncountable collection of probability distributions (with particular focus on the density regression application), generalizing and expanding upon some of the ideas in [43]. In this model, the covariate-dependent distribution for an observation's θ-value is an infinite mixture of "basis" random measures (typically either point masses or draws from a Dirichlet process) with stickbreaking mixture weights. To induce dependence on the covariates, the beta random variables defining the stick-breaking process are weighted by kernels evaluated at the covariate value and centered at random locations with some arbitrary prior distribution. Dunson and Park's MCMC algorithm for pointwise UQ was a hybrid between marginal and conditional: like [43], they marginalized over the basis random measures; but at the t th step of the chain they sampled M t mixture weights, where M t is the highest index of an occupied cluster across the first t iterations. The authors repeated the same simulation study as in [43], showing that the pointwise credible intervals from the KSBP model enveloped the true densities. Norets and Pelenis [153] explored the same simulated data model, showing how changes in the KSBP hyperparameters affected the quality of inference.
The formulation in the preceding paragraph directly models the conditional density of X given t by specifying a covariate-dependent random measure. Alternatively, it is possible to first model the joint distribution of X and t as a mixture of a kernel κ (X, t | θ, ψ) = κ (X | t, θ) κ (t | ψ) with respect to a random distribution on the product space for (θ, ψ), then obtain the desired (conditional) density estimates by standard calculations. This approach is used by Park and Dunson [158], who put a Dirichlet process prior on the product measure; and Wade et al. [209], who gave separate DP priors to G θ and G ψ|θ to allow for greater flexibility. Both used marginal samplers for inference (again, with uncertainty only in terms of the finite-dimensional parameters), with the latter finding pointwise credible intervals to be much narrower and more accurate than those resulting from a DP on the product measure.
Returning to the DDP, note that it is also a suitable starting point when there are multiple sets of observations from different discrete time points, in which case the density is a random process evolving through time. Nieto-Barajas et al. [150] used this approach in such a context, making the atom locations independent of time but introducing dependence into the weights through their stick-breaking construction. They achieved the latter by introducing latent variables Y tj dependent on the stick-breaking proportion V tj , such that V (t+1)j is in turn dependent on Y tj and the usual Dirichlet process is recovered by marginalizing out the Y 's. They applied this construction in a mixed-effects model for protein activation over time, using a partially marginalized algorithm which exploited conjugacy to sample only atoms corresponding to clusters containing data. Müller and Rodriguez [145] showed densities with pointwise credible intervals from the same application, presumably using the same algorithm. Gutiérrez, Mena and Ruggiero [82] used a different approach to introduce dependence in the stick-breaking process: with random probability p having a Beta prior, they sampled V (t+1)j from its usual distribution, and set it equal to V tj otherwise. They used slice sampling for inference, but did not specify if their density draws incorporated any components beyond those sampled (recall that this was the case for the Favaro-authored papers in Section 7.2). Their simulation study showed that their method was much more effective than one based on spline regression at capturing the true shape of their density, but their pointwise credible intervals did a much better job at enveloping the true density at later time points than at earlier ones.
Finally, there may be multiple samples X 1 , · · · , X m for which it makes sense to "share information", assigning mutually dependent densities to each sample. The hierarchical methods discussed in Müller and Rodriguez [145] and its refer-ences are perhaps the most natural ways of doing this, but there does not appear to be existing literature which specifically conducts UQ with these methods. Griffin, Kolossiatis and Steel [76] developed an interesting model: starting with p underlying i.i.d. CRM's, the mixing distribution for each density is the normalized sum of some sample-specific subset of the underlying measures. Griffin et al. called this the correlated NRMI model, and implemented it with a combination of slice sampling and a split-merge step (in which clusters are moved between the underlying measures to address posterior multimodality). Although the main purpose of their model was assessing differences between distributions, they did show pointwise intervals for survival functions fitted from intervalcensored data; as always, it seems reasonable to assume that density inference is possible by similar means.

Finite mixtures
As previously mentioned, one way around the difficulties of infinite-dimensional models is to simply truncate the sum in (29) at some level N . This case leads to a vector of weights w = (w 1 , . . . , w N ) on the (N − 1)-dimensional probability simplex. This was the approach taken by the early conditional samplers of Ishwaran and Zarepour [99] and Ishwaran and James [98], who considered generalized Dirichlet priors on w to approximate random measures with stickbreaking representations (namely, those for which the stick-breaking variables V j in (32) have beta distributions). For instance, to approximate a Dirichlet process mixture with concentration parameter α, they would either give w a symmetric Dirichlet prior with parameters α/N ; or truncate its stick-breaking representation, setting V N = 1 to ensure the N weights summed to one. They gave asymptotic justifications (as N grows large) for both options. With the conditional samplers devised in these papers, approximate posterior inference is obviously possible. Of course, extensions to the types of data structures considered in Section 7.3 are also possible. For instance, Chung and Dunson [32] modelled covariate-dependent densities using truncated random measures with stick-breaking weights derived from a probit model. Their structure for the weights incorporated a variable selection component, resulting in a rather flexible density regression framework. Finucane et al. [56] conducted a meta-analysis of child nutrition data by modelling the study-specific densities of interest with finite mixtures of normals, using probit model stick-breaking weights which incorporated individual time and location effects. Norets and Pelenis [152] modelled the joint distribution of a response variable and covariates with a finite Gaussian mixture, obtaining the conditional response densities with standard calculations. Their model allows for any number of discrete variables by mapping them to continuous latent variables. The pointwise credible intervals obtained in these papers showed reasonably good uncertainty quantification, although the choice of a fixed finite number of components naturally reduces their flexibility somewhat.
The focus thus far in this section has been overwhelmingly Bayesian. Frequentist approaches to mixture models do exist in the literature, but it is rare to see them consider density UQ as it is defined here. Roeder [176] provided one rather novel exception for mixture-of-Gaussians estimators with finitely supported mixing distributions. Given some bandwidth h for the Gaussian kernel κ, the mixing distributionĜ h is uniquely chosen to optimize an asymptotically normal statistic based on sample spacings 11 . This statistic is nonincreasing in h, and so it is possible to find a range of h-values such that the statistic falls within the (α/2)-and (1 − α/2)-quantiles of the standard Normal distribution. The confidence set defined by Roeder is then the set of all estimators f · |Ĝ h as h varies through this range. This set is comprised entirely of finite mixtures (although the number of components for each is random), is easy to visualize, and provides correct coverage if the true density is assumed to be a mixture of Gaussians.
In addition to the KDE connection, it is easy to see parallels between finite mixtures and some of the basis expansion methods discussed earlier. Indeed, even if one were to put a prior on N (e.g. Norets and Pati [151], whose inference involved modelling conditional densities using covariate-dependent multinomial logit mixture weights), the model would be similar in principle to the fully Bayesian approaches in Section 4. Thus, beyond what has already been explored, there is little else to discuss here. The interested reader may refer to Gelman et al. [62] for some more details on working with models of this type.

Other methods
This section explores uncertainty quantification for an assortment of density estimation methods for which literature is too scarce to warrant separate sections.

Nearest neighbour methods
This classical density estimator is closely related to the KDE and is applicable to any density on R d . Let k = k(n) be an integer increasing with sample size n, let · be some norm on R d (typically Euclidean, but some other norms also satisfy the required conditions for some of the results discussed here), and for an arbitrary point x ∈ R d let R(k, x) be the · -distance between x and the k thclosest value in X. Then for a kernel K, the nearest neighbour density estimator as defined by Mack [135] iŝ Unless otherwise stated, all results in this section require K to equal 0 outside of the unit · -ball. A particularly common case arises from the uniform kernel: where V (k, x) is the volume of the · -ball centered at x with radius R(k, x). Nearly all of the literature on NN density inference is theoretical, and closely mirrors the results discussed previously for KDE's 12 . For instance, Theorem 9.3.7 in Csörgő [34] is essentially a Smirnov-Bickel-Rosenblatt result for univariate NN estimators. Unlike the KDE and wavelet theorems, their formulation would lead to confidence bands over a certain random interval defined by order statistics of the sample, but they noted that this interval converges to the full support as n → ∞. Moore and Yackel [144] provided what appear to be the first asymptotic normality results for (35), showing that the limiting distribution could be made to center at f 0 under some conditions on its properties and the asymptotic behaviour of k. They also noted that the asymptotic variance of the NN estimator is smaller than that of the KDE at points x where f 0 (x) is small, claiming that this makes it more efficient for estimating density tails. Mack and Rosenblatt [136] expanded on this, noting that the NN estimator can be much more biased than the KDE in the tails, with the opposite relations holding for large values of f 0 (x). These observations, combined with the non-monotonic dependence of asymptotic bias on k, make error analysis here somewhat less straightforward than it is for the KDE.
Mack [135] derived slightly different asymptotic normality results than Moore and Yackel, centering at E f instead of f 0 . This allows for less restrictive conditions: for instance, theirs are the only results here which do not require K to vanish outside of the unit ball. Pointwise Gaussian limits centered on f 0 with some variant of usual conditions (among others, as needed) are also available for univariate NN density estimates with non-i.i.d. data structures, such as randomly right-censored data [142], observations from an α-mixing sequence [125, only for the uniform kernel], or randomly left-truncated samples [216, who actually implemented confidence intervals in practice using a plug-in estimator of the asymptotic variance].
A technical report by Rodríguez [173] [see also 174] made an interesting connection between NN estimators of the form (36) and KDE's: the former allocates the fixed mass k/n to the random volume V (k, x), while the latter can be rewritten to show that it essentially does the opposite, spreading a random mass over a fixed volume. This observation motivated Rodríguez to view the two estimators as endpoints on a "continuum" of estimators of the form where ω is a distribution on [0, 1] with mean c, and the (possibly random) number μ and function h meet certain technical conditions. Rodríguez showed how KDE's and uniform NN estimators arise as special cases and described every-thing in-between as "double smoothing": in the numerator (resp. denominator), the mass (resp. volume) given by F n (resp. h d ) is smoothed with K (resp. ω). Rodríguez proved asymptotic normality for certain subclasses of these estimators in this report, as did Biau et al. [8] for another variant. Both cases are generalized NN estimators, and the results hold even using the "optimal" k(n) with given asymptotic biases. It is possible to eliminate the bias and center at f 0 with a suboptimal k n , although the conditions for this are less restrictive here than in [144] at the expense of stricter smoothness assumptions on f 0 .

Logistic Gaussian process estimators
This approach is usually Bayesian and involves density estimates of the form where the latent function g is given a zero-mean Gaussian process (GP) prior with hyperparameters γ governing the covariance kernel. The "logistic" transformation of g ensures that the estimates are valid densities: nonnegative and integrating to one. Riihimäki and Vehtari [170] explored some approaches for approximate Bayesian inference with this model with 1-or 2-dimensional densities. Technically, they assumed that g would be the sum of a Gaussian process and a parametric polynomial component, but they integrated out the coefficients for the latter so that the basis function values and hyperparameters could simply fold into the mean and variance of the GP. Similarly to Lambert and Eilers [117] (see Section 6.1), Riihimäki and Vehtari discretized the model, replacing the actual data with observation counts in a fine, equally-spaced partition of the domain. Assuming that the partition consists of J subregions and letting m and g respectively denote the vectors of observation counts and latent function values within each subregion, the likelihood P (m | g) is essentially the same as (24 -25) [117], except the B-spline values in (25) are replaced by the latent function values g j for j = 1, . . . , J. In turn, the prior Π (g | γ) for the latent values is simply the multivariate normal distribution induced by evaluating the GP prior at the center points of the subregions. The main object of inference is then the conditional posterior of g given the observation counts and hyperparameters (and, technically, conditioned on the chosen partition as well), This posterior is not analytically tractable, so approximate methods must be used to employ this model in practice. As an alternative to MCMC, Riihimäki and Vehtari proposed the use of a Laplace approximation: a Gaussian distribution for g based on a second-order Taylor approximation to the log of (38). In order to quantify uncertainty in f , samples must be drawn from this approximate Gaussian posterior and transformed via (37). To this end, the authors showed that importance sampling can improve performance, and rejection sampling can also be incorporated to ensure appropriate tail behaviour if necessary.
The model is completed by putting a prior on γ, but Riihimäki and Vehtari also considered the possibility of ignoring the uncertainty in these hyperparameters: marginalizing the approximate Laplace posterior over g, maximizing it with respect to γ, and simply plugging in the resulting approximate MAP point estimate for γ. They found that their method performed (in terms of mean log predictive density, evaluated with cross-validation for real data or w.r.t. the true distribution for simulated data) comparably with MCMC targeting the true joint posterior of (g, γ), as well as the Dirichlet process mixture models of Griffin [74]. The pointwise credible intervals for real and simulated data provided reasonable practical visualization for uncertainty quantification. However, one of their simulations showed that densities with varying amounts of smoothness throughout the domain can be challenging, as the MAP parameters needed to capture more narrow features can result in excessive roughness elsewhere.
The authors also showed how their method can extend to density regression, modelling densities conditional on covariate values.

Pólya trees
The Pólya tree (PT) prior is a Bayesian nonparametric method for constructing a random probability measure, discussed in [119] and the first few references therein. The construction is based on a recursive partitioning of the domain and is most easily explained when the domain is an interval in R. At the m th level of partitioning, the interval is split into 2 m subintervals. It is common to set the partition boundaries to the dyadic quantiles of some base measure G 0 (i.e. G −1 0 (j/2 m ) , j = 0, . . . , 2 m ), thus "centering" the random measures drawn from the PT prior around this base [119,145] to B , where B is the subinterval associated to binary number = 1 . . . m . Iterating this process over all m ∈ N results in a draw from the Pólya tree prior (so named because the recursive partitioning defines a tree with nodes corresponding to subsets), defined by the sequence of partitions and Beta parameters. A special case for the latter gives rise to the Dirichlet process, but they can also be tailored to almost surely produce absolutely continuous distributions [e.g. 119,146], which is obviously more appealing for density inference.
It is possible to extend this construction to d-dimensional domains, for instance by using the construction of Hanson [92]. At the m th level, the domain is partitioned into 2 md subsets, indexed by base-2 d numbers = 1 . . . m ∈ {0, . . . , 2 d − 1} m [145]. These subsets are formed by taking Cartesian products of the subintervals used in the univariate construction, then applying a suitable affine transformation. Probabilities are assigned to each subset in an analogous way to the univariate case, except that for a fixed 1 . . . m−1 , the variables Y 1 ... m−1 e , e = 0, . . . , 2 d − 1 have a 2 d -dimensional Dirichlet distribution. Literature on multivariate PT's rarely entails any density UQ, so the remainder of this section focuses primarily on the univariate case.
Castillo [24] provided theoretical results for posterior inference with such priors on the unit interval, with partition boundaries at the dyadic rationals. In particular, they showed that, when f 0 is Hölder with regularity β ∈ (0, 1] and bounded away from zero, a type of Bernstein-von Mises result holds (i.e. the posterior weakly [25] converges in P 0 -probability to a Gaussian process) when the Beta parameters of the prior grow suitably fast with m. The posterior must be centered at some estimator for f 0 for this to hold: either the posterior mean or, when the Beta parameters grow suitably slowly depending on β (note that this corresponds to "undersmoothing" of the posterior), a "canonical" estimate based on the Haar wavelet expansion of the empirical measure. Castillo noted that this result can lead to similar results to some of those discussed earlier for wavelet estimators [25]: namely, multiscale credible bands similar to (20) with Pólya trees should have correct frequentist coverage.
Practical implementations of Pólya tree models involve truncating the partitioning at some finite "terminal" level, rather than continuing it infinitely. By a well-known conjugacy result [e.g. 146,92], the posterior for the PT prior is simply an updated PT, with the same partition and updated Beta (or Dirichlet, in the multivariate case) parameters for the Y 's. With the aforementioned truncation, density samples from this posterior can be obtained by allocating the mass proportion within each terminal subset either uniformly [65, chapter 3] or according to the density of the base measure [as in 92]. The resulting densities will be discontinuous at the partition boundaries [146,65] and are therefore perhaps not as "well-behaved" as one may prefer. In a survival model with longitudinal data and a PT prior on event times, Zhang, Müller and Do [218] addressed this issue by applying kernel smoothing to the actual posterior PT draws to obtain event time densities. There are other ways around this which change the structure of the model itself: mixing the prior over the parameters of the base distribution [92], adding random "jitter" to the partition boundaries [156], or mixing a kernel with respect to a PT measure [12]. Surprisingly, literature employing such methods does not tend to address UQ for densities. On the other hand, Nieto-Barajas and Müller [149] did so for their rubbery Pólya tree (rPT) prior, introducing dependence amongst the Y 1 ... m−1 0 's at level m (i.e. all "left nodes" in the tree at a given depth) through latent variables. The construction resembles that used to introduce dependence for time-series DDP's by Nieto-Barajas et al. [150] as discussed in Section 7.3, and recovers the usual PT prior by marginalizing over the latent variables. Conditional conjugacy allows for an easy Gibbs sampler, which Nieto-Barajas and Müller implemented by truncating the partitioning process at some depth (using a depth between 5-8 in all experiments) and allocating the mass uniformly within each of the terminal subsets. Pointwise credible intervals in their simulation study fully contained the true densities, but were not smooth. Indeed, the rPT only "smooths" the estimates in the sense of reducing jump sizes between masses in neighbouring partition sets. Its dependence structure addresses variability, not continuity. Nieto-Barajas and Müller suggested mixing (either over a kernel w.r.t. a rPT prior, or over the parameters of the rPT's base distribution) when more smoothness is desired, but did not attempt uncertainty quantification with such models.
A different extension of the model came from Hanson, Zhou and Inácio De Carvalho [93], when each X i is observed at a spatial location t i . Their object of interest was the predictive density (i.e. marginalizing over G) for a new X, and they proposed to modify the usual formula by weighting the contribution of each observation by some distance between their locations and that of the new X. Uncertainty was with respect to the (hyper)parameters of the PT prior and the distance function and was quantified with MCMC output. Their pointwise credible intervals appeared quite smooth; it is unclear whether this is the result of an actual procedure or merely the plotting functions used.

Multiscale estimators
This rather novel Bayesian approach from Canale and Dunson [23] uses multiscale mixtures of Bernstein polynomials as estimates: The weights π sh are constructed in terms of a stochastic process defined on an infinite binary tree. For the h th node at tree depth s (h = 1, . . . , 2 s ), let S sh be the probability of stopping at that node and R sh be the probability (conditional on not stopping) of moving to the right daughter of node (s, h). These probabilities define a sort of "random climb" on the branches of the tree, which at each step either stops with some probability or else moves on to the next depth, randomly choosing either the left or right path. The weight π sh is then the probability of the process taking the path to node (s, h) (starting from the root of the tree) and then stopping there. For instance, π 12 = (1 − S 00 ) R 00 S 12 , and π 23 = (1 − S 00 ) R 00 (1 − S 12 ) (1 − R 12 ) S 23 Beta (b, b), where a and b can be fixed or given their own hyperpriors.
Canale and Dunson noted that this model induces an interesting multiscale clustering on the data: two data points may be assigned to the same tree node at some depth s, meaning they are sufficiently similar to be clustered together at this scale; but may occupy different nodes at a depth r > s, so that they are separated at a higher "resolution". For practical inference, they truncated or "pruned" to a maximum tree depth S by simply setting the stopping probabilities S Sh = 1 for all h. Using a slice sampling approach, they devised an MCMC algorithm for inference, which alternates between two steps: assigning each observation to a tree node (equivalently, to a "multiscale cluster") given the π sh 's, and updating the S sh 's and R sh 's given these allocations. Posterior density samples can then be obtained by plugging these probabilities into (39), truncated accordingly. Although Canale and Dunson did not show any credible intervals for densities in their paper, the corresponding R package for this type of model implements them readily [22].

Shape-restricted methods
If an a priori assumption can be made about the shape of the true density f 0 (for instance, that it is monotone or unimodal), one may wish to incorporate this into estimation and inference. A solid body of literature exists on the use of such shape constraints in nonparametric estimation, but only a subset of this literature specifically considers UQ for densities.
Perhaps the best-studied shape constraint is monotonicity, in which f 0 is assumed to be non-decreasing. In the frequentist setting, the so-called Grenander estimator is the canonical choice for estimation of f 0 , and is also the MLE subject to the monotonicity constraint [73]. Letting F n denote the empirical distribution function of X, letF be the least concave majorant of F n : the smallest concave c.d.f. such thatF ≥ F n throughout the entire support (typically assumed w.l.o.g. to be [0, 1], or [0, ∞)) [77]. The Grenander estimatorf is then the left derivative ofF , which turns out to be a step function with jumps at sample values andf (x) = 0 for x ≤ 0 and x > X (n) [165]. Prekasa Rao [165] derived a pointwise limiting distribution for this estimator, showing that with suitable standardization it is asymptotically equivalent to a particular functional of Brownian motion 13 . Groeneboom and Jongbloed [78] leveraged this fact to derive the asymptotic distribution of a likelihood ratio test statistic for f 0 (x) when f 0 has nonzero derivative in a neighbourhood of x ∈ (0, ∞). The limiting distribution is that of a different functional of Brownian motion derived by Banerjee and Wellner [6]. The authors of that paper did not find an analytic form for this distribution, but provided estimates of its quantiles from simulation-based methods. Groeneboom and Jongbloed used these estimated quantiles to obtain pointwise confidence intervals with asymptotically correct coverage by inverting their likelihood ratio test. They also considered pointwise bootstrap intervals based on a boundary-corrected kernel (under-)smoothing of the Grenander estimator. The use of the bootstrap in this way is at least partially justified by an asymptotic normality result for this smoothed Grenander estimator [80] (indeed, there are a few modifications to this method that result in smooth, asymptotically normal estimators; see also [203]). Unfortunately, the bootstrap is unsuitable for inference with the unaltered estimator, due to the inconsistency results shown by Kosorok [113] and Sen, Banerjee and Woodroofe [187] and demonstrated in practice by the latter. However, both papers showed that consistency can be restored with a smoothed bootstrap (i.e. resampling from a modified kernel estimate of f 0 , rather than from the empirical distribution). Kosorok further showed that smoothed bootstrap methods could be used to define an L 1 -ball of functions centered atf with correct asymptotic coverage, based on the known asymptotic normality of the L 1 -error [77]. Recall, however, that such sets are limited in visual interpretability. Uniform confidence bands were briefly considered by Durot, Kulikov and Lopuhaä [44], who derived a Gumbel limiting distribution similar to the Smirnov-Bickel-Rosenblatt results of Section 4.4.1. However, they believed that the technicalities required for datadriven construction of such a band were not worth exploring further. A recent preprint by Deng, Han and Zhang [39] proposed another method to construct pointwise intervals, based on the adaptation of an analogous method for inference in isotonic regression (see their manuscript for details). They suggested that their method, which involves suitable estimates of nuisance parameters in the limiting distribution, could be tailored to adapt to the smoothness of f 0 more readily than the method of Groeneboom and Jongbloed [78], but both methods require simulation-based estimates for quantiles of the complicated limiting distribution.
As an alternative to frequentist inference methods based on the Grenander estimate (or some modification thereof), Bayesian methods are also available. For instance, Martin [138] proposed an empirical prior in which the density is modelled as a finite scale mixture of uniform densities. The mixture weights and uniform density scales are respectively given Dirichlet and Pareto priors, both of which are calibrated so the prior over densities is centered at some predetermined mode. This mode (and the dimensionality of the mixture) can either arise from a sieve MLE (i.e. the MLE over a space whose size depends on n) or the Grenander estimate. Using simulated data and MCMC, Martin compared the pointwise credible intervals from this model to those obtained from a Dirichlet process mixture, and found that the empirical model resulted in higher coverage probability and shorter intervals on average.
The second most common shape constraint explored in the literature is arguably log-concavity, in which log f 0 is assumed to be concave. As in the monotone case, frequentist UQ for log-concave densities typically centers on the MLÊ f . Rufibach [182] showed that the log off is piecewise linear with breaks at sample values, andf supported on the range of X. Balabdaoui, Rufibach and Wellner [5] obtained a limiting distribution for this estimator under some regularity conditions on f 0 . Much like the monotone case, the MLE for log-concave densities converges in distribution to a certain functional of Brownian motion, scaled by nuisance parameters that depend on the value of f 0 and its derivatives. Azadbakhsh, Jankowski and Gao [2] translated these results into practical means of constructing pointwise confidence intervals. They estimated the necessary quantiles of the limiting distribution by simulation, and considered several methods (kernel-based and plug-in) to estimate the f 0 -dependent nuisance parameters. The intervals thus obtained performed reasonably well in a simulation study, although the best overall results came from standard bootstrap percentile intervals. Compared to the bootstrap intervals, the pointwise intervals based on asymptotics generally had a somewhat higher propensity for undercoverage in some parts of the domain, and for overcoverage (i.e. coverage probability exceeding the desired nominal level, leading to wider intervals than necessary) in other parts. Despite these promising empirical results, the authors cautioned that there were no theoretical results justifying bootstrap methods for this purpose.
Mariucci, Ray and Szabó [137] developed a Bayesian model for log-concave densities f : The function w is piecewise linear with m break-points, where m is a predetermined number dependent on sample size. The weights p 1 , . . . , p m can either be given a Dirichlet prior, or a prior based on truncating the stick-breaking representation of the Dirichlet process (32). The support [a, b] can be deterministic (based on n), empirical (a = X (1) , b = X (n) ), or hierarchical (a and b − a given their own priors). Priors on γ 1 ≥ 0, γ 2 ∈ R, and θ 1 , . . . , θ m ∈ [0, b − a] complete the model, and posterior density draws can be obtained from MCMC samples of these parameters using (40). See Mariucci, Ray and Szabó for technical details, as well as motivation for (40). Pointwise credible intervals obtained with MCMC did a good job capturing true densities in their simulation studies, although in some cases they underperformed somewhat around boundaries or modes. The authors also evaluated the coverage probability of the intervals in one example, showing a tendency for undercoverage in some parts of the domain but overall reasonable performance with increasing sample sizes. Similarly to [165] and [5], complicated limiting distributions have been derived for density estimation under different shape constraints. Examples include monotonicity with right-censored data [97], convexity [79], and s-concavity [90]. In principle these limiting distributions could be used to derive practical UQ methods for densities as in the examples described above, but there does not appear to be any literature directly doing so.

Connections to nonparametric regression
Various parts of this paper have suggested that some uncertainty quantification ideas from other nonparametric models could apply for density estimation. Indeed, there exist a great deal of theoretical results showing that many such models are "equivalent" in a sense involving asymptotic convergence of their risks [e.g. 154, 13, and references therein, especially those by Lucien Le Cam]. Brown et al. [14] offered a practical way of leveraging these ideas. They proposed the root-unroot algorithm to estimate a density on, say, [0, 1] via nonparametric regression. The algorithm proceeds as follows.
1. Divide the domain, assumed wlog to be [0, 1], into T equal subintervals. 2. For j = 1, . . . , T , let Y j = Q j + 1/4, where Q j is the count of X i 's in the j th subinterval and the offset of 1/4 gives optimal bias and variance properties. 3. Treat the Y j 's as response variables and use any suitable method to fit the corresponding smooth regression function m. 4. Takef (·) = [ m (·)] 2 / [ m (t)] 2 dt as the density estimate.
Wang [212] used the root-unroot algorithm for Bayesian density inference, using integrated nested Laplace approximations (INLA) for the posterior of the regression model. The details of INLA -first given by Rue, Martino and Chopin [181] -are omitted here, but it suffices to note that it uses Gaussian approximations and numerical integration to approximate the posterior, allowing for inference without MCMC being necessary. In the above algorithm, Wang tookm to be the posterior mean from the INLA model. Letting γ denote the normalizing integral in the denominator, they divided the INLA quantiles of m by γ to obtain approximate pointwise credible intervals for f . Such intervals did an excellent job capturing true density shapes in their simulation studies. More broadly, one may exploit the connections described here to quantify density uncertainty with any number of methods originally devised for nonparametric regression. Examples include confidence bands based on coverage of surrogate functions [63], or on relaxed notions of coverage that still try to minimize the extent to which the band excludes the true function but allow for nice adaptivity properties [20].

Simulation study
Recall Figure 1 from Section 2, which shows select combinations of density estimation and UQ methods for a simulated dataset. Having described many such methods in the preceding sections, a more thorough discussion of the figure is presented here.
The dataset X is a sample of size n = 1000 from the mixture density f 0 = 0.5N 1 2 , 1 49 + 0.5N 5 7 , 1 490 . This is a bimodal, everywhere-positive density with almost all of its mass contained in the interval [0, 1], and its magnitude and curvature are close to zero at the boundaries of this interval. Thus, it "approximately satisfies" the assumptions made by many of the UQ methods discussed here, while having a fairly interesting shape which provides a good test for UQ methods.
The methods applied to X and shown in Figure 1 are as follows.
1. KDE with pointwise bias-corrected confidence intervals as in Calonico, Cattaneo and Farrell [21], and fixed-width bootstrap confidence bands based on the same bias correction [29]. The bandwidth was selected to minimize estimated integrated MSE (instead of pointwise MSE as in the former reference) in order to ensure a smooth estimator.
2. Adaptive basis expansion with Bernstein polynomials as in Petrone [160], with pointwise credible intervals and credible bands based on median absolute deviations [45]. 3. Logspline estimation with stepwise knot selection [110] and exponentiated pointwise Gaussian confidence intervals using bootstrap standard error estimates [111]. 4. A Dirichlet process mixture of Gaussians with a Normal-Inverse Gamma base measure. A marginal MCMC sampler was used (see Section 7.1) but pointwise credible intervals incorporated "full uncertainty" by using posterior draws of the Dirichlet process obtained by conditional conjugacy [60].
All Bayesian methods were based on output from an appropriate MCMC sampler, and the level for all UQ methods was taken to be 1 − α = 0.95. The simulation study was conducted with the R programming language [199], and further details can be found in the supplementary material for this manuscript [141]. As noted in Section 2, the bands are expectedly wider than the pointwise intervals for both estimation methods shown on the top row of Figure 1. Note that the confidence sets for the KDE are not centered at the estimator due to the bias correction, and are in fact closer to the true density. However, they still fail to fully reach the height of the main mode. Certainly no conclusions can be made about the coverage probability of any UQ method when it is applied to only a single dataset, but further simulations (not shown; see [141]) suggested that this deficiency is typical for samples taken from the true density f 0 , even when using pointwise instead of integrated MSE to select bandwidths. In fact, sample sizes in the millions were necessary to attain good coverage probability at the main mode, although the performance was much better at the smaller mode even for n = 1000. To some degree this is to be expected as the coverage error depends on higher-order derivatives of f 0 [21], but it is infeasible to fully predict this error in practice. This leads to an important point to be made about the difference between asymptotic and finite-sample behaviour: although Calonico, Cattaneo and Farrell [21] showed that these confidence intervals have coverage error ultimately decaying at the optimal rate with respect to n, there are no concrete guarantees for any finite sample size when using data-driven methods.
Recall from Section 3.2 that Cheng and Chen [29] provided bootstrap methods for both fixed-and variable-width bias-corrected confidence bands for KDE's. Here the former was used, as the latter involves bootstrapping a quantity which can have a zero denominator when using a compact kernel, as was the case here [141]. In contrast, the credible band used for the Bernstein polynomial estimator has variable width (see (13)). However, the band shown in the top-right plot of Figure 1 extends over the subinterval [0.01, 0.99], as the band taken over all of [0, 1] was far too wide to be graphically meaningful. This is because a sizeable proportion of MCMC draws had absolute deviations near the boundaries that were much larger than the MAD there, so that the quantile ξ α in (13) was very large. In turn, the MAD was comparatively small at the boundaries because, like f 0 itself, most MCMC draws had tail values near zero. These examples demonstrate that variable-width bands may not be an ideal choice unless f 0 is bounded suitably far away from zero.
Interpretation of the bottom row of Figure 1 is straightforward. The pointwise intervals for the logspline estimator are noticeably less smooth than those for the other estimation methods. Recall that the width of the interval (on the log scale) is determined by the pointwise sample variance of bootstrap density estimates [111]; evidently this induces some roughness. The pointwise credible intervals for the DP mixture are quite similar to those for the Bernstein polynomial estimator: both are quite narrow and smooth and encompass f 0 throughout nearly the entire domain.

Conclusion
There is a vast, sprawling body of work on density uncertainty quantification, dating back over half a century and spanning across many different methods for both estimation and inference. Reviewing the literature -from classical approaches like KDE's and histograms, to the spline methods of the late twentieth century, to modern nonparametric methods -one notices that the gap between theoretical and practical ideas seems to have widened over time. KDE's and related methods are extremely well-studied, with a litany of theoretical and practical results for all relevant types of UQ. Turning the focus to the past two decades of developments, one sees that UQ in the literature for random mixtures is entirely practical, with almost no regard for asymptotic properties; conversely, the advanced wavelet-based papers comprising the core of new theoretical developments often include no data studies whatsoever. It seems natural to wonder whether it is possible to "bridge the gap": perhaps introducing greater theoretical justification for some of the most commonly-used practical methods, or facilitating applications of some of the more obscure asymptotic arguments. However, such developments may be hampered by issues intrinsic to the problems at hand, such as the known complexities of asymptotics in nonparametric Bayesian inference [e.g. the review of 178]. The importance of these considerations is certainly a subjective matter, and as modern practitioners turn their focus to larger datasets and more overt "data science" approaches, there is perhaps a case to be made that applications could provide "all the proof we need".
Based on the simulation study described in Section 9, Figure 1 shows finitesample results for a few of the methods discussed throughout this paper. The code for these experiments is available in the supplementary material [141], and there is certainly merit to further comparative analysis beyond that considered here.
Of interest for future work are extensions to frameworks beyond a single i.i.d. sample, particularly hierarchical modelling of multiple related densities. Bayesian nonparametric methods are emerging as a promising approach to such frameworks, and we are eager to explore the improvements which further developments can provide.