Wasserstein Gradients for the Temporal Evolution of Probability Distributions

Many studies have been conducted on flows of probability measures, often in terms of gradient flows. We utilize a generalized notion of derivatives with respect to time to model the instantaneous evolution of empirically observed one-dimensional distributions that vary over time and develop consistent estimates for these derivatives. Employing local Fr\'echet regression and working in local tangent spaces with regard to the Wasserstein metric, we derive the rate of convergence of the proposed estimators. The resulting time dynamics are illustrated with time-varying distribution data that include yearly income distributions and the evolution of mortality over calendar years.


Introduction
There exists a sizeable literature on flows of probability measures, often described in terms of gradient flows (Ambrosio et al., 2008;Santambrogio, 2017). However, the statistical modeling of the instantaneous evolution of observed distributions that are indexed by time has not yet been explored. Figure 1 shows an example of time-indexed densities, which correspond to demographic age-at-death distributions from 1936 to 2010 in the US, for females and males respectively. Motivated by this and similar data, we study temporal flows for one-dimensional probability distributions. Recently, there has been intensive interest in comparing distributions with the Wasserstein distance, both in theory and applications (e.g. Bolstad et al., 2003;Bigot et al., 2017;Galichon, 2017;Cazelles et al., 2018;Bigot et al., 2019), and in visualization (e.g. Delicado and Vieu, 2017). In the one-dimensional case that we consider here, it is well known that the Wasserstein transport can also be expressed in terms of quantile functions (Hoeffding, 1940;Zhang and Müller, 2011;Chowdhury and Chaudhuri, 2016). Our goal is to develop statistical models that reflect instantaneous evolution of such temporal flows of distributions. Starting with the Monge-Kantorovich problem (e.g., Ambrosio, 2003;Villani, 2003Villani, , 2008, given two probability measures p 1 and p 2 , one aims to transport the pile of mass distributed as in p 1 to that as in p 2 while minimizing the transport cost. The transport map attaining the minimum transport cost defines the optimal transport from p 1 to p 2 . Based on such optimal transport maps, basic concepts such as tangent bundles and exponential and logarithmic maps in Riemannian manifolds can be generalized to the space of univariate probability distributions endowed with the Wasserstein distance, which form a quasi-Riemannian manifold (e.g., Ambrosio et al., 2008;Bigot et al., 2017;Zemel and Panaretos, 2019). The log map, defined as the difference between the optimal transport and identity maps, captures the direction and distance of each small element of mass along the order-preserving transport from the starting probability measure to the target measure and can be used to quantify the change between the two probability measures. Hence, we utilize temporal derivatives of log maps, the Wasserstein temporal gradients, to model the instantaneous temporal evolution of distributions. For this purpose, we harness local Fréchet regression (Petersen and Müller, 2019a) to first smooth the observed probability measures over time due to the discrepancy between the true conditional Fréchet mean and the observed distributions and then estimate the Wasserstein temporal gradients by difference quotients based on the local Fréchet regression estimates.
The Wasserstein temporal gradients that we target are introduced in Section 2, with estimation and asymptotic theory in Section 3. In Section 4, we discuss implementation details, followed by a simulation study. Applications are demonstrated in Section 5 for longitudinal household income and human mortality data.

Optimal Transport in the Wasserstein Space
Given a compact interval D in R, we focus on the Wasserstein space W " WpDq of distributions on D (for which second moments are finite), endowed with the L 2 -Wasserstein distance d W . This metric is related to the solution of Monge's optimal transport problem (Villani, 2003) and has repeatedly been rediscovered, with examples including Mallow's distance (Mallows, 1972), earth mover's distance (Rubner et al., 1997) or quantile normalization (Bolstad et al., 2003).
Specifically, the L 2 -Wasserstein distance between any p 1 , p 2 P W is the square root of where g#p is a push-forward measure such that g#ppAq " pptx : gpxq P Auq, for any measurable function g : R Ñ R, distribution p P W and set A Ď R. It is well known (e.g., Cambanis et al., 1976) that if p 1 is atomless, the minimum in (1) is attained at the optimal transport map Υ p 1 ,p 2 " F´1 2˝F 1 from p 1 to p 2 , and is d 2 W pp 1 , p 2 q " ş 1 0 rF´1 1 puq´F´1 2 puqs 2 du. Here, F l and F´1 l are the cumulative distribution function and quantile function of p l for l " 1, 2, where cumulative distribution functions are considered to be right continuous and quantile functions to be left continuous. A distribution p is atomless if it has a continuous cumulative distribution function.
Basic concepts of Riemannian manifolds can be analogously defined in the Wasserstein space W based on optimal transport maps (e.g., Ambrosio et al., 2008;Bigot et al., 2017;Zemel and Panaretos, 2019). Suppose that p 0 is an atomless reference probability measure in W. The tangent space at p 0 is defined as (Equation (8.5.1), Ambrosio et al., 2008) T p 0 " tηpΥ p 0 ,p´i dq : p P W, η ą 0u where L 2 p 0 " L 2 p 0 pDq is the Hilbert space of p 0 -square-integrable functions on D Ă R, with inner product x¨,¨y p 0 and norm }¨} p 0 ; we reserve the notations without subscripts x¨,¨y and }¨} for the the inner product and norm corresponding to the Lebesgue measure. Due to the atomlessness of p 0 , the tangent space T p 0 is a subspace of L 2 p 0 equipped with the same inner product and induced norm. The exponential map Exp p 0 : T p 0 Ñ W is then defined by Although the exponential map here is not a local homeomorphism as in Riemannian manifolds (Ambrosio et al., 2004), any p P W can be recovered from p 0 by Exp p 0 pΥ p 0 ,p´i dq, which motivates the definition of the inverse of the exponential map, i.e., the logarithmic map Log p 0 : W Ñ T p 0 , The tangent vector given by log maps quantifies the difference between p 0 and p. Indeed, }Log p 0 p} p 0 " d W pp 0 , pq. Furthermore, the difference between optimal transport maps and the identity map reveals how mass is transported between distributions provided that the order is preserved. Specifically, given x P D, if Υ p 1 ,p 2 pxq ą x (respectively, Υ p 1 ,p 2 pxq ă x), then x should be moved to the right (respectively, left) to Υ p 1 ,p 2 pxq in order to keep its rank, i.e., F 2 pΥ p 1 ,p 2 pxqq " F 1 pxq.

Wasserstein Temporal Gradients
Let pT, P q be a pair of random elements in TˆW with joint distribution F, where T Ď R is the time domain. We assume (A1) P is atomless almost surely.
Note that Erd 2 W pP, pq | T " ts ď diampDq 2 ă 8 for all p P W and t P T . Since the Wasserstein space W is a Hadamard space (Kloeckner, 2010), there exists a unique minimizer of Erd 2 W pP,¨q | T " ts (Sturm, 2003). Thus, the conditional Fréchet mean µ ' ptq of P given T " t is well-defined; specifically, µ ' ptq " argmin pPW M pp, tq, with M pp, tq :" Erd 2 W pP, pq | T " ts, and the quantile function of µ ' ptq is given by F´1 µ ' ptq p¨q " ErF´1 P p¨q | T " ts, where F´1 P is the quantile function of P .
To model the instantaneous temporal evolution of probability distributions, we are aiming to generalize the notion of derivatives, which are used to quantify the instantaneous change of differentiable real-valued functions, to the scenario of temporal distribution flows. As discussed in Section 2.1, log maps quantify the discrepancy between two probability distributions. We note that the atomlessness of µ ' ptq is guaranteed by (A1). Hence, a measure of instantaneous temporal evolution of distributions, the Wasserstein temporal gradient at time t P T can be defined by provided that the bivariate function pt, uq Þ Ñ F´1 µ ' ptq puq is differentiable with respect to t. Here, F µ ' psq and F´1 µ ' psq are the cumulative distribution function and quantile function of µ ' psq for s P T . If there exists g P L 1 pT q such that d W pµ ' psq, µ ' ptqq ď ş t s gpxq dx, then µ ' is an absolutely continuous curve in the Wasserstein space, and V t is also referred to as the velocity vector of µ ' (Ambrosio et al., 2004).
Example 1. For t P T , let µ ' ptq " N r0,1s pζ t , υ 2 t q be a truncated Gaussian distribution on the interval r0, 1s. Then the Wasserstein temporal gradient at t is where F µ ' ptq pxq " rΦppx´ζ t q{υ t q´Φp´ζ t {υ t qs{rΦpp1´ζ t q{υ t q´Φp´ζ t {υ t qs, Φ and ϕ are the cumulative distribution function and density of standard normal distributions, respectively, and we use the notation g 1 t " pd{dtqg t " pd{dtqgptq for a function g. Densities and Wasserstein temporal gradients of N r0,1s pζ t , υ 2 t q with different values of ζ t and υ t are shown in Figure 2.
Example 2. For t P T , let µ ' ptq be atomless distributions in a location-scale family with location and scale parameters being ξ t and ς t , respectively. Specifically, the cumulative distribution function of µ ' ptq is given by x Þ Ñ Gppx´ξ t q{ς t q, where G : R Ñ r0, 1s is a template cumulative distribution function. Then the Wasserstein temporal gradient at t P T is For a real-valued differentiable function g : T Ñ R, Thus, comparing the actual flow g 1 ptq for a given longitudinal trajectory gptq with the optimal flow V t pgptqq provides insights into how the rank of gptq changes at each time t. If g 1 ptq ą V t pgptqq (respectively, g 1 ptq ă V t pgptqq), then d dt F µ ' ptq pgptqq is positive (respectively, negative), i.e., the rank of gptq increases (respectively, decreases) instantaneously at time t.

Distribution Estimation
In practice, distributions are usually not fully observed. This creates an additional challenge for the implementation of the Wasserstein temporal gradients. This issue can be addressed, for example, by estimating cumulative distribution functions (e.g., Aggarwal, 1955;Read, 1972;Falk, 1983;Leblanc, 2012), or estimating quantile functions (e.g., Parzen, 1979;Falk, 1984;Yang, 1985;Cheng and Parzen, 1997) of the underlying distributions from which the observed data are sampled. Note that with any quantile function estimator p F´1 (respectively, cumulative distribution function estimator p F ), the corresponding cumulative distribution function (respectively, quantile function) can be obtained by right (respectively, left) continuous inversion, Alternatively, one can first estimate densities (Panaretos and Zemel, 2016;Petersen and Müller, 2016) and then obtain the cumulative distribution functions and quantile functions by integration and inversion. Suppose tpT i , P i qu n i"1 are n independent realizations of pT, P q. Available observations are samples of independent measurements tX ij u m i j"1 generated from P i , respectively, where m i are the sample sizes which may vary across distributions P i , for i " 1, . . . , n. Note that the observed data X ij result from two independent random mechanisms: The first of these generates independently and identically distributed pairs pT i , P i q; the second generates samples of observations tX ij u m i j"1 according to each distribution P i , i.e., X ij " P i independently.
For a given distribution p P W, with a cumulative distribution function estimate p F obtained by any estimation method based on a random sample generated from p, we denote by p p " πp p F q the distribution associated with p F . We make the following assumption on the discrepancy of the estimated and true probability distributions for the theoretical analysis of the Wasserstein temporal gradient estimation.
(D1) For any distribution p P W, with nonnegative decreasing sequence α m " op1q as m Ñ 8, the corresponding estimate p p based on a sample of size m generated from p satisfies sup pPW Erd 2 W pp p, pqs " Opα m q.
We note that this assumption can be easily satisfied. For example, the density estimator proposed by Panaretos and Zemel (2016) where f p is the density function of a distribution p P W, supppf p q " D is the support of p and C ą 0 is a constant, then the empirical measure satisfies (D1) with α m " m´1.
In order to deal with the estimation of n distributions simultaneously, we also require (D2) There exists a sequence m " mpnq such that min 1ďiďn tm i u ě m and m Ñ 8 as n Ñ 8.

Estimation of Wasserstein Temporal Gradients
We assume that for each i " 1, . . . , n, we obtain an estimate p F P i of the cumulative distribution function of P i by one of the methods discussed in Section 3.1 from the observed data tX ij u m i j"1 . Denote by p P i " πp p F P i q the distribution associated with p F P i . Since the discrepancy Erd 2 W pP i , µ ' pT i qq | T i s between the random distributions P i and the conditional Fréchet means µ ' pT i q does not vanish as n Ñ 8, difference quotients based on the estimated distributions p P i are not directly suitable as an estimate of Wasserstein temporal gradients.
Accordingly, we utilize local Fréchet regression (Petersen and Müller, 2019a) to smooth the distributions t p P i u over time, which yields consistent estimates of µ ' ptq, for any t P T . Following Petersen and Müller (2019a), we define the localized Fréchet mean by Here, wps, t, hq " K h ps´tqrκ 2 ptq´κ 1 ptqps´tqs{σ 2 0 ptq, where κ z ptq " ErK h pT´tqpT´tq z s, for z " 0, 1, 2, σ 2 0 ptq " κ 0 ptqκ 2 ptq´κ 1 ptq 2 , K h p¨q " Kp¨{hq{h, K is a smoothing kernel, i.e., a density function symmetric around zero, and h " hpnq ą 0 is a bandwidth sequence. If assuming the distributions P i are fully observed, setting p wps, t, hq " K h ps´tqrp κ 2 ptq´p κ 1 ptqps´tqs{p σ 2 0 ptq, where p κ z ptq " n´1 ř n i"1 K h pT i´t qpT i´t q z , for z " 0, 1, 2, and p σ 2 0 ptq " p κ 0 ptqp κ 2 ptq´p κ 1 ptq 2 , an oracle local Fréchet regression estimate is In practice, we usually only observe random samples of measurements X ij generated from P i . Replacing P i with the corresponding estimates p P i as discussed in Section 3.1, a data-based local Fréchet regression estimate is For simplicity, we assume for theoretical analysis that the support of the marginal density f T of T , i.e., T " tt P R : f T ptq ą 0u, is connected. Let T˝be the interior of T . Furthermore, we require the following assumptions for the asymptotic analysis of the weights wpT, t, hq and p wpT i , t, hq in ν ' ptq and p ν ' ptq, respectively.
(R1) The kernel K is a probability density function, symmetric around zero and continuous on r´1, 1s, such that Kpxq " 0, for all |x| ą 1.
(R2) The marginal density f T of T exists and is continuous on T and twice continuously differentiable on T˝. The second-order derivative f 2 T is bounded, sup tPT˝| f 2 T ptq| ă 8.
For any t P T˝, with the local Fréchet regression estimate p ν ' ptq as per (5) and some small ∆ ą 0, an estimate of the Wasserstein temporal gradient V t in (2) is then given by where F p ν ' psq and F´1 p ν ' psq are the cumulative distribution function and quantile function of p ν ' psq for s P T .

Parallel Transport
Note that the true and estimated Wasserstein temporal gradients lie in different tangent spaces; specifically, V t P T µ ' ptq and p V t,∆ P T p ν ' ptq . To quantify the estimation discrepancy of p V t,∆ , an expedient tool is parallel transport, which is commonly used for manifold-valued data (e.g., Yuan et al., 2012;Lin and Yao, 2018;Petersen and Müller, 2019b;Chen et al., 2020). For two probability measures p 1 , p 2 P W, a parallel transport operator Γ p 1 ,p 2 : L 2 p 1 Ñ L 2 p 2 is defined by where F 2 and F´1 1 are the cumulative distribution function of p 2 and quantile function of p 1 , respectively.
Note that since p 1 and p 2 are atomless, the tangent spaces satisfy T p k Ă L 2 p k for k " 1, 2, and the parallel transport operator Γ p 1 ,p 2 | Tp 1 restricted to the tangent space T p 1 defines the parallel transport between tangent spaces T p 1 and T p 2 . Furthermore, the parallel transport operator Γ p 2 ,p 1 from L 2 p 2 to L 2 p 1 is the adjoint operator of Γ p 1 ,p 2 , i.e., xΓ p 1 ,p 2 g 1 , g 2 y p 2 " xg 1 , Γ p 2 ,p 1 g 2 y p 1 . Thus, the discrepancy between functions g 1 P T p 1 and g 2 P T p 2 can be quantified by }Γ p 2 ,p 1 g 2´g1 } p 1 .

Asymptotic Theory
As discussed in Section 3.3, in order to justify }Γ p ν ' ptq,µ ' ptq p V t,∆´Vt } µ ' ptq as a measure of estimation discrepancy of p V t,∆ , we require the atomlessness of µ ' ptq and p ν ' ptq. The former follows from (A1). However, the latter is not guaranteed in general. For theoretical derivations, we instead consider an atomless variant q ν ' ptq of p ν ' ptq, which is defined as follows. Suppose min D " x 0 ă x 1 ă¨¨¨ă x B " max D is an equidistant grid on D with increment b. Then the cumulative distribution function of q ν ' ptq is given by F q ν ' ptq pxq " F p ν ' ptq px l´1 q`b´1px´x l´1 qrF p ν ' ptq px l q´F p ν ' ptq px l´1 qs, for x P rx l´1 , x l q; F q ν ' ptq pxq " 0 and 1 for x ă x 0 and x ě x B , respectively. We assume that b " bpnq is a positive sequence such that b Ñ 0 as n Ñ 8. Hence, an estimate of the Wasserstein temporal gradient at time t based on q ν ' psq with s P T is given by To obtain the convergence rate of q V t,∆ , we also require the following assumption.
We take ∆ " h; this choice, together with suitable values for h, b and ∆, will lead to Wasserstein temporal gradient estimates q V t,∆ with an asymptotic rate of convergence that matches the wellknown optimal rate of derivative estimation for nonparametric regression for the case of real-valued responses assuming twice continuous differentiability of the regression function. This optimal rate is for example achieved by derivative estimates based on local polynomial fitting (Müller, 1987;Fan and Gijbels, 1996).
Proofs are in the appendix.

Implementation and Simulations
There are two tuning parameters for implementation of Wasserstein temporal gradients, namely the bandwidth h involved in the local Fréchet regression as per (5) and the time increment ∆ used in the difference quotient estimator as per (6). As suggested by the theoretical analysis in Section 3.4, we take ∆ " h in practice. We choose the bandwidth h by leave-one-out cross validation, where the objective function to be minimized is the mean discrepancy between the local Fréchet regression estimates and the observed distributions; specifically, where p ν´i 'h 1 pT i q is the local Fréchet regression estimate of µ ' pT i q obtained with bandwidth h 1 based on the sample excluding the ith pair pT i , p P i q, i.e., and p P i is the estimate of P i based on the observed measurements tX ij u m i j"1 as discussed in Section 3.1. In practice, we replace leave-one-out cross validation by 10-fold cross validation when n ą 30.
Step 3: Draw an independently and identically distributed sample tX ij u m j"1 of size m from each of the distributions tP i u n i"1 .
Four cases were considered with n P t50, 200u and m P t25, 500u. We simulated 500 runs for each pair pn, mq. To evaluate the performance of the Wasserstein temporal gradient estimate based on the local Fréchet regression as per (6), we computed the integrated error (IE) for given t P r0, 1s; specifically, The results are summarized in the boxplots of IEs in Figure 3. It can be seen that the estimation error decreases as n or m increases.

Applications
In this section, we will demonstrate the proposed Wasserstein gradients for time-dependent household income and human mortality data. As mentioned before, the underlying densities are practically never known and need to be estimated from data that they generate. In the household income and mortality examples, the data are reported in the form of histograms, respectively life tables. Our methods can be applied in a straightforward way to histogram data; specifically we estimate the densities by applying a smoothing step, e.g., using local linear regression. For local Fréchet regression, we use the Epanechnikov kernel function Kptq " 0.75p1´t 2 q1 r´1,1s ptq and choose smoothing bandwidths h by cross validation.

Household Income Data
Many studies have been conducted on income distribution and inequality (Jones, 1997;Heathcote et al., 2010), since this is a major measure of economic equality/inequality. The evolution of income distributions over time is of particular interest as it provides quantification of the directions in which income inequality is evolving. The US Census Bureau provides histogram data of US household income over calendar years from 1994 to 2016, available at https://census.gov. To make incomes of different years comparable, adjustments for inflation have been made, using the year 2000 as baseline for constant dollars.
We focus on incomes less than $300, 000. The data require some preprocessing, as the width of the histogram bins changed between 2000 and 2001; for 2010, due to census changes, two datasets are available based on both census 2010 and 2000 populations; for 2013, two sets of data are also available and one of them is based on a redesigned questionnaire which has been used since then. To  Figure 4 reveals not much variation in the income distributions over time except around 2008. The estimated Wasserstein temporal gradients as per (6) (with bandwidths h " 1.99, 1.55, 1.50 and 1.50 years for the four periods, respectively, chosen by cross validation, see Section 4 and time increment ∆ " h) demonstrate how the income of poor, middle-class and rich households evolved for the four periods in Figure 5. The Wasserstein temporal gradients for the ending years of each period cannot be well estimated due to the relative large value of ∆, and hence the results for 2000, "2010 pop00", "2013" and 2016 are not displayed. Since the local Fréchet regression has increased variance near endpoints, the estimated Wasserstein temporal gradients on the two ends of each period are somewhat unreliable.  It can be seen in Figure 5 that for the period 1994-1999, incomes of households at the same percentile levels increased almost throughout, except for relatively poor households whose incomes tended to decrease in 1999. Incomes of households earning more than $150,000 per year increased much faster than the other incomes. For the second period 2001-2010, the economic status of the lower and middle earners was stable in the first three years, rose in 2004-2006, and then declined starting in 2007. Higher incomes declined until 2002, and beginning in 2003, a divide manifested itself in the higher income levels: The lower tier of higher incomes was associated with declining income, whereas the higher tier was associated with increasing income, except for 2007 and 2008. Note that in 2007 and 2008, all household incomes tended to decrease, coinciding with the financial crisis. For the last two periods, it can be seen that household incomes gradually recovered from the crisis. While top incomes above 240,000 US dollars always gained, households with relatively low incomes did not recover until around 2014.

Human Mortality Data
The analysis of mortality data across countries and species has found interest in demography and statistics (Carey et al., 1992;Chiou and Müller, 2009;Ouellette and Bourbeau, 2011;Hyndman et al., 2013;Shang and Hyndman, 2017). Of particular interest is how the distribution of age-ofdeath evolves over time. The Human Mortality Database (http://www.mortality.org) provides data of yearly life tables for 37 countries, from which the distributions of ages-at-death in terms of histograms can be extracted.
We focus on ages-at-death in the age interval r0, 100s (in years) and take Russia, Sweden and the United States as three examples. The densities obtained by smoothing the histogram data for females and males separately are shown in Figures 6, 7 and 1, respectively. It can been seen that densities of mortality and their changes vary across these three countries, which is partly due to the different domains in terms of calendar years during which country-specific mortality has been recorded, which goes much further into the past for Sweden than for the other countries. Estimates of the Wasserstein temporal gradients have been obtained with bandwidths h chosen by cross validation as discussed in Section 4 per gender and country (see Table 1 for details) and ∆ " h. For Russia, the densities of ages-at-death from 1961 to 2010 are shown in the top two panels in Figure 6. The age-at-death distributions are quite different between females and males; female adults tend to live longer than males. The estimates of the Wasserstein temporal gradients for 1961-2010 for Russia are shown in the bottom two panels in Figure 6, and were obtained based on data from 1959 to 2014; the estimated gradients for the first and last two years were excluded due to boundary effects. Between 1970 and 2000, the movement of mortality to higher ages and thus longer life was interrupted, with a lot of variation during this period, and resumed only in the 2000s, where substantial improvement occurs in children's mortality. In the 1990s, there was a remarkable reversal in the trend of longevity increase, as the estimates of the Wasserstein temporal gradients were negative for those years, especially for young females and mid-age males.
For Sweden, as shown in Figure 7, the densities of ages-at-death of females and males are quite similar, indicating a general increase in longevity over the years. The estimated Wasserstein temporal gradients for Sweden in Figure 7 from 1754 to 2012, which are obtained based on data from 1751 to 2016, show some volatility in the age-at-death distributions for both females and males, especially before 1950. Compared to Russia, the evolution of the age-at-death distributions in Sweden is more balanced-years where the distribution moves to the left (right) are followed by years with a rightward (leftward) movement in the distribution. The Wasserstein temporal gradients for Sweden vary in a much larger range than Russia, which is partly due to the inclusion of early calendar years, where the variation of mortality from year to year was much larger, compared to more recent calendar years. For example, the top orange curve for 1773 demonstrates a massive increasing trend in life span for both females and males while the bottom orange curves for 1770-1771 demonstrate a strongly decreasing trend. For the US, the age-at-death distributions are somewhat similar across genders. The estimates of the Wasserstein temporal gradients from 1936 to 2010 obtained based on data from 1933 to 2015 for the US in Figure 8 indicate that age-at-death distributions tend to move to the right in almost all years, suggesting increasing longevity. However, for several of the years since the 1980s, reversals can be found for both females and males. A major reversal can be found for the males from young adults to middle age during 1983-1987. This puzzling reversal has been attributed to drug use (e.g., Case and Deaton, 2015). Appendix: Derivation of Theorem 1 For t P T , we define q t p¨q " E " wpT, t, hqF´1 P p¨q ‰ , where F´1 P , F´1 P i and p F´1 P i are the quantile functions of P , P i and p P i , respectively. Considering any fixed t P T˝, we will show that For q t , we will show that with sufficiently small h, Bq t puq{Bu ą 0, for all u P p0, 1q, (A.2) i.e., q t is a quantile function, whence we will show that Furthermore, we will show that }p q t`h´r q t`h } " O p´p α m hq 1{2¯, Similar results hold when replacing t`h with t´h. We note that for all t P T , }F´1 q ν ' ptq´F´1 p ν ' ptq } ď b, a.s. In conjunction with the atomlessness of µ ' ptq and q ν ' ptq, taking ∆ " h yields whence (7) follows with h " n´1 {5 , b " Opn´2 {5 q and α m " Opn´3 {5 q. Next, we will prove (A.1)-(A.4), respectively. For any given t 0 P T˝, there exists ρ ą 0 such that rt 0´ρ , t 0`ρ s Ă T . We note that under (R1) and (R2), as h Ñ 0, it holds for z P t0, 1u that E rK h pT´tqpT´tq z s 2 " Oph 2z´1 q, where K k,l " ş 1 1 Kpsq k s l ds, for k, l P N and the O and o terms are uniform in t P rt 0´ρ , t 0`ρ s. For the following proofs, we define κz ptq " ErK h pT i´t q|T i´t | z s and p κz ptq " n´1 ř n i"1 rK h pT it q|T i´t | z s, for z " 0, 1, 2, 3. .