The Annals of Applied Statistics
previous :: next

Gamma shape mixtures for heavy-tailed distributions

Sergio Venturini, Francesca Dominici, and Giovanni Parmigiani
Source: Ann. Appl. Stat. Volume 2, Number 2 (2008), 756-776.

Abstract

An important question in health services research is the estimation of the proportion of medical expenditures that exceed a given threshold. Typically, medical expenditures present highly skewed, heavy tailed distributions, for which (a) simple variable transformations are insufficient to achieve a tractable low-dimensional parametric form and (b) nonparametric methods are not efficient in estimating exceedance probabilities for large thresholds. Motivated by this context, in this paper we propose a general Bayesian approach for the estimation of tail probabilities of heavy-tailed distributions, based on a mixture of gamma distributions in which the mixing occurs over the shape parameter. This family provides a flexible and novel approach for modeling heavy-tailed distributions, it is computationally efficient, and it only requires to specify a prior distribution for a single parameter. By carrying out simulation studies, we compare our approach with commonly used methods, such as the log-normal model and nonparametric alternatives. We found that the mixture-gamma model significantly improves predictive performance in estimating tail probabilities, compared to these alternatives. We also applied our method to the Medical Current Beneficiary Survey (MCBS), for which we estimate the probability of exceeding a given hospitalization cost for smoking attributable diseases. We have implemented the method in the open source GSM package, available from the Comprehensive R Archive Network.

First Page: Show Hide

Related Works:

Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1215118537
Digital Object Identifier: doi:10.1214/07-AOAS156
Zentralblatt MATH identifier: 05591297
Mathematical Reviews number (MathSciNet): MR2524355

References

Abramowitz, M. and Stegun, I. A. E. (1972)., Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. Dover, New York.
Mathematical Reviews (MathSciNet): MR208797
Aitchison, J. and Shen, S. M. (1980). Logistic normal distributions: Some properties and uses., Biometrika 67 261–272.
Mathematical Reviews (MathSciNet): MR581723
Zentralblatt MATH: 0433.62012
Digital Object Identifier: doi:10.2307/2335470
Aitkin, M. and Rubin, D. B. (1985). Estimation and hypothesis testing in finite mixture models., J. Roy. Statist. Soc. Ser. B 47 67–75.
Barber, J. and Thompson, S. (2004). Multiple regression of cost data: Use of generalised linear models., J. Health Services Research and Policy 9 197–204.
Briggs, A. and Gray, A. (2006). The distribution of health care costs and their statistical analysis for economic evaluation., J. Health Economics 25 198–213.
Briggs, A., Nixon, R., Dixon, S. and Thompson, S. (2005). Parametric modelling of cost data: Some simulation evidence., Health Economics 14 421–428.
Buntin, M. B. and Zaslavsky, A. M. (2004). Too much ado about two-part models and transformation? Comparing methods of modeling Medicare expenditures., J. Health Economics 23 525–542.
Cantoni, E. and Ronchetti, E. (1998). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures., J. Health Services Research and Policy 3 233–245.
Conigliani, C. and Tancredi, A. (2005). Semi-parametric modelling for costs of health care technologies., Statistics in Medicine 24 3171–3184.
Mathematical Reviews (MathSciNet): MR2209050
Digital Object Identifier: doi:10.1002/sim.2012
Conwell, L. J. and Cohen, J. W. (2005). Characteristics of persons with high medical expenditures in the U.S. civilian noninstitutionalized population, 2002. Technical report. Available at www.meps.ahrq.gov/papers/st73/stat73.pdf. Agency for Healthcare Research and, Quality.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Roy. Statist. Soc. Ser. B 39 1–38.
Mathematical Reviews (MathSciNet): MR501537
Diebolt, J. and Robert, C. P. (1990). Bayesian estimation of finite mixture distributions. I. Theoretical aspects. Technical report 110, Univ. Paris VI, Paris.
Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling., J. Roy. Statist. Soc. Ser. B 56 363–375.
Mathematical Reviews (MathSciNet): MR1281940
Dodd, S., Bassi, A., Bodger, K. and Williamson, P. (2004). A comparison of multivariable regression models to analyse cost data., J. Evaluation in Clinical Practice 9 197–204.
Dominici, F., Cope, L., Naiman, D. Q. and Zeger, S. L. (2005). Smooth quantile ratio estimation (SQUARE)., Biometrika 92 543–557.
Mathematical Reviews (MathSciNet): MR2202645
Zentralblatt MATH: 1183.62056
Digital Object Identifier: doi:10.1093/biomet/92.3.543
Dominici, F. and Zeger, S. L. (2005). Smooth quantile ratio estimation with regression: Estimating medical expenditures for smoking-attributable diseases., Biostatistics 6 505–519.
Duan, N. (1983). Smearing estimate: A nonparametric retransformation method., J. Amer. Statist. Assoc. 78 605–610.
Mathematical Reviews (MathSciNet): MR721207
Zentralblatt MATH: 0534.62021
Digital Object Identifier: doi:10.2307/2288126
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation., J. Amer. Statist. Assoc. 97 611–631.
Mathematical Reviews (MathSciNet): MR1951635
Zentralblatt MATH: 1073.62545
Digital Object Identifier: doi:10.1198/016214502760047131
Fraley, C. and Raftery, A. E. (2006). MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ., Washington.
Gelman, A. (2007). Struggles with survey weighting and regression modeling., Statist. Sci. 22 153–164.
Mathematical Reviews (MathSciNet): MR2408951
Digital Object Identifier: doi:10.1214/088342306000000691
Project Euclid: euclid.ss/1190905511
Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling., Statist. Sci. 20 50–67.
Mathematical Reviews (MathSciNet): MR2182987
Digital Object Identifier: doi:10.1214/088342305000000016
Project Euclid: euclid.ss/1118065042
Zentralblatt MATH: 1100.62032
Johnson, E., Dominici, F., Griswold, M. and Zeger, S. L. (2003). Disease cases and their medical costs attribuitable to smoking: An analysis of the national medical expenditure survey., J. Econometrics 112 135–151.
Mathematical Reviews (MathSciNet): MR1963235
Digital Object Identifier: doi:10.1016/S0304-4076(02)00157-4
Zentralblatt MATH: 1038.62101
Kilian, R., Matschinger, H., Loeffler, W., Roick, C. and Angermeyer, M. C. (2002). A comparison of methods to handle skew distributed cost variables in the analysis of the resource consumption in schizophrenia treatment., J. Mental Health Policy and Economics 5 21–31.
Lehmann, E. L. and Casella, G. (1998)., Theory of Point Estimation, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR1639875
Lindsay, B. G. (1995)., Mixture Models: Theory, Geometry and Applications. IMS, Hayward, CA.
Lipscomb, J., Ancukiewicz, M., Parmigiani, G., Hasselblad, V., Samsa, G. and Matchar, D. B. (1998). Predicting the cost of illness: A comparison of alternative models applied to stroke., Medical Decision Making 18 S39–S56.
Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem., J. Amer. Statist. Assoc. 89 958–966.
Mathematical Reviews (MathSciNet): MR1294740
Zentralblatt MATH: 0804.62033
Digital Object Identifier: doi:10.2307/2290921
Lohr, S. L. (1999)., Sampling: Design and Analysis. Duxbury Press, Pacific Grove, CA.
MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior., Comm. Statist.—Simul. Comput. 23 727–741.
Mathematical Reviews (MathSciNet): MR1293996
Zentralblatt MATH: 0825.62053
Digital Object Identifier: doi:10.1080/03610919408813196
MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential importance sampling for nonparametric Bayes models: The next generation., Canad. J. Statist. 27 251–267.
Mathematical Reviews (MathSciNet): MR1704407
Digital Object Identifier: doi:10.2307/3315637
Zentralblatt MATH: 0957.62068
Manning, W. G. (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem., J. Health Economics 17 283–295.
Manning, W. G. and Mullahy, J. (2001). Estimating log models: To transform or not to transform?, J. Health Economics 20 461–494.
Marin, J. M., Mengersen, K. and Robert, C. P. (2005). Bayesian modelling and inference on mixtures of distributions. In, Handbook of Statistics 25 (D. Dey and C. R. Rao, eds.) 459–507. North-Holland, Amsterdam.
Mathematical Reviews (MathSciNet): MR2490536
McLachlan, G. and Peel, D. (2000)., Finite Mixture Models. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1789474
Mullahy, J. (1998). Much ado about two: Reconsidering retransformation and the two-part model in health econometrics., J. Health Economics 17 247–281.
Mullahy, J. and Manning, W. G. (2005). Generalized modeling approaches to risk adjustment of skewed outcomes data., J. Health Economics 24 465–488.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data., Internat. Statist. Rev. 61 310–337.
Powers, C. A., Meyer, C. M., Roebuck, M. C. and Vaziri, B. (2005). Predictive modeling of total healthcare costs using pharmacy claims data: A comparison of alternative econometric cost modeling techniques., Medical Care 43 1065–1072.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion)., J. Roy. Statist. Soc. Ser. B 59 731–792.
Mathematical Reviews (MathSciNet): MR1483213
Digital Object Identifier: doi:10.1111/1467-9868.00095
Robert, C. P. (1996). Mixtures of distributions: Inference and estimation. In, Markov Chain Monte Carlo in Practice (W. R. Gilks, S. Richardson and D. J. Spiegelhalter, eds.) 441–464. Chapman and Hall/CRC, New York.
Mathematical Reviews (MathSciNet): MR1397966
Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals., J. Amer. Statist. Assoc. 92 894–902.
Mathematical Reviews (MathSciNet): MR1482121
Zentralblatt MATH: 0889.62021
Digital Object Identifier: doi:10.2307/2965553
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation., J. Amer. Statist. Assoc. 82 528–540.
Mathematical Reviews (MathSciNet): MR898357
Zentralblatt MATH: 0619.62029
Digital Object Identifier: doi:10.2307/2289457
Titterington, D. M. (1997). Mixture distributions. In, Encyclopedia of Statistical Sciences 399–407. Wiley, New York.
Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985)., Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
Mathematical Reviews (MathSciNet): MR838090
Zentralblatt MATH: 0646.62013
Tu, W. and Zhou, X.-H. (1999). A Wald test comparing medical cost based on log-normal distributions with zero valued costs., Statistics in Medicine 18 2749–2761.
Venturini, S., Dominici, F. and Parmigiani, G. (2008). Supplemet to “Gamma shape mixtures for heavy-tailed distributions.” DOI:, 10.1214/08-AOAS156SUPP.
Mathematical Reviews (MathSciNet): MR2524355
Zentralblatt MATH: 05591297
Digital Object Identifier: doi:10.1214/07-AOAS156
Project Euclid: euclid.aoas/1215118537
Willan, A. R. and Briggs, A. H. (2006)., Statistical Analysis of Cost-Effectiveness Data. Wiley, New York.
Mathematical Reviews (MathSciNet): MR2269496
Zentralblatt MATH: 1129.62109
Zellner, A. (1971). Bayesian and non-Bayesian analysis of the log-normal distribution and log-normal regression., J. Amer. Statist. Assoc. 66 327–330.
Mathematical Reviews (MathSciNet): MR301837
Zentralblatt MATH: 0226.62064
Digital Object Identifier: doi:10.2307/2283931
Zhou, X.-H., Gao, S. and Hui, S. L. (1997). Methods for comparing the means of two independent log-normal samples., Biometrics 53 1129–1135.
Zhou, X.-H., Li, C., Gao, S. and Tierney, W. M. (2001). Methods for testing equality of means of health care costs in a paired design study., Statistics in Medicine 20 1703–1720.
previous :: next

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?