The Annals of Applied Statistics

Gamma shape mixtures for heavy-tailed distributions

Sergio Venturini, Francesca Dominici, and Giovanni Parmigiani

Full-text: Open access


An important question in health services research is the estimation of the proportion of medical expenditures that exceed a given threshold. Typically, medical expenditures present highly skewed, heavy tailed distributions, for which (a) simple variable transformations are insufficient to achieve a tractable low-dimensional parametric form and (b) nonparametric methods are not efficient in estimating exceedance probabilities for large thresholds. Motivated by this context, in this paper we propose a general Bayesian approach for the estimation of tail probabilities of heavy-tailed distributions, based on a mixture of gamma distributions in which the mixing occurs over the shape parameter. This family provides a flexible and novel approach for modeling heavy-tailed distributions, it is computationally efficient, and it only requires to specify a prior distribution for a single parameter. By carrying out simulation studies, we compare our approach with commonly used methods, such as the log-normal model and nonparametric alternatives. We found that the mixture-gamma model significantly improves predictive performance in estimating tail probabilities, compared to these alternatives. We also applied our method to the Medical Current Beneficiary Survey (MCBS), for which we estimate the probability of exceeding a given hospitalization cost for smoking attributable diseases. We have implemented the method in the open source GSM package, available from the Comprehensive R Archive Network.

Article information

Ann. Appl. Stat. Volume 2, Number 2 (2008), 756-776.

First available: 3 July 2008

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Mathematical Reviews number (MathSciNet)


Venturini, Sergio; Dominici, Francesca; Parmigiani, Giovanni. Gamma shape mixtures for heavy-tailed distributions. The Annals of Applied Statistics 2 (2008), no. 2, 756--776. doi:10.1214/07-AOAS156.

Export citation


  • Abramowitz, M. and Stegun, I. A. E. (1972)., Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. Dover, New York.
  • Aitchison, J. and Shen, S. M. (1980). Logistic normal distributions: Some properties and uses., Biometrika 67 261–272.
  • Aitkin, M. and Rubin, D. B. (1985). Estimation and hypothesis testing in finite mixture models., J. Roy. Statist. Soc. Ser. B 47 67–75.
  • Barber, J. and Thompson, S. (2004). Multiple regression of cost data: Use of generalised linear models., J. Health Services Research and Policy 9 197–204.
  • Briggs, A. and Gray, A. (2006). The distribution of health care costs and their statistical analysis for economic evaluation., J. Health Economics 25 198–213.
  • Briggs, A., Nixon, R., Dixon, S. and Thompson, S. (2005). Parametric modelling of cost data: Some simulation evidence., Health Economics 14 421–428.
  • Buntin, M. B. and Zaslavsky, A. M. (2004). Too much ado about two-part models and transformation? Comparing methods of modeling Medicare expenditures., J. Health Economics 23 525–542.
  • Cantoni, E. and Ronchetti, E. (1998). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures., J. Health Services Research and Policy 3 233–245.
  • Conigliani, C. and Tancredi, A. (2005). Semi-parametric modelling for costs of health care technologies., Statistics in Medicine 24 3171–3184.
  • Conwell, L. J. and Cohen, J. W. (2005). Characteristics of persons with high medical expenditures in the U.S. civilian noninstitutionalized population, 2002. Technical report. Available at Agency for Healthcare Research and, Quality.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Diebolt, J. and Robert, C. P. (1990). Bayesian estimation of finite mixture distributions. I. Theoretical aspects. Technical report 110, Univ. Paris VI, Paris.
  • Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling., J. Roy. Statist. Soc. Ser. B 56 363–375.
  • Dodd, S., Bassi, A., Bodger, K. and Williamson, P. (2004). A comparison of multivariable regression models to analyse cost data., J. Evaluation in Clinical Practice 9 197–204.
  • Dominici, F., Cope, L., Naiman, D. Q. and Zeger, S. L. (2005). Smooth quantile ratio estimation (SQUARE)., Biometrika 92 543–557.
  • Dominici, F. and Zeger, S. L. (2005). Smooth quantile ratio estimation with regression: Estimating medical expenditures for smoking-attributable diseases., Biostatistics 6 505–519.
  • Duan, N. (1983). Smearing estimate: A nonparametric retransformation method., J. Amer. Statist. Assoc. 78 605–610.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation., J. Amer. Statist. Assoc. 97 611–631.
  • Fraley, C. and Raftery, A. E. (2006). MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ., Washington.
  • Gelman, A. (2007). Struggles with survey weighting and regression modeling., Statist. Sci. 22 153–164.
  • Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling., Statist. Sci. 20 50–67.
  • Johnson, E., Dominici, F., Griswold, M. and Zeger, S. L. (2003). Disease cases and their medical costs attribuitable to smoking: An analysis of the national medical expenditure survey., J. Econometrics 112 135–151.
  • Kilian, R., Matschinger, H., Loeffler, W., Roick, C. and Angermeyer, M. C. (2002). A comparison of methods to handle skew distributed cost variables in the analysis of the resource consumption in schizophrenia treatment., J. Mental Health Policy and Economics 5 21–31.
  • Lehmann, E. L. and Casella, G. (1998)., Theory of Point Estimation, 2nd ed. Springer, New York.
  • Lindsay, B. G. (1995)., Mixture Models: Theory, Geometry and Applications. IMS, Hayward, CA.
  • Lipscomb, J., Ancukiewicz, M., Parmigiani, G., Hasselblad, V., Samsa, G. and Matchar, D. B. (1998). Predicting the cost of illness: A comparison of alternative models applied to stroke., Medical Decision Making 18 S39–S56.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem., J. Amer. Statist. Assoc. 89 958–966.
  • Lohr, S. L. (1999)., Sampling: Design and Analysis. Duxbury Press, Pacific Grove, CA.
  • MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior., Comm. Statist.—Simul. Comput. 23 727–741.
  • MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential importance sampling for nonparametric Bayes models: The next generation., Canad. J. Statist. 27 251–267.
  • Manning, W. G. (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem., J. Health Economics 17 283–295.
  • Manning, W. G. and Mullahy, J. (2001). Estimating log models: To transform or not to transform?, J. Health Economics 20 461–494.
  • Marin, J. M., Mengersen, K. and Robert, C. P. (2005). Bayesian modelling and inference on mixtures of distributions. In, Handbook of Statistics 25 (D. Dey and C. R. Rao, eds.) 459–507. North-Holland, Amsterdam.
  • McLachlan, G. and Peel, D. (2000)., Finite Mixture Models. Wiley, New York.
  • Mullahy, J. (1998). Much ado about two: Reconsidering retransformation and the two-part model in health econometrics., J. Health Economics 17 247–281.
  • Mullahy, J. and Manning, W. G. (2005). Generalized modeling approaches to risk adjustment of skewed outcomes data., J. Health Economics 24 465–488.
  • Pfeffermann, D. (1993). The role of sampling weights when modeling survey data., Internat. Statist. Rev. 61 310–337.
  • Powers, C. A., Meyer, C. M., Roebuck, M. C. and Vaziri, B. (2005). Predictive modeling of total healthcare costs using pharmacy claims data: A comparison of alternative econometric cost modeling techniques., Medical Care 43 1065–1072.
  • Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion)., J. Roy. Statist. Soc. Ser. B 59 731–792.
  • Robert, C. P. (1996). Mixtures of distributions: Inference and estimation. In, Markov Chain Monte Carlo in Practice (W. R. Gilks, S. Richardson and D. J. Spiegelhalter, eds.) 441–464. Chapman and Hall/CRC, New York.
  • Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals., J. Amer. Statist. Assoc. 92 894–902.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation., J. Amer. Statist. Assoc. 82 528–540.
  • Titterington, D. M. (1997). Mixture distributions. In, Encyclopedia of Statistical Sciences 399–407. Wiley, New York.
  • Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985)., Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
  • Tu, W. and Zhou, X.-H. (1999). A Wald test comparing medical cost based on log-normal distributions with zero valued costs., Statistics in Medicine 18 2749–2761.
  • Venturini, S., Dominici, F. and Parmigiani, G. (2008). Supplemet to “Gamma shape mixtures for heavy-tailed distributions.” DOI:, 10.1214/08-AOAS156SUPP.
  • Willan, A. R. and Briggs, A. H. (2006)., Statistical Analysis of Cost-Effectiveness Data. Wiley, New York.
  • Zellner, A. (1971). Bayesian and non-Bayesian analysis of the log-normal distribution and log-normal regression., J. Amer. Statist. Assoc. 66 327–330.
  • Zhou, X.-H., Gao, S. and Hui, S. L. (1997). Methods for comparing the means of two independent log-normal samples., Biometrics 53 1129–1135.
  • Zhou, X.-H., Li, C., Gao, S. and Tierney, W. M. (2001). Methods for testing equality of means of health care costs in a paired design study., Statistics in Medicine 20 1703–1720.

Supplemental materials