An important question in health services research is the
estimation of the proportion of medical expenditures that exceed
a given threshold. Typically, medical expenditures present
highly skewed, heavy tailed distributions, for which (a) simple
variable transformations are insufficient to achieve a tractable
low-dimensional parametric form and (b) nonparametric methods
are not efficient in estimating exceedance probabilities for
large thresholds. Motivated by this context, in this paper we
propose a general Bayesian approach for the estimation of tail
probabilities of heavy-tailed distributions, based on a mixture
of gamma distributions in which the mixing occurs over the shape
parameter. This family provides a flexible and novel approach
for modeling heavy-tailed distributions, it is computationally
efficient, and it only requires to specify a prior distribution
for a single parameter. By carrying out simulation studies, we
compare our approach with commonly used methods, such as the
log-normal model and nonparametric alternatives. We found that
the mixture-gamma model significantly improves predictive
performance in estimating tail probabilities, compared to these
alternatives. We also applied our method to the Medical Current
Beneficiary Survey (MCBS), for which we estimate the probability
of exceeding a given hospitalization cost for smoking
attributable diseases. We have implemented the method in the
open source GSM package, available from the Comprehensive R
Abramowitz, M. and Stegun, I. A. E. (1972)., Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. Dover, New York.
Mathematical Reviews (MathSciNet): MR208797
Aitchison, J. and Shen, S. M. (1980). Logistic normal distributions: Some properties and uses., Biometrika 67 261–272.
Mathematical Reviews (MathSciNet): MR581723
Aitkin, M. and Rubin, D. B. (1985). Estimation and hypothesis testing in finite mixture models., J. Roy. Statist. Soc. Ser. B 47 67–75.
Barber, J. and Thompson, S. (2004). Multiple regression of cost data: Use of generalised linear models., J. Health Services Research and Policy 9 197–204.
Briggs, A. and Gray, A. (2006). The distribution of health care costs and their statistical analysis for economic evaluation., J. Health Economics 25 198–213.
Briggs, A., Nixon, R., Dixon, S. and Thompson, S. (2005). Parametric modelling of cost data: Some simulation evidence., Health Economics 14 421–428.
Buntin, M. B. and Zaslavsky, A. M. (2004). Too much ado about two-part models and transformation? Comparing methods of modeling Medicare expenditures., J. Health Economics 23 525–542.
Cantoni, E. and Ronchetti, E. (1998). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures., J. Health Services Research and Policy 3 233–245.
Conigliani, C. and Tancredi, A. (2005). Semi-parametric modelling for costs of health care technologies., Statistics in Medicine 24 3171–3184.
Conwell, L. J. and Cohen, J. W. (2005). Characteristics of persons with high medical expenditures in the U.S. civilian noninstitutionalized population, 2002. Technical report. Available at www.meps.ahrq.gov/papers/st73/stat73.pdf. Agency for Healthcare Research and, Quality.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Roy. Statist. Soc. Ser. B 39 1–38.
Mathematical Reviews (MathSciNet): MR501537
Diebolt, J. and Robert, C. P. (1990). Bayesian estimation of finite mixture distributions. I. Theoretical aspects. Technical report 110, Univ. Paris VI, Paris.
Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling., J. Roy. Statist. Soc. Ser. B 56 363–375.
Dodd, S., Bassi, A., Bodger, K. and Williamson, P. (2004). A comparison of multivariable regression models to analyse cost data., J. Evaluation in Clinical Practice 9 197–204.
Dominici, F., Cope, L., Naiman, D. Q. and Zeger, S. L. (2005). Smooth quantile ratio estimation (SQUARE)., Biometrika 92 543–557.
Dominici, F. and Zeger, S. L. (2005). Smooth quantile ratio estimation with regression: Estimating medical expenditures for smoking-attributable diseases., Biostatistics 6 505–519.
Duan, N. (1983). Smearing estimate: A nonparametric retransformation method., J. Amer. Statist. Assoc. 78 605–610.
Mathematical Reviews (MathSciNet): MR721207
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation., J. Amer. Statist. Assoc. 97 611–631.
Fraley, C. and Raftery, A. E. (2006). MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ., Washington.
Gelman, A. (2007). Struggles with survey weighting and regression modeling., Statist. Sci. 22 153–164.
Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling., Statist. Sci. 20 50–67.
Johnson, E., Dominici, F., Griswold, M. and Zeger, S. L. (2003). Disease cases and their medical costs attribuitable to smoking: An analysis of the national medical expenditure survey., J. Econometrics 112 135–151.
Kilian, R., Matschinger, H., Loeffler, W., Roick, C. and Angermeyer, M. C. (2002). A comparison of methods to handle skew distributed cost variables in the analysis of the resource consumption in schizophrenia treatment., J. Mental Health Policy and Economics 5 21–31.
Lehmann, E. L. and Casella, G. (1998)., Theory of Point Estimation, 2nd ed. Springer, New York.
Lindsay, B. G. (1995)., Mixture Models: Theory, Geometry and Applications. IMS, Hayward, CA.
Lipscomb, J., Ancukiewicz, M., Parmigiani, G., Hasselblad, V., Samsa, G. and Matchar, D. B. (1998). Predicting the cost of illness: A comparison of alternative models applied to stroke., Medical Decision Making 18 S39–S56.
Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem., J. Amer. Statist. Assoc. 89 958–966.
Lohr, S. L. (1999)., Sampling: Design and Analysis. Duxbury Press, Pacific Grove, CA.
MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior., Comm. Statist.—Simul. Comput. 23 727–741.
MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential importance sampling for nonparametric Bayes models: The next generation., Canad. J. Statist. 27 251–267.
Manning, W. G. (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem., J. Health Economics 17 283–295.
Manning, W. G. and Mullahy, J. (2001). Estimating log models: To transform or not to transform?, J. Health Economics 20 461–494.
Marin, J. M., Mengersen, K. and Robert, C. P. (2005). Bayesian modelling and inference on mixtures of distributions. In, Handbook of Statistics 25 (D. Dey and C. R. Rao, eds.) 459–507. North-Holland, Amsterdam.
McLachlan, G. and Peel, D. (2000)., Finite Mixture Models. Wiley, New York.
Mullahy, J. (1998). Much ado about two: Reconsidering retransformation and the two-part model in health econometrics., J. Health Economics 17 247–281.
Mullahy, J. and Manning, W. G. (2005). Generalized modeling approaches to risk adjustment of skewed outcomes data., J. Health Economics 24 465–488.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data., Internat. Statist. Rev. 61 310–337.
Powers, C. A., Meyer, C. M., Roebuck, M. C. and Vaziri, B. (2005). Predictive modeling of total healthcare costs using pharmacy claims data: A comparison of alternative econometric cost modeling techniques., Medical Care 43 1065–1072.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion)., J. Roy. Statist. Soc. Ser. B 59 731–792.
Robert, C. P. (1996). Mixtures of distributions: Inference and estimation. In, Markov Chain Monte Carlo in Practice (W. R. Gilks, S. Richardson and D. J. Spiegelhalter, eds.) 441–464. Chapman and Hall/CRC, New York.
Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals., J. Amer. Statist. Assoc. 92 894–902.
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation., J. Amer. Statist. Assoc. 82 528–540.
Mathematical Reviews (MathSciNet): MR898357
Titterington, D. M. (1997). Mixture distributions. In, Encyclopedia of Statistical Sciences 399–407. Wiley, New York.
Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985)., Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
Mathematical Reviews (MathSciNet): MR838090
Tu, W. and Zhou, X.-H. (1999). A Wald test comparing medical cost based on log-normal distributions with zero valued costs., Statistics in Medicine 18 2749–2761.
Venturini, S., Dominici, F. and Parmigiani, G. (2008). Supplemet to “Gamma shape mixtures for heavy-tailed distributions.” DOI:, 10.1214/08-AOAS156SUPP.
Willan, A. R. and Briggs, A. H. (2006)., Statistical Analysis of Cost-Effectiveness Data. Wiley, New York.
Zellner, A. (1971). Bayesian and non-Bayesian analysis of the log-normal distribution and log-normal regression., J. Amer. Statist. Assoc. 66 327–330.
Mathematical Reviews (MathSciNet): MR301837
Zhou, X.-H., Gao, S. and Hui, S. L. (1997). Methods for comparing the means of two independent log-normal samples., Biometrics 53 1129–1135.
Zhou, X.-H., Li, C., Gao, S. and Tierney, W. M. (2001). Methods for testing equality of means of health care costs in a paired design study., Statistics in Medicine 20 1703–1720.