Statistical Science

Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors

Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky

Full-text: Open access

Abstract

Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.

Article information

Source
Statist. Sci. Volume 14, Number 4 (1999), 382-417.

Dates
First available in Project Euclid: 24 December 2001

Permanent link to this document
http://projecteuclid.org/euclid.ss/1009212519

Digital Object Identifier
doi:10.1214/ss/1009212519

Mathematical Reviews number (MathSciNet)
MR1765176

Keywords
Bayesian model averaging Bayesian graphical models learning model uncertainty Markov chain Monte Carlo

Citation

Hoeting, Jennifer A.; Madigan, David; Raftery, Adrian E.; Volinsky, Chris T. Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors. Statist. Sci. 14 (1999), no. 4, 382--417. doi:10.1214/ss/1009212519. http://projecteuclid.org/euclid.ss/1009212519.


Export citation

References

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (B. Petrox and F. Caski, eds.) 267.
  • Barnard, G. A. (1963). New methods of quality control. J. Roy. Statist. Soc. Ser. A 126 255.
  • Bates, J. M. and Granger, C. W. J. (1969). The combination of forecasts. Operational Research Quarterly 20 451-468.
  • Berger, J. O. and Delampady, M. (1987). Testing precise hypotheses. Statist. Sci. 2 317-352.
  • Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis (withdiscussion). J. Amer. Statist. Assoc. 82 112-122.
  • Bernardo, J. and Smith, A. (1994). Bayesian Theory. Wiley, Chichester.
  • Besag, J. E., Green, P., Higdon, D. and Mengerson, K. (1995). Bayesian computation and stochastic systems. Statist. Sci. 10 3-66.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 26 123-140.
  • Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580-619.
  • Brozek, J., Grande, F., Anderson, J. and Keys, A. (1963). Densitometric analysis of body composition: revision of some quantitative assumptions. Ann. New York Acad. Sci. 110 113-140.
  • Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics 53 275-290.
  • Buntine, W. (1992). Learning classification trees. Statist. Comput. 2 63-73.
  • Carlin, B. P. and Chib, S. (1993). Bayesian model choice via Markov chain Monte Carlo. J. Roy. Statist. Soc. Ser. B 55 473-484.
  • Carlin, B. P. and Polson, N. G. (1991). Inference for nonconjugate Bayesian models using the Gibbs sampler. Canad. J. Statist. 19 399-405.
  • Chan, P. K. and Stolfo, S. J. (1996). On the accuracy of metalearning for scalable data mining. J. Intelligent Integration of Information 8 5-28.
  • Chatfield, C. (1995). Model uncertainty, data mining, and statistical inference (withdiscussion). J. Roy. Statist. Soc. Ser. A 158 419-466.
  • Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. Amer. Statist. 40 327-335.
  • Clemen, R. T. (1989). Combining forecasts: a review and annotated bibliography. Internat. J. Forecasting 5 559-583.
  • Clyde, M., DeSimone, H. and Parmigiani, G. (1996). Prediction via orthoganalized model mixing. J. Amer. Statist. Assoc. 91 1197-1208.
  • Cox, D. R. (1972). Regression models and life tables (withdiscussion). J. Roy. Statist. Soc. Ser. B 34 187-220.
  • Dawid, A. P. (1984). Statistical theory: the prequential approach. J. Roy. Statist. Soc. Ser. A 147 278-292.
  • Dickinson, J. P. (1973). Some statistical results on the combination of forecasts. Operational Research Quarterly 24 253- 260.
  • Dijkstra, T. K. (1988). On Model Uncertainty and Its Statistical Implications. Springer, Berlin.
  • Draper, D. (1995). Assessment and propagation of model uncertainty. J. Roy. Statist. Soc. Ser. B 57 45-97. Draper, D., Gaver, D. P., Goel, P. K., Greenhouse, J. B., Hedges, L. V., Morris, C. N., Tucker, J. and Waternaux,
  • C. (1993). Combining information: National Research Council Panel on Statistical Issues and Opportunities for Research in the Combination of Information. National Academy Press, Washington, DC. Draper, D., Hodges, J. S., Leamer, E. E., Morris, C. N. and
  • Rubin, D. B. (1987). A researchagenda for assessment and propagation of model uncertainty. Technical Report Rand Note N-2683-RC, RAND Corporation, Santa Monica, California.
  • Edwards, W., Lindman, H. and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review 70 193-242.
  • Fern´andez, C., Ley, E. and Steel, M. F. (1997). Statistical modeling of fishing activities in the North Atlantic. Technical report, Dept. Econometrics, Tilburg Univ., The Netherlands.
  • Fern´andez, C., Ley, E. and Steel, M. F. (1998). Benchmark priors for Bayesian model averaging. Technical report, Dept. Econometrics, Tilburg Univ., The Netherlands.
  • Fleming, T. R. and Harrington, D. H. (1991). Counting Processes and Survival Analysis. Wiley, New York.
  • Freedman, D. A., Navidi, W. and Peters, S. C. (1988). On the impact of variable selection in fitting regression equations. In On Model Uncertainty and Its Statistical Implications (T. K. Dijkstra, ed.) 1-16. Springer, Berlin.
  • Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256-285. Fried, L. P., Borhani, N. O. Enright, P., Furberg, C. D., Gardin, J. M., Kronmal, R. A., Kuller, L. H., Manolio, T. A., Mittelmark, M. B., Newman, A., O'Leary, D. H., Psaty,
  • B., Rautaharju, P., Tracy, R. P. and Weiler, P. G. (1991). The cardiovascular health study: design and rationale. Annals of Epidemiology 1 263-276.
  • Furnival, G. M. and Wilson, R. W. (1974). Regression by leaps and bounds. Technometrics 16 499-511.
  • Geisser, S. (1980). Discussion on sampling and Bayes' inference in scientific modeling and robustness (by GEPB). J. Roy. Statist. Soc. Ser. A 143 416-417.
  • George, E. and McCulloch, R. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881-889. George, E. I. (1986a). Combining minimax shrinkage estimators. J. Amer. Statist. Assoc. 81 437-445. George, E. I. (1986b). A formal Bayes multiple shrinkage estimator. Commun. Statist. Theory Methods (Special issue on Stein-type multivariate estimation) 15 2099-2114. George, E. I. (1986c). Minimax multiple shrinkage estimation. Ann. Statist. 14 188-205.
  • George, E. I. (1999). Bayesian model selection. In Encyclopedia of Statistical Sciences Update 3. Wiley, New York. To appear.
  • Good, I. J. (1950). Probability and the weighing of evidence. Griffin, London.
  • Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. Ser. B 14 107-114. Grambsch, P. M., Dickson, E. R., Kaplan, M., LeSage, G., Flem
  • ing, T. R. and Langworthy, A. L. (1989). Extramural crossvalidation of the Mayo primary biliary cirrhosis survival model establishes its generalizability. Hepatology 10 846- 850.
  • Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711-732.
  • Heckerman, D., Geiger, D. and Chickering, D. M. (1994). Learning Bayesian networks: the combination of knowledge and statistical data. In Uncertainty in Artificial Intelligence, Proceedings of the Tenth Conference (B. L. de Mantaras and D. Poole, eds.) 293-301. Morgan Kaufman, San Francisco.
  • Hodges, J. S. (1987). Uncertainty, policy analysis, and statistics. Statist. Sci. 2 259-291.
  • Hoeting, J. A. (1994). Accounting for model uncertainty in linear regression. Ph.D. dissertation, Univ. Washington, Seattle.
  • Hoeting, J. A., Raftery, A. E. and Madigan, D. (1996). A method for simultaneous variable selection and outlier identification in linear regression. J. Comput. Statist. 22 251- 271.
  • Hoeting, J. A., Raftery, A. E. and Madigan, D. (1999). Bayesian simultaneous variable and transformation selection in linear regression. Technical Report 9905, Dept. Statistics, Colorado State Univ. Available at www.stat.colostate.edu.
  • Ibrahim, J. G. and Laud, P. W. (1994). A predictive approachto the analysis of designed experiments. J. Amer. Statist. Assoc. 89 309-319.
  • Johnson, R. W. (1996). Fitting percentage of body fat to simple body measurements. J. Statistics Education 4.
  • Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773-795.
  • Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses with large samples. J. Amer. Statist. Assoc. 90 928-934.
  • Katch, F. and McArdle, W. (1993). Nutrition, Weight Control, and Exercise, 4th ed. Williams and Wilkins, Philadelphia.
  • Kearns, M. J., Schapire, R. E. and Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning 17 115-142.
  • Kincaid, D. and Cheney, W. (1991). Numerical Analysis. Brooks/Cole, Pacific Grove, CA.
  • Kuk, A. Y. C. (1984). All subsets regression in a proportional hazards model. Biometrika 71 587-592.
  • Kwok, S. and Carter, C. (1990). Multiple decision trees. In Uncertainty in Artificial Intelligence (R. Shachter, T. Levitt, L. Kanal and J. Lemmer, eds.) 4 323-349. North-Holland, Amsterdam.
  • Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford.
  • Lauritzen, S. L., Thiesson, B. and Spiegelhalter, D. J. (1994). Diagnostic systems created by model selection methods: a case study. In Uncertainty in Artificial Intelligence (P. Cheeseman and W. Oldford, eds.) 4 143-152. Springer Berlin.
  • Lawless, J. and Singhal, K. (1978). Efficient screening of nonnormal regression models. Biometrics 34 318-327.
  • Leamer, E. E. (1978). Specification Searches. Wiley, New York.
  • Lohman, T. (1992). Advance in Body Composition Assessment, Current Issues in Exercise Science. Human Kinetics Publishers, Champaign, IL. Madigan, D., Andersson, S. A., Perlman, M. and Volinsky, C. T. (1996a). Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Comm. Statist. Theory Methods 25 2493-2520. Madigan, D., Andersson, S. A., Perlman, M. D. and Volinsky, C. T. (1996b). Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Comm. Statist. Theory Methods 25 2493-2519.
  • Madigan, D., Gavrin, J. and Raftery, A. E. (1995). Elicting prior information to enhance the predictive performance of Bayesian graphical models. Comm. Statist. Theory Methods 24 2271-2292.
  • Madigan, D. and Raftery, A. E. (1991). Model selection and accounting for model uncertainty in graphical models using Occam's window. Technical Report 213, Univ. Washington, Seattle.
  • Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam's window. J. Amer. Statist. Assoc. 89 1535-1546. Madigan, D., Raftery, A. E., York, J. C., Bradshaw, J. M. and
  • Almond, R. G. (1994). Strategies for graphical model selection. In Selecting Models from Data: Artificial Intelligence and Statistics (P. Cheeseman and W. Oldford, eds.) 4 91-100. Springer, Berlin.
  • Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Internat. Statist. Rev. 63 215-232. Markus, B. H., Dickson, E. R., Grambsch, P. M., Fleming, T. R., Mazzaferro, V., Klintmalm, G., Weisner, R. H., Van Thiel,
  • D. H. and Starzl, T. E. (1989). Efficacy of liver transplantation in patients withprimary biliary cirrhosis. New England J. Medicine 320 1709-1713.
  • Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions. Management Science 22 1087-1096.
  • McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, London.
  • Miller, A. J. (1990). Subset Selection in Regression. Chapman and Hall, London.
  • Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized body composition prediction equation for men using simple measurement techniques (abstract). Medicine and Science in Sports and Exercise 17 189.
  • Philips, D. B. and Smith, A. F. M. (1994). Bayesian model comparison via jump diffusions. Technical Report 94-20, Imperial College, London.
  • Raftery, A. E. (1993). Bayesian model selection in structural equation models. In Testing Structural Equation Models (K. Bollen and J. Long, eds.) 163-180. Sage, Newbury Park, CA.
  • Raftery, A. E. (1995). Bayesian model selection in social research(withdiscussion). In Sociological Methodology 1995 (P. V. Marsden, ed.) 111-195. Blackwell, Cambridge, MA.
  • Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83 251-266.
  • Raftery, A. E., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Assoc. 92 179-191.
  • Raftery, A. E., Madigan, D. and Volinsky, C. T. (1996). Accounting for model uncertainty in survival analysis improves predictive performance (withdiscussion). In Bayesian Statistics 5 (J. Bernardo, J. Berger, A. Dawid and A. Smith, eds.) 323-349. Oxford Univ. Press.
  • Rao, J. S. and Tibshirani, R. (1997). The out-of-bootstrap method for model averaging and selection. Technical report, Dept. Statistics, Univ. Toronto.
  • Regal, R. and Hook, E. B. (1991). The effects of model selection on confidence intervals for the size of a closed population. Statistics in Medicine 10 717-721.
  • Roberts, H. V. (1965). Probabilistic prediction. J. Amer. Statist. Assoc. 60 50-62.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-46.
  • Smith, A. F. M. and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Roy. Statist. Soc. Ser. B 55 3-23.
  • Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 5 421- 433. Spiegelhalter, D. J., Dawid, A., Lauritzen, S. and Cowell,
  • R. (1993). Bayesian analysis in expert systems (withdiscussion). Statist. Sci. 8 219-283.
  • Spiegelhalter, D. J. and Lauritzen, S. (1990). Sequential updating of conditional probabilities on directed graphical structures. Networks 20 579-605.
  • Stewart, L. (1987). Hierarchical Bayesian analysis using Monte Carlo integration: computing posterior distributions when there are many possible models. The Statistician 36 211- 219.
  • Taplin, R. H. (1993). Robust likelihood calculation for time series. J. Roy. Statist. Soc. Ser. B 55 829-836.
  • Thompson, E. A. and Wijsman, E. M. (1990). Monte Carlo methods for the genetic analysis of complex traits. Technical Report 193, Dept. Statistics, Univ. Washington, Seattle.
  • Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81 82-86.
  • Volinsky, C. T. (1997). Bayesian model averaging for censored survival models. Ph.D. dissertation, Univ. Washington, Seattle. Volinsky, C. T., Madigan, D., Raftery, A. E. and Kronmal,
  • R. A. (1997). Bayesian model averaging in proportional hazard models: assessing the risk of a stroke. J. Roy. Statist. Soc. Ser. C 46 433-448.
  • Weisberg, S. (1985). Applied Linear Regression, 2nd ed. Wiley, New York.
  • Wolpert, D. H. (1992). Stacked generalization. Neural Networks 5 241-259.
  • York, J., Madigan, D., Heuch, I. and Lie, R. T. (1995). Estimating a proportion of birthdefects by double sampling: a Bayesian approachincorporating covariates and model uncertainty. J. Roy. Statist. Soc. Ser. C 44 227-242.
  • and Kooperberg, 1999). Markov chain Monte Carlo (MCMC) methods provide a stochastic method of obtaining samples from the posterior distributions f Mk Y and f Mk Mk Y and many of the algorithms that the authors mention can be viewed as special cases of reversible jump MCMC algorithms.
  • (Clyde, Parmigiani and Vidakovic, 1998). Sampling models and 2 in conjunction withthe use of Rao-Blackwellized estimators does appear to be more efficient in terms of mean squared error, when there is substantial uncertainty in the error variance (i.e., small sample sizes or low signal-to-noise ratio) or important prior information. Recently, Holmes and Mallick (1998) adapted perfect sampling (Propp and Wilson, 1996) to the context of orthogonal regression. While more computationally intensive per iteration, this may prove to be more efficient for estimation than SSVS or MC3 in problems where the method is applicable and sampling is necessary. While Gibbs and MCMC sampling has worked well in high-dimensional orthogonal problems, Wong, Hansen, Kohn and Smith (1997) found in high-dimensional problems such as nonparametric regression using nonorthogonal basis functions that Gibbs samplers were unsuitable, from both a computational efficiency standpoint as well as for numerical reasons, because the sampler tends to get stuck in local modes. Their proposed sampler "focuses" on variables that are more "active" at each iteration and in simulation studies provided better MSE performance than other classical nonparametric approaches or Bayesian approaches using Gibbs or reversible jump (Holmes and Mallick, 1997) sampling. With the exception of a deterministic search, most methods for implementing BMA rely on algorithms that sample models with replacement and use ergodic averages to compute expectations, as in (7). In problems, suchas linear models, where posterior model probabilities are known up to the normalizing constant, it may be more efficient to devise estimators using renormalized posterior model probabilities (Clyde, DeSimone and Parmigiani, 1996; Clyde, 1999a) and to devise algorithms based on sampling models without replacement. Based on current work withM. Littman, this appears to be a promising direction for implementation of BMA. While many recent developments have greatly advanced the class of problems that can be handled using BMA, implementing BMA in high-dimensional problems withcorrelated variables, suchas nonparametric regression, is still a challenge from both a computational standpoint and the choice of prior distributions.
  • AIC, BIC, and RIC (Clyde and George, 1998, 1999; George and Foster, 1997; Hanson and Yu, 1999) for bothmodel selection and BMA.
  • MODEL AVERAGING, MAYBE This paper offers a good review of one approachto dealing withstatistical model uncertainty, an important topic and one which has only begun to come into focus for us as a profession in this decade (largely because of the availability of Markov chain Monte Carlo computing methods). The authors-who together might be said to have founded the Seattle school of model uncertainty-are to be commended for taking this issue forward so vigorously over the past five years. I have eight comments on the paper, some general and some specific to the body-fat example (Jennifer Hoeting kindly sent me the data, which are well worthlooking at; the data set, and a full description of it, may be obtained by emailing the message send jse/v4n1/datasets.johnson to archive@jse.stat.ncsu.edu).
  • Draper and Fouskakis, 1999). 5. What characteristics of a statistical example predict when BMA will lead to large gains? The only obvious answer I know is the ratio n/p of observations to predictors (withtens of thousands of observations and only dozens of predictors to evaluate, intuitively the price paid for shopping around in the data for a model should be small). Are the authors aware of any other simple answers to this question? As an instance of the n/p effect, in regressionstyle problems like the cirrhosis example where p is in the low dozens and n is in the hundreds, the effect of model averaging on the predictive scale can be modest. HMRV are stretching a bit when they say, in this example, that "the people assigned to the high risk group by BMA had a higher death rate than did those assigned high risk by other methods; similarly those assigned to the low and medium risk groups by BMA had a lower total deathrate"; this can be seen by attaching uncertainty bands to the estimates in Table 5. Over the single random split into build and test data reported in that table, and assuming (at least approximate) independence of the 152 yes/no classifications aggregated in the table, deathrates in the highrisk group, withbinomial standard errors, are 81% ± 5%, 75% ± 6% and 72% ± 6% for the BMA, stepwise, and top PMP methods, and combining the low and medium risk groups yields 18% ± 4%, 19% ± 4% and 17% ± 4% for the three methods, respectively, hardly a rousing victory for BMA. It is probable that by averaging over many random build-test splits a "statistically significant" difference would emerge, but the predictive advantage of BMA in this example is not large in practical terms. 6. Following on from item (4) above, now that the topic of model choice is on the table, why are we doing variable selection in regression at all? People who think that you have to choose a subset of the predictors typically appeal to vague concepts like "parsimony," while neglecting to mention that the "full model" containing all the predictors may well have better out-of-sample predictive performance than many models based on subsets of the xj. Withthe body-fat data, for instance, on the same build-test split used by HMRV, the model that uses all 13 predictors in the authors' Table 7 (fitted by least squares-Gaussian maximum likelihood) has actual coverage of nominal 90% predictive intervals of 95 0 ± 1 8 % and 86 4 ± 3 3 % in the build and test data subsets, respectively; this out-of-sample figure is better than any of the standard variable-selection methods tried by HMRV (though not better than BMA in this example). To make a connection withitem (5) above, I generated a data set 10 times as big but withthe same mean and covariance structure as the body-fat data; with 2,510 total observations the actual coverage of nominal 90% intervals within the 1,420 data values used to fit the model was 90 6 ± 0 8 %, and on the other 1,090 observations it was 89 2 ± 0 9 %. Thus with only 251 data points and 13 predictors, the "full model" overfits the cases used for estimation and underfits the out-of-sample cases, but this effect disappears withlarge n for fixed p (the rate at which this occurs could be studied systematically as a function of n and p). (I put "full model" in quotes because the concept of a full model is unclear when things like quadratics and interactions in the available predictors are considered.) There is another sense in which the "full model" is hard to beat: one can create a rather accurate approximation to the output of the complex, and computationally intensive, HMRV regression machinery in the following closedform Luddite manner. (1) Convert y and all of the xj to standard units, by subtracting off their means and dividing by their SDs, obtaining y and x j (say). This goes some distance toward putting the predictors on a common scale. (2) Use least squares-Gaussian maximum likelihood to regress y on all [or almost all ] th e x j, resolving collinearity problems by simply dropping out of the model altogether any x's that are highly correlated with other x's (when in doubt, drop the x in a pair of suchpredictors that is more weakly correlated with y. This
  • tors in George (1986a, b, c, 1987). However, by going outside the proper prior realm, norming constants
  • George and McCulloch(1998). I am currently developing dilution priors for multiple regression and will report on these elsewhere.
  • 1997). For the purpose of approximating BMA*, I am less sanguine about Occam's window, which is fundamentally a heuristic search algorithm. By restricting attention to the "best" models, the subset of models selected by Occam's Window are unlikely to be representative, and may severely bias the approximation away from BMA*. For example, suppose substantial posterior probability was diluted over a large subset of similar models, as discussed earlier. Although MCMC methods would tend to sample suchsubsets, they would be entirely missed by Occam's Window. A possible correction for this problem might be to base selection on a uniform prior, i.e. Bayes factors, but then use a dilution prior for the averaging. However, in spite of its limitations as an approximation to BMA*, the heuristics which motivate Occam's Window are intuitively very appealing. Perhaps it would simply be appropriate to treat and interpret BMA under Occam's Window as a conditional Bayes procedure.
  • DiCiccio et al., 1997; Oh, 1999). For BMA, it is desirable that the prior on the parameters be spread out enoughthat it is relatively flat over the region of parameter space where the likelihood is substantial (i.e., that we be in the "stable estimation" situation described by Edwards,
  • Lindman and Savage, 1963). It is also desirable that the prior not be much more spread out than is necessary to achieve this. This is because the integrated likelihood for a model declines roughly as -d as
  • els by Raftery, Madigan and Hoeting (1997). A second suchproposal is the unit information prior (UIP), which is a multivariate normal prior centered at the maximum likelihood estimate with variance matrix equal to the inverse of the mean observed Fisher information in one observation. Under regularity conditions, this yields the simple BIC approximation given by equation (13) in our paper
  • (Kass and Wasserman, 1995; Raftery, 1995). The unit information prior, and hence BIC, have been criticized as being too conservative (i.e., too likely to favor simple models). Cox (1995) suggested that the prior standard deviation should decrease withsample size. Weakliem (1999) gave sociological examples where the UIP is clearly too spread out, and Viallefont et al. (1998) have shown how a more informative prior can lead to better performance of BMA in the analysis of epidemiological case-control studies. The UIP is a proper prior but seems to provide a conservative solution. This suggests that if BMA based on BIC favors an "effect," we can feel on solid ground in asserting that the data provide ev
  • idence for its existence (Raftery, 1999). Thus BMA results based on BIC could be routinely reported as a baseline reference analysis, along withresults from other priors if available. A third approach is to allow the data to estimate the prior variance of the parameters. Lindley and Smith (1972) showed that this is essentially what ridge regression does for linear regression, and Volinsky (1997) pointed out that ridge regression has consistently outperformed other estimation methods in simulation studies. Volinsky (1997) proposed combining BMA and ridge regression by using a "ridge regression prior" in BMA. This is closely related to empirical Bayes BMA, which Clyde and George (1999) have shown to work well for wavelets, a special case of orthogonal regression. Clyde, Raftery, Walsh and Volinsky (2000) show that this good performance of empirical Bayes BMA extends to (nonorthogonal) linear regression.
  • income (Featherman and Hauser, 1977). X1 and X2 are highly correlated, but the mechanisms by which they might impact Y are quite different, so all four models are plausible a priori. The posterior model probabilities are saying that at least one of X1 and
  • a LISREL-type model (Bollen, 1989). BMA and Bayesian model selection can still be applied in this
  • context (e.g., Hauser and Kuo, 1998).
  • 1995). Draper says that model choice is a decision problem, and that the use to which the model is to be put should be taken into account explicitly in the model selection process. This is true, of course, but in practice it seems rather difficult to implement. This was first advocated by Kadane and Dickey (1980) but has not been done much in practice, perhaps because specifying utilities and carrying out the full utility maximization is burdensome, and also introduces a whole new set of sensitivity concerns. We do agree with Draper's suggestion that the analysis of the body fat data would be enhanced by a cost- benefit analysis whichtook account of bothpredictive accuracy and data collection costs. In practical decision-making contexts, the choice of statistical model is often not the question of primary interest, and the real decision to be made is something else. Then the issue is decision-making in the presence of model uncertainty, and BMA provides a solution to this. In equation (1) of our article, let be the utility of a course of action, and choose the action for which E D is maximized. Draper does not like our Figure 4. However, we see it as a way of depicting on the same graph the answers to two separate questions: is wrist circumference associated withbody fat after controlling for the other variables? and if so, how strong is the association? The posterior distribution of 13 has two components corresponding to these two questions. The answer to the first question is "no" (i.e., the effect is zero or small) with probability 38%, represented by the solid bar in Figure 4. The answer to the second question is summarized by the continuous curve. Figure 4 shows double shrinkage, withbothdiscrete and continuous components. The posterior distribution of 13, given that 13 = 0, is shrunk continuously towards zero via its prior distribution. Then the posterior is further shrunk (discretely this time) by taking account of the probability that 13 = 0. The displays in Clyde (1999b) convey essentially the same information, and some may find them more appealing than our Figure 4. Draper suggests the use of a practical significance caliper and points out that for one choice, this gives similar results to BMA. Of course the big question here is how the caliper is chosen. BMA can itself be viewed as a significance caliper, where the choice of caliper is based on the data. Draper's Table 1 is encouraging for BMA, because it suggests that BMA does coincide withpractical significance. It has often been observed that P values are at odds with"practical" significance, leading to strong distinctions being made in textbooks between statistical and practical significance. This seems rather unsatisfactory for our discipline: if statistical and practical significance do not at least approximately coincide, what is the use of statistical testing? We have found that BMA often gives results closer to the practical significance judgments of practitioners than do P-values.
  • Bollen, K. A. (1989). Structural Equations with Latent Variables. Wiley, New York.
  • Browne, W. J. (1995). Applications of Hierarchical Modelling. M.Sc. dissertation, Dept. Mathematical Sciences, Univ. Bath, UK.
  • Chipman, H., George, E. I. and McCulloch, R. E. (1998). Bayesian CART model search(withdiscussion). J. Amer. Statist. Assoc. 93 935-960. Clyde, M. (1999a). Bayesian model averaging and model search strategies (withdiscussion). In Bayesian Statistics 6. (J. M. Bernardo, A. P. Dawid, J. O. Berger and A. F. M. Smith, eds) 157-185. Oxford Univ. Press. Clyde, M. (1999b). Model uncertainty and health effect studies for particulate matter. ISDS Discussion Paper 99-28. Available at www.isds.duke.edu.
  • Clyde, M. and DeSimone-Sasinowska, H. (1997). Accounting for model uncertainty in Poisson regression models: does particulate matter particularly matter? ISDS Discussion Paper 97-06. Available at www.isds.duke.edu.
  • Clyde, M. and George., E. I. (1998). Flexible empirical Bayes estimation for wavelets. ISDS Discussion Paper 98-21. Available at www.isds.duke.edu.
  • Clyde, M. and George., E. I. (1999). Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet-Based Models (P. Muller and B. Vidakovic, eds.) 309-322. Springer, Berlin. Clyde, M. and George, E. I. (1999a). Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet Based Models (P. Muller and B. Vidakovic, eds.) Springer, Berlin. To appear. Clyde, M. and George, E. I. (1999b). Flexible empirical Bayes estimation for wavelets. Technical Report, ISDS, Duke Univ.
  • Clyde, M., Parmigiani, G. and Vidakovic, B. (1998). Multiple shrinkage and subset selection in wavelets. Biometrika 85 391-402. Clyde, M., Raftery, A. E., Walsh, D. and Volinsky, C. T.
  • (2000). Technical report. Available at www.stat.washington. edu/tech.reports.
  • Copas, J. B. (1983). Regression, prediction, and shrinkage (with discussion). J. Roy. Statist. Soc. Ser. B 45 311-354.
  • Cox, D. R. (1995). The relation between theory and application in statistics (disc: P228-261). Test 4 207-227.
  • de Finetti, B. (1931). Funzioni caratteristica di un fenomeno aleatorio. Atti Acad. Naz. Lincei 4 86-133.
  • de Finetti, B. (1974, 1975). Theory of Probability 1 and 2. (Trans. by A. F. M. Smithand A. Machi). Wiley, New York.
  • Dellaportas, P. and Forster, J. J. (1996). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Technical Report, Faculty of Mathematics, Southampton Univ. UK. DiCiccio, T. J., Kass, R. E., Raftery, A. E. and Wasserman, L.
  • (1997). Computing Bayes factors by combining simulation and asymptotic approximations. J. Amer. Statist. Assoc. 92 903-915. Draper, D. (1999a). Discussion of "Decision models in screening for breast cancer" by G. Parmigiani. In Bayesian Statistics 6 (J. M. Bernardo, J. Berger, P. Dawid and A. F. M. Smitheds.) 541-543 Oxford Univ. Press. Draper, D. (1999b). Hierarchical modeling, variable selection, and utility. Technical Report, Dept. Mathematical Sciences, Univ. Bath, UK.
  • Draper, D. and Fouskakis, D. (1999). Stochastic optimization methods for cost-effective quality assessment in health. Unpublished manuscript.
  • Featherman, D. and Hauser, R. (1977). Opportunity and Change. Academic Press, New York.
  • Gelfand, A. E., Dey, D. K. and Chang, H. (1992). Model determination using predictive distributions, withimplementation via sampling-based methods (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid, A. F. M. Smith, eds.) 147-167. Oxford Univ. Press.
  • Gelman, A., Meng, X.-L. and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statist. Sinica 6 733-760.
  • George, E. I. (1987). Multiple shrinkage generalizations of the James-Stein estimator. In Contributions to the Theory and Applications of Statistics A Volume in Honor of Herbert Solomon (A. E. Gelfand, ed.) 397-428. Academic Press, New York.
  • George, E. I. (1999). Discussion of "Model averaging and model searchstrategies" by M. Clyde. In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 157-185. Oxford University Press.
  • George, E. I. (1999). Discussion of "Model averaging and model searchby M. Clyde." In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) Oxford University Press.
  • George, E. I. and Foster, D. P. (1997). Calibration and empirical Bayes variable selection. Technical Report, Dept. MSIS, Univ. Texas, Austin.
  • George, E. I., and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339-373.
  • Godsill, S. (1998). On the relationship between MCMC model uncertainty methods. Technical report Univ. Cambridge.
  • Good, I. J. (1983). Good Thinking: The Foundations of Probability and Its Applications. Univ. Minnesota Press, Minneapolis.
  • Granger, C. W. J. and Newbold, P. (1976). The use of R2 to determine the appropriate transformation of regression variables. J. Econometrics 4 205-210.
  • Greenland, S. (1993). Methods for epidemiologic analyses of multiple exposures-a review and comparative study of maximum-likelihood, preliminary testing, and empirical Bayes regression. Statistics in Medicine 12 717-736.
  • Hacking, I. (1975). The Emergence of Probability. Cambridge University Press.
  • Hanson, M. and Kooperberg, C. (1999). Spline adaptation in extended linear models. Bell Labs Technical Report. Available at cm.bell-labs.com/who/cocteau/papers.
  • Hanson, M. and Yu, B. (1999). Model selection and the principle of minimum description. Bell Labs Technical Report. Available at cm.bell-labs.com/who/cocteau/papers.
  • Hauser, R. and Kuo, H. (1998). Does the gender composition of sibships affect women's educational attainment? Journal of Human Resources 33 644-657.
  • Holmes. C. C. and Mallick, B. K. (1997). Bayesian radial basis functions of unknown dimension. Dept. Mathematics technical report, Imperial College, London.
  • Holmes, C. C. and Mallick, B. K. (1998). Perfect simulation for orthogonal model mixing. Dept. Mathematics technical report, Imperial College, London.
  • Kadane, J. B. and Dickey, J. M. (1980). Bayesian decision theory and the simplification of models. In Evaluation of Econometric Models (J. Kmenta and J. Ramsey, eds.) Academic Press, New York.
  • Key, J. T., Pericchi, L. R. and Smith, A. F. M. (1999). Bayesian model choice: what and why? (with discussion). In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 343-370. Oxford Univ. Press.
  • Lindley, D. V. and Smith, A. F. M. (1972). Bayes estimates for the linear model (with discussion). J. Roy. Statist. Soc. Ser. B 34 1-41.
  • Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA.
  • Oh, M.-S. (1999). Estimation of posterior density functions from a posterior sample. Comput. Statist. Data Anal. 29 411-427.
  • Propp, J. G. and Wilson, D. B. (1996). Exact sampling withcoupled Markov cahins and applications to statistical mechanics. Random Structures Algorithms 9 223-252. Raftery, A. E. (1996a). Approximate Bayes factors and accounting from model uncertainty in generalised linear models. Biometrika 83 251-266. Raftery, A. E. (1996b). Hypothesis testing and model selection. In Markov Chain Monte Carlo in Practice (W. R. Gilks and D. Spiegelhalter, eds.) 163-188. Chapman and Hall, London.
  • Raftery, A. E. (1999). Bayes factors and BIC: Comment on "A Critique of the Bayesian information criterion for model selection." Sociological Methods and Research 27 411-427.
  • Sclove, S. L., Morris, C. N. and Radhakrishna, R. (1972). Nonoptimality of preliminary-test estimators for the mean of a multivariate normal distribution. Ann. Math. Statist. 43 1481-1490.
  • Viallefont, V., Raftery, A. E. and Richardson, S. (1998). Variable selection and Bayesian Model Averaging in case-control studies. Technical Report 343, Dept. Statistics, Univ. Washington.
  • Wasserman, L. (1998). Asymptotic inference for mixture models using data dependent priors. Technical Report 677, Dept. Statistics, Carnegie-Mellon Univ.
  • Weakliem, D. L. (1999). A critique of the Bayesian information criterion for model selection. Sociological Methods and Research 27 359-297.
  • Western, B. (1996). Vague theory and model uncertainty in macrosociology. Sociological Methodology 26 165-192.
  • Wong, F., Hansen, M. H., Kohn, R. and Smith, M. (1997). Focused sampling and its application to nonparametric and robust regression. Bell Labs technical report. Available at cm.bell-labs.com/who/cocteau/papers.