Statistical Science
previous :: next

Integrated likelihood methods for eliminating nuisance parameters

James O. Berger, Brunero Liseo, and Robert L. Wolpert

Source: Statist. Sci. Volume 14, Number 1 (1999), 1-28.

Abstract

Elimination of nuisance parameters is a central problem in statistical inference and has been formally studied in virtually all approaches to inference. Perhaps the least studied approach is elimination of nuisance parameters through integration, in the sense that this is viewed as an almost incidental byproduct of Bayesian analysis and is hence not something which is deemed to require separate study. There is, however, considerable value in considering integrated likelihood on its own, especially versions arising from default or noninformative priors. In this paper, we review such common integrated likelihoods and discuss their strengths and weaknesses relative to other methods.

Keywords: Marginal likelihood; nuisance parameters; profile likelihood; reference priors

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ss/1009211804
Mathematical Reviews number (MathSciNet): MR1702200
Digital Object Identifier: doi:10.1214/ss/1009211804
Zentralblatt MATH identifier: 02068895

References

Aitkin, M. and Stasinopoulos, M. (1989). Likelihood analysis of a binomial sample size problem. In Contributions to Probability and Statistics (L. J. Gleser, M. D. Perlman, S. J. Press and A. Sampson, eds.) Springer, New York.
Mathematical Reviews (MathSciNet): MR90k:62054
Barnard, G. A., Jenkins, G. M. and Winsten, C. B. (1962). Likelihood inference and time series (with discussion). J. Roy. Statist. Soc. Ser. A 125 321-372.
Barndorff-Nielsen, O. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70 343-365.
Mathematical Reviews (MathSciNet): MR85b:62023
Zentralblatt MATH: 0532.62006
Barndorff-Nielsen, O. (1988). Parametric Statistical Models and Likelihood. Lecture Notes in Statist. 50. Springer, New York.
Mathematical Reviews (MathSciNet): MR90h:62001
Barndorff-Nielsen, O. (1991). Likelihood theory. In Statistical Theory and Modelling: In Honour of Sir D.R. Cox. Chapman and Hall, London.
Bartlett, M. (1937). Properties of sufficiency and statistical tests. Proc. Roy. Soc. London Ser. A 160 268-282.
Zentralblatt MATH: 0016.41201
Basu, D. (1975). Statistical information and likelihood (with discussion). Sankhy¯a Ser. A 37 1-71.
Basu, D. (1977). On the elimination of nuisance parameters. J. Amer. Statist. Assoc. 72 355-366.
Zentralblatt MATH: 0395.62003
Bayarri, M. J., DeGroot, M. H. and Kadane, J. B. (1988). What is the likelihood function? In Statistical Decision Theory and Related Topics IV (S. S. Gupta and J. O. Berger, eds.) 2 3-27. Springer, New York,
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer, New York.
Berger, J. O. and Bernardo, J. M. (1989). Estimating a product of means: Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84 200-207.
Berger, J. O. and Bernardo, J. M. (1992). Ordered group reference priors with applications to a multinomial problem. Biometrika 79 25-37.
Berger, J. O. and Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist 76 159-165.
Berger, J. O., Philippe, A. and Robert, C. (1998). Estimation of quadratic functions: noninformative priors for non-centrality parameters. Statist. Sinica 8 359-376.
Berger, J. O. and Strawderman, W. (1996). Choice of hierarchical priors: admissibility in estimation of normal means. Ann. Statist. 24 931-951.
Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle: A Review, Generalizations, and Statistical Implications, 2nd ed. IMS, Hayward, CA.
Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. Roy. Statist. Soc. Ser. B 41 113-147.
Mathematical Reviews (MathSciNet): MR84i:62005
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR96a:62006
Bjørnstad, J. (1996). On the generalization of the likelihood function and the likelihood principle. J. Amer. Statist. Assoc. 91 791-806.
Zentralblatt MATH: 0871.62006
Butler, R. W. (1988). A likely answer to "What is the likelihood function?" In Statistical Decision Theory and Related Topics IV (S. S. Gupta and J. O. Berger, eds.) 2 21-26. Springer, New York.
Carroll, R. J. and Lombard, F. (1985). A note on N estimators for the Binomial distribution. J. Amer. Statist. Assoc. 80 423- 426.
Mathematical Reviews (MathSciNet): MR792743
Chang, T. and Eaves, D. (1990). Reference priors for the orbit in a group model. Ann. Statist. 18 1595-1614.
Zentralblatt MATH: 0725.62003
Cox, D. R. (1975). Partial likelihood. Biometrika 62 269-276.
Zentralblatt MATH: 0312.62002
Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. Roy. Statist. Soc. Ser. B 49 1-39.
Zentralblatt MATH: 0616.62006
Cruddas, A. M., Reid, N. and Cox, D. R. (1989). A time series illustration of approximate conditional likelihood. Biometrika 76 231. Datta, G. S. and Ghosh, J. K. (1995a). Noninformative priors for maximal invariant parameter in group models. Test 4 95-114. Datta, G. S. and Ghosh, J. K. (1995b). On priors providing frequentist validity for Bayesian inference. Biometrika 82 37-45.
Mathematical Reviews (MathSciNet): MR90j:62138
Zentralblatt MATH: 0666.62090
Datta, G. S. and Ghosh, J. K. (1996). On the invariance of noninformative priors. Ann. Statist. 24 141-159.
Mathematical Reviews (MathSciNet): MR97b:62004
Dawid, A. P., Stone, M. and Zidek, J. V. (1973). Marginalization paradoxes in Bayesian and structural inference. J. Roy. Statist. Soc. Ser. B 35 180-233.
Mathematical Reviews (MathSciNet): MR51:2057
de Alba, E. and Mendoza, M. (1996). A discrete model for Bayesian forecasting with stable seasonal patterns. In Advances in Econometrics II (R. Carter Hill, ed.) 267-281. JAI Press.
Draper, N. and Guttman, I. (1971). Bayesian estimation of the binomial parameter. Technometrics 13 667-673.
Zentralblatt MATH: 0223.62042
Eaton, M. L. (1989). Group Invariance Applications in Statistics. IMS, Hayward, CA.
Mathematical Reviews (MathSciNet): MR92a:62002
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10 507.
Fisher, R. A. (1921). On the "probable error" of a coefficient of correlation deduced from a small sample. Metron 1 3-32.
Fisher, R. A. (1935). The fiducial argument in statistical inference. Ann. Eugenics 6 391-398.
Fraser, D. A. S. and Reid, N. (1989). Adjustments to profile likelihood. Biometrika 76 477-488.
Mathematical Reviews (MathSciNet): MR91f:62003
Zentralblatt MATH: 0678.62008
Ghosh, J. K., ed. (1988). Statistical Information and Likelihood. A Collection of Critical Essays by D. Basu. Springer, New York.
Ghosh, J. K. and Mukerjee, R. (1992). Noninformative priors. In Bayesian Statistics 4 (J. O. Berger, J. M. Bernardo, A. P. Dawid and A. F. M. Smith, eds.) 195-203. Oxford Univ. Press.
Gleser, L. and Hwang, J. T. (1987). The nonexistence of 100 1 % confidence sets of finite expected diameter in errors-invariable and related models. Ann. Statist. 15 1351-1362.
Mathematical Reviews (MathSciNet): MR88k:62058
Good, I. J. (1983). Good Thinking: The Foundations of Probability and Its Applications. Univ. Minnesota Press.
Hui, S. and Berger, J. O. (1983). Empirical Bayes estimation of rates in longitudinal studies. J. Amer. Statist. Assoc. 78 753-760.
Zentralblatt MATH: 0532.62086
Jeffreys, H. (1961). Theory of Probability. Oxford Univ. Press.
Mathematical Reviews (MathSciNet): MR32:4710
Kahn, W. D. (1987). A cautionary note for Bayesian estimation of the binomial parameter n. Amer. Statist. 41 38-39.
Mathematical Reviews (MathSciNet): MR882767
Kalbfleisch, J. D. and Sprott, D. A. (1970). Application of likelihood methods to models involving large numbers of parameters. J. Roy. Statist. Soc. Ser. B 32 175-208.
Mathematical Reviews (MathSciNet): MR42:5362
Kalbfleish, J. D. and Sprott, D. A. (1974). Marginal and conditional likelihood. Sankhy¯a Ser. A 35 311-328.
Laplace, P. S. (1812). Theorie Analytique des Probabilities Courcier, Paris.
Lavine, M. and Wasserman, L. A. (1992). Can we estimate N? Technical Report 546, Dept. Statistics, Carnegie Mellon Univ.
Liseo, B. (1993). Elimination of nuisance parameters with reference priors. Biometrika 80 295-304.
Mathematical Reviews (MathSciNet): MR95j:62019
Zentralblatt MATH: 0778.62025
McCullagh, P. and Tibshirani, R. (1990). A simple method for the adjustment of profile likelihoods. J. Roy. Statist. Soc. Ser. B 52 325-344.
Mathematical Reviews (MathSciNet): MR91f:62006
Moreno, E. and Gir ´on, F. Y. (1995). Estimating with incomplete count data: a Bayesian Approach. Technical report, Univ. Granada, Spain.
Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica 16 1-32.
Mathematical Reviews (MathSciNet): MR9,600d
Zentralblatt MATH: 0034.07602
Olkin, I., Petkau, A. J. and Zidek, J. V. (1981). A comparison of n estimators for the binomial distribution. J. Amer. Statist. Assoc. 76 637-642.
Mathematical Reviews (MathSciNet): MR82i:62050
Raftery, A. E. (1988). Inference for the binomial N parameter: a hierarchical Bayes approach. Biometrika 75 223-228.
Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138-157.
Zentralblatt MATH: 0955.62524
Reid, N. (1996). Likelihood and Bayesian approximation methods. In Bayesian Statistics 5 (J. O. Berger, J. M. Bernardo, A. P. Dawid and A. F. M. Smith, eds.) 351-369. Oxford Univ. Press.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist. 11 416- 431.
Mathematical Reviews (MathSciNet): MR85i:62002
Savage, L. J. (1976). On rereading R. A. Fisher. Ann. Statist. 4 441-500.
Mathematical Reviews (MathSciNet): MR53:7698
Zentralblatt MATH: 0325.62008
Sun, D. (1994). Integrable expansions for posterior distributions for a two parameter exponential family. Ann. Statist. 22 1808-1830.
Zentralblatt MATH: 0828.62071
Sun, D. and Berger, J. O. (1998). Reference priors with partial information. Biometrika 85 55-71. Sweeting, T. (1995a). A framework for Bayesian and likelihood approximations in statistics. Biometrika 82 1-24. Sweeting, T. (1995b). A Bayesian approach to approximate conditional inference. Biometrika 82 25-36.
Zentralblatt MATH: 01192362
Sweeting, T. (1996). Approximate Bayesian computation based on signed roots of log-density ratios. In Bayesian Statistics 5 (J. O. Berger, J. M. Bernardo, A. P. Dawid and A. F. M. Smith, eds.) 427-444. Oxford Univ. Press.
Mathematical Reviews (MathSciNet): MR97g:62054
Ye, K. and Berger, J. O. (1991). Non-informative priors for inference in exponential regression models. Biometrika 78 645- 656.
Zabell, S. L. (1989). R.A. Fisher on the history of inverse probability. Statist. Sci. 4 247-263.
Zentralblatt MATH: 0955.01501
likelihood introduced by Efron (1993). As noted by Berger, Liseo and Wolpert, in nonBayesian inference one cannot argue for integrated likelihood solely on grounds of rationality and coherency . However, one can at least say that as a likelihood method, integrated likelihood satisfies the likelihood principle, and is a proper likelihood. That is, if two likelihood functions for are proportional, then so are the corresponding integrated likelihoods for . The main inferential issue with eliminating nuisance parameters from the likelihood function is to be able to take into account the uncertainty in . It is well known that replacing by an estimate or even conditional to estimate (giving us the profile likelihood) ignores the uncertainty in . This can be especially serious if the dimension of is large. The resulting L can then be much too accurate, giving the impression that we have more information about than is warranted. I think that the single most important reason for using an integrated likelihood is, as emphasized in the paper, that this partial likelihood automatically and naturally takes into account parameter uncertainty in . A central theme in the paper is the comparison of the operations of integration and maximization. One of the main messages I read from the paper is that any reasonable integrated likelihood will typically outperform the profile likelihood. It seems quite clear that integration is a safer operation than maximization, so if it is obvious what kind of noninformative to use, integration would clearly be preferable. In fact, it seems to me that maybe the best thing is to do what Laplace suggested, choose parametrizations of the nuisance
likelihood by Harris (1989). Instead of integrating L with respect to a conditional noninformative , one can use a data-based weight function for , for example the distribution of the MLE at ,
1996). Thus, in the case that h = c in (24), and det I22 are equal up to first order and thus the integrated and profile likelihoods are the same up to first order. Hence for the reference priors used here, one expects similar profile and integrated likelihoods with large samples and regular models. Appropriately, the examples considered involve small samples or irregular models. The comments here will be restricted primarily to Examples 3, 4 and 5 with some additional comments about likelihood based methods in models with large numbers of nuisance parameters. Example 3 illustrates a general situation where profile likelihood methods can be expected to perform poorly. The problem with using profile
efficiency issues see also Lindsay, 1980). As a different example of the use of random effects integrated likelihoods, consider Example 5. The problems associated with the use of profile likelihood methods appear to be due to the relatively large number of nuisance parameters. A random effects assumption seems natural here given the parameter of interest. Consider the same random effects model as in Example 2: the µi are i.i.d. N 2 In this case, = and is a random variable ( in the notation of Section 1.3.1). The authors' recommendation of (6) as a likelihood in this setting is closely connected with the discussion about empirical Bayes methods in Section 3.3. If (6) is taken as the likelihood, the profile likelihood for = µ2i/p would be obtained by substituting an estimate into (6). It is more common to substitute an estimate of the based upon the likelihood for : the function of obtained by integrating over in (6). In this case the estimate for is = ¯x 2 = max 0 s2 1 Substituting this estimate of into (6) gives an integrated likelihood that is proportional to an estimate of the conditional density of given the observed data. Since the µi x are independently N mi v where v = 2/ 1 + 2 and mi = vxi + 1 v ¯x the estimate of the conditional distribution for p /v is a noncentral 2 distribution with p degrees of freedom and noncentrality parameter v-1 m2i/2 The additional random effects assumption is used in the above derivation but, even with the the original assumption that the µi are a fixed sequence, assuming that converges to a limit, one can show that this conditional distribution converges to a point mass at this limit. Statistical models with a large number of nuisance parameters arise frequently and profile likelihoods for these models often give unreasonable parameter of interest inferences. Often a natural alternative to a model with a large number of nuisance parameters is a random effects or mixture model. The analysis arising from the use of a random effects model can be viewed as an integrated likelihood method and frequently gives reasonable parameter of interest inferences that are robust to the assumption of randomness about the nuisance parameter. In many other statistical models, profile and integrated likelihoods tend to be similar. However, in models with badly behaved likelihood contributions, as in Example 3, profile likelihoods can give unreasonable parameter of interest inferences where integrated likelihoods give reasonable solutions. In models with a large amount of uncertainty about the nuisance parameter, as in Example 4, integrated likelihoods can give misleading results. Thus the differences between the results for the two methods is informative in itself, which suggests that in important and complex problems it might be of value to integrate the two as a form of sensitivity analysis.
(1989). We are somewhat concerned with the apparent double use of the data in the definitions of Lest and L est (using the data once in the likelihood function and again in the weight function for integration). Indeed, these weighted likelihoods are somewhat hard to compute. We were able to compute L est for Example 3, and it turned out to be the same as the (inadequate) profile likelihood, which does not instill confidence in the method.
Efron, B. (1993). Bayes and likelihood calculations from confidence intervals. Biometrika 80 3-26.
Mathematical Reviews (MathSciNet): MR94m:62082
Zentralblatt MATH: 0773.62021
Harris, I. (1989). Predictive fit for natural exponential families. Biometrika 76 675-684.
Zentralblatt MATH: 0679.62021
Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 27 887-906.
Mathematical Reviews (MathSciNet): MR19,189a
Leonard, T. (1982). Comment on "A simple predictive density function" by M. Lejeune and G. D. Faulkenberry. J. Amer. Statist. Assoc. 77 657-658.
Mathematical Reviews (MathSciNet): MR84f:62046
Liseo, B. and Sun, D. (1998). A general method of comparison for likelihoods. ISDS discussion paper, Duke Univ.
Tierney, L. J. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81 82-86.
Mathematical Reviews (MathSciNet): MR830567
previous :: next

2009 © Institute of Mathematical Statistics