Integrated likelihoods in models with stratum nuisance parameters

Frequentist inference about a parameter of interest in presence of a nuisance parameter can be based on an integrated likelihood function. We analyze the behaviour of inferential quantities based on such a pseudo-likelihood in a two-index asymptotics framework, in which both sample size and dimension of the nuisance parameter may di-verge to inﬁnity. We show that a properly chose integrated likelihood largely outperforms standard likelihood methods, such as the proﬁle likelihood. These results are conﬁrmed by simulation studies, in which comparisons with modiﬁed proﬁle likelihood are also considered.


Introduction
Consider stratified data y = (y 1 , . . ., y q ), where y i is a realization of an m−dimensional random variable Y i with density p i (•; ψ, λ i ).Suppose that Y 1 , . . ., Y q are independent and consider ψ as the parameter of interest, with λ = (λ 1 , . . ., λ q ) as a nuisance parameter.We assume that each λ i has the same meaning and the same parameter space, Λ.
It is well known that, in models in which the dimension of the nuisance parameter is large relative to the sample size, methods of likelihood inference, such as those based on the profile likelihood, can perform poorly.To deal with this fact, several modifications to profile likelihood have been proposed; see Barndorff-Nielsen & Cox (1994, Chapter 8) and Severini (2000, Chapters 8, 9) for general discussion of these methods and further references.An alternative solution is offered by integrated likelihood functions (Kalbfleisch & Sprott, 1970), which are formed by integrating the likelihood function with respect to a weight function for the nuisance parameter.In our setting, a reasonable form for the weight function is q i=1 g(λ i ; ψ), thus leading to an integrated likelihood While this approach is quite common in a Bayesian framework, where g(λ i ; ψ) is a conditional prior density of λ given ψ, it is less used under the frequentist paradigm.A notable exception is inference on the index parameter of a composite group family, where a marginal likelihood for the index parameter can be obtained from the distribution of the maximal invariant.Such marginal likelihood can be expressed in the form of an integrated likelihood (Pace & Salvan, 1997, Section 7.7.2).More generally, an integrated likelihood could be an attractive tool for inference since elimination of nuisance parameters is based on averaging, and this could avoid problems related to maximization that may arise with methods based on the profile likelihood, or its modifications (see, for example, Berger et al., 1999).From a frequentist point of view, inference on ψ proceeds by treating the integrated likelihood as a genuine likelihood for ψ.In particular, we will study the properties of the usual quantities such as the integrated log likelihood l I (ψ) = log L I (ψ), its maximizer ψ, and the likelihood ratio statistic, W = 2{l I ( ψ) − l I (ψ)}.When ψ is scalar, as often in the paper, also the integrated signed likelihood ratio statistic, R = sgn( ψ − ψ) W 1 2 will be considered.
In the following, we focus on frequentist properties of a special class of integrated likelihoods, as defined in Severini (2007Severini ( , 2010Severini ( , 2011)), in models with many stratum nuisance parameters.Formally, we consider stratified models in which both m, the within-stratum sample size, and q, the number of strata, which is also the number of nuisance parameters, approach infinity.
This type of "two-index asymptotics" is more relevant to cases in which the number of strata is large relative to the total sample size; see, e.g., Barndorff-Nielsen (1996) for a general discussion of two-index asymptotics and Sartori (2003) for discussion of the properties of profile and modified profile likelihoods in this setting.
A central role in the construction of the integrated likelihoods is played by the weight function g(λ i ; ψ).In a standard asymptotic setting, i.e. with q fixed, Severini (2007Severini ( , 2010) ) shows the relative insensitivity of the integrated likelihood function to the choice of the weight function when the nuisance parameter is strongly unrelated to the parameter of interest and when such weight function does not depend on ψ.In this case, indeed, the integrated likelihood behaves asymptotically as a proper likelihood for ψ.Here we investigate the properties of the integrated likelihoods in the "two-index asymptotics" setting, providing both a specific class of weight functions and a condition on m and q that guarantee the usual asymptotic behaviour of likelihood quantities based on the integrated likelihood.For instance, we show that, using one of these weight functions, R has an asymptotic standard normal distribution even in extreme settings, i.e. large q and small m, when the same results do not hold for R, the usual signed likelihood ratio statistic based on the profile likelihood.The results for the integrated likelihood are equivalent to those for the modified profile likelihood in Sartori (2003).We give however some examples in which the integrated likelihood outperforms the modified profile likelihood, thus suggesting that it may be a useful alternative.
Although the present paper does not deal with Bayesian inference, we note that the proposed integrated likelihood may be used in conjunction with a prior for ψ only in order to obtain an approximate marginal posterior distribution for ψ, as suggested for instance in Efron (1993) andVentura et al. (2009).In particular, Ventura et al. (2009) propose a valid matching prior for ψ that can be used with the modified profile likelihood for ψ.Given the connection between the integrated likelihood used here and the modified profile likelihood, the same matching prior could be used also with the inte-grated likelihood in order to obtain Bayesian inference with good frequentist properties.
The paper is organized as follows: in Section 2 we describe the target likelihood, an unavailable ideal likelihood for ψ, that we approximate using a suitable integrated likelihood; the latter is based on an appropriate choice of the weight function, which is discussed in Section 3. The asymptotic properties of the likelihood quantities based on the integrated likelihoods, such as the integrated score function, the maximum integrated likelihood estimator and the integrated likelihood ratio statistics, are studied in Section 4. Examples are presented in Section 5, in which comparisons with profile and modified profile likelihoods are considered.
We will study the asymptotic properties of integrated likelihoods by relating them to the least favourable target likelihood (Pace & Salvan, 2006).
Let E 0 denote expectation with respect to the true parameter (ψ 0 , λ 0 ), where in λ i for fixed ψ; the target log-likelihood is given by l T (ψ) = q i=1 l i (ψ, λ iψ ).The target likelihood is a function of ψ, the data, and the true parameter value (ψ 0 , λ 0 ); it is a genuine likelihood for ψ, but it is not available in practice since it depends on (ψ 0 , λ 0 ).In some sense, it is analogous to a "true value" for a likelihood for a parameter of interest.Note that likelihood quantities based on the target likelihood have the usual asymptotic properties of functions of a likelihood for the parameter ψ; for instance, the target score, l T ψ , divided by √ mq, has an asymptotic normal distribution with mean 0 and variance equal to the expected information, regardless of the rates of divergence of q and m.Thus, relating statistical procedures based on an integrated likelihood to the ones based on the target likelihood is a useful and convenient way to establish asymptotic results in complex settings such as the one considered here.The target likelihood is also closely related to the modified profile likelihood and related pseudolikelihood functions (Pace & Salvan, 2006).
Exact integration of the likelihood function is possible only in exceptional cases; in all other cases integration in practice is performed either numerically or relying on analytical approximations.In deriving the properties of procedures based on the integrated likelihood, here we follow the latter approach, exploiting the features of the Laplace approximations.Since each λ i appears only in a single stratum, the analytical approximation of l I (ψ) = log L I (ψ) can be obtained by using a Laplace approximation in each stratum and then combining the results, Expansion of the target likelihood (Pace & Salvan, 2006) yields The second summand of the right hand side of this formula can be approximated (Pace & Salvan, 2006) by while the third summand is a term of order O p (max{ q/m, q/m}).The latter, rather unconventional, expression is required by the unconventional two-index asymptotics, in which q and m may diverge at different rates (Sartori, 2003).Hence (2) Combining ( 1) and (2), log( λiψ −λ iψ ) 2 +O p (max{ q/m, q/m}). (3) This equation can be used to relate quantities based on the integrated likelihood to the corresponding quantities based on the target likelihood.
3 Target weight functions To reduce the order of the error term in this relationship, we can choose a weight function g such that log g( λψi ; under this condition, l I (ψ) = l T (ψ) + O p (max{ q/m, q/m}).We will refer to weight functions which satisfy this requirement as target weight functions.
From the expansion of the term ( λiψ −λ iψ ), using the results of Pace & Salvan (2006), it is possible to show that functions of the form fulfill (4) and therefore are example of target weight functions.Here The derivation of these functions, however, may be cumbersome.It may be easier, therefore, to rely on an orthogonal parameterization to construct an integrated likelihood satisfying l I (ψ) = l T (ψ) + O p (max{ q/m, q/m}).
If λ i is orthogonal to ψ, therefore, any weight function for λ i not depending on ψ is a target weight function, and the integrated likelihood agrees to the target likelihood up to order O p (max{ q/m, q/m}).In particular, the weight function g(λ; ψ) = 1 is useful in this context; that is, it is sufficient to integrate the likelihood with respect to λ to obtain the required integrated likelihood.
Here, the orthogonal parameter can be taken to be the informationorthogonal parameter discussed in detail by Cox & Reid (1987), and obtained by solving a differential equation based on the expected information matrix.Alternatively, it may be taken to be the zero-score expectation (ZSE) parameter used in Severini (2007), which is obtained by solving to obtain expression for φ i in terms of ψ, λ i , ψ 0 ; ψ 0 is then replaced by an estimator, such as the maximum likelihood estimator.
The integrated likelihood based on a target weight function for the ZSE parameter has the advantage that it approximately satisfies the second Bartlett identity; this is not true if the integrated likelihood is based on a target weight function for an information-orthogonal parameter (Severini, 2007).One consequence of this is that, in some cases, inferences based on an integrated likelihood using a target weight function for the ZSE parameter are preferable to inferences based on an integrated likelihood using a target weight function for the information-orthogonal parameter.On the other hand, the ZSE parameter requires a reliable estimator of ψ, which may not be available in the setting considered here.Also, for some models the information-orthogonal parameter is easier to obtain, while in other models the reverse is true.Thus, both approaches are useful in practice; this is illustrated in the examples.
4 Asymptotic properties of some integrated likelihood-based quantities + O p (max{ q/m, q/m}).
The term is in general of order O p (q).If g is a target weight function, then D(ψ) = O p (q/m) and the discrepancy between the two score functions is of order O p (max{q/m, q/m}).
Since, by definition, the target score is unbiased, this result allows us to study the bias of the integrated score.In general, E[D(ψ)] is O(q), which means that l I (ψ) has score bias of order O(q); this is the same order as the score bias of the profile likelihood (Sartori, 2003).With a target weight function, the order of the integrated score bias is O(q/m), that is the same as the order of the score bias of the modified profile likelihood (Sartori, 2003).

Maximum integrated likelihood estimator
Let ψ and ψT denote the maximizers of l I (ψ) and l T (ψ), respectively.Note that √ mq( ψT − ψ) is asymptotically normally distributed, with 0 mean and variance equal to the inverse of the partial expected information for ψ which is the lower bound for the asymptotic covariance matrix of a regular estimator of ψ when λ is unknown (Bahadur, 1964).
Recall that the maximizer ψ of a log-likelihood l(ψ) based on n observations can be expanded (e.g., Severini, 2000, Section 5.3) where Applying this expansion to ψ and ψT , we have where the symbol ¯denotes quantities based on l I (ψ), while ˜denotes quantities based on l T (ψ).
Since, in general, the following relations hold Hence, in general, we have On the other hand, for an integrated likelihood based on a target weight function, the analogous relations are Hence, It follows that, in general, ψ has the same asymptotic distribution as ψT and, hence, is asymptotically efficient, provided that q/m = o(1).When the integrated likelihood is based on a target weight function, instead, ψ has the same asymptotic distribution as ψT and it is asymptotically efficient provided that q/m 3 = o(1).

Integrated likelihood ratio statistics
We now consider likelihood-ratio-type statistics based on the integrated likelihood.Since the target likelihood is a likelihood for ψ based on a sample of size mq, the likelihood ratio statistic for ψ based on the target likelihood, W T , is asymptotically distributed according to a chi-squared distribution with p degrees-of-freedom, where p = dim(ψ).Using the relationships between derivatives of l I and derivatives of l T , along with the relationship between ψ and ψT , it is straightforward to show that where, in general, ∆ q,m = O p ( q/m) and for a target weight function, Let W denote the likelihood ratio statistic for ψ based on the integrated likelihood.It follows that, in general, and, hence, that W is in general asymptotically chi-squared-distributed, provided that q/m = o(1).For an integrated likelihood based on a target weight function, this condition is weakened to q/m 3 = o(1).
The same approach can used for signed likelihood ratio statistics for a scalar parameter ψ.Using the results above, it is easily shown that the signed integrated likelihood ratio statistic R is approximately equal to R T , the signed likelihood ratio statistic based on the target likelihood, with error ∆ q,m .Since R T is asymptotically standard normal with error 1/ √ mq it follows that, in general, R is asymptotically standard normal provided that q/m = o(1); again, if the integrated likelihood is based on a target weight function, the required condition is q/m 3 = o(1).
The target weight function used to form the integrated likelihood can be based on either the ZSE parameter or on an information-orthogonal parameter and the properties of R discussed in this section hold in either case.
However, there is a sense in which the properties of R are better if a ZSEbased integrated likelihood is used.Specifically, in this case, the standard deviation of R can be expanded as 1 + O(1/m 2 ) + O(1/(mq)) while for the information-orthogonal-based integrated likelihood, the standard deviation of R can be expanded as 1 + O(1/m) + O(1/(mq)).However, in both cases, E[ R] can be expanded as O( q/m 3 ) + O(1/ √ mq) and these terms will often be more important than the error in the standard deviation.Moreover, issues related to the ease in solving for the parameter and the necessity of finding a useful estimator of ψ to use in forming the ZSE parameter, as discussed in Section 2, typically play a more important role in choosing a parameterization.
The results above for the integrated likelihoods can be compared to those for likelihood ratio statistics based on the profile and modified profile likelihoods.For example, Sartori (2003) shows that, if q/m = o(1), then the usual signed likelihood ratio statistic is asymptotically normally distributed and that the signed likelihood ratio statistic based on the modified profile likelihood is asymptotically normal provided that q/m 3 = o(1).Hence, from a theoretical point of view, an integrated likelihood based on a target weight function guarantees the same inferential accuracy as the modified profile likelihood.On the other hand, from a practical point of view, there may be instances in which the integrated likelihood approach may be preferable, both in terms of accuracy and computational requirements, as illustrad in some of the examples below.Another example is small sample meta-analysis, as described in Bellio & Guolo (2014).
Writing s = u−m q i=1 log v i , where u = q i=1 m j=1 log y ij and v i = m j=1 y ij are the components of the sufficient statistic, the conditional and profile loglikelihoods are l C (ψ) = ψs + q log Γ(mψ) − mq log Γ(ψ), l P (ψ) = ψs + mqψ log mψ − mqψ − mq log Γ(ψ), while, if we use as a weight function the density function of an exponential random variable with mean 1, the integrated log-likelihood is The integrated likelihood with target weight function, instead, is computed by applying a constant weight function to the likelihood reparameterized with the zero-score-expectation parameter (5), which is φ i = ψλ i /ψ.In this case, we obtain an integrated likelihood that is exactly equivalent to the the conditional likelihood.It is worth noting that the conditional likelihood for the shape parameter of a gamma is also a marginal likelihood.Hence, the same result can be achieved also using as a weight function φ −1 i , the weight function related with the right invariant measure (see Pace & Salvan, 1997, Example 7.29).As a final remark, we note that the modified profile likelihood in this example is not exactly equivalent to the conditional likelihood (Sartori, 2003, Example 2).
We perform a simulation study with q = 1000 and m = 10.We chose rather extreme values of m and q in order to investigate the importance of the asymptotic condition on q/m 3 in practice.Table 1 shows the empirical coverages of several signed root likelihood ratio statistics based on 8,000 replications, where the nuisance parameters λ i have been generated from seen, it coincides with R C , the conditional signed likelihood ratio statistic, while the performance of R M P , the modified profile signed likelihood ratio statistic, is overall slightly worse due to the inability of the modified profile log-likelihood to recover l C (ψ).The same conclusions can be drawn from bias and root mean squared error of the corresponding estimators, reported in Table 2.
Example 2: matched binomial.Let us consider Y i1 and Y i2 , i = 1, . . ., q, two independent random variables with distribution Bi(m, p i1 ) and Bi(1, p i2 ) respectively.Let λ i = log{p i1 /(1 − p i1 )} be the stratum nuisance parameter and ψ = log{p i2 /(1 − p i2 )} − log{p i1 /(1 − p i1 )} be the parameter of interest, common among strata.We may deal with a model like this in case-control studies where we are interested in studying the effect of a certain factor by the comparison among one case and m controls (Sartori, 2003, Example 3).
The likelihood is e (y i1 +y i2 )λ+y i2 ψ )/{(1 + e λ ) m (1 + e ψ+λ )} q , while the conditional likelihood is a noncentral hypergeometric distribution, see, for instance, Davison (1988, Example 6.1).In order to find a target weight function, we use here an idea suggested by Cox & Reid (1993), i.e., we choose a weight function based on the original parameterization that would act like a uniform one in an orthogonal parameterization, (ψ, ξ i ).Since the model is a full exponential family, (ψ, ξ i ) might be given by the mixed parameterization.Hence, we have ∂ξ i /∂λ i = me λ i /(1 + e λ i ) 2 + e ψ+λ i /(1 + e ψ+λ i ) 2 , which leads to an integrated likelihood After a change of variable λ i (ω i ) = log{ω i /(1 − ω i )} and some algebra, we obtain where 2 F (a, b, c, z & Stegun, 1964, formula 15.3.1, page 558).We also use the procedure based on the ZSE parameterization (5).Exploiting the exponen-  prove substantially over R. The close agreement between RI and R M P can be explained by the results for exponential families in Severini (2007).The same indication can be found looking at bias and root mean squared error of the corresponding estimators in Table 4.
When the time series in each stratum are stationary, that is when we assume y i0 ∼ N {0, σ 2 /(1 − ρ 2 )}, then λ i is orthogonal to ψ = (ρ, σ 2 ), and the modified profile likelihood and the integrated likelihood proposed in this paper are equivalent and they both coincide with a marginal likelihood.The latter yields consistent estimates for ψ when q diverges, even for fixed m (Bartolucci et al., 2014, Example 1).
Here, we consider the non stationary case, which appears to be the dominant one in the econometric literature (see, for instance, Lancaster, 2002, Section 3).This means that we condition on the observed initial value y i0 and permit the autoregressive parameter to equal or exceed unity.Without loss of generality, in the following we will assume that y i0 = 0, i = 1, . . ., q.
Indeed, this corresponds to assuming model ( 10) for the differences y ij − y i0 , with λ i reparameterized as The log likelihood for model ( 10) is the sum of q independent components of the form Lancaster (2002, page 655) shows that an information orthogonal parameterization is given by ξ i = λ i exp{b(ρ)}, where The parameter ξ i is orthogonal to both ρ and σ 2 , with the latter two being orthogonal to each other.
Alternatively, we can use the ZSE parameterization (5), with φ i solution of Using again the results in Lancaster (2002), we find is the first derivative of (11).For this model computation of profile and integrated log likelihoods is straightforward since all maximization and integration involved can be easily done analytically.In particular, focusing interest on the parameter ρ, we have where SS(ρ) = q i=1 m j=1 {w ij (ρ) − wi (ρ)} 2 , with w ij (ρ) = y ij − ρy ij−1 , and wi (ρ) = m −1 m j=1 w ij (ρ).Formulae ( 13) and ( 14) are the integrated loglikelihoods with the orthogonal parameters ξ i and with the ZSE parameters φ i , respectively.In both cases we used a constant weight function for the incidental parameters and for log σ.These integrated log likelihoods could be also obtained by first integrating out the incidental parameters, thus obtaining the integrated log likelihoods for (ρ, σ 2 ), and then profiling out σ 2 .Since the maximum likelihood estimate is generally highly biased, this could have an effect on the accuracy of the integrated likelihood (14).A possible solution could be given by using in (12) alternative estimates for ρ and σ 2 in place of ρ and σ2 .One solution could be the use of a parametric bootstrap bias corrected version of ρ and σ2 .Alternatively, one could use a different estimate, such as for instance the maximizer of ( 13), or the maximizer of ( 14) itself, leading to a two-step solution.In the numerical example and in the simulations below we used the former option, thus obtaining the new ZSE parameter φ , where ρO denote the maximizer of ( 13).The corresponding integrated likelihood has the form ( 14), with ρ replaced by ρO , and will be denoted by lI (ρ).
Sometimes l O (ρ) can be monotonic increasing for large values of ρ.On the other hand, for values of m, q and ρ of practical interest, it has a local maximum for ρ ∈ (−ρ l , ρ u ), where ρ l , ρ u > 0 are threshold values that can exceed one.Lancaster (2002), developing the integrated likelihood from a Bayesian perspective, shows that such a local maximum is a consistent estimator of ρ for large q, even for fixed m.Also lI (ρ) and lI (ρ) can be monotonic increasing for large values of ρ, and this problem seems to occur "sooner" than for l O (ρ).
Moreover, the second term in the right hand side of ( 14) cannot be computed for values of ρ greater than ρ + 1/b (ρ) (which is however always greater than 1).A similar comment applies also to lI (ρ).Even in these cases, in practice, this has not proven to be a problem for maximization and inference.
As a numerical illustration, Figure 1 shows the relative log likelihoods for a simulated sample with m = 8, q = 500, ρ = 0.9, σ 2 = 1 and λ i   We also run some simulation studies comparing the empirical coverage probabilities for the signed likelihood ratio statistics based on l P (ρ), l O (ρ), lI (ρ), and lI (ρ), which are denoted by R, R O , RI and RI , respectively.Also bias and mean squared errors of the corresponding estimators have been considered.Tables 5 and 6 report the results for the same setting of Figure 1, with 10,000 simulated samples.The results indicate that lI (ρ), although largely improving over l P (ρ), is far from having the same accuracy of l O (ρ).
On the other hand, RI seems to be slightly more accurate than R O , in particular in the tails.Results in more extreme settings, such as with m = 4 and q = 1000, confirm these findings.Finally, we note that the modified profile likelihood is not straightforward to obtain in this model.A possibility is to use the approximation of Severini (1998), avoiding the sometimes cumbersome analytical calculation of required expected value by means of Monte Carlo simulation.This approach is quite general, although computationally more intensive, and has been used also by Bartolucci et al. (2014) for a dynamic regression model for binary data.
Claudia Di Caterina, in an unpublished Master Thesis of the University of Padova, proved that Severini's approximation of the mixed derivative is linear in ρ.This implies that the modified profile log likelihood does not exist for certain values of ρ, similarly to lI (ρ).Moreover, it also shares the other drawbacks of lI (ρ), i.e., it could be monotonic increasing for not very large values of ρ, and the normal approximation for the corresponding signed likelihood ratio statistic has an accuracy very close to that of RI , which is not very satisfactory for practical purposes when m is moderate and q is very large.We note that also the modified profile likelihood depends on the maximum likelihood estimates.Therefore it is likely that the use of better estimates could improve also its accuracy, as for the integrated likelihood lI (ρ), although, to our knowledge, this has not been investigated yet.

4. 1
Score functionFirst consider the relationship between the score function based on an integrated likelihood function and the score function based on the target likelihood.Using (3),

Figure 1 :
Figure1: Example 3. Relative log likelihoods for simulated data of the nonstationary autoregressive model: m = 8, q = 500, ρ = 0.9, σ 2 = 1 and λ i ∼ N (1, 1).The solid line corresponds to l P (ρ), the dashed line to l O (ρ), the dot-dashed line to lI (ρ), and the long-dashed line to lI (ρ).The vertical dotted line indicates the true parameter value, while the horizontal dotted line provides confidence intervals of level 0.95 based on the corresponding likelihood ratio statistics.The left panel shows the uncostrained plot, while the right panel shows a zoomed version in a region of interest.

Table 2 :
Example 1. Bias and Root Mean Squared Error (RMSE) for different estimators.with 10 degrees of freedom and considered fixed in each replication, while the true value of ψ is 2. The empirical coverages of R and RG are, as expected, very poor.Here, with RG we denote the integrated signed likeli- hood ratio statistic based on (6), i.e., constructed with a non-target weight function.Using a target weight function in the construction of integrated likelihood, instead, we obtain empirical coverages for the integrated signed likelihood ratio statistics, RI , very close to the nominal ones.Indeed, as

Table 3 :
Example 2. Empirical coverage of R, R C , RI ,RO , and

Table 4 :
Example 2. Bias and root mean squared error (RMSE) of different estimators.

Table 5 :
Example 3. Empirical coverage of R, R O , RI and RI .