A unifying approach to the estimation of the conditional Akaike information in generalized linear mixed models

The conditional Akaike information criterion, AIC, has been frequently used for model selection in linear mixed models. We develop a general framework for the calculation of the conditional AIC for different exponential family distributions. This unified framework incorporates the conditional AIC for the Gaussian case, gives a new justification for Poisson distributed data and yields a new conditional AIC for exponentially distributed responses but cannot be applied to the binomial and gamma distributions. The proposed conditional Akaike information criteria are unbiased for finite samples, do not rely on a particular estimation method and do not assume that the variance-covariance matrix of the random effects is known. The theoretical results are investigated in a simulation study. The practical use of the method is illustrated by application to a data set on tree growth. AMS 2000 subject classifications: Primary 62J12; secondary 62J07.


Introduction
Generalized linear mixed models (Breslow & Clayton (1993)), GLMMs, provide a broad range of models for the analysis of grouped data.They extend the idea of linear mixed models to non-normal data.In recent years, GLMMs have also been used as a representation of generalized additive models (e.g.Ruppert et al. (2003)).This increase in flexibility and complexity leads to extended need for model selection.
The Akaike information criterion, AIC (Akaike (1973)), is a well known information based criterion for model selection.There have been several extensions to the AIC.For example in the case of small sample size or highly overparameterized models Hurvich & Tsai (1989) proposed a corrected criterion called AIC C .In linear mixed models, a natural choice would be to base the AIC on the marginal model, i.e. the model with the random effects integrated out.This leads to a biased criterion (Greven & Kneib (2010)).An AIC based on the conditional likelihood was introduced by Vaida & Blanchard (2005) but was derived assuming the variance parameters of the random effects to be known.Plugging in estimated variance-covariance matrices induces a bias that leads to a preference for larger models with more random effects (Greven & Kneib (2010)).
A correction to avoid that bias was proposed by Liang et al. (2008) by use of an identity known from Stein (1972).
An extension of the conditional AIC to GLMMs has for example been proposed by Yu & Yau (2012).They suggested an asymptotically unbiased conditional AIC, where the estimation of the variance parameters of the random effects is based on maximum likelihood and that of the fixed and random effects on maximizing the joint likelihood.Another conditional AIC was proposed by Donohue et al. (2011).It is also asymptotically unbiased and in addition requires that the covariance structure of the random effects is known.In this report, we suggest a method for deriving unbiased estimates of the conditional Akaike information for exponential family distributions even if the sample size is finite and the covariance structure of the random effects is unknown.This unified framework for the conditional AIC in GLMMs contains the known estimators for the normal and Poisson distribution as special cases and provides a more general derivation for the Poisson case than previously given (Lian (2011)), which highlights the connection to the normal case.We also extend this idea to the exponential distribution.In addition to the theoretical results, we illustrate the performance of the new estimator in a simulation study and in an application to tree growth data.Proofs of new results are given in the appendix.

Generalized linear mixed models
Consider a GLMM with predictor η = Xβ + Zu with the full column rank (n × p) and (n × r) design matrices X and Z, the fixed effects β and random effects u.The random effects are assumed to be normally distributed, i.e. u ∼ N (0, G(ϑ)), where ϑ contains all q variance parameters in the covariance matrix G.The responses y 1 , . . ., y n have conditional expectation with response function h(•).Moreover the responses conditioned on the random effects u follow an exponential family distribution, i.e. the conditional density of y i is given by log where b(•) only depends on θ, c(•) only on y i and φ, φ is a scale parameter, and θ is the canonical parameter of the distribution as in the generalized linear model context (Nelder & Wedderburn (1972)).In the marginal density, the random effects are integrated out where f (u|ϑ) is the density of the random effects.In the following, we denote by β, θ and û estimators of β, θ and u, respectively, e.g. the maximum likelihood estimator, the restricted maximum likelihood estimator and the empirical Bayes estimator.If we want to emphasize the dependence on the data y, we write β(y) and so forth.

Akaike information criterion
The Akaike information is defined as twice the expected relative Kullback-Leibler distance −2E y (E z (log f (z|γ(y)))), with independent replications z and y from the underlying model and parameter vector γ.In standard regression settings, if certain regularity conditions are fulfilled, the Akaike information criterion with ν = dim(γ) is an asymptotically unbiased estimator for the Akaike information.A direct extension of the AIC to GLMMs based on the marginal model would be the marginal AIC, where f (y| β, θ) is the maximized marginal likelihood.If the dispersion parameter φ is estimated, the bias correction in (2.3) changes to 2(p + q + 1).Using the marginal model implies that the focus is on the fixed effects and that new data z does not share the random effects of y.However, the marginal AIC may be inappropriate for variable selection in linear mixed effect models if the focus is on clusters rather than on the population, as stated in Vaida & Blanchard (2005).Even under the marginal model it is not an (asymptotically) unbiased estimator of the Akaike information as shown for the Gaussian case by Greven & Kneib (2010).
Use of the conditional model formulation focuses on the random effects and implies shared random effects between y and z.The conditional Akaike information is where g(y, u) = g(y|u)g(u) is the (true) joint density of y and u (Vaida & Blanchard (2005)).For (conditionally) Gaussian responses and known random effects variance parameters ϑ they show that an asymptotically unbiased estimator of the conditional Akaike information is are the effective degrees of freedom (Hodges & Sargent (2001)).Liang et al. (2008) introduced a bias correction that takes the estimation uncertainty of ϑ into account.For known error variance σ 2 they replace 2ρ by (2.4) They propose a similar but lengthy formula for unknown error variance.Following the findings of Greven & Kneib (2010), the estimation uncertainty of the error variance can be ignored.

Bias correction
For GLMMs with responses following an exponential family distribution as in (2.1) and unknown random effects variance parameters ϑ, we derive the following bias correction.
Proposition 2.1.In GLMMs with responses following an exponential family distribution and unknown ϑ, the bias correction for −2 log f (y| β, û) as an estimator of cAI is If φ is estimated, φ in the first expression is replaced by φ.A proof for this result is given in the Technical details section.

Continuous distributions
The proposed bias correction in (2.5) suffers from the use of the true but unknown mean µ and therefore cannot be applied directly.Liang et al. (2008) solved this problem by the use of a formula known from Stein (1972) which turns (2.5) into (2.4).The following result extends the idea of Stein to continuous exponential family distributions and is a slight modification of Hudson (1978).
Theorem 3.1.Let y be continuous and have density given by (2.1).For a differentiable function m : R → R that vanishes on the limits of the support of y if the limits of the support are finite and If y is Gaussian, formula (3.1) simplifies to the formula known from Stein.Applied to the bias correction (2.5) this yields the bias correction 2Φ 0 known from Liang et al. (2008).The theorem can also be applied to obtain bias corrections for other exponential family distributions as stated in the following.For y exponentially distributed with mean µ, y ∼ E( 1 µ ), and letting m(y) = (3.2) We use this equation to derive an analytically accessible version of (2.5).
Corollary 3.1.Let y i |u ∼ E( 1 µi ).Then an unbiased estimator of the cAI is where y −i is the vector of observed responses without the i−th observation and hence θi (y −i , x) is the estimator based on (y 1 , . . ., y i−1 , x, y i+1 , . . ., y n ).
A proof of this result is outlined in the appendix.In Section 5 numerical integration is used to evaluate 3.3.

Discrete distributions
A similar identity to Theorem 3.1 also holds for discrete random variables from an exponential family distribution.The following Theorem is also due to Hudson (1978).
Corollary 3.2.Let y i |u ∼ P(λ i ).Then an unbiased estimator of the cAI is where y −i is the vector of observed responses without the i−th observation and y i is the i−th observation with y i θi (y −i , y i − 1) = 0 if y i = 0 by convention.
Corollary 3.2 gives an alternative derivation of the result in Lian (2011), which highlights the connection to the normal case.

Limits of the approach
Theorem 3.1 and Theorem 3.2 can be extended to further distributions.For instance the generalized SURE formula (Lemma 2) in Shen & Huang (2006) is a generalisation of Theorem 3.1 and Theorem 3.2 to distributions not necessarily from the exponential family.Although the formula has been obtained in a different context, it is closely related to the findings in Section 3 and gives further insight on how identities for further distributions could potentially be derived.On the other hand, formulas as in Theorems 3.1 and 3.2 do not necessarily lead to bias correction terms computable from observable quantities for all distributions, as discussed in the following.

Continuous distributions
For example if y follows a gamma distribution with mean µ and scale parameter ν, i.e. y ∼ G(µ, ν) identity (3.1) is This can be rewritten in terms of µ In contrast to the ν = 1 case, this identity cannot be used to remove the true but unknown parameter µ i in the bias correction term (2.5) unless we could rewrite the estimator of the canonical parameter in (2.5) by θi (y for a function m(•) fulfilling the requirements in Theorem 3.1.Since this seems to be not possible, Theorem 3.1 cannot be used to rewrite the bias correction term (2.5) for a gamma distribution with ν = 1.

Discrete distributions
Similarly, applying Theorem 3.2 to the negative binomial distribution where y has the probability function with m(y − 1) = 0 for y = 0.In terms of the mean µ, the identity above is The second part of the bias correction (2.5), i.e. µ i E y,u ( θi (y)) could therefore only be replaced if the estimator for the canonical parameter θi (•) can be written as θi for some arbitrary function m(•) as in Theorem 3.2.This is not possible.Theorem 3.2 cannot be applied to the binomial distribution B(n, p) since a binomially distributed random variable only takes values in {0, 1, . . ., n} ⊂ N 0 .Extending the distribution by defining P (y = n + k) = 0 ∀k ∈ N does not yield an identity which could be applied to the bias correction (2.5), for the same reason as in the case of the negative binomial distribution.

Simulation study
In the first part of this simulation study, we concentrate on random intercept models.The bias corrections (3.3) and (3.6) are analysed in two different ways.First we compare the precision and the variability of different bias corrections as estimators of the correction term obtained by estimating the relative Kullback-Leibler distance with the log-likelihood.In a second step, the model choice behaviour of the bias correction for exponential responses (3.3) and Poisson distributed responses (3.6) is assessed.
The second part of the simulation study is concerned with the model choice behaviour of the proposed estimators for smoothing spline models.We therefore use a common link between mixed-effects models and smoothing spline models.

Exponential distribution
First we will focus on the precision and the variability of the proposed bias correction (3.3).We therefore consider a model with an exponentially distributed response y ij and a random intercept u i with where u i ∼ N (0, τ 2 ), β 0 = 0.1, β 1 = 0.2 and x j = j.Different numbers of clusters, cluster sizes and random effect variances are considered: m = 5, 10, n i = 5, 10 for i = 1, . . ., m and the random effect variances are τ 2 = 0, 0.5, 1.For each of the settings, 1,000 data sets are generated and the mean and the standard deviations of the different bias correction terms are calculated.The model is fitted by the PQL method as introduced by Breslow & Clayton (1993).We use an implementation in R based on Wood (2006).
We compare the proposed estimator for the bias correction Ψ obtained from refitting the model for each i with the true bias BC defined by (2.5), the asymptotically unbiased estimator ρml proposed by Yu & Yau (2012) and the estimator ρDon of Donohue et al. (2011).The true bias correction BC is derived by averaging 30,000 samples of (2.5) based on model (5.1).This criterion used as a benchmark is not available in practice since for its calculation the true mean µ has to be known.
For the proposed bias correction Ψ as in (3.3), an integral needs to be evaluated.Since this can not be done analytically it is approximated by adaptive quadrature.The resulting bias correction is used to obtain the proposed cAIC.
The cAIC suggested by Yu & Yau (2012) is included to assess the performance of an asymptotically unbiased estimator of the cAI in finite sample settings.Similarly to the cAIC suggested by Vaida & Blanchard (2005) for Gaussian responses, the cAIC proposed by Donohue et al. (2011) requires known random effects variance parameters.For known random effects variance parameters, the criterion is consistent.In our simulated random intercept model, τ 2 would need to be known.Since in many applications this will not be the case, we use the proposed bias correction of Donohue et al. (2011) with the estimated variance parameter τ 2 taken as truth.
In the calculation of ρml , the bias correction proposed by Yu & Yau (2012), numerical difficulties occurred.We therefore excluded all results in which the bias correction exceeded a threshold of 200.This excluded between 0 and 5 observations per setting.
Table 1 Mean estimated values of four different estimators of the bias correction (2.5) and the corresponding standard deviations (indicated by σ with corresponding index) of model (5.1) for different cluster sizes (n i ), number of clusters (m) and variances of random effects (τ 2 ).
The true bias correction BC is derived by (2.5), the estimator Ψ is directly calculated by (3.3), ρml is the estimator proposed by Yu & Yau (2012)  Table 1 shows the results.They suggest, that the proposed estimator performs well although numerical integration was used.The estimator ρml has the tendency to underestimate the true bias correction for positive true τ 2 and to overestimate it for true τ 2 = 0.This may be due to the fact that a non-canonical link function was used, while the authors derive their estimator only for canonical links.Furthermore the authors do not use PQL as fitting method, see 5.3 for a short remark.
The estimator ρDon consistently underestimates BC, as it ignores variability due to the estimation of the variance components.The last four columns give the standard deviations of each estimator.The standard deviation of the proposed estimator is low, which also speaks in favour of the estimator.The standard deviation of ρml is very high especially for low random effects variance, despite the exclusion of extreme values.
We now consider the behaviour of the proposed bias correction (3.3) when selecting random effects.Therefore consider the same settings as in model (5.1) but with the random effect variances as τ 2 = 0, 0.1, 0.2, . . ., 1.8, respectively.For each of the settings, 1000 data sets are generated and one model containing a random intercept (τ 2 ≥ 0) and another (generalized linear) model without random effects are fitted to each data set.The random effects model is fitted by PQL, see Breslow & Clayton (1993) and Wood (2006).
We compute the frequency of selecting the model including the random intercept (τ 2 > 0), which is chosen whenever the proposed AIC is smaller than an AIC, derived from the model without a random intercept (τ 2 = 0).As reference AICs for the model without random intercept we use (2.2) for the marginal AIC, Donohue's cAIC and Yu & Yau's cAIC.For the proposed cAIC we use formula (3.3) with a generalized linear model as reference.Thus, for each AIC we use as a reference the AIC it reduces to in the null model without intercept.The marginal AIC as defined in (2.3) requires the marginal log-likelihood, which is obtained by Laplace approximation.The results for different settings and AICs are displayed in Figure 1.
The mAIC behaves similarly to the mAIC with Gaussian responses as investigated in Greven & Kneib (2010).For small τ 2 the mAIC never chooses the model including the random effects.When the sample size increases, a preference for the smaller model remains.The other AICs select the more complex model in a higher proportion of cases.Both the proposed AIC and Yu and Yau's proposal exhibit increasing sensitivity as well as specificity as sample size increases, with the asymptotic criterion showing a stronger preference for larger models when the variance is zero or small.The estimator suggested by Donohue et al. (2011) shows a behaviour similar to the cAIC of Vaida & Blanchard (2005), observed by Greven & Kneib (2010): It chooses the model including the random effects far more often than the other criteria do.This might have been expected, since similarly to the cAIC by Vaida & Blanchard (2005), this criterion needs the variance-covariance matrices of the random effects to be known and using a plug-in estimator introduces a bias.

Poisson distribution
Investigating the precision and variability of the bias correction (3.6), we consider a random intercept model with Poisson distributed responses and subject where u i ∼ N (0, τ 2 ), β 0 = 0.1, β 1 = 0.2 and x j = j.Different numbers of clusters, cluster sizes and random effect variances are considered: m = 5, 10, n i = 5, 10 for i = 1, . . ., m and the random effect variances are τ 2 = 0, 0.3, 0.6, respectively.The differing values of τ 2 , compared to the model with exponentially distributed responses, are chosen due to the changed signal-to-noise ratio.
We generate 1,000 data sets for each setting and calculate the mean and the standard deviations of the different bias corrections.The true bias correction is derived the same way as for the exponential responses.
The results are shown in Table 2.The proposed estimator Ψ combines high precision with low variance.Compared to the estimates with exponentially distributed responses, ρml performs well although it shows a tendency towards overestimation and has high variances especially for a larger number of small clusters.The estimator ρDon underestimates the true bias correction as it did in the previous setting.
As for the simulation study with exponentially distributed responses, we also assess the model choice behaviour of bias correction (3.6).The settings are the same as in model (5.2) except the random effects variance, that is τ 2 = 0, . . ., 0.8.Then 1000 data sets for each setting are generated.Two models are fit to the data, one model containing a random intercept (τ 2 ≥ 0) and another model without random effects (τ 2 = 0).The frequency of selecting the more complex model, including the random effects is computed for different AICs.Just as for the exponential responses, PQL was used as model fitting method.The different proposed AICs are the same as in the exponential model (5.1): 1) the proposed bias correction Ψ as in (3.6); 2) the cAIC suggested by Yu & Yau (2012); 3) the cAIC proposed by Donohue et al. (2011), with the estimated variance parameter τ 2 plugged in as true τ 2 ; 4) the marginal AIC as defined in (2.3), which is obtained by Laplace approximation.
The results are displayed in Figure 2.They are similar to the results observed for exponential responses.The marginal AIC chooses the model including the random effects only very rarely even for random effects variances larger than zero.On the other hand, the AIC proposed by Donohue et al. (2011) chooses to include random effects very often, even if the model was simulated without random effects.The proposed criterion and Yu and Yau's asymptotic critierion behave similar, with a stronger preference for the larger model when the variance is zero or small for Yu and Yau's AIC.The asymptotically unbiased criterion proposed by Yu & Yau (2012) behaves as expected.For larger cluster sizes and increasing number of clusters the model choice behaviour gets better.

Penalized spline smoothing
It is well known that penalized spline models have a mixed model representation, see for example Wood (2006) and Ruppert et al. (2003).In this part of the simulation study, we assess the performance of different criteria for model selection in penalized spline models using the mixed model representation.
We investigate models where the mean µ is linked to a smooth function m(•): are generated, and a linear and a nonlinear model are fitted to the data.The frequency of selecting the more complex, nonlinear model for each criterion is computed.
Figure 3   bias correction (3.3) shows similar behaviour to the other settings.For increasing sample size, Yu & Yau (2012) show an unexpected behaviour.The cAIC by Yu & Yau (2012) selects the nonlinear model with a proportion increasing with sample size, even for zero or small variances τ 2 , and for the largest sample size more often than the cAIC proposed by Donohue et al. (2011).Since this behaviour seems to contradict the findings of Yu & Yau (2012), a short discussion is given in 5.3.

Poisson distribution
For Poisson distributed responses y i ∼ P(µ i ), model (5.3) stays the same but, due to a different signal-to-noise ratio, we choose a different sequence of nonlinearity parameters.In order to compare the precision and variability of the different bias corrections, we choose the nonlinearity parameter d = 0, 0.8, 1.6.
For each level of nonlinearity and for the sample sizes n = 25, 50, 75, 100 the estimated bias corrections are listed in Table 4.The results show, that the proposed estimator Ψ is close to the bias correction BC derived by 30,000 times reestimating model (5.3) with Poisson distributed responses and calculating 2.5.The BC bias correction is not applicable in practice since the true unknown mean µ has to be known for its calculation.The high variance of the estimator ρml , proposed by Yu & Yau (2012) indicates some very large values which seem to be due to numerical instabilities.
The selection frequencies are derived for nonlinearity levels d = 0, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6.They are shown in Figure 4.They behave similar to the ones observed for the smoothing spline model with exponentially distributed responses.The unexpected behaviour of the cAICs proposed by Yu & Yau (2012)  and Donohue et al. (2011) are not as pronounced as for exponentially distributed responses.Nevertheless the bias correction of Yu & Yau (2012) is occasionally smaller than the one proposed by Donohue et al. (2011) in this setting as well.

General remarks
The cAIC proposed by Yu & Yau (2012) is used here as an ad-hoc criterion since it is one of the few available benchmarks for model selection in generalized linear mixed models.The criterion was derived for ML estimation of the variance parameters based on McGilchrist (1994), while our models were fitted using the REML based PQL method proposed by Breslow & Clayton (1993).Despite the difference between REML and ML, the two approaches are similar to each other in maximizing the joint likelihood of y and u as mentioned by McGilchrist (1994).However the main objection to the application of the cAIC proposed by Yu & Yau (2012) may be that the models (5.3) and (5.1) have a non-canonical link although the criterion of Yu & Yau (2012) requires canonical link functions.Nevertheless the results of our simulation study do not reflect the findings of Yu & Yau (2012), even for Poisson distributed responses with canonical link, since in their simulation study the proposed cAIC can distinguish between τ 2 = 0 and τ 2 > 0 very well, i. e. the proportion of selecting a model with τ 2 > 0, although τ 2 = 0 is the true model, is zero, see Figure 1, p. 637 in Yu & Yau (2012).In our simulation, on the other hand, in at least a quarter of the cases the more complex model (τ 2 > 0) was chosen, independent of the specific settings.
Furthermore in our simulations the bias correction of Yu & Yau (2012) sometimes was smaller than the bias correction proposed by Donohue et al. (2011).This contradicts Remark 3 in Yu & Yau (2012) that says, that their bias correction is equal to the one proposed by Donohue et al. (2011) plus the trace of a positive semi-definite matrix.However in our simulation the matrix which, following Remark 3 in Yu & Yau (2012), is positive semi-definite sometimes has negative eigenvalues.This seems to be due to a boundary issue.When deriving the criterion, the derivative with respect to τ 2 needs to be calculated when τ 2 lies on the boundary of the parameter space.In these cases the trace of the matrix is sometimes negative.
The implementation of the cAIC by Yu & Yau (2012) was adapted from the MATLAB code the authors provided, but simulations were carried out in statistical programming language R. The code of the simulation study can be found in the supplementary material (Saefken et al. (2014)).
A disadvantage regarding the proposed estimator (3.3) when using numeric integration is, that for each datum the integral needs to be calculated.Therefore if for one i in (3.3) the integral can not be calculated the bias correction can not be obtained.This may be a problem particularly in large data sets and for instance, if there is collinearity in the data.
The implementation of the proposed method to derive (3.3) based on numerical integration takes 330 s for model (5.1) with random-effects variance 1 and five clusters with five observations each on a 2.80-GHz personal computer.The computational cost depends on how precise the numerical integration is and on the size of the data set.
For data from model (5.2) with random-effects variance 1 and five clusters with five observations each it takes about 3 s to compute (3.6) on a 2.80-GHz personal computer.This leave-one-out implementation is increasingly time consuming for larger data sets and less time consuming if there are many zeros in the observed responses.

Example: Modelling tree growth with water availability
Tree growth is of high economic importance as it determines the amount of available timber per time.As the trend is turning to a more sustainable silviculture, it becomes even more important to understand the underlying processes under close to natural conditions.
In this case study, we show how the proposed estimator of the Kullback-Leibler distance for exponential responses influences the selection of models for tree growth.The study is based on a sub-sample of 2655 trees, from a 28.5 ha large area that is located in the core zone of the Hainich National Park, Thuringia, Germany.The National Park is part of Germany's largest continuous broad-leaved forest.To estimate tree growth, in 1999 and 2007 for each tree within the study area the Diameter at Breast Height (DBH), i.e. at about 1.30 m, was mapped, see Butler-Manning (2008).The difference in diameter is the dependent variable growth.We only consider beech, which accounted for 90 % of the recorded trees.We included only trees with 10-30 cm DBH, because they can be reasonably assumed to have completed the phase of highest mortality due to competition (self-thinning), without reproducing yet themselves.Furthermore, we excluded trees for which no positive growth was recorded as these measurements seem to be erroneous.
Growth performance is highly influenced by competition for light.Thus, we assumed that neighbours that potentially overshadow the individuals are crucial for predicting growth.Neighbour-processes are included as KRAFT-class (k i ), nearest-and second nearest-neighbour distances (nnd1 i and nnd2 i ).
Water availability is a good proxy for abiotic resource availability on rich soils, because water availability, apart from light, mainly limits tree-growth, influencing the predominance of beech.To estimate spatial variation in water availability due to soil properties, we use the soil depth (sd i ) as covariate.A second available covariate, the Topographic Wetness Index (twi i ), is calculated from a Digital Elevation Model and measures water availability determined by topography, see Boehner et al. (2006).
Our aim is to find a model that best describes the tree growth with the help of the given covariates.Hence we choose the model with the lowest estimated Kullback-Leibler distance from a set of candidate models.We concentrate on the selection of linear versus nonlinear modelling of the continuous covariates.This corresponds to the selection of random effects in the mixed model framework.We model the DBH difference using an exponential distribution, as using a gamma distribution resulted in a dispersion parameter estimate of 0.98 for model 6.1 that is very close to one.

Univariate smooth function
In order to investigate the model choice behaviour of the mAIC and the proposed AIC with bias correction (3.3) in a simple model, we consider a univariate smoothing example, based on the tree growth data.We estimate the effect of soil depth on the tree growth and include the KRAFT-class to account for differing growth potentials due to light availability.For the mean of the tree growth µ, we obtain the following model: The model captures the positive effect of increasing soil depth for water availability.This effect levels off in very deep soils when fine root density is very low.The negative trend in very deep soils is a joint effect of soil depth and change of grain size to silt perceived as dry soils.

Generalized additive model
In a more sophisticated approach, we consider a model incorporating possibly nonlinear effects of three covariates and one linear effect of the KRAFT-class k.Accordingly we extend model (6.1) to a generalized additive model, see Wood (2006): for i = 1, . . ., 2655, depending on whether the first-or second nearest-neighbour distance is included.We only consider one of the two nearest-neighbour distances, since the two variables are collinear.Thus we use the AICs to decide which of the nearest-neighbour distances to include into the model.The functions m 1 , . . ., m 3 may either be linear or nonlinear functions.In the model selection process, we choose between the two possibilities for each of the three functions in the two models.In consequence we can choose from a set of 16 candidate models.We expect all covariates to have an effect on growth and therefore do not include models into the model selection process that completely omit one of the covariates, except nnd1 and nnd2 respectively.All possible models and the corresponding AIC values can be found in Table 5.
The model selection process is based on two criteria, the proposed conditional AIC with associated bias correction (3.3) and the conditional AIC proposed by Donohue et al. (2011).The marginal AIC is omitted because we can not extract the design matrices Z i , i = 1, 2, 3 corresponding to the random effects parametrization of the smoothing splines, that are needed to derive the Laplace approximation.This problem does not occur in univariate smoothing models since there is no need to split up the design matrix Z corresponding to the random effect.The conditional AIC proposed by Yu & Yau (2012) could not be calculated due to the need for inverting matrices that are singular.
The two criteria both choose the model including the second nearest-neighbour distance with all three effects modelled as nonlinear functions.Comparing each specific model whether to include the second or the first nearest-neighbour distance, both criteria in each case favour the model with the second nearestneighbour distance.For all models, except the two models only containing linear effects, the proposed conditional AIC is larger and therefore penalizes more than the conditional AIC proposed by Donohue et al. (2011).This confirms the behaviour observed in the simulation study, that the criterion by Donohue et al. (2011) underestimates the bias correction.
This example additionally highlights that the various criteria for the estimation of the Kullback-Leibler distance can lead to different model choices.For instance, in the comparison of the two models log (µ i ) = β 0 + β 1 k i + m 1 (sd i ) + β 2 twi i + m 3 (nnd1 i ) (6.4) and log (µ i ) = β 0 + β 1 k i + β 2 sd i + m 2 (twi i ) + m 3 (nnd2 i ) , (6.5) our proposed estimator chooses the first model (6.4), while the estimator proposed by Donohue et al. (2011) chooses the latter (6.5).

Discussion
The proposed class of estimators of the conditional Akaike information is unbiased for finite samples, does not require a particular estimation method for GLMMs and does not assume known variance-covariance matrices for the random effects.Therefore it improves on available asymptotic results in terms of bias, variability as well as model selection.The formulation of penalized regression as mixed models allows the model choice techniques considered in this paper to be used for penalized regression models as well.All of these characteristics make these conditional Akaike information criteria appealing to use.For the theoretical derivation, the working model has to be correctly specified in equations (3.1) and (3.4).The behaviour of the proposed methods for misspecified models needs to be investigated in future work.For other exponential family distributions than the ones discussed above, like the gamma and the binomial distribution, the Stein type formulas do not seem to yield criteria that are computable from observable quantities.It may also be of interest to investigate extensions to distributions beyond the exponential family using the generalized SURE formula of Shen & Huang (2006).The behaviour of other information based criteria like the Bayesian information criterion, BIC, for the selection of random effects in GLMMs needs further investigation.

Appendix: Technical details
Here we give outlines of proofs of the main results.The proofs of the essential Theorem 3.1 and Theorem 3.2 can be done by integration by parts; see Hudson (1978) with small modifications.Here y −i is the vector of observed responses without the i−th observation and y i is the i−th observation with y i θi (y −i , y i − 1) = 0 if y i = 0 by convention.

Fig 1 .
Fig 1. Results for the random intercept model with exponentially distributed responses.The y-axis shows the number of simulation replications out of 1000 where the more complex model was favoured by the different AICs.

Fig 2 .
Fig 2. Results for the random intercept model with Poisson distributed responses.The y-axis shows the number of simulation replications out of 1000 where the more complex model was favoured by the different AICs.

Fig 3 .
Fig 3. Results for the spline smoothing model with exponentially distributed responses.The y-axis shows the number of simulation replications where out of 1000 the more complex model was favoured by the different AICs.

Fig 4 .
Fig 4. Results for the spline smoothing model with Poisson distributed responses.The y-axis shows the number of simulation replications out of 1000 where the more complex model was favoured by the different AICs.

Fig 5 .
Fig 5.The linear effect (dotted line) and the nonlinear effect (solid line) with confidence interval (dashed lines) of the soil depth on the tree growth.The pointwise confidence intervals are defined using twice the standard deviation of the estimator.

Table 2
Yu & Yau (2012)values of four different estimators of the bias correction (2.5) and the corresponding standard deviations (indicated by σ with corresponding index) of model (5.2) for different cluster sizes (n i ), number of clusters (m) and variances of random effects (τ 2 ).The true bias correction BC is derived by (2.5), the estimator Ψ is directly calculated by (3.6), ρml is the estimator proposed byYu & Yau (2012)and ρDon is the estimator proposed byDonohue et al. (2011)

Table 3
Yu & Yau (2012)values of four different estimators of the bias correction (2.5) and the corresponding standard deviations (indicated by σ with corresponding index) of model(5.3)fordifferentsamplesizesn and different degrees of nonlinearity d.The estimator Ψ is directly calculated by (3.3), BC is derived by (2.5), ρml is the estimator proposed byYu & Yau (2012)and ρDon is the estimator proposed byDonohue et al. (2011)

Table 4
Yu & Yau (2012)values of four different estimators of the bias correction (2.5) and the corresponding standard deviations (indicated by σ with corresponding index) of model(5.3)withPoissondistributedresponses for different sample sizes n and different degrees of nonlinearity d.The estimator Ψ is directly calculated by (3.6), BC is derived by (2.5), ρml is the estimator proposed byYu & Yau (2012)and ρDon is the estimator proposed byDonohue et al. (2011)

Table 5
Estimated Kullback-Leibler distance for 16 models fitted to the tree growth data.The first four columns indicate if the effects of the covariates are modelled by linear (−) or nonlinear (∼) functions, corresponding to the absence and presence of random effects.Two different estimators of the Kullback-Leibler distance are listed in the table:The AIC based on the bias correction 3.3 (cAIC) and the AIC proposed byDonohue et al. (2011) (dAIC)