Penalized regression , mixed effects models and appropriate modelling

Linear mixed effects methods for the analysis of longitudinal data provide a convenient framework for modelling within-individual correlation across time. Using spline functions allows for flexible modelling of the response as a smooth function of time. A computational connection between linear mixed effects modelling and spline smoothing has resulted in a cross-fertilization of these two fields. The connection has popularized the use of spline functions in longitudinal data analysis and the use of mixed effects software in smoothing analyses. However, care must be taken in exploiting this connection, as resulting estimates of the underlying population mean might not track the data well and associated standard errors might not reflect the true variability in the data. We discuss these shortcomings and suggest some easy-to-compute methods to eliminate them. AMS 2000 subject classifications: Primary 62G08; secondary 62J99.


Introduction
Two approaches have emerged for the analysis of longitudinal data: linear mixed effects modelling and functional data analysis.Linear mixed effects modelling has its roots in parametric statistics, with, for instance, the response variable assumed to be linear in time (see Demidenko [3]).In contrast, functional data analysis has its roots in smoothing (Ramsay [22]).Recently, these two approaches have become intertwined, with researchers in one approach borrowing methods from the other.Linear mixed effects modellers have borrowed from smoothing research by replacing linear response models with more flexible piecewise polynomial response models, using basis functions such as B-splines.Within-individual correlation can be incorporated via random regression coefficients.See, for instance, Verbyla et al [33], Fitzmaurice [8], Ruppert et al ([27,28]) and the extensive work in animal breeding, including work of Meyer ([19,20]).However, correct modelling of the within-individual correlation is rarely straightforward.
Smoothers have made good use of the fact that, for a specific and known covariance structure, there is a computational equivalence between a particular linear mixed effects model and a standard smoothing approach.This equivalence parallels the well-known correspondence between Bayes estimation and penalized regression.See, for instance, the work on cubic smoothing splines by Kimeldorf and Wahba [17].Curve estimates using this covariance structure can be calculated quickly and often, easily with existing software (Ngo [21] using the software PROC MIXED in SAS, and White et al [35] using the software AS-Reml by Gilmour et al [11]).This connection also provides an automatic choice of the amount of smoothing via the estimation of the ratio of variances in the mixed effects model.
The connection between linear mixed effects models and smoothing is not only elegant, but has also proven useful in many applications such as the comparison of human growth curves (Durban et al, [5]).However, care must be taken in exploiting this computational connection.Inappropriate modelling of the mean structure or too much reliance on the assumed specific covariance structure for the random effects can lead to undesirable features in the population effect estimation, in variance parameter estimation, and in specification of standard errors.In particular, the smoothing-based covariance structure should not be completely trusted since typically its form does not come from subject area modelling.
We illustrate the problems and our solution using two data sets.One data set is taken from an evolutionary biology experiment involving fruit flies' metabolic characteristics measured at irregular time points.Figure 1 shows measurements of the metabolic variable velocity (i.e.rate) of oxygen consumption (VO2) of fruit flies placed in individual dessication (drying) chambers.The data were collected by Donna G. Folk in the laboratory of Timothy J. Bradley at University of California Irvine in order to study the evolution of physiological traits in fruit flies that were selectively bred to withstand dry conditions (Folk and Bradley [9]).The researchers provided already processed data giving the oxygen consumption rates.The right panel of Figure 1 shows measurements on eight fruit flies from the selectively bred group while the left panel shows measurements from the eighteen fruit flies in the control group, who were randomly bred.Repeated measurements were made on each individual until death, usually within one to two days.The researchers had compared lifetimes of the two groups, but were also interested in comparing the temporal patterns of the V02 measurements, that is, the shape of the curves.We therefore scaled lifetimes to the unit interval, to allow us to focus on shape and ignore lifetime.This re-scaling is a very simple type of curve registration (see Ramsay and Silverman [22]).Due to the design of the laboratory machinery, fruit flies were measured at different times.This inherent imbalance in the design with respect to the time variable is increased by the individual rescaling of time to the unit interval.Thus, the researchers cannot use pointwise analysis nor standard multivariate analysis to compare the two groups of insects.Our analysis of these data is presented in Section 5.
The other data we analyze were collected in a balanced design.In such a design, we can calculate unbiased estimates of variances and, through direct comparison of variance estimates, we can illustrate some of the problems that might arise with blind application of the standard smoothing/linear mixed model analysis.Also, the widespread availability of this data set, the CanadianWeather in the fda library in the statistical software package R, allows the reader to check our calculations.Figure 2 shows average daily temperatures recorded at 35 Canadian weather stations, where time t = 1 corresponds to January 1.In our analysis, we view the 35 stations as a random sample of all possible Canadian locations, with our goal being inference for the mean daily temperature in Canada.While this is a somewhat unusual goal for these data, it allows us to illustrate the main points of this paper.A more appropriate goal for these data, estimation of a particular station's "typical" weather curve, is not addressed here, although it is briefly mentioned.
We propose fitting data sets such as these in two steps: first, for speed of computation, we fit a standard smoothing-based linear mixed effects model.This yields our proposed estimate of the population curve.Then we use the output of this fit to calculate pointwise sandwich standard errors of our population curve estimate.We make two recommendations for the linear mixed effects model.We recommend that the model used for the individual-level random effects be a submodel of the population curve model, in the sense specified in Theorem 3.1.In addition, we recommend that the covariance structure for individuallevel random effects be based on subject area knowledge.If such a rationale for the covariance is not available, we recommend using a "time-neutral" covariance structure, defined in Section 2. These recommendations will, in some cases, reduce or eliminate bias in the estimate of the population curve.While we do not propose anything further for bias correction of the estimate of the population curve, we do show via simulation studies that, if our recommendations are followed, resulting pointwise confidence intervals have good coverage properties (Section 6).This is further supported by our theoretical calculations in Section 4.2.
Figure 3 contains four estimates of the population mean temperature: one estimate is the daily mean of the temperatures.The other three estimates are from linear mixed effects models.They are calculated with the R library nlme, which uses restricted maximum likelihood (REML) estimates of variance components.For a discussion of restricted maximum likelihood estimators, see, for instance, Demidenko [3].The specifics of our calculations are given in Section 3.4.The three mixed effects estimates are calculated using the same function space to model the population mean and the same function space to model the individual station effects, but the three estimates use different covariance structures for the random effects, covariance structures commonly used in smoothing.One mixed effects estimate uses a covariance structure corresponding to "time running forward", the other uses a covariance structure corresponding to "time running backward".The third mixed effects estimate uses a "time neutral" covariance structure.The first two mixed effects estimates do not track the pointwise average well, with the "forward" estimate deviating slightly from the pointwise average near day 365 and the "backward" estimate deviating near day 1.Clearly, both are poor estimates.The "time neutral" estimate tracks the mean fairly well.
In Figure 4, we compare several methods of computing standard errors.Throughout this paper, we plot error bands as plus or minus one standard error.Panels a) through c) show three of the four estimates from Figure 3 along with pointwise standard errors.Panel a) shows the pointwise average, with standard errors given by the pointwise standard deviation divided by √ 35.Panels b) and c) show model-based standard errors, that is, standard errors calculated using estimates of the assumed covariance structure of the linear mixed effects models.Panel b) corresponds to the "running backward" covariance structure and panel c) to the "running forward" model.Note the widening of the standard error bars in panels b) and c) to values that are much higher than the standard errors of the pointwise averages, standard error bars so wide that they are nonsensical.The "time neutral" covariance structure yields the model-based standard errors shown in panel d), where we also show the standard errors from panel a).Note that these two types of standard errors are similar in magnitude, but the "time neutral" covariance structure yields standard errors that are almost constant, which contradicts the fact that, in the winter, the variance between city temperatures is much larger than in the summer.Note that the standard errors in panel a) of Figure 4 will only be useful when, on each day, temperatures are recorded for a fairly large number of stations.In many applications, this is not the case and these straightforward standard errors cannot be readily calculated.
The observation that the assumed covariance structure of the random effects in a linear mixed effects model can impact mean estimates and standard errors is not completely new.Misspecification of the covariance structure might have an effect on the estimated means, and typically can have a big effect on standard errors and on inference.Unfortunately, currently published work on smoothing in mixed effects models has largely ignored this potential problem.Two notable exceptions are Brumback et al [1] and Djeundje and Currie [4], who both note the serious implications of reliance on the smoothing-induced covariance.The former suggest a computer-intensive bootstrap procedure to rectify the problems.The latter specifically note the problem of fanning as illustrated in Figure 4 and suggest a penalized approach to reduce fanning and also to reduce bias in the estimate of the population curve.This penalized approach should work well when the data are generated from a a specific covariance structure that matches the penalty.Section 4.2 contains further discussion of Djeundje and Currie's method, along with discussion of general issues of assessing variability in an estimated regression function.Our simulation studies in Section 6 indicate that, when the covariance structure does not match the penalty, Djeundje and Currie's variability bands severely underestimate the sampling variability in the estimates of the population mean.
In summary, from Figures 3 and 4, we see that the covariance structure assumed for the random effects can seriously impact both estimates and standard errors.Fortunately, under certain conditions, mean estimates are not effected by the assumed covariance structure (see Theorem 3.1).Appropriate standard errors can be easily constructed via sandwich estimators (see Sections 3.2 and 4.3).Figure 5 contains standard error bars calculated via our recommended methods.These standard error bars do not show the widening as seen in Figure 4 and can be calculated even when we do not have multiple observations on each day.See Liang and Zeger [18] for a discussion of variance misspecification in the context of generalized estimating equations.See also Szpiro et al [32], who propose a sandwich estimator-based solution in a slightly different context.
Section 2 contains notation and the general formulation of the linear mixed effects model.Sections 3 and 4 contain detailed calculations and discussion of estimators, predictors, and standard errors.Section 3 covers the conceptually straightforward model in which the population curve is nonrandom.In Section 4, the population curve is random.This assumption is unorthodox in classical linear mixed effects modelling but is common in smoothing (see, for instance, Ruppert et al [28]) and in Gaussian Process Regression, a popular technique in machine learning (Rasmussen and Williams, [23]).Section 5 contains analysis of the fruit fly data set shown in Figure 1.Section 6 contains the results of a simulation study.

General formulation
Data are collected on N independent subjects, with data on subject i, (t ij , Y ij ), j = 1, . . ., n i , modelled as We model the population curve µ via a set of J P + K P basis functions {ψ P j , 1 ≤ j ≤ J P , φ P k , 1 ≤ k ≤ K P }: We model individual i's deviation g i via basis functions γ Ij , j = 1, . . ., L: In all of our analyses, β = (β[1], . . ., β[J P ]) ′ is a fixed effect and the θ i 's, We assume that θ 1 , . . ., θ N , ǫ 1 , . . ., ǫ N (and δ, when δ is assumed random) are independent, mean zero and normally distributed.When δ is random, we will assume that var(δ) is an unknown positive constant times Σ δ , a known covariance matrix.We denote var(θ i ) = Σ I and we consider two general models for Σ I , one with Σ I unrestricted and the other with Σ I = Σ R I of some particular known form with just a few unknown parameters.The primary purpose of restricting the form of Σ I is to facilitate computation in fitting the linear mixed effects model.In addition, some restricted forms have a connection to smoothing and so have become popular in mixed model smoothing.
We propose fitting the parameters of the model defined by (2.1), (2.2) and (2.3) using a restricted covariance for var(θ i ).However, to avoid the problems in standard errors sometimes caused by mis-specifying the covariance structure, we propose using sandwich-type standard errors based on the unrestricted Σ I .
While our results hold in general, in all of our examples and in our simulation studies, we assume that µ and the g i 's are linear splines with equally spaced knots and we suppose that t ∈ (t min , t max ) where t min = min i,j {t ij } and t max = max i,j {t ij }.We do not restrict the knots for modelling µ to be the same as the knots for modelling the g i 's.A linear spline on the interval [t min , t max ] with K interior knots, K 1 < K 2 < • • • < K K , with K 1 > t min and K K < t max , is continuous and piecewise linear with the "pieces" defined on the subintervals determined by the knots.With K interior knots, we require K + 2 basis functions and we consider the following bases.These bases were also considered by Djeundje & Currie [4].
Splines are widely used for flexible fitting of functions.See, for instance, Ramsay and Silverman [22].See Eilers and Marx [7] and Welham et al [34] for extensive discussion of the connections between the truncated power basis and a Bspline basis in penalized smoothing.We consider four models for

1) -(2.3).
A. µ is fixed, g i is random: assume that β and δ are non-random and var(θ i ) = Σ I is unrestricted.B. µ is fixed, g i is random but with modelled covariance: assume that β and δ are non-random and that var(θ i ) = Σ R I , some restricted form.C. µ is random, g i is random: assume that β is fixed, that δ is N (0, σ 2 P,C Σ δ ) for some known Σ δ and unknown σ 2 P,C , that δ is independent of the θ i 's and ǫ i 's, and that var(θ i ) = Σ I is unrestricted.D. µ is random, g i is random but with modelled covariance: assume that β is fixed, that δ is N (0, σ 2 P Σ δ ) for some known Σ δ and unknown σ 2 P , that δ is independent of the θ i 's and ǫ i 's, and that var(θ i ) = Σ R I , some restricted form.
We use the notation σ 2 P,C for the model C parameter to avoid confusion between estimating the variance of a component of δ under the unrestricted model C versus the restricted model D.
It is important to keep in mind that the model defined in (2.1)-(2.3)has two components: the choice of the basis functions and the assumed covariance structure of the random coefficients.These two elements determine the covariance between µ(t) + g i (t) and µ(s) + g i (s), and it is the structure of this covariance that is crucial in analysis.Clearly, a change of basis without the accompanying change in assumptions about the covariance of the random coefficients may change the covariance of µ+ g i .Notice that, in models A and C, the form for the covariance of the g i 's depends only on the space spanned by the basis functions and not on the choice of basis.In contrast, in models B and D, the basis for the g i 's and the form of Σ R I are both important in defining the model for the covariance of µ + g i .
When modelling µ in (2.2) in our data analyses and simulation study, we take the ψ P j 's and φ P j 's to be either the linear "time running forward" or "time running backward" basis functions.When µ is random, as in models C and D, we take Σ δ equal to the identity.
When modelling the g i 's in (2.3) in our data analyses and simulation studies, we use piecewise linear functions and we consider two restrictions on var(θ i ).When using a power basis for g i , with either time running forward or time running backward, we take Here Σ β is the two by two unrestricted covariance matrix of ( ), the coefficients of γ I1 (t) ≡ 1 and γ I2 (t) = t, and σ 2 I I is the restricted covariance matrix of (θ i [3], . . ., θ i [K + 2]), the coefficients of the power functions.This covariance structure was used by Durban et al [5].When using the Bspline basis for g i , we take var We now discuss the covariance structures induced by these assumptions.For "time running forward" and with covariance structure as in model B or D with var(θ i ) = Σ p I as in (2.5), var an increasing function of t.Similarly, for "time running backward" and with the same var(θ i ), the variance of Assuming more variability in individual random effects at one end of the time scale than the other leads to the "drifting" of the estimate of µ in Figure 3 and to the unacceptable widening of the standard error bands in panels b) and c) of Figure 4.
Consider the covariance induced by the linear B-spline basis with equi-spaced knots, with the difference between knots equal to ∆. Suppose that either model B or model D holds, with var In particular, at the equi-spaced knots the variance is constant: var(g i (K k )) = σ 2 I .On the interval between knots, the variance is quadratic with minimum value at the midpoint.The plot contains four estimates of µ, the typical weather curve, as described in Sections 2 and 3.The solid line is the pointwise average, the dotted line is the "time neutral" estimate, the dot-dashed line is the "time running forward" estimate, and the long dashed line is the "time running backward" estimate.The latter three estimates are based on piecewise linear functions with 41 equispaced interior population knots and 7 equispaced interior individual knots.
We call the model for g i assuming either (2.5) for the linear power basis or (2.6) for the linear Bspline basis a "smooth g i " model.Likewise, we call our piecewise linear model for random µ a "smooth µ model" when Σ δ is the identity matrix.To see why, consider model B with power basis functions and var(θ i ) = Σ p I as in (2.5).Write . By Henderson's justification [26], for fixed σ 2 ǫ , σ 2 I and Σ β , the best linear unbiased predictors (the BLUPs) of the f i 's in (2.1), (2.2) and (2.3) are obtained by minimizing The ith summand is a penalized least squares regression with penalties on β i and δ i .When we model g i as a spline using the power basis, the penalty on δ i with "smoothing parameter" σ 2 ǫ /σ 2 I is one of those proposed by Eilers and Marx [6] for P-spline smoothing, and is recommended by Ruppert et al [28].To consider the effect of the penalty, suppose that the penalty on δ i is large, that is, that σ 2 ǫ /σ 2 I is large.Then the BLUPs of the δ i 's will be close to 0 and so the BLUP of g i will be close to a line.We can thus say that the penalty on δ i shrinks the BLUP of g i to a line, with the amount of shrinkage depending on σ 2 ǫ /σ 2 I .Therefore, the effect of the penalty is similar to penalizing for large second divided differences of g i and so the penalty is similar to the second derivative penalty that yields a smoothing spline estimate [12].In fact, when the knots are equi-spaced with K k = (k − 1)∆, then one can show that δ i [k] is exactly equal to the second divided difference [g i ((k + 2)∆) − 2g i ((k + 1)∆) + g i (k∆)]/∆.See Eilers and Marx [7].
Similarly in model D with Σ δ equal to the identity matrix, by Henderson's justification, the predictors of the f i 's are based on minimizers of Thus, when using the power basis, the restrictions in our linear mixed effects model lead us to P-spline smoothing type estimators of µ and the g i 's.
Conversely, we can rewrite many penalized least squares criteria in terms of the estimation criterion in a linear mixed effects model.Suppose that the f i 's have the basis expansions as in (2.1), (2.2) and (2.3) for some basis functions, not necessarily Bsplines or power functions.Suppose that our estimates of the β[j]'s, δ[k]'s and θ i 's are the minimizers of the penalized least squares criterion where Ω is a known symmetric non-negative definite L×L matrix and λ is unknown.If Ω is invertible, then we can use a linear mixed model in which the covariance matrix of the θ i 's is restricted to be Σ R I ≡ (λΩ) −1 .This is exactly model B or D for θ i .If Ω is not invertible, then we can reparameterize the problem to a linear mixed model that is similar to the model for θ i in model B or D, as follows.Let Ω = QΛQ ′ be the eigendecomposition of Ω, with Λ diagonal with first k diagonal elements equal to 0. Partition Q as and the penalized sum of squares criterion becomes i,j with Λ * a diagonal matrix containing the non-zero eigenvalues of Ω.Thus, the penalized sum of squares criterion is the same as a linear mixed effects model criterion where the θ * i1 's are fixed effects and the θ * i2 's are random effects with covariance matrix (λΛ * ) −1 .Some restrictions on the θ * i2 's may be needed, to ensure identifiability.A slight modification leads us directly to model B or D and avoids the problem of non-identifiability: suppose that the θ * ij 's are independent with means equal to the zero vector and with the covariance of θ * i1 = Σ I , unrestricted, and the covariance matrix of the θ * i2 's equal to the restricted matrix (λΛ * ) −1 .
Special cases of models A and B have been considered elsewhere.In the animal breeding literature see, for instance, Huisman et al [15] and Meyer [19].Rice and Wu [24] consider model A in a medical context.In these models where µ is fixed, the only smoothness of µ comes from the smoothness of the basis functions.We do not include a penalty term, as in the random µ case, where the penalty term forces further smoothness of µ.Our model for fixed µ is a standard model in spline regression.See, for instance, Stone et al [30] or Friedman [10].
Durban et al [5] analyzed growth data by reformulating a penalized spline approach with µ fixed into an analysis using the restricted model D, modelling µ as random, using the degree 1 power basis.They considered a range of models for g i : g i equal to a random intercept, a random line with unspecified covariance matrix for the slope and intercept, and g i piecewise linear with the same knots as µ, with cov(θ i ) = Σ P I as in (2.5).Their main goal was to carry out various hypothesis tests.While their figures do contain prediction bands for their function estimates, they provide no explanation of their calculation.
Details of calculations of estimators and standard errors are given in the following sections.In summary, when µ is a nonrandom function, we estimate µ by the maximum likelihood estimator under the restricted model B. We propose easy-to-compute standard errors of this estimate, valid under the unrestricted model A. When µ is a random function, we use the restricted model D to calculate the BLUP of µ and then use the unrestricted model C for easy-tocompute pointwise prediction bands.Throughout, we ignore any model-based bias, that is, we assume that (2.1) -(2.3) are exact.
In this article, we only consider inference for µ -we do not study prediction of the individual effects.But this prediction is straightforward, predicting g i by substituting our estimates of θ i into (2.3).
We do not consider estimation or prediction of µ under models A or C because fitting linear mixed effects models with so many variance parameters is computationally challenging.For instance, the model with unrestricted Σ I caused us computational problems when the dimension of Σ I wasn't small.In our analysis of the fruit fly data, we found that R's lme would not converge when we used five or more knots in modelling g i , that is, when the dimension of the unrestricted Σ I was 7 by 7 or larger.The convergence was extremely slow when Σ I was 6 by 6.This held for both the power basis and the more computationally stable Bspline basis.In contrast, calculation of estimates with the restricted covariances always converged and converged very quickly, even when ten knots were used to model g i .The small number of parameters in the restricted model can turn a computationally impossible analysis into a possible analysis.
The techniques we use in our calculations are not new.Many calculations in the linear mixed effects model appear in Demidenko [3] and Ruppert et al [28].However we present these calculations in a way that clearly shows when we are relying on the smoothing model covariance structure of models B and D and when we are using the more general models A and C. We also discuss in Section 4 interpretations of various techniques for error bars of a predictor of µ when µ is random.When µ is random, we should assess variability of the predictor about µ, not about E(µ).

Estimation of µ under model B
Consider data generated according to (2.1) through (2.4) under either model A or B. Since δ is a fixed effect, µ is non-random; thus when we talk about an estimate of µ(t) and a standard error of the estimator, our meaning is clear.The estimator of θ = (β ′ , δ ′ ) ′ under the assumptions of model B is the generalized least squares estimate, minimizing with Σ * R i denoting the variance of ǫ * i under the restricted model B, Therefore the estimator of θ for known variance parameters is The estimator of µ(t) for given A linear mixed effects model fit of model B yields (restricted) maximum likelihood variance estimators ΣR I and σ2 ǫ , and thus yields Ĥi = H i ( Σ * R 1 , . . ., Σ * R N ), an estimator of H i .The (restricted) maximum likelihood estimator θ is then equal to Ĥi Y i and the (restricted) maximum likelihood estimator of µ(t), μ(t), is gotten in the obvious way from θ.The method also provides estimators, θi i = 1, . . ., N , of the BLUPs of the θ i 's.These estimators, commonly called the estimated best linear unbiased predictors or EBLUPs, are gotten by substituting covariance estimates into the expressions for the best linear unbiased predictors.We use a tilde when an estimator or predictor is based on known covariance parameters and a hat when estimated covariance parameters are used.
In particular, we use a tilde to denote a BLUP and a hat to denote an EBLUP.

Calculation of standard errors
The estimator μ is derived under the assumption that model B holds.In this section, we calculate the standard deviation of μ(t) valid under the unrestricted model A. We then use this standard deviation to compute a standard error of μ(t) by plugging in variance parameter estimates that are appropriate under model A. We ignore variability caused by estimation of the variance parameters that appear in μ, but acknowledge that doing so is likely to produce standard errors that may be small.This variability could be accounted for by, e.g., methods of Kackar and Harville [16].
The variance of μ(t) is calculated from var( θ) using variance/covariance rules as var( θ) = where Σ * R i is obtained by fitting model B. Expression (3.4) was used to calculate the standard errors shown in panels b), c) and d) of Figure 4. Clearly, we do not want to use the model-based covariances which produced panels b) and c), as doing so gives unrealistic standard errors for our estimate of µ.The problems in panel d) are not as striking, but we still do see the effects of the assumed "time neutral" covariance structure.
To construct better standard errors, we require an estimator of var( and thus we require an estimator of σ 2 ǫ and an unrestricted estimator of Σ I = var(θ i ).We estimate Σ I by S θ , the sample covariance matrix of the θi 's, our estimators of the BLUPs from fitting model B: We estimate σ 2 ǫ by where and df adj corrects for parameter over-counting.For instance, when using power bases at both the population level and the individual level, df adj = 2 + the number of common population and individual interior knots.In the special case that the population knots and individual knots are the same and the t ij 's do not depend on i, our formula for the degrees of freedom simplifies: with n = n i , K = the number of interior knots, df = N n − N (K + 2).The resulting estimates of σ 2 ǫ and var( θ) agree with Demidenko's (pp 61 ff [3]).
Other estimates of σ 2 ǫ and Σ I = var(θ i ) are possible.Since θ and θi are shrinkage estimators one could adjust (3.5) and (3.6) by adjusting the degrees of freedom to account for shrinkage.See Hodges and Sargent [14] for a discussion of a variety of suggestions.On the other hand, a sensible estimate of σ 2 ǫ can be gotten by ordinary least squares, with no shrinkage in estimation of any basis function coefficient.Alternatively, we might consider a method of moments approach, as follows.Let S θi and σ2 ǫ be the analogues of (3.5) and (3.6) but with θ and θi calculated using the true variance parameters.The expected values of S θ and σ2 ǫ are easily calculated and are linear in both σ 2 ǫ and Σ I .To estimate σ 2 ǫ and Σ I , set E(S θ ) = S θ, E(σ 2 ǫ ) = σ2 ǫ and solve.We have not made a careful investigation of these possibilities.
Our estimator of var(ǫ ǫ I, and we use this in (3.3) to estimate the variance of θ under model A: The variance estimator in (3.7) relies on the assumed form of the variance of ǫ * i given in model A, and so we call a standard error for μ based on this variance estimator a half sandwich standard error.If this form of the variance is suspect, if, for instance, the covariance matrix of ǫ i is not a constant times the identity, then the following general sandwich estimator of the variance of θ might be preferred: We call a standard error for μ based on this variance estimator a full sandwich standard error.Robert-Granié, Heude and Foulle [25] consider such a sandwich estimator when fitting a simple random regression model assuming a specific variance structure that depends on covariates.

Balanced design
We call a design for (2.4) balanced if C i ≡ C and C Ii ≡ C I .While many designs are not balanced, considering the balanced design can provide us with insight into linear mixed effects analysis.For a balanced design, the variance of ǫ * i does not depend on i.In this case, the model B estimator of θ in (3.2) does not depend on i, the sandwich variance estimator in (3.8) simplifies as follows: using the facts that ĤC = N −1 I and θ = ĤN Ȳ, we see that Ĥ(Y i −C θ) = Ĥ(Y i − Ȳ) and so var s ( θ) = Ĥ (Y i − Ȳ)(Y i − Ȳ) ′ Ĥ′ .Thus, we see that estimating the variance of the ǫ * i 's in (3.3) via model B based residuals is equivalent to estimating the variance using the sample variance of the Y i 's.
Suppose that the design is balanced and that model (2.4) holds with Σ θ denoting the possibly restricted covariance matrix of θ i .Then the maximum likelihood estimator of θ when variance parameters are known is the generalized least squares estimate Under an additional condition on C and C I , given in the following Theorem, the estimator θG is equal to the ordinary least squares estimator and thus does not depend on the assumed covariance matrix.Under the same condition explicit formulae for the maximum likelihood and restricted maximum likelihood estimators for Model A covariance parameters can be given; see the end of this section.The proof of the Theorem appears in the Appendix.By considering the proof, we see that Theorem 3.1 holds for general matrices C and C I , that is, the Theorem does not require that C and C I depend on basis functions.Theorem 3.1.Suppose that model (2.4) holds with C i ≡ C and C Ii ≡ C I .If the column space of C I is contained in the column space of C, then the θG in (3.9) and θ, the corresponding maximum likelihood estimator when variance parameters are unknown, are equal to the ordinary least squares estimate θO The conditions of the Theorem relate to the choice of function spaces in modelling in (2.1) -(2.4).Suppose that the design is balanced, that the function space modelling the g i 's is a subspace of the function space modelling µ and that θ is non-random.Then the column space of C I is contained in the column space of C and the Theorem states that the MLE for θ is the ordinary least squares estimate, not depending on the covariance structure of the ǫ * i 's.Translating this to our models with piecewise linear functions, if the knots for the g i 's are a subset of the knots for the population curve µ, θ in (3.2) simplifies to the ordinary least squares estimate of θ and so μ does not depend on the specific basis functions or on the assumed covariance structure of the station-specific random effects.Consequently our "forward time", "backward time" and "time neutral" estimates of µ are the same.
Demidenko [3] considers model (2.4) but with general matrices C i and C Ii not necessarily derived from basis functions.He establishes the conclusion of Theorem 3.1 in a balanced design under the stronger condition C = C I ; he then gives, under this same condition, explicit formulae for the maximum likelihood and restricted maximum likelihood estimates of Σ I and σ 2 ǫ when the population δ is not random and when Σ I is unrestricted (pp 61ff).Careful reading of his proof shows that these formulae remain valid whenever generalized least squares reduces to ordinary least squares.Thus under the conditions of Theorem 3.1 we find that the restricted and unrestricted maximum likelihood estimators of σ 2 ǫ under model A are equal and given by where C I is n by L. The maximum likelihood estimator of To get the restricted maximum likelihood estimator of Σ I replace the N in the denominator of S by N − 1. See Demidenko, p 63 [3].

Temperature data analysis with µ nonrandom
In all of the temperature data analyses, we model functions as splines of degree p = 1 using either the power basis or the B-spline representation, as described in Section 2. Knots are equi-spaced with equal distances from the "edges" of 1 and 365: a sequence of K interior knots is constructed with K j = 1 + 364 j/(K + 1), j = 1, . . ., K.
For the analysis of Figure 3, we use 41 interior population knots and 7 interior individual knots, all equispaced, so the model doesn't satisfy the conditions of Theorem 1.As we see, our "time running forward" and "time running backward" and "time neutral" estimates of µ are not the same.
The unacceptably widening pointwise standard error bands in panels b) and c) of Figure 4 were computed using the model-based variance estimate in (3.4).As noted in Section 2, this widening is caused by the covariance assumptions of model B. In panel d), we see that the model-based pointwise standard errors using the "time neutral" covariance structure do not show sufficient heteroscedasticity, also as explained in Section 2.
In Figure 5, estimation of µ involved 41 interior population knots and 6 interior individual knots.When using these knots, by Theorem 3.1, the "time running forward", the "time running backward" and the "time neutral" estimates of µ are all the same.Indeed, when using these knots, the covariance structure of the random effects does not influence the estimate of µ, as the estimate is the ordinary least squares estimate.Panel a) shows the estimate of µ, along with the pointwise averages of the temperatures for comparison.Panel b) displays standard errors based on the sandwich estimators (3.7) and (3.8), using the "time neutral" covariance structure.Panel c) shows that the difference between sandwich standard errors using the "time running forward" and using the "time neutral" covariance structures is negligible.Sandwich estimation appears to have greatly reduced the bias in the standard error estimates, a bias caused by model mis-specification.
We have not plotted the model-based standard errors for this choice of knots, but they exhibit the same undesirable behaviour shown in Figure 4. Indeed, model-based standard errors exhibit undesirable behavior for a wide range of choices of number of knots.
It is important to remember that both model-based and sandwich standard errors for μ are affected by the assumed covariance structure.For instance, even if the "forward" estimator of µ is the same as the "backward" estimator of µ, the model B based standard errors of the "forward time" estimator will, in general, be different from those of the "backward time" estimator.Even sandwich standard errors can be affected by the covariance structure: in plots not shown here, in the "time running forward" and "time running backward" models, the standard errors using (3.7) or (3.8) also exhibit fanning, albeit mild, if the conditions of Theorem 3.1 do hold.Even if the knot conditions implied by the assumptions of the Theorem do hold, some amount of fanning may occur (see Figure 8, where the fly data are analyzed assuming that µ is random).
In an analysis not shown here, we fit model B with the individual random effect g i equal to a line with the assumed covariance matrix of the slope and ).Panel c) shows the difference between sandwich standard errors using the "time running forward" and using the "time neutral" covariance structures.The dotted line is the difference for half-sandwich standard errors and the dashed line is the difference for full sandwich standard errors.
intercept unrestricted.The widening of the standard error bands didn't occur, a result that agrees with the analysis in Smith and Wand [29].
To calculate the estimators of β and the BLUP of δ, we assume that the restricted model D holds.The variance of ǫ * i under model D is Σ * R i as defined in (3.1).Let 0 J,K denote a J × K matrix of zeroes and let Then, for known variance parameters, under model D, β, the maximum likelihood estimator of β, and δ, the BLUP of δ, can be found via Henderson's justification (see Robinson [26]), as the minimizers of Therefore, our predictor of µ(t) in model D for known variance parameters is μ(t) = θ′ (ψ P 1 (t), . . ., ψ P JP (t), φ P 1 (t), . . ., φ P KP (t)) ≡ θ′ f (t).
A linear mixed effects model fit of model D yields (restricted) maximum likelihood estimators of σ 2 P , Σ R I , and σ 2 ǫ , and thus estimators of H i , denoted Ĥi .The fit also produces estimators of the BLUPs of θ and the θ i 's.The estimator of the BLUP of θ is θ = Ĥi Y i .The predictor of µ(t), denoted μ(t), is gotten in the obvious way from θ.

Assessing variability of the predictor of µ(t)
Historically, linear mixed effects models have been used to estimate fixed effects.Using them to predict random effects, as in the prediction of µ, raises conceptual problems in the interpretation of µ and in how one should construct prediction intervals.The appropriate method for assessing variability of μ(t) is not obvious when µ is modelled as random.
To model the variability of μ, we could take a strictly Bayesian approach, since our estimate of µ(t) is based on the BLUP μ(t), which is equal to E(µ(t)| the data).The Bayesian viewpoint would have us construct a credible interval for µ(t) using the posterior variance of µ(t) given the data.However, one must be confident in the choice of prior.In our context, the prior is based on a smoothing trick and not reflective of prior information about µ.That is, the prior and the view that µ is random do not arise from Bayesian principles.Therefore, we prefer a frequentist approach, assessing variability based on mean squared error.Since we are interested in µ(t) and not the population fixed effect E{µ(t)}, we construct intervals based on a measure of the magnitude of μ(t) − µ(t).So, for instance, we do not construct intervals of the form μ(t) ±[var{μ(t)}] 1/2 since var{μ(t)} = var{μ(t) − β j ψ P j (t)} measures variability of μ(t) about the population fixed effect, not about µ(t).
We study two measures of the magnitude of μ(t) − µ(t): e 2 δ (t) = E[ {μ(t) − µ(t)} 2 | δ] and e 2 (t) = E{μ(t) − µ(t)} 2 , and discuss how we might use these measures to construct intervals that are likely to contain µ(t).The measure e 2 δ (t) provides inference that holds for each realization of µ, and thus seems the most sensible, as our data set has been generated by only one realization of µ.
Although our arguments here are in a Bayesian framework with µ random, the conditional approach would appeal to the frequentist, who views µ as fixed and considers the randomness of µ as merely a mechanism for smoothing.In either case, there is really just one µ of interest, leading us to think of µ as fixed in our inference.The measure e 2 (t) provides inference that holds on average over all realizations of µ.It may perform poorly for some realizations of µ and perform well for others.Below we calculate e 2 δ (t) and e 2 (t) assuming that the unrestricted model C holds.In Section 4.3, we present estimators of these two measures, estimators that are appropriate under model C.
If the variance model is correctly specified, that is, if model D holds, then e 2 (t) is equal to the posterior variance of µ given the data.To see this, write μ As noted, this posterior variance is commonly used in a Bayesian approach to assess variability of the posterior mean.
The pointwise standard errors proposed by Djeundje and Currie [4] are equivalent to the square roots of estimates of pointwise posterior variances, under a prior chosen to reduce frequentist bias.However, in our simulation study in Section 6, when we applied their method to data generated under a different, but reasonable, prior, the standard errors proved to be far too narrow.
Standard errors based on the assumed prior are discussed by Ruppert et al [28], but only in the case that N = 1, that is, in the case of P-spline smoothing regression.In Section 6.4 of their book, they discuss the analogues of V θ , e 2 δ (t) and e 2 (t) for this single-curve case.To calculate confidence intervals for µ(t), they compare e 2 (t) and f (t) ′ V θ f (t), and state they prefer the former.They don't consider confidence intervals based on e 2 δ (t).Their calculations assume that the smoothing model D holds, that is, that Σ I is as in (2.5), while our calculations hold under the more general model C. The simulation results in Section 6 show that this method, with its reliance on the assumed covariance structure, doesn't extend well to more than one curve.
Other authors have proposed alternatives to model-based standard errors.Sun, Zhang and Tong [31] study a linear mixed effects model with time-varying coefficients at the population level, proposing a two step estimation procedure to reduce computational cost.The first step is local least squares regression ignoring the covariance structure, to estimate the population mean parameters.This step produces individual level residuals which are then used in a method of moments procedure to estimate variance parameters.For the case that N = 1, Crainiceanu et al [2] take a Bayesian approach to modelling Σ δ .

Estimating e 2
δ (t) and e 2 (t) We have defined μ(t) and μ(t), predictors of µ(t), in the restricted model D and we have defined two measures of the variability of μ(t), e 2 δ (t) in (4.6) and e 2 (t) in (4.7).Our calculations for e 2 δ (t) and e 2 (t) are valid under the unrestricted model C. In this section, we define estimators of e 2 δ (t) and e 2 (t) that are also valid under the unrestricted model C. We then use these estimators to define prediction intervals for µ(t) centered at μ(t).
Our estimates of e 2 δ and e 2 rely on estimators of A in equation (4.2) and the H i 's in equation (4.3).We estimate A and the H i 's by "plugging in" the parameter estimates from our fit of the restricted model D. Our fit of this restricted model also yields σ2 P , the estimate of σ 2 P , and δ, the BLUP of δ.Using all of these in (4.5) and (4.6), we define the full sandwich estimate of e 2 δ as ê2 δ,full (t) = f (t) ′ 1 σ4 To define the full sandwich estimate of e 2 in (4.7), we let σ2 P,C be the sample variance of the components of δ: We also define half sandwich estimates of e 2 δ and e 2 , estimates that assume the covariance structure of the ǫ * i 's is as specified in model C.For these estimates, we simply replace (Y i − C i θ)(Y i − C i θ) ′ in the full sandwich estimators with an estimate of var(ǫ * i ) valid under the unrestricted model C: with S θ and σ2 ǫ as in equations (3.5) and (3.6), but using the θi 's and θ gotten from fitting model D instead of model B.
Thus, we have four different sandwich prediction errors.Using e 2 (t) yields a full sandwich error and a half sandwich error.Using e 2 δ (t) yields a full sandwich error given δ and a half sandwich error given δ.Our simulation studies indicate that the performances of the four estimators are comparable.One might prefer the full sandwich estimators, as they require fewer model assumptions.In a frequentist approach in which the randomness of µ is simply a device for smoothing, one would prefer a conditional mean squared error.Combining these reasons yields ê2 δ,full as the preferred estimator.

Temperature data analysis
We constructed figures (not shown) analogous to Figure 3 and 4, except based on model D with µ random, piecewise linear with "time running forward" covariance structure, that is with the "running forward" power basis and with Σ δ equal to the identity.We constructed predictors of µ using different configurations of knots, and, for the individual level random effects, different covariance structures, that is, for "time running forward", "time running backward" and "time neutral".We also constructed the model-based standard errors and all types of pointwise sandwich standard errors.The results were, for the most part, qualitatively the same as those assuming µ fixed.We found that, when knots weren't chosen according to the principles of Theorem 3.1, the "time running forward" estimate of µ did not track the pointwise average well.Surprisingly, the "time running backward" estimate did track fairly well.The "time neutral" estimate of µ tracked well.Choosing the individual knots to be a subset of the population knots resulted in all estimates tracking the mean.
No matter what the choice of knots, the model-based prediction errors for the "time running forward" and "time running backward" individual level covariance structures were unreasonably wide, just as in Figure 4.The "time neutral" covariance structure gave pointwise model-based standard error bands that were similar to those in Figure 4, not reflecting the fact that variability in temperatures between cities is much lower in the summer than in the winter.
Under all three individual level covariance structure assumptions, the sandwich standard errors provided substantial improvement over the model-based standard errors.This improvement led to satisfactory standard errors when the individual knots were a subset of the population knots.However, when the individual knots were not a subset of the population knots, the sandwich standard errors did not track the raw standard errors as closely as one would like.Indeed, in this case, the sandwich standard errors still showed some slight signs of the underlying assumed covariance structure.We see this phenomenon in the analysis of the fruit fly data set, in the next section.
For further details and plots, see the Supplementary Material [13].

Fruit fly data analysis
We analyzed the fruit fly data using all of the techniques described in Sections 3 and 4. Here, we only present some of the Section 4 analyses, based on random µ.Complete analyses are contained in the Supplementary Material.As always, for µ random, we assumed the "time running forward" covariance structure for µ.Throughout this section, analysis is based on piecewise linear functions with 29 equispaced interior knots for the population mean and 4 equispaced interior knots for the individual effects.Thus the individual knots are a subset of the population knots, and Theorem 3.1 would hold if the design were balanced and µ were modelled as a fixed function.
As an ad hoc method of estimating expected VO2 response within each group, we linearly interpolated the data from each individual and calculated pointwise means at a grid of 100 time points.We also calculated pointwise standard errors of these means, using the pointwise standard deviation divided by the square root of the number of individuals.Note that these standard errors are probably too small, as we have used extra "data points" in their calculation.But the standard errors do give some indication of the pattern of variability since the number of observations is fairly evenly spread over the scaled time interval.If, on the other hand, there were few observations over a portion of the interval, our standard errors might be misleading due to the varying level of certainty in those standard errors.
Figure 6 contains results from an analysis using the individual level "time neutral" structure.The plots show the estimates of the expected VO2 response of fruit flies in the selection and control groups, along with pointwise full-sandwich standard errors based on e 2 (t).The Figure also contains the difference of these estimates, along with pointwise standard errors.
Figure 7 shows all types of "time neutral" standard errors for the selection and control groups.Note that the model-based standard errors indicate an approximate homoscedasticity that is not supported by the data.Note also how the sandwich standard errors have corrected this.
Figure 8 shows all types of "time running forward" standard errors.The model-based standard errors display extreme fanning for t near 1.The sandwich standard errors have corrected much but not all of the fanning problem.However, in plots (not shown), we found that estimates of µ under the individual level "time running forward" model seemed to be biased, drifting from the point-wise mean for t near 1.Perhaps the widening of the standard error bars compensates for this drifting of the estimate of µ.
Figures in the Supplementary Material indicate that, when the knots do not satisfy the conditions of Theorem 3.1, the assumed covariance structure has a stronger effect on the sandwich standard errors.

Simulation study
We simulated and analyzed data sets in order to study the validity of the different pointwise standard errors.The model for the simulation was chosen to produce data that mimicked the Canadian weather data; station i's temperature on day j was Y ij = µ(j) + a i [cos(2πj/365) + 2] + ǫ ij .Here µ is of the form µ(t) = γ 1 + γ 2 t + γ 3 cos(2πt/365) + γ 4 sin(2πt/365) + γ 5 cos(4πt/365) + γ 6 sin(4πt/365) where the γ k 's minimize 365 1 (µ(j) − Ȳj ) 2 , where Ȳj = i Y ij /35.The random effect a i is normal with mean 0 and standard deviation 3.5 and the error ǫ ij is normal with mean 0 and standard deviation 0.842721, which is the estimated value of σ ǫ from the data analysis.To compare the simulated data to the real data, Figure 9 shows one simulated data set, the Canadian weather data, the pointwise means for these two data sets, and the standard errors of these pointwise means.Standard errors of estimates of the population mean V02 in the selection and control groups of flies by techniques of Section 4, using 29 equispaced interior population knots and 4 equispaced interior individual knots and the "time neutral" covariance structure for the individual level random effects.The standard errors portrayed are: the ad hoc using linear interpolation (black line), the full sandwich (red) using êfull , the half sandwich (yellow), the full sandwich given δ (green) using êδ,full , the half sandwich given δ (blue) and the model-based (cyan).
We estimated µ in each simulated data set four ways: assuming µ is fixed and the individual random effects have a "time running forward" structure, assuming µ is fixed and the individual random effects have a "time neutral" covariance structure, assuming µ is random and the individual random effects have a "time running forward" structure, and assuming µ is random and the individual random effects have a "time neutral" covariance structure.For random µ, we assumed the "time running forward" covariance structure with Σ δ equal to the identity.All functions were modelled as piecewise linear.We analyzed the data using 39 interior population knots and 7 interior individual knots, so that the individual knots were a subset of the population knots.equispaced interior individual knots and the "time running forward" covariance structure for the individual level random effects.The standard errors portrayed are: the ad hoc using linear interpolation (black line), the full sandwich (red) using êfull , the half sandwich (yellow), the full sandwich given δ (green) using êδ,full , the half sandwich given δ (blue) and the model-based (cyan).
We found that all methods of estimating µ worked well and that modelbased standard errors always performed poorly.All sandwich standard errors performed well.
We present some plots here from the random µ "time neutral" analysis, with more plots available in the Supplementary Material.Figure 10 shows pointwise quantiles of the model-based standard errors (in black) and the full-sandwich standard errors based on êfull (t) (in red).The dashed blue line is the pointwise standard deviation of the 200 estimates of µ -this is a simulated estimate of the target of our standard errors.We see that the sandwich estimator performs well and the model-based errors do not.Figure 11 shows the proportion of times that a supposed 95% or 98% confidence interval for µ actually contains µ.We calculated intervals as μ(t)± z α/2 SE(t), with z α/2 from the normal distribution.The model-based standard error under-covers during much of the mid-year, as expected from Figure 10.The sandwich standard error has moderate undercoverage, but it may be possible to correct this using a t-distribution instead of the normal when constructing confidence limits.We also compared our method to that of Djeundje and Currie [4], using their posted R-code (doi: 10.1214/10-EJS583SUPP) in their files Bases.r and Masters 3.r, with basis set to c("B", "B"), corresponding to fitting their model M1, as described in and near their equation (3.3).In their paper, they reported unusually small standard errors.Standard errors for our simulated data were also small, much smaller than the pointwise standard deviations of the estimates of µ.The pointwise 5th, 25th, 75th 95th quantiles (dashed lines) and 50th quantiles (solid line) of the 200 standard errors from the simulation study of Section 6, with 39 interior population knots and 7 interior individual knots.The analysis assumes that µ is random and that the individual random effects have a "time neutral" covariance structure.The black lines are the quantiles of the model-based standard errors and the red lines are the quantiles of the full sandwich standard errors, using êfull (t).The dotted cyan line is the empirical standard error, that is, the pointwise standard deviation of the estimates of µ.
The resulting pointwise 95% confidence intervals for µ had abut 50% covererage.Discussion of these results is in the Supplementary Material.

Discussion
The use of linear mixed effects modelling as a smoothing tool in the analysis of longitudinal data has increased, with many researchers taking advantage of readily available mixed effects model software.Incorporating spline functions into the analysis at both the population level and the individual levels allows more flexible estimators than those from traditional parametric methods.Additionally, using smoothing-based models B or D for the individual random effects results in fast computation.However, use of these spline models can lead to an incorrect population mean estimate and incorrect pointwise standard errors.Indeed, over-reliance on any particular covariance model for the individual random effects can cause problems.The impact of the assumed covariance structure of the individual effects on the population mean estimate is sometimes serious but can often be remedied by ensuring that the function space for the individual effects is a subspace of the function space for the population mean curve.Therefore, we recommend that the Coverage of confidence intervals in the simulation study of Section 6, assuming that µ is random and using the time neutral covariance structure.The plots show the proportion of times that µ lies within the pointwise 95% confidence interval (solid line) and the 98th percent confidence interval (dashed line).Horizontal red lines are drawn at 0.95 and 0.98.
individual level knots be a subset of the population level knots.This structure has the conceptual advantage of allowing us to readily view the random effects as deviations from the mean.It also has the appealing property of resulting in the ordinary least squares estimate of µ when the design is balanced for the case that µ is modelled as non-random (see Theorem 3.1).As we have seen, this choice of knots also improves performance of our proposed sandwich estimators of standard errors.We have seen from simulation studies that confidence interval coverage is good, without any further correction of bias in the estimate of µ.
For the choice of the number of population level knots, we agree with Ruppert et al (Section 5.5.3 [27]) who state that "the idea is to choose enough knots to resolve the essential structure of the underlying regression function".When we consider µ random, based on our experience, we also agree with the recommendation of Ruppert et al to choose 20 to 40 population knots or one-fourth the number of distinct t values, whichever is smaller.Note, however, that their recommendation was made in a different context, with only one regression data set.Like Ruppert et al in their context, we found that the exact number of population knots had little effect, provided the number was sufficiently large.This is in contrast to µ non-random, where one would typically use a much smaller number of population knots, as too many knots will result in a rough estimate of µ.
At the individual level, we do not recommend using only a random intercept or a random slope and intercept, as this typically is not rich enough to account for individual to individual variability.We have found that, to capture this variability, it suffices to use just a few individual knots.
As we have seen, the choice of the covariance structure for the population curve and the individual deviations can have a large effect on the standard errors.If possible, one should use a covariance structure that arises from the application.For instance, an appropriate covariance structure for the temperature data should reflect the fact that January 1 is just one day after December 31, and so temperatures on these two days are highly correlated.If an appropriate covariance structure cannot be determined a priori, we recommend using the "time neutral" linear Bspline basis with independent and identically distributed coefficients to estimate µ. even this covariance structure may be incorrect, we still recommend basing standard errors on an unrestricted covariance structure, using sandwich standard errors based on (3.7), (3.8), (4.6) or (4.7).We have found that these sandwich standard errors perform best when we have assumed a "time neutral" covariance structure.
The main contribution of this paper is in drawing attention to the problem of model-based standard errors in the smoothing formulation of random regression, and in presenting a better approach.Our two stage methodology, a smoothing-based linear mixed effects fit followed by method of moment estimation of variance parameters, provides fast and flexible analysis of longitudinal data, analysis that is robust to variance model misspecification.

Supplementary Material
Supplement to "Penalized regression, mixed effects models and appropriate modelling" (doi: 10.1214/00-EJS809SUPP; .zip).The Supplementary Material includes code to obtain estimates of µ and standard errors, as described in Sections 3 and 4. Also included are • code to produce plots from the paper; • code to generate simulated data and to run the simulation for our methods; • code to simulate according to the methods of Djeundje and Currie [4]; • details of the results of the simulation study;

Fig 1 .
Fig 1. Velocity of oxygen consumption (V02) of fruit flies, with lifetimes scaled to the unit interval.The selection group has been bred to withstand dry conditions.Data were collected by Donna G. Folk in the laboratory of Timothy J. Bradley at University of California Irvine.
Fig 3.The plot contains four estimates of µ, the typical weather curve, as described in Sections 2 and 3.The solid line is the pointwise average, the dotted line is the "time neutral" estimate, the dot-dashed line is the "time running forward" estimate, and the long dashed line is the "time running backward" estimate.The latter three estimates are based on piecewise linear functions with 41 equispaced interior population knots and 7 equispaced interior individual knots.

Fig 4 .
Fig 4. Plotted are three of the estimates of µ shown in Figure 3, along with corresponding pointwise standard errors.In panel a) the estimate of µ is the pointwise average and the standard errors are the pointwise standard deviations of the 35 temperatures, divided by √ 35.Panels b) and c) show analogous information but with estimates using model B and standard errors calculated using equation (3.4) based on the restricted model B. Panel b) contains the "time running backward" estimate and panel c) contains the "time running forward".Panel d) compares the pointwise standard errors of panel a) to the model B-based standard errors using (3.4) for the "time neutral" estimate of µ (dashed line).

Fig 5 .
Fig 5. Analysis of the Canadian weather data using the methods of Section 3, with 41 interior population knots and 6 interior individual knots.Panel a) shows the mixed model estimate of µ, which by Theorem 3.1 does not depend on the covariance structure of the individual random effects.Panel a) also shows pointwise averages (red line).Panel b) contains pointwise sandwich standard errors of μ using the "time neutral" covariance structure for individual random effects.The dotted line is the half-sandwich standard error based on (3.7) and the dashed line is the full sandwich standard error based on(3.8).Panel c) shows the difference between sandwich standard errors using the "time running forward" and using the "time neutral" covariance structures.The dotted line is the difference for half-sandwich standard errors and the dashed line is the difference for full sandwich standard errors.

4. Random µ 4 . 1 .
Prediction of µ under model D Suppose data are generated according to (2.1) through (2.4) under either model C or D, so that the population effect δ is random.Under either model we write

DifferenceFig 6 .
Fig 6.Estimates and pointwise full-sandwich standard errors of the population mean V02 in the selection and control groups of flies, by techniques of Section 4, using 29 equispaced interior population knots and 4 equispaced interior individual knots.Analysis used the "time neutral" covariance structure for the individual level random effects and full sandwich standard errors are based on ê2 full (t).Also shown is the difference of the estimates, with pointwise standard errors of the difference.
Fig 7.  Standard errors of estimates of the population mean V02 in the selection and control groups of flies by techniques of Section 4, using 29 equispaced interior population knots and 4 equispaced interior individual knots and the "time neutral" covariance structure for the individual level random effects.The standard errors portrayed are: the ad hoc using linear interpolation (black line), the full sandwich (red) using êfull , the half sandwich (yellow), the full sandwich given δ (green) using êδ,full , the half sandwich given δ (blue) and the model-based (cyan).

Fig 8 .
Fig 8.  Standard errors of estimates of the population mean V02 in the selection and control groups of flies by techniques of Section 4, using 29 equispaced interior population knots and 4 equispaced interior individual knots and the "time running forward" covariance structure for the individual level random effects.The standard errors portrayed are: the ad hoc using linear interpolation (black line), the full sandwich (red) using êfull , the half sandwich (yellow), the full sandwich given δ (green) using êδ,full , the half sandwich given δ (blue) and the model-based (cyan).

SEsFig 9 .
Fig 9.  Comparing the Canadian weather data to data simulated as described in Section 6.The upper left plot shows one of the simulated data sets and the upper right plot contains the Canadian data.The lower left plot shows pointwise means of two simulated data sets (black lines) and of the Canadian data (red line).The lower right plot shows the pointwise standard errors of the means depicted in the lower left.
Fig 10.The pointwise 5th, 25th, 75th 95th quantiles (dashed lines) and 50th quantiles (solid line) of the 200 standard errors from the simulation study of Section 6, with 39 interior population knots and 7 interior individual knots.The analysis assumes that µ is random and that the individual random effects have a "time neutral" covariance structure.The black lines are the quantiles of the model-based standard errors and the red lines are the quantiles of the full sandwich standard errors, using êfull (t).The dotted cyan line is the empirical standard error, that is, the pointwise standard deviation of the estimates of µ.
Fig 11.Coverage of confidence intervals in the simulation study of Section 6, assuming that µ is random and using the time neutral covariance structure.The plots show the proportion of times that µ lies within the pointwise 95% confidence interval (solid line) and the 98th percent confidence interval (dashed line).Horizontal red lines are drawn at 0.95 and 0.98.
If the restricted model B holds, then the covariance matrix of ǫ * i is equal to Σ * R i as in (3.1), and var( θ) simplifies to ( C