Gaussian copula marginal regression

This paper identifies and develops the class of Gaussian copula models for marginal regression analysis of non-normal dependent observations. The class provides a natural extension of traditional linear regression models with normal correlated errors. Any kind of continuous, discrete and categorical responses is allowed. Dependence is conveniently modelled in terms of multivariate normal errors. Inference is performed through a likelihood approach. While the likelihood function is available in closed-form for continuous responses, in the non-continuous setting numerical approximations are used. Residual analysis and a specification test are suggested for validating the adequacy of the assumed multivariate model. Methodology is implemented in a R package called gcmr. Illustrations include simulations and real data applications regarding time series, cross-design data, longitudinal studies, survival analysis and spatial regression.


Introduction
Marginal regression models for non-normal correlated responses are typically fitted by the popular generalized estimating equations approach of Liang and Zeger [34].Despite several theoretical and practical advantages, likelihood analysis of non-normal marginal regression models is much less widespread, see Diggle et al. [13].The main reason is the difficult identification of general classes of multivariate distributions for categorical and discrete responses.Nevertheless, likelihood analysis of marginal models has been considered by many authors, see Molenberghs and Verbeke [39] for some references.
Gaussian copulas [47] provide a flexible general framework for modelling dependent responses of any type.Gaussian copulas combine the simplicity of interpretation in marginal modelling with the flexibility in the specification of the dependence structure.Despite this, Gaussian copula regression had still a limited use since for non-continuous dependent responses the likelihood function requires the approximation of high-dimensional integrals.The intents of this paper are to identify the class of Gaussian copula regression models and to show that methods developed for multivariate probit regression can be usefully adapted to the Gaussian copula models in a way to overcome numerical difficulties of the likelihood inference.
The identified class of models stems from and generalizes previous work on Gaussian copula regression models [47,43] and multivariate probit models [7].In particular, we show that with a proper parameterization of the correlation matrix of the Gaussian copula, the ideas of Song [47] can be applied also to the analysis of time series and spatially correlated observations.We suggest model fitting through maximum likelihood.In the continuous case, the likelihood function is available in closed-form.Otherwise, in the discrete and the categorical cases numerical approximations of the likelihood are needed.We propose to approximate the likelihood by means of an adaptation of the Geweke-Hajivassiliou-Keane importance sampling algorithm [30].Another contribute of this manuscript is to interpret the normal scores of the Gaussian copula as errors and thus develop residuals analysis for model checking.We also suggest to validate the inferential conclusions on the marginal parameters through a Hausman-type specification test [22].
The methodology discussed in this paper has been implemented in the R [44] package gcmr -Gaussian copula marginal regression -available from the Comprehensive R Archive Network at url http://cran.r-project.org/web/packages/gcmr.

Framework
Let Y = (Y 1 , . . ., Y n ) T be a vector of continuous, discrete or categorical dependent random variables and let y = (y 1 , . . ., y n ) T be the corresponding realizations.Dependence may arise in several forms, as for example repeated measurements on the same subject, observations collected sequentially in time, or georeferenced data.We consider situations where the primary scientific objective is evaluating how the distribution of Y i varies according to changes in a vector of p covariates x i = (x i1 , . . ., x ip ) T .Dependence is regarded as a secondary, but significant, aspect.
Denote by p i (y i ; λ) = p(y i |x i ; λ) the density function of Y i given x i , so that covariates are allowed to affect not only the mean of Y i but the entire univariate marginal distribution.Within this framework density p i (y i ; λ) identifies the regression model.Without further assumptions about the nature of the dependence among the responses, inference on the marginal parameters λ can be conducted with the pseudolikelihood constructed under working independence assumptions, If the marginals p i (y i ; λ) are correctly specified, then the maximum independence likelihood estimator λind = argmax λ L ind (λ; y) consistently estimates λ with no specification of the joint distribution of Y given the model matrix X = (x 1 , . . ., x n ) T .Despite this important robustness property, there are various causes of concern with the above estimator.First, it may suffer from considerable loss of efficiency when dependence is appreciable.Second, standard likelihood theory does not apply and corrected standard errors should be based on sandwich-type formulas [57], whose computation can be difficult when the response vector Y does not factorize into independent subvectors.Finally, predictions that do not take into account dependence may be of poor quality.
For these reasons, complementing the regression model p i (y i ; λ) with a dependence structure is important for attaining more precise inferential conclusions and predictions.The ideal model would contain all the possible joint distributions of Y with univariate marginals p i (y i ; λ), i = 1, . . ., n.However, this broader semiparametric model appears too general to be of practical use.Hence, in the following we identify and develop a narrower parametric model which is flexible enough for many applications.

Gaussian copula marginal regression models
In very general terms, a regression model is expressed as where g(•) is a suitable function of the regressors x i and of an unobserved stochastic variable ǫ i , commonly denoted as the error term.It is assumed that the regression model ( 2) is known up to a vector of parameters λ.Among the possible specifications for the function g(•), a useful choice is where ǫ i is a standard normal variable and F i (•; λ) = F(•|x i ; λ) and Φ(•) are the cumulative distribution functions of Y i given x i and of a standard normal variate, respectively.By the integral transformation theorem, the regression model (3) ensures the desired marginal distribution for the response Y i and specifies ǫ i in the familiar terms of a normal error.Specification (3) includes all possible parametric regression models for continuous and noncontinuous responses.For example, the Gaussian linear regression model 3), with λ = (β T , σ) T , while the Poisson log-linear model is obtained by setting where µ i = exp(x T i β), with λ ≡ β.For subsequent developments, it is important to notice that the mapping between the response Y i and the error term ǫ i is one-to-one only in the continuous case, otherwise the mapping is one-to-many and hence the relationship (3) between Y i and ǫ i cannot be inverted.
This manuscript deals with regression analysis in presence of dependence.Model specification is then completed by assuming that the vector of errors where Ω is a correlation matrix.The special case of independent observations corresponds to Ω = I n , the n × n identity matrix.Model identifiability requires that ǫ has zero mean vector and unit variances because univariate characteristics are modelled separately in the marginals p i (y i ; λ).The model specification conveniently separates the marginal component (3) from the dependence component (4), the latter being described in terms of a multivariate normal process.Various forms of dependence in the data can be modelled by suitably parametrizing the correlation matrix Ω as a function of a vector parameter τ , see Section 3.1 for some common examples.The whole parameter vector is denoted by θ = (λ T , τ T ) T .Thereafter, λ is termed the vector of marginal parameters and τ the vector of dependence parameters.
The model ( 3)-( 4) offers a natural interpretation in terms of the copula theory [28] where the normal scores ǫ i are seen as nonlinear transformations of the variables Y i and not as error terms.In this manuscript, we prefer the errors interpretation for the ǫ i because we think this provides a clear connection to other regression approaches and because this interpretation facilitates the presentation of the residuals analysis described later in Section 7.1.
The approach investigated here differs from much of the existing literature about copula modelling where viceversa marginals are treated as nuisance components and interest lies on the dependence parameters of the copula.For an example of the latter use of copulas see the work on semiparametric Gaussian copula modelling by Hoff [23] and the references therein.
As stated in Section 2, ideal inferences on λ should be based on the semiparametric model of all the possible copulas with marginals p i (y i ; λ).The model discussed here restricts to a particular copula, namely the Gaussian copula [47].This choice appears advantageous because it naturally inherits several wellknown properties of the multivariate normal distribution, see e.g.Nikoloulopoulos et al. [41].A limit of the normality assumption is that the full multivariate dependence structure is induced by the bivariate dependencies.Other copulas might be considered as well but at the cost of lessened interpretability.Further, simulation studies as that reported in Section 8 suggest some amount of robustness of the Gaussian copula to local misspecification of the dependence structure.In the rest of the paper, the class of models identified by the pair of equations ( 3)-( 4) is termed Gaussian copula marginal regression (GCMR) models.
Song [47] introduced Gaussian copula generalized linear models for longitudinal data analysis.See also Song et al. [50] for an extension to multivariate longitudinal responses, also of mixed type.Pitt et al. [43] develop efficient Bayesian analysis of multivariate regression models using an unstructured correlation matrix for the copula.These models are examples of GCMR.Alternatively, GCMR may be seen as multivariate probit regression with marginals F i (y i ; λ).The multivariate probit model with logistic marginals considered by Le Cessie and Van Houwelingen [33] for longitudinal binary data and the correlated probit model for joint modelling of clustered observations of mixed-type by Gueorguieva and Agresti [20] may also be interpreted as special cases of the model class discussed in this paper.Recent advances on the joint copula analysis of mixed dependent data can be found in a series of papers by A.R. de Leon and colleagues [11,12,58].

Dependence models
The dependence structure in GCMR is modelled through the specification of an appropriate correlation matrix Ω of the errors vector ǫ.Although any correlation matrix Ω is allowed, we identify here some particular model types that seem likely to have wide application.All the models described below are implemented in our package gcmr.

Longitudinal and clustered data
Suppose observations are grouped in m clusters of dimensions n r , r = 1, . . ., m, with m r=1 n r = n.This is the case of longitudinal or panel data where m subjects are observed on n r occasions each.Under the standard assumption of independence between different subjects or groups, appropriate correlation matrices for the errors are block-diagonal, where Ω r is a n r × n r correlation matrix.Similarly to the method of generalized estimating equations [34], we can identify some useful correlation structures for a generic block Ω r .Consider indices i and j denoting two observations belonging to the same cluster r, i.e. (n Then, possible correlation structures are 1.exchangeable, corr(ǫ i , ǫ j ) = τ for each choice of indices i and j; 2. autoregressive of order one, corr(ǫ i , ǫ j ) = τ |i−j| ; 3. moving average of order q, corr(ǫ i , ǫ j ) = τ |i−j| for |i − j| ≤ q; 4. unstructured, corresponding to a correlation matrix without any restriction.
Song [47,48] studies models of this type for the special case of marginals given by generalized linear models and calls them vector generalized linear models.However, Gaussian copulas can be used to join any type of marginals not only those belonging to the exponential family.As an example of this, censored clustered responses are later analysed with a Weibull regression model in Section 9.3.The work by P. Song focuses on longitudinal data with small size clusters.One of the intents of this paper is to show that Gaussian copula regression models can be also used for the analysis of larger dimensional processes observed in time series and spatial statistics.

Time series
Marginal regression models with stationary time series errors for equi-spaced observations may be specified by assuming that Ω is the correlation matrix induced by an autoregressive and moving average process of orders p and q.An illustration regarding serially correlated counts is discussed in Section 9.1.

Spatial data
Regression analysis with spatial and spatio-temporal dependent responses may be modelled by assuming that errors are generated from a stationary Gaussian random field [10].A flexible choice for Ω in the spatial case is the Matérn isotropic correlation function where s i are the coordinates of the ith observation and B τ2 is the modified Bessel function of order τ 2 .The two parameters τ 1 and τ 2 both need to be strictly positive.For more details on these and other spatial correlation functions see, for example, Cressie [10] and Diggle and Ribeiro [14].An illustration of spatial regression of counts is provided in Section 9.4.

Distributional forms
GCMR models differ from other marginal models in the form of bivariate and higher order dimensional joint distributions.In the continuous case, the mapping (3) between ǫ i and Y i is one-to-one, so that marginal distributions of the responses are readily obtained by standard transformation rules from the distribution of the corresponding errors.For example, in the bivariate case we have p ij (y i , y j ; θ) = p i (y i ; λ)p j (y j ; λ)q(ǫ i , ǫ j ; θ), where is the density of the bivariate Gaussian copula.Given the model assumptions, p(ǫ i ; λ) is a univariate standard normal density, while p(ǫ i , ǫ j ; θ) is a bivariate normal density with zero means, unit variances and correlation given by the element at position (i, j) in matrix Ω.
In the categorical and discrete cases, mapping (3) is many-to-one.It follows that the joint marginal distributions are expressed by multivariate normal integrals.For example, the bivariate marginal distribution is the two-dimensional integral whose domain is the Cartesian product of intervals where F i (y − i ; λ) is the left-hand limit of F i (•; λ) at y i .If the support of Y i is contained in N, as for binomial and Poisson marginals, then y − i = y i − 1.

Dependence properties
In the special case of linear regression models with normally distributed errors, the correlation between pairs of responses, given the corresponding covariates, coincides with the correlation of the corresponding normal errors corr(ǫ i , ǫ j ).
Otherwise, the correlation between Y i and Y j is a nonlinear function of the correlation of ǫ i and ǫ j .This nonlinear function can be computed by a variety of numerical integration methods.Recently, Kugiumtzis and Bora-Senta [32] suggested to approximate these correlations making use of piece-wise linear approximations.
In the copula theory, alternative measures of association based on ranks are often used.The two most popular rank measures are the Kendall τ and the Spearman ρ.A recent illustration of the use of the Kendall τ in longitudinal regression analysis is given by Parzen et al. [42].Song [48, §6.3.3]supplies a detailed discussion of the relative merits of the various association measures in Gaussian copula models.In particular, it is shown that Spearman ρ is very close to corr(ǫ i , ǫ j ) and this is positively correlated with the Kendal τ index.
Interpretation of the dependence structured inherited by the Gaussian copula requires some care because the correlation of the responses is attenuated with respect to the correlation of the errors, as shown by Klaassen and Wellner [31].The attenuation is often considerable, in particular in the noncontinuous case where the margins restrict the range of possible association between the responses.A special case that has received much attention is when the margins identify a probit model and corr(ǫ i , ǫ j ) corresponds to the tetrachoric correlation [21].However, the restricted range of dependence is a common problem for any multivariate analysis of discrete variables.See also Genest and Nešlehová [17] for a detailed discussion about the correct interpretation of dependence measures in copula models for count data.
Although closed-form computation for bivariate, and higher order, moments is not available, some key aspects of the dependence structure of the model (3)-( 4) are easily derived.
Property 1.If errors ǫ i and ǫ j are uncorrelated, then the corresponding pair of responses Y i and Y j are independent given the covariates x i and x j .This coherency quality is obvious from bivariate distribution expressions in the continuous (6) and in the non-continuous (7) cases.See also Bodnar et al. [4] where this result is discussed together with other useful properties of Gaussian copula models.To appreciate Propriety 1, consider the special case of regression with stationary time series errors.The property states that if errors follow a moving average process of order q, then responses that are more than q units apart are independent.

Property 2. The direction of the association between any pair of responses Y i
and Y j given the covariates x i and x j coincides with the sign of the correlation between the corresponding pair of errors ǫ i and ǫ j .This is a direct consequence of mapping (3) being non-decreasing.This property ensures that the correlation structure of the normal errors does not modify the direction of the dependence between the responses given the covariates.As illustrated in the examples in Section 9, this simple result is very useful for interpretation of the fitted model.Property 3. If the error vector ǫ follows a Markovian process of order p and all the marginals p i (y i ; λ) are continuous, also the response vector Y given the model matrix X then follows a Markovian process of order p.
This property is easily verified in the special case of time-series or longitudinal data.In fact, the conditional density of a continuous response Y i given its predecessors is If the errors are Markovian of order p, the above conditional density reduces to the limited memory conditional density, indeed Similarly, but with notational complications, it can be shown that if the errors ǫ are realizations from a Gaussian Markovian random field, then the multivariate continuous responses Y are realizations from a Markovian random field.For time series, if Property 1 states a kind of parallelism between moving average errors and responses, then Property 3 extends the parallelism also to autoregressive processes but only for continuous responses.

Maximum likelihood inference
Suggested fitting of GCMR models is through the method of maximum likelihood.A clear advantage of this approach is that standard tools for hypothesis testing and model selection, such as likelihood ratio statistics and information criteria, can be used.Along the following lines, details of likelihood computations are discussed.For this purpose, it is convenient to treat the continuous case and the discrete or categorical case separately.We start from the former because it is simpler and is propaedeutic to the latter.The one-to-one relationship between responses Y and errors ǫ in the continuous case yields the likelihood for θ as where the copula density can be interpreted as the likelihood ratio between the assumed multivariate normal model for the errors and that under the independence hypothesis.Hence the likelihood L (θ; y) is obtained by sharpening the independence likelihood L ind (λ; y) through a measure q(ǫ; θ) of the evidence for dependence among the errors.
More difficult is the case of discrete or categorical-valued responses.In this case, likelihood evaluation requires the computation of the n-dimensional rectangular integral It is convenient to re-express the above integral by considering the change of variable from Then, likelihood (9) assumes the form whose interpretation is much the same as in the continuous case (8) except for the adjustment term which is an average over likelihood ratios of type q(ǫ; θ) but computed at the randomized errors ǫ i (u i ) given by expression (10).

Likelihood computation
In the non-continuous case the likelihood is expressed in terms of the Gaussian probability integral (9).Whenever this integral factorizes in low-dimensional terms as in longitudinal studies with few observations per subject, we suggest to use precise deterministic approximations as the method of Joe [27] or other recent numerical methods by Miwa et al. [38] and Craig [9].Joe's and Miwa's algorithms are both publicly available through R packages mprobit, authored by Joe, Chou and Zhang, and mvtnorm [24], respectively.For larger dimensions, occurring with time series, spatial data, longer longitudinal studies, the computational cost of deterministic approximations is too large for practical use.Hence, randomized methods need to be considered.An example is the randomized quasi-Monte Carlo method of Genz and Bretz [18] and also included in the above mentioned R package mvtnorm.This numerical method is of general purpose and may not be efficient for the particular models considered in this paper.For this reason, in the next subsection we describe another Monte Carlo approximation of the likelihood for non-continuous responses, tailored to the GCMR model class.

Importance sampling
Expression (11) suggests the following simple Monte Carlo approximation of the likelihood: 1. for k = 1, . . ., K, (a) generate n independent uniform (0,1) variates, u n ; (b) compute the randomized errors ǫ 10); (c) compute the Gaussian copula density Unfortunately, this approximation turns out to be quite inefficient.Indeed, consider the importance sampling approximation of the likelihood, where ǫ (k) is a vector of randomized errors drawn from the importance sampling distribution p IS (ǫ|y; θ).It follows that approximation (12) corresponds to an importance density constructed under the working assumption that the errors ǫ i are independent given the observations y i , Although valid, this importance density may be very inefficient because of the strong independence assumption.Obviously, the ideal importance density would be the exact conditional density p(ǫ|y; θ): with this choice, each term in sum ( 13) is exactly equal to p(y; θ) and thus a single draw (K = 1) would be sufficient to restore the exact likelihood.Unfortunately, using this ideal distribution is unfeasible because the importance sampling weights depend on p(y; θ).By noticing that a draw from the ideal p(ǫ|y; θ) could be obtained by sampling sequentially from p i (ǫ i |y i , . . ., y 1 ; θ), i = 1, . . ., n, we replace this unmanageable importance density with sequential sampling from density For the special case of multivariate probit regression, the above importance sampling density corresponds to the popular Geweke-Hajivassiliou-Keane simulator (GHK), see for example Keane [30].
The extension of the GHK simulator to deal with GCMR is immediate.Under the model assumptions, p(ǫ i |ǫ i−1 , . . ., ǫ 1 ; θ) is a normal density with mean m i = E(ǫ i |ǫ i−1 , . . ., ǫ 1 ) and variance v 2 i = var(ǫ i |ǫ i−1 , . . ., ǫ 1 ).Thus, ( 14) is a truncated normal density over the interval D i (y i ; λ) and a draw from it is obtained by setting where u 1 , . . ., u n are n independent draws from a uniform (0,1) random variable and Hence, the resulting sequential sampler algorithm for approximating the likelihood is 1. for k = 1, . . ., K, (a) generate n independent uniform (0,1) variates, u n ; (b) compute the randomized errors ǫ . A few comments on numerical aspects are in order.Quantities m i and v 2 i can be efficiently computed by the Cholesky factorization of Ω which, in any case, is an integral component of the likelihood computation.Substantial computational saving is achieved by exploiting the fact that the error correlation matrix Ω is the same for all the simulated error vectors ǫ (k) , k = 1, . . .K.
Other importance densities could be considered, for example following the ideas of Durbin and Koopman [16].However, the increase in computational cost may not be justified in terms of numerical precision given the simplicity and the good results obtained using the suggested sampler.See the simulation study in Section 6.
In general, the total complexity for one likelihood approximation with the discussed importance sampling, as well as with any other likelihood approximation for GCMR models, is of order O(n 3 ), because of the necessary inversion of the correlation matrix Ω.However, for specific problems the computational cost is much lower.For example, in time series models with ǫ following an autoregressive moving average process the Cholesky factorization can be efficiently implemented via the Kalman filter and only O(n) computations are needed.If n is large, the general computational cost O(n 3 ) is impractical.Some possible remedies are discussed in the final section within the directions for future research.
Maximum likelihood estimates are better computed from the log-likelihood.Importance sampling is however designed for giving unbiased estimates of the likelihood, while log L IS (θ; y) is a slightly biased estimator of the log-likelihood ℓ(θ; y).In our package gcmr, an approximately unbiased estimator of the loglikelihood is obtained with the correction given by Durbin and Koopman [16, §2.3].
The Monte Carlo size needed for correct inferential conclusions is typically a function of the degree of discreteness, the most difficult case being binary responses.As a general suggestion, it is advisable to repeat the analysis to check the adequacy of the Monte Carlo size, for example starting with a small Monte Carlo size and then increasing it until differences in the parameter estimates become practically insignificant.See Section 9.5 for some guidelines about the choice of the Monte Carlo size.

Simulations under model conditions
We have carried out several simulation studies to investigate the reliability of the importance sampling approximation.Here we only show the results for a time series log-linear regression model with a continuous time-varying covariate and two seasonal terms.Denote by F i (y i ; λ) the cumulative distribution function of a Poisson random variable of mean µ i = E(Y i |x i ).Data are generated from the marginal regression model ∼ N(0, 1), (16) with the correlation matrix of the errors Ω corresponding to that of an autoregressive process of order one with first-order autocorrelation equals to 0.8.For values of the marginal parameters β 0 = 1, β 1 = β 2 = β 3 = 0.2, we generated 300 time series of different lengths.Then, for each simulated series, five different estimates of the parameters were independently computed.In all cases, the log-likelihood was approximated by using K = 300 sequences of uniform pseudo-random numbers u n , k = 1, . . ., K. Columns 2 and 5 of Table 1 display the averages of the 300 × 5 = 1, 500 estimates for time series of lengths n = 100 and n = 500, respectively.Clearly, the averages of the estimates are close to the true values.It is more interesting to evaluate the impact of the Monte Carlo errors.For this purpose, we compute the standard ANOVA decomposition of the total sum of squares into between and within parts, using the 300 time series as grouping factor.Columns 3-4 and 6-7 of Table 1 show the between and within sums of squares measuring the variability due to the true estimation errors and to Monte Carlo errors, respectively.With the chosen Monte Carlo sample size, the variability due to the simulation is much smaller than that due to estimation errors.Similar results have been obtained for other parameter values, other dependence structures and other discrete marginal distributions.

Model checking
Validation of the model assumptions is a crucial aspect in any regression analysis, particularly in our multivariate setting.Model checking by residual analysis and by a Hausman [22] type specification test is discussed below.

Residuals
In the continuous case, the Rosenblatt's transformations [45] M are uniformly and independently distributed in the unit interval.Hence, model adequacy can be checked through residuals which are realizations of n uncorrelated standard normal variables if the model is correctly specified.These type of Cox and Snell residuals [8] are termed (normalized) quantile residuals in Dunn and Smyth [15] and normal pseudo-residuals in Zucchini and MacDonald [63].Normality of the quantile residuals can be inspected by normal probability plots and tests.Assessment of lack of correlation can involve graphical tools as autocorrelation plots for time series and longitudinal studies and variograms for spatial data.
In the non-continuous case, we define residuals r i to be any arbitrary value belonging to the interval where the lower extreme is defined as m − i = F i (y − i |y i−1 , . . ., y 1 ; θ), the left-hand limit of m i at y i .Accordingly, we base model diagnostic on the so-called randomized quantile residuals where u i is a draw from a (0, 1) uniform variate [15].Under model conditions, r rnd i (u i ) are realizations of uncorrelated standard normal variables and thus they can be used as ordinary residuals for checking model assumptions.Since residuals r rnd i (u i ) depend on the uniform variates u i , it is appropriate to inspect several sets of residuals before taking a decision about the model.
Alternatively, it is possible to avoid randomization by considering the mid interval quantile residuals r mid , as suggested by Zucchini and MacDonald [63] in the context of hidden Markov models.These residuals are, however, neither normally distributed nor uncorrelated.Zucchini and MacDonald [63] also suggest diagnostics based on the interval quantile residuals in order to preserve the data discreteness.However, interval quantile residuals are problematic when observations are the minimal or the maximal possible values, since in these cases r int i are left or right open (infinite) intervals.

Hausman-type specification test
In most cases there is scientific interest in the marginal parameters λ or a subset of regression parameters β.The assumed marginals can be checked through a variety of well-developed graphical and numeric methods, thus correct specification of the independence likelihood is generally accomplished.This leads to the dilemma of basing inference on λ on the safe but inefficient independence likelihood or considering the complete model but with the risk of copula misspecification.In other terms, we are interested in assessing the null hypothesis H 0 : the assumed multivariate model is correctly specified, against the alternative H 1 : marginals are correct but the assumed multivariate normal distribution for the errors is not (wrong copula).
The independence likelihood estimator λind is consistent under both null and alternative hypotheses, while the maximum likelihood estimator θ = ( λT , τ T ) T is consistent and efficient under the null hypothesis but inconsistent under the alternative.This suggests validating the correct model specification by the Hausman [22]-type statistic with the variance D = var( λind − λ) computed under the null hypothesis, where h(Y) is distributed as a chi-squared random variable with dim(λ) degrees of freedom.
Remark.The above framework differs from the usual Hausman test, which is focused on the complete parameter θ and not on its subset λ.Consequently, the Hausman orthogonality result does not apply, that is cov( λind − λ, λ) = 0. Hence, D does not simplify into the difference of variances var( λind ) − var( λ).
Under the null hypothesis, the vector of estimates ( λT ind , θT ) T has asymptotic variance matrix where Hence, D = C T VC for a contrast matrix C = (I T , −I T , 0 T ), where the identity blocks I have the dimension of λ.Generally, the components of matrix V are unavailable in closed-form.They can be estimated by Monte Carlo simulation from the assumed model.Alternatively, one can consider a more accurate, but computationally costly, direct estimation of D via parametric bootstrap.In the examples discussed in Section 9, the test statistic is computed via parametric bootstrap.
The Hausman-type test can be used for checking the correctness of inference about the global marginal parameter λ as well as for checking subsets of λ, such as single regressors β i , or combinations of regressors, in conformity with the scientific focus.

Robustness to the misspecification of the error distribution
Despite the residual analysis described in the previous section, a thorough validation of the multivariate normal assumption for the errors ǫ remains a difficult task.Hence, it is natural to ask whether correct inferences for λ can be obtained under local misspecification of the multivariate distribution for the errors.A general answer to this question is also difficult, however simulation studies such those reported below give a preliminary positive indication for robustness against local misspecification of the error distribution.
We illustrate two different simulation studies.In the first study, we investigate the robustness to the misspecification of the ARMA correlation of the errors and to the presence of heavy tails in their marginal distribution.In the second study, we consider the effect of non-linearity in the interdependence among the errors and skewness in their marginal distribution.

Effect of heavy tails and unspecified moving average correlation
We consider the same simulation setting of Section 6 with marginal structure as in formulas ( 16), but with a different error model, namely: 1. the errors follow an ARMA(1,1) normal model; 2. the errors follow an AR(1) model but are distributed as Student t random variables with ν degrees of freedom; 3. the errors follow an ARMA(1,1) model and are distributed as Student t random variables with ν degrees of freedom.
In each case, estimates are obtained from the misspecified model with normal AR(1).Table 2 displays 1, 000 simulated estimates for the covariate parameter β 3 for (i) various choices of the degrees of freedom ν, (ii) with or without a moving average component and (iii) different sample sizes.We choose to show results only for the time-varying covariate parameter for space limitations: results for the other marginal parameters are similar.Table 2 reports (i) the averages of the estimates, (ii) their standard deviations, (iii) the averages of standard errors computed from the Fisher observed information assuming the model is correct and (iv) the empirical coverages of 95% confidence intervals.For comparison, we also report the corresponding estimates using the independence likelihood with standard errors computed from the sandwich heteroskedasticity and autocorrelation consistent estimator for time-series data (Zeileis, 2006) to account for the serial dependence.
The first column of Table 2 corresponds to correct specification of the error distribution.Subsequent columns describe increasing levels of misspecification, the extreme case being errors that follow the ARMA(1,1) process with autoregressive parameter 0.8 and moving average parameter 0.2 and are distributed as Cauchy random variables (ν = 1).
We start by describing the results of the first column, where the two estimation methods are both correctly specified.As expected, maximum independence likelihood estimates show loss of efficiency with respect to estimates based on the fully specified dependence model.It is however more interesting that standard errors of the maximum independence likelihood estimates are strongly biased with n = 50 and 100, leading to over-optimistic confidence intervals.This result illustrates the well-known very slow convergence of robust variance estimation using sandwich-type methods; see Kauermann and Carroll (2001) for a general discussion on this point.
The other five columns of Table 2 show what happens if the dependence model is misspecified.In this case, the maximum independence likelihood estimate λind remains correct, while the estimator θ = ( λT , τ T ) T derived from the incorrect dependence model is inconsistent for the complete parameter θ.However, our simulations show that the component λ of θ relative to the marginal parameters remains essentially correct under local misspecification of the errors.Furthermore, standard errors wrongly computed by inverting the observed Fisher information matrix do not exhibit a practically relevant bias.In contrast, robust sandwich standard errors for the maximum independence likelihood grossly underestimate the uncertainty if the sample size is not large enough.

Effect of skewness and non-linear dependence among the errors
In the second simulation study, we again consider the same simulation setting of Section 6 with marginal structure as in formulas (16).In this case, the errors are generated from the following scaled threshold autoregressive model of order one TAR(1) [52] ∼ N(0, 1). ( This model yields two forms of misspecification with respect to the assumed scaled AR(1) model for the errors ∼ N(0, 1).( 18) First, there is a non-linear dependence among the errors because of the absolute value of the past error in the recursive equation (17).Secondly, the marginal distribution of the errors is not standard normal anymore but it is the skew-normal distribution [3] with location parameter equal to zero, unit scale parameter and skewness parameter τ / Hence, as the autoregressive parameter τ raises in absolute value, both the nonlinear dependence and the skewness of the errors increase together.Figure 1 illustrates the increasing difference between the stationary distribution of the scaled AR(1) process and the scaled TAR(1) process for values of τ ∈ {0.3, 0.6, 0.9}.Table 3 displays

Table 3
Averages, standard deviations (sim.s.e.), averages of estimated standard errors (est.s.e.) and 95% confidence intervals coverages for 1, 000 estimates of parameter β 3 in model ( 16) with errors that follow the TAR(1) model (17) for the previous simulation study, standard errors for the maximum likelihood independence estimator are based on the sandwich heteroskedasticity and autocorrelation consistent estimator, while those of the misspecified Gaussian copula model are obtained from the Fisher observed information as if the model were instead correct.
Once more, the maximum likelihood estimates of the marginal parameters λ based on the Gaussian copula model do not show significant bias despite the misspecification of the errors.For values of τ equal to 0.3 and 0.6, the misspecified maximum likelihood estimator λ has essentially the same efficiency as the maximum independence likelihood estimator λind .However, the standard errors of the latter estimator tend to underestimate the uncertainty, especially for moderately small sample sizes.Accordingly, the empirical coverage of the confidence intervals is sensibly smaller than the nominal value.By comparison, the coverage of the confidence intervals based on the misspecified Gaussian copula model are much closer to the nominal values.For values of τ that approach the limit of one, the maximum likelihood estimates based on the Gaussian copula model are remarkably more efficient than the maximum independence likelihood estimates.
In synthesis, the two simulation studies show how inferences from an incorrect dependence model can be more precise than inferences obtained from a correct model that avoids specification of the dependence structure.This form of robustness is a consequence of the ability of the nuisance parameters τ to accommodate departures in the dependence structure in such a way as to keep inferences on marginal parameters λ broadly correct.Similar findings were obtained for other models.

Examples
In this section we illustrate the model flexibility through a variety of examples covering applications to time series, crossed design experiments, survival analysis and spatial data.All examples are based on well known data sets, in order best to facilitate comparisons with other possible models and fitting methods.Some comments about computational aspects are provided in Section 9.5.

Generalized linear models with time-series errors
The time series of monthly Polio incidences in the USA from 1970 to 1983 has been analyzed by several authors with different observation-and parameterdriven models since Zeger [59].Among many others, some useful references may be found in Song [48, §12].The scientific question is whether or not there is evidence of a decreasing trend of Polio infections in the observation period.
Following previous analyses of these data, we consider a log-linear model with covariates designed to capture possible linear trend and seasonality effects, where t i is the time of the ith observation.Marginals are modelled through a negative binomial distribution with mean E(Y i |x i ) = µ i and variance var(Y i |x i ) = Serial correlation between polio counts is accommodated by assuming an ARMA(p,q) model for the errors.
According to the corrected Akaike information criterion [25], the best fit is provided by GCMR with ARMA(2,1) errors.The first and second order autoregressive parameters are estimated as (standard errors in brackets) −0.53(0.21)and 0.31(0.09),while the moving average parameter as 0.71(0.22).The first two autocorrelations coefficients induced by the fitted ARMA(2,1) model are corr(ǫ i , ǫ i−1 ) = 0.15 and corr(ǫ i , ǫ i−2 ) = 0.24.By Property 2, it follows that, conditionally to the covariates, there is positive association between Y i and Y i−1 and between Y i and Y i−2 .Table 4 summarizes estimates for the marginal parameters from the working independence model and from the ARMA(2,1) dependence model.
Standard errors of the independence likelihood estimates are computed using the heteroskedasticity and autocorrelation consistent (HAC) sandwich estimator for time series of Andrews [2].The estimates of the overdispersion parameter κ are significantly different from zero and thus negative binomial marginals are preferable to Poisson marginals.The point estimate of the trend parameter, that is the parameter of interest, is −4.33 under working independence.This value is very close to the estimate −4.31 obtained via the GCMR model with ARMA(2,1) errors.However, the latter model provides a standard error (s.e.2.30) substantially larger than that derived under working independence assumptions using the HAC sandwich (s.e.1.84).Correspondingly, the Wald test statistic for the one-sided hypothesis of negative trend (H 0 : β 1 = 0 vs. H 1 : β 1 < 0) has a p-value of 0.03 with ARMA(2,1) errors and a p-value of 0.009 under the working independence assumption.In both the cases, the conclusion is in favor of a positive trend, although at a very different level of confidence.
As regard model checking, the Hausman-type test is passed overall (h = 3.49, p-value 0.83) and also for all single marginal parameters as shown in last three columns of Table 4. Randomized quantile residuals computed several times sustain the distributional assumptions and suggest that no residual serial correlation is present in the data.Normal probability and autocorrelation plots for a set of residuals are displayed in Figure 2.

Cross-correlated binary data
The Salamander mating data come from a study on barriers to interbreeding in two geographically isolated species of salamanders called Whiteside (W) and Rough-Butt (R).These much studied data have a rather complicated incomplete, balanced, crossed design.The data consist of three experiments, one conducted in Summer 1986 and two in the Fall of the same year.A total of 80 animals were used: the same 40 salamanders in Summer 1986 and in the first Fall experiment.A new set of 40 salamanders was used in the second Fall experiment.Each experiment involved 10 females and 10 males for each of the two populations.Each female was paired six times: three times with males of her own population and three times with males of the other population.The binary response records whether the mating event was successful or not.See McCullagh and Nelder [37] for more details.These data have been analysed in a number of papers with crossed-random effects logistic or probit models.The random effects are designed to account for the correlation between the results of two matings involving the same female or the same male.The likelihood for this mixed-effects model is very difficult to manage because of the high-dimensional integrals involved.Typical solutions are based on simulation methods.Examples include Markov chain Monte Carlo [60], Monte Carlo expectation-maximization [5] and importance sampling [51] algorithms.
Here we consider a marginal regression analysis of the Salamander data.The proposed model is the marginal counterpart of model "B" of Zeger and Karim [60].For the marginal part we assume the logistic model with x 1i indicating whether the female used in the ith mating belongs to the R race, x 2i whether the male belongs to the R race and x 4i whether the experiment took place in Fall (x 4i = 1) or in summer (x 4i = 0).The correlation matrix Ω of the errors is partitioned in blocks of dimension 120 × 120, Block Ω S describes the cross-correlation structure in the Summer experiment.Apart from the diagonal, the correlation matrix Ω S has zero entries everywhere but in the cells corresponding to two different matings done in summer involving the same female and/or the same male.Thus, there are three types of non-zero entries: the correlation between errors due to the use of the same female in two distinct experiments in Summer is measured by parameter τ 1 , the correlation between errors due to the use of the same male by parameter τ 2 , finally the sum τ 1 + τ 2 measures the correlation between errors when both the same female and male are used in Summer.
Similarly, block Ω F accounts for cross-correlation between errors in the two Fall experiments and it is parameterized by τ 3 and τ 4 , the first parameter related to female cross-correlation, while the second one to male cross-correlation.Finally, block Ω S,F accounts for the association between the Summer and Fall experiments sharing the same animals and it is parameterized by parameters τ 5 and τ 6 for females and males, respectively.
The main scientific focus is however on the marginal probabilities of matings.Denote by π RW the probability that an R female successfully mates with a W male, and similarly denote by π RR , π WR and π WW the remaining probabilities.The estimates of these quantities are easily obtained from the estimated β coefficients.For example the maximum likelihood estimate of π WR is exp( β0 + β1 )/{1 + exp( β0 + β1 )} = 0.27.All the other estimates are listed in Table 5 along with 95% confidence intervals.For comparison, we also report estimates and confidence intervals obtained by Zeger and Karim [60] with a random effects logistic model fitted by Gibbs sampling.Results from the two models are very similar, although the marginal-model confidence intervals are narrower in most cases.The clear conclusion is that the probability of a successful mating is much smaller when a W female is matched with an R male, than in all of the other three cases.

Matched time-to-event data
For an illustration of survival data analysis, we reanalyse the data of Mantel et al. [36] about times to tumor appearance in litter-matched rats.Three female α }, where α denotes the shape parameter, η i = exp(β 0 + β 1 x i ) with x i being the indicator for treatment and λ = (α, β 0 , β 1 ) T .This model corresponds to both an accelerated life model and a proportional hazard model.

Spatial regression with count data
Data on incidence of male lip cancer in Scotland during years 1975-1980 have been analysed by many authors for illustrating varying disease mapping methods, see Waller and Gotway [55] and Wakefield [56].Data consist of observed Y i and expected number of cases e i in each of the 56 counties of Scotland and they are available through the home page of Waller and Gotway's book at http://www.sph.emory.edu/~lwaller/book/ch2/scotland.dat.Interest lies in studying whether excess of cases can be associated with the proportion of the population employed in agriculture, fishing, or forestry (AFF), whose map is displayed in the right panel of Figure 3.Some authors refer to a possible spatial trend effect along the south-north direction, with a demographic interpretation because the most northern counties of Scotland are sparsely populated.The left panel of Figure 3 displays the standardized morbidity ratio (SMR) defined as the ratio of the observed to expected cases.This map appears to confirm the south-north trend.The standard non-spatial model for these data consists in assuming that the observed cases Y i are distributed as a Poisson variable with mean µ i = φ i e i exp(β 0 + β 1 x 1i + β 2 x 2i ) where x 1i is the AFF covariate and x 2i is the latitude coordinate (divided by 100).Quantity φ i is an overdispersion parameter distributed as a Gamma variable with mean 1 and scale κ.In other words, a marginal negative binomial model for Y i is assumed.
Residual spatial dependence is modelled by assuming that the errors ǫ i are realizations of an underlying continuous zero mean Gaussian random field model.Under the assumption of uniform spatial distribution within each county, the correlation between the errors of county i and county j is given by the average where A i denotes the ith county, |A i | its area and ρ(•; τ ) is a spatial correlation function.Given its flexibility, we consider the Matérn spatial correlation function defined in equation (5).
We start the analysis approximating corr(ǫ i , ǫ j ) by the spatial correlation function between the centroids of the two areas, corr(ǫ i , ǫ j ) ≈ ρ( si − sj 2 ), with si denoting the centroid of the ith county.Diggle and Ribeiro [14] affirm that the shape parameter τ 2 of the Matérn correlation function is difficult to identify and suggest choosing its value from the discrete set {0.5, 1.5, 2.5}, which represents various degrees of mean-square differentiability of the underlying signal process.For these data, we found little difference by varying the shape parameter τ 2 .In the following, we present estimates obtained from the model with τ 2 = 0.5, corresponding to the so-called exponential correlation function corr(ǫ i , ǫ j ) = exp (− si − sj 2 /τ 1 ) .The estimated marginal parameters yield the mean response Ê(Y i |x 1i , x 2i ) = e i exp(−20.80 (4.58) + 4.31 (1.43) x 1i + 36.74 (8.06) x 2i ), and the non-spatial overdispersion parameter φ i is distributed as a Gamma variable with mean of one and estimated scale parameter of κ = 0.17 (0.06).Covariate AFF (x 1 ) is significantly positively associated with excess of tumor incidence.The estimated spatial trend coefficient (x 2 ) is also highly significant although this association should be interpreted with care because the most northern counties are very sparsely populated resulting in zero expected cases.
There is evidence of some local residual spatial correlation.The estimate of the correlation parameter of the exponential correlation function is 14.36 Km (6.19 Km).Hence, two counties more than 3 × 14.36 = 43.08Km apart have errors with a correlation lower than 0.05.There are 157 distinct pairs of counties with centroids at distance lower than 43 Km among the observed 56!/(2!54!) = 1, 540 pairs.It follows by Property 1 that there is weak residual spatial dependence in the responses once accounting for the AFF covariates and the south-north trend.
A more precise analysis should consider the geography of Scotland and require the computation of the correlation between the errors generated by an underlying continuous process as in formula (20).However, this more elaborated analysis seems likely to be nonessential because of the weak local spatial dependence once the covariates AIFF and the spatial trend are included.

Computational details
The models in examples 9.1, 9.2 and 9.4 are fitted with the GHK algorithm.Each model is first fitted with K=100 Monte Carlo replications using the maximum independence likelihood estimates as starting values for the marginal parameters.Then, the model is re-fitted with a larger value of K=1,000 replications using the initial estimates as starting values.Our experience is that K=1,000 is a sufficient size to obtain statistically stable estimates with data set up to few hundreds of observations.The computational time needed to fit the various models depends on various factors, not only the number of observations.Other crucial factors are the number of dependence parameters, the degree of dependence and the level of discreteness of the responses.For example, with a MacBook Air notebook with processor 1.8 Ghz Intel Core i7 and 4 Gb of memory we need only 0.08 minutes to fit the spatial model to Scotland Lip Cancer data, which involve 56 observations, only one dependence parameter and the degree of dependence is very low.The computational time for fitting the ARMA(2,1) model to the Polio data is instead 1.2 minutes.This longer time is due to the larger sample size (n = 168), the presence of three dependence parameters and significant serial dependence.The analysis of Salamander data is more time consuming as model fitting needs 4.78 minutes.This time is explained not only by the size of 360 observations, but also because there are six dependence parameters and the responses are binary.

Concluding discussion
In this paper we have discussed the class of Gaussian copula models for marginal regression analysis of correlated non-normal data.The convenient model specification allows for simple interpretation of marginal parameters and great flexibility in specification of the dependence structure.Applications to time series, longitudinal studies, spatial data and survival analysis have been used to illustrate this flexibility.Residual analysis and the Hausman-type specification test can be used to validate the adequacy of the assumed multivariate model and simulation studies reported in Section 8 suggest that a certain level of local misspecification can be tolerated.Investigation of this apparent robustness is perhaps the most interesting direction for future research.Other possible future research directions include the following aspects.
1. High dimensional data.Numerical difficulties may arise from the inversion of matrix Ω with high-dimensional data and from the dimension of the integral needed to compute the likelihood in the non-continuous case.
In these cases it is necessary to simplify either the correlation model for the errors or the estimation procedure.An example of model simplification for spatial applications is the approximation of errors following a Gaussian random field by a Gaussian Markov random field in a fine grid, as shown in Rue and Tjelmeland [46].With this approximation, the computational cost for inversion of Ω roughly reduces from O(n 3 ) to O(n 3/2 ) in two dimensions and O(n 2 ) in three dimensions.An example of simplified estimation procedure is the method of maximum composite marginal likelihood [35,54].This consists of maximizing a pseudolikelihood constructed from a product of marginal densities, typically low-dimensional.The basic example is the pairwise likelihood formed by bivariate densities p ij (y i , y j ; θ) wij , where w ij are suitable weights and the bivariate densities are given in formulas ( 6) and (7).The pairwise likelihood does not require inversion of Ω and has a computational cost of O(n 2 ) if all pairs are used.Furthermore, in the non-continuous case only bivariate integrals (7) need to be evaluated.The computational saving can be substantial and can be further improved by a suitable choice of the weights w ij to remove less informative pairs, such as those formed by too-distant observations in spatial applications.Maximum pairwise likelihood estimators are consistent and asymptotically normal under appropriate regularity conditions, and in many applications they have competitive efficiency.Zhao and Joe [62] study the performance of pairwise likelihood estimators for Gaussian copula models and conclude that this method perform well.2. Elliptic distributions other than normal for the errors.Although simulation results reported in Section 8 suggest that inference is relatively robust to misspecification of the joint distribution for the errors, in some applications it may useful to consider more flexible elliptic distributions for handling heavy tails or skewness of the errors.For example, the assumed Gaussian copula is not appropriate for modelling extreme values events because of its well-known lack of tail dependence.In this case, it may be appropriate to assume instead a Student-t distribution for the errors [40].3. Maximization by parts.Maximization by parts is a numerical iterative algorithm proposed by Song et al. [49] to optimize complex log-likelihoods that can be partitioned into a manageable "working log-likelihood" plus a more complex "remainder log-likelihood".See also Song [48].The algorithm aims to enhance numerical stability relative to the direct numerical optimization of the likelihood function.Among other applications, maximization by parts has been proposed for fitting continuous Gaussian copula regression models.

Maximum simulated likelihood via Markov chain Monte Carlo algorithms.
Recently, Jeliazkov and Lee [26] show that Markov chain Monte Carlo algorithms designed for marginal likelihood computation in Bayesian inference can be used for efficient maximum simulated likelihood analysis.The key ingredient of this proposal is the method of Chib [6] for marginal likelihood estimation from the output of Gibbs sampling algorithms.Translated to the context of this paper, one possibility described in Jeliazkov and Lee [26] is drawing from p(ǫ|y; θ) with the Gibbs sampling algorithm by Geweke [19] and then estimate the marginal likelihood with Chib's method with Rao-Blackwellization. Jeliazkov and Lee [26] discuss other variants of this idea and compare them against the GHK algorithm in the context of multivariate probit models.

Fig 2 .
Fig 2. Polio data.Normal probability (left panel) and autocorrelation (right panel) plots for a set of randomized residuals.

Table 1
(16)ages of 1,500 estimates of the parameter of model(16)with AR(1) errors and time series of lengths n ∈ {100, 500} using the importance sampling approximation with a Monte Carlo sample size of K = 300.Also reported are the sums of squares pertaining to estimation errors (SSest) and to Monte Carlo errors (SS MC )

Table 2
Averages, standard deviations (sim.s.e.), averages of estimated standard errors (est.s.e.) and 95% confidence intervals coverages for 1, 000 estimates of parameter β 3 in model (16) with errors that follow an ARMA(1,1) model and are distributed as Student t random variables with ν degrees of freedom.The estimates are computed either with a misspecified dependence model with normal AR(1) errors (gcmr) or under the working assumption of independence with robust standard errors (ind.).The table displays results for sample sizes n ∈ {50, 100, 300}, . The estimates are computed either with a misspecified dependence model with normal AR(1) errors (gcmr) or under the working assumption of independence with robust standard errors (ind.).The table displays results for sample sizes n ∈ {50, 100, 300} and autoregressive parameter AR ∈ {0.3, 0.6, 0.9}

Table 4
Polio data.Estimates (est.) and standard errors (s.e.) for the marginal parameters from (a) the independence model and (b) from the dependence model with ARMA(2,1) errors.Also displayed are estimates, standard errors and z values for the differences of the estimates

Table 5
[60]mander data.Estimated mating probabilities and their 95% confidence intervals from the marginal model compared with the estimates provided by Zeger and Karim[60]litter mates are observed for each of 50 litters.One of the three mates received a treatment, while the other two serve as controls.Responses consist of either the week of tumor occurrence or the week of death before the instance of any tumor.In any case, all rats were sacrificed after 104 weeks.Scientific interest is addressed to evaluating a possible association between treatment and time to tumor.Let Y i = min{T i , c i } denote the response with T i being the time to tumor and c i the censoring time, i = 1, . . ., 150.Marginally, we assume a Weibull regression model for the survival times T