Gaussian Copula Regression in R

This article describes the R package gcmr for ﬁtting Gaussian copula marginal regression models. The Gaussian copula provides a mathematically convenient framework to handle various forms of dependence in regression models arising, for example, in time series, longitudinal studies or spatial data. The package gcmr implements maximum likelihood inference for Gaussian copula marginal regression. The likelihood function is approximated with a sequential importance sampling algorithm in the discrete case. The package is designed to allow a ﬂexible speciﬁcation of the regression model and the dependence structure. Illustrations include negative binomial modeling of longitudinal count data, beta regression for time series of rates and logistic regression for spatially correlated binomial data.


Introduction
Copula models (Joe 2014) are often considered to extend univariate regression models assuming independent responses to more general frameworks (e.g., Frees and Valdez 1998;Song 2000;Parsa and Klugman 2011;Kolev and Paiva 2009).The principal merit of the approach is that the specification of the regression model is separated from the dependence structure.This paper focuses on Gaussian copula regression method where dependence is conveniently expressed in the familiar form of the correlation matrix of a multivariate Gaussian distribution (Song 2000;Pitt et al. 2006;Masarotto and Varin 2012).Gaussian copula regression models have been successfully employed in several complex applications arising, for example, in longitudinal data analysis (Frees and Valdez 2008;Sun et al. 2008;Shi and Frees 2011;Song et al. 2013), genetics (Li et al. 2006;He et al. 2012), mixed data (Song et al. 2009;de Leon and Wu 2011;Wu and de Leon 2014;Jiryaie et al. 2015), spatial statistics (Kazianka and Pilz 2010;Bai et al. 2014;Hughes 2015;Nikoloulopoulos 2015), time series (Guolo and Varin 2014).
Various authors discussed likelihood inference for Gaussian copula models (e.g., Masarotto and Varin 2012;Song et al. 2013;Nikoloulopoulos 2015).While likelihood computations for continuous responses are straightforward, the discrete case is considerably more difficult because the likelihood function involves multidimensional Gaussian integrals.Simulation methods are often employed to approximate the likelihood of Gaussian copula models in presence of high-dimensional discrete responses.Pitt et al. (2006) developed a Markov chain Monte Carlo algorithm for Bayesian inference, Masarotto and Varin (2012) adopted a sequential importance sampling algorithm, Nikoloulopoulos (2013) studied simulated maximum likelihood based on randomized quasi-Monte Carlo integration, see also Nikoloulopoulos (2015).Alternatively, composite likelihood methods (Varin et al. 2011) have been used to reduce the integral dimensionality and avoid inversion of high-dimensional covariance matrices (e.g., Zhao and Joe 2005;Bai et al. 2014;Hughes 2015).
Well-known limits of the Gaussian copula approach are the impossibility to deal with asymmetric dependence and the lack of tail dependence.These limits may impact the use of Gaussian copulas to model forms of dependence arising, for example, in extreme environmental events or in financial data.Conversely, this paper focuses on working Gaussian copulas used to conventiently handle dependence in regression analysis as described in Masarotto and Varin (2012).In other terms, the parameters of interest are the regression coefficients, while the dependence structure identified by the Gaussian copula is a nuisance component.
The CRAN archive contains several R (R Core Team 2015) packages devoted to copula modeling, but only a few consider copulas for regression modeling.The weightedScores package (Nikoloulopoulos and Joe 2014) is designed for longitudinal modeling of discrete responses.Regression parameters are estimated with optimal estimating equations, while dependence parameters are estimated by maximum composite likelihood based on a working Gaussian copula model.The principal merit of the approach is the robustness against misspecification of the copula distribution.CopulaRegression (Kraemer et al. 2013) uses various copula models to describe the joint distribution of a pair of continuous and discrete random variables.The marginals are defined via generalized linear models and various parametric copulas are fitted to the bivariate joint distribution with the method of maximum likelihood.copCAR implements a Gaussian copula regression model for areal data.Model parameters are estimated with composite likelihood and other types of likelihood approximation.The popular package copula (Hofert et al. 2015) can also be used for multivariate regression with continuous responses, although it does not contain functions directly designed for regression, see the Appendix of Yan (2007).
Package gcmr differs from the above packages in terms of the covered models, the fitting method in the discrete case and the functionalities for evaluation of the fitted model.Available marginal regression models include generalized linear models, negative binomial regression for overdispersed counts and beta regression for rates and proportions.Implemented Gaussian copula correlation matrices allow to handle various forms of dependence arising, for example, in longitudinal data analysis, time series and geostatistics.Advanced users may expand gcmr with additional marginal regression models and Gaussian copula correlation matrices as explained in the Appendix.Models are fitted with the method of maximum likelihood in the continuous case and maximum simulated likelihood in the discrete case.Among the advantages of likelihood inference there are the possibility to select models using information criteria such as AIC or BIC and the computation of the profile log-likelihood for inference on a focus parameter.Given potential concerns about the assumed copula, various types of robust sandwich standard errors are available.Moreover, gcmr implements residuals analysis to evaluate departures from the model assumptions.
The article is organized as follows.Section 2 briefly summarizes the theory of Gaussian copula regression with emphasis on model specification, likelihood inference, quantification of estimation uncertainty and model validation through residuals analysis.Section 3 describes the R implementation available within the gcmr package.Section 4 illustrates various gcmr functionalities using longitudinal data, time series and spatial data.The Appendix provides some guidance to advanced users about how to extend the package capabilities by specification of regression models and dependence structures not yet available in gcmr.

Gaussian copula regression
Consider a vector of n dependent variables Y 1 , . . ., Y n .The marginal cumulative distribution of a single variable Y i is denoted by F (•|x i ) and depends on a p-dimensional vector of covariates x i .We assume that F (•|x i ) is parameterized in terms of a location parameter µ i , typically corresponding to the expected value E(Y i |x i ), that depends on x i through the relationship for a suitable link function g 1 (•) and a p-dimensional vector of regression coefficients β.This setting encompasses a variety of popular model classes such as, for example, generalized linear models (McCullagh and Nelder 1989) or beta regression (Cribari-Neto and Zeileis 2010).If the distribution of Y i includes a dispersion parameter, then the model can be extended to allow for variable dispersion with a second regression model (Cribari-Neto and Zeileis 2010) where g 2 (•) is the dispersion link function, ψ i is the dispersion parameter associated to Y i , z i is the q-dimensional vector of dispersion covariates and γ is the corresponding vector of regression coefficients.For the sake of notation simplicity, thereafter the marginal cumulative univariate distribution of Y i will be denoted as F (•|x i ) even in the case of variable dispersion, where indeed the model is In Gaussian copula regression the dependence between the variables is modelled with a Gaussian copula so that the joint data cumulative distribution function is given by where i = Φ −1 {F (y i |x i )}, with Φ(•) denoting the univariate standard normal cumulative distribution function and Φ n (•; P) the n-dimensional multivariate standard normal cumulative distribution function with correlation matrix P.
An equivalent formulation of the Gaussian copula model that emphasizes the regression setting is described in Masarotto and Varin (2012).Consider a regression model that links each variable Y i to a vector of covariates x i through the generic relationship where i is a stochastic error.Among many possible specifications of the function h(•) and the error i , the Gaussian copula regression model assumes that and the vector of "error terms" = ( 1 , . . ., n ) has a multivariate standard normal distribution with correlation matrix P. In other terms, the Gaussian copula identifies a regression model constructed in way to (i) preserve the marginal univariate distributions and (ii) have multivariate normal errors.
An attractive feature of the Gaussian copula approach is that various forms of dependence can be expressed through suitable parameterization of the correlation matrix P. For example, longitudinal data can be modeled assuming the working correlation models considered in generalized estimating equations (Song 2007, § 6).Serial dependence in time series can be described with a correlation matrix corresponding to an autoregressive and moving average process (Guolo and Varin 2014), while spatial dependence can be handled with a correlation matrix induced by a Gaussian random field (Bai et al. 2014).

Likelihood inference
The gcmr package implements maximum likelihood inference for Gaussian copula regression models.Let θ denote the vector of model parameters consisting of the parameters of the univariate marginals and the parameters belonging to the Gaussian copula correlation matrix.
The likelihood function for θ in the continuous case has the closed-form (e.g., Song 2000) where φ(•) indicates the univariate standard normal density, φ n (•; P) the n-dimensional standard normal density with correlation matrix P and f (•|x i ) the density of Y i given x i .The discrete case is considerably more involved because the likelihood is given by the n-dimensional normal integral where the integral domain is the Cartesian product of the intervals A remarkable amount of research has been addressed to the numerical approximation of multivariate normal integrals.For example, quasi-Monte Carlo approximations are available through the popular R package mvtnorm (Genz and Bretz 2009;Genz et al. 2012).We refer to Nikoloulopoulos (2013Nikoloulopoulos ( , 2015) ) for numerical studies about the efficiency of simulated maximum likelihood estimation based on mvtnorm in Gaussian copula models.Masarotto and Varin (2012) suggest that efficient numerical approximations of Equation 3 can be obtained by suitable generalizations of numerical methods for approximate likelihood inference in multivariate probit models.The most popular of such methods is probably the Geweke-Hajivassiliou-Keane (GHK) simulator (Keane 1994).The GHK simulator is a sequential importance sampling algorithm extensively studied in the computational Econometrics literature, where it is commonly considered the gold standard for likelihood computation in multivariate probit models, see, for example, Train (2003).The gcmr package implements maximum simulated likelihood estimation based on a variant of the GHK algorithm described in Masarotto and Varin (2012).
There are different options to evaluate the uncertainty of the maximum likelihood estimator.First, the classical approach that evaluates the uncertainty with the inverse of the observed Fisher information.Alternatively, the asymptotic variance of the maximum likelihood estimator can be estimated with the outer product of the scores derived from the predictive decomposition of the likelihood.The above options are valid if the Gaussian copula model is correctly specified.Given the potential concerns about the Gaussian copula assumption, then it is advisable to compare model-based standard errors with robust sandwich estimators.Significant divergences between model-based and robust standard errors provide indirect indication of model misspecification.

Residuals
Masarotto and Varin (2012) suggest to validate Gaussian copula regression models for continuous responses with predictive quantile residuals where θ denotes the maximum likelihood estimator of θ.Under model conditions, quantile residuals r i are, approximatively, realizations of uncorrelated standard normal variables and they are unrelated to the covariates x i .The predictive quantile residuals for continuous responses can be expressed in the familiar form of standardized residuals In the discrete case, quantile residuals r i are defined as any arbitrary value in the interval Model checking can be based on the randomized quantile residuals where u i is generated from a uniform random variable on the unit interval (Dunn and Smyth 1996).Randomized quantile residuals r rnd i (u i ) are realizations of uncorrelated standard normal variables under model conditions and they can be used as ordinary residuals for checking model assumptions.Since r rnd i (u i ) are randomized, it is opportune to examine several sets of residuals before concluding about model fitting.Zucchini and MacDonald (2009) suggest to avoid randomization with the mid-interval quantile residuals r mid i = Φ −1 {(a i + b i )/2}, which are, however, neither normally distributed nor uncorrelated.
Quantities r i , r rnd i and r mid i are examples of conditional quantile residuals, because they involve the conditional distribution of Y i given the "previous" observations y i−1 , . . ., y 1 .It is also possible to consider "marginal versions" of these residuals based on the univariate marginal distribution of Y i obtained, in the continuous case, with Equation 4 replaced by r marg i = Φ −1 {F (y i |x i ; θ)}.Differently from the conditional versions, marginal quantile residuals are useful for checking the assumptions about the marginal component of the model, but they are uninformative about the correctness of the Gaussian copula assumption.

Implementation in R
The main function of the gcmr package is gcmr() which allows to fit Gaussian copula models by maximum likelihood in the continuous case and by maximum simulated likelihood in the discrete case.The arguments of gcmr() are the following gcmr (formula, data, subset, offset, marginal, cormat, start, fixed, options = gcmr.options(...), model = TRUE, ...) The function has standard arguments for model-frame specification (Chambers and Hastie 1993) such as a formula, the possibility to restrict the analysis to a subset of the data, to set an offset, or to fix contrasts for factors.The specific arguments of gcmr() include the two key arguments marginal and cormat, which specify the marginal part of the model and the copula correlation matrix, respectively.Finally, there are three optional arguments to supply starting values (start), fix the values of some parameters (fixed) and set the fitting options (options).The rest of this section describes the components of gcmr() and the related methods.

Two-part formulas
The basic formula allowed in gcmr is of type y ~x1 + x2 and it specifies the regression model for the mean response of Equation 1 with the link function g 1 (•) defined in the argument marginal as explained in Section 3.2 below.Following the implementation of beta regression in package betareg (Cribari-Neto and Zeileis 2010), the gcmr package also allows to specify a second regression model for the dispersion through a "two-part" formula of type y ~x1 + x2 | z1 + z2 using functionalities inherited from package Formula (Zeileis and Croissant 2010).In the two-part formula case, the model has the same mean regression expression y x1 + x2, while the dispersion parameter is modeled as a function of the linear predictor ~z1 + z2.Package gcmr assumes a log-linear model g 2 (•) = log(•) for the dispersion regression model of Equation 2.

Specification of the marginal model
The marginal model F (•|x i ) is specified through an object of class marginal.gcmrset in the argument marginal of function gcmr().The marginal distributions available in gcmr version 0.7.5 are beta, binomial, gamma, Gaussian, negative binomial, Poisson and Weibull, see Table 1.For each of these distributions, it is possible to choose a link function that relates the mean of the response to the linear predictor as in traditional generalized linear models.All the link functions available in the class link-glm are allowed.Gaussian marginals are included in gcmr for completeness, but it is not recommended to use gcmr for fitting multivariate normal models trivially arising from the combination of Gaussian marginals with a Gaussian copula.In fact, the package gcmr is designed to work with Gaussian copula models with generic univariate marginal distributions and thus it is not numerically efficient for inference in multivariate linear Gaussian models, where the availability of analytic results allows for a significant speed-up of computations.
The user may also construct her/his own marginal model by specifying a new object of class marginal.gcmras explained in Appendix A.1.

Specification of the correlation structure
The correlation matrix P of the Gaussian copula is specified through an object of class cormat.gcmrset in the argument cormat of function gcmr().Version 0.7.5 of gcmr includes four correlation structures of wide applicability, see Table 2.The working independence correlation option is similar in spirit to that of generalized estimating equations.The other three marginal.gcmrDistribution Dispersion beta.marg(link= "logit") beta yes binomial.marg(link= "logit") binomial no Gamma.marg(link= "inverse") gamma yes gaussian.marg(link= "identity") Gaussian yes negbin.marg(link= "log") negative binomial yes poisson.marg(link= "log") Poisson no weibull.marg(link= "log") Weibull yes Table 1: Marginals models available in gcmr version 0.7.5 with the default link function.The column "Dispersion" identifies the distributions with a dispersion parameter.
correlation structures allow to deal with time series, clustered or longitudinal data and spatial data.Clustered and longitudinal data can be analyzed with cluster.cormat(id,type) constructed upon functions inherited from package nlme (Pinheiro et al. 2014).The inputs of cluster.cormat() are the vector of subject id and the type of correlation with possible options "independence", "ar1", "ma1", "exchangeable" and "unstructured".Subject id is a vector of the same length as the number of observations.Data are assumed to be sorted in such a way that observations from the same subject (or cluster) are contiguous, otherwise gcmr stops and returns an error message.Serial dependence in time series can be described with function arma.cormat(p,q), which receives the orders p and q of the ARMA(p, q) process as input.Spatially correlated data can be modeled by assuming a Mátern spatial correlation function set by function matern.cormat(D,alpha), where D is the matrix of the distances between observations and alpha is the shape parameter (Diggle and Ribeiro 2007).Function matern.cormat() is constructed upon function matern() of the geoR package (Ribeiro and Diggle 2015).The default value for parameter alpha is 0.5, and it corresponds to an exponential correlation model.

Fitting options
The fitting options in gcmr() are set by argument options or by a direct call to function gcmr.options(seed= round(runif(1, 1, 1e+05)), nrep = c(100, 1000), no.se = FALSE, method = c("BFGS", "Nelder-Mead", "CG"), ...) Available arguments are seed, for fixing the pseudo-random seed used in the GHK algorithm to approximate the likelihood function with discrete responses, nrep, for setting the number of the Monte Carlo replications in the GHK algorithm, no.se, for choosing whether computing the standard errors or not, and method, for selection of the optimization method to be passed to optim().The default optimization algorithm is the quasi-Newton BFGS algorithm.It is possible to provide a vector of Monte Carlo replications to nrep, so that the model is fitted with a sequence of different Monte Carlo sizes.In this case, the starting values for likelihood optimization are taken from the previous fitting.A reasonable strategy is to fit the model with a small Monte Carlo size to obtain sensible starting values and then refit with a larger Monte Carlo size.The default Monte Carlo size is 100 for the first optimization and 1, 000 for the second and definitive optimization.If the responses are continuous, then the likelihood function has a closed-form expression and the values of seed and nrep are ignored.

Methods
The returned fitted-model object of class gcmr is a list that contains, among others, the maximum likelihood estimates, the maximized log-likelihood and numerical estimates of the Hessian and the Jacobian of the log-likelihood computed at the maximum likelihood estimate.A set of standard methods is available to extract information from the fitted model, see Table 3.Most of the functions and methods have standard syntax as in other R packages oriented to regression analysis, see, for example, betareg (Cribari-Neto and Zeileis 2010).
Method plot.gcmr()produces various diagnostic plots of the fitted gcmr object that include scatterplots of the quantile residuals against the indices of the observations or against the linear predictor, the normal probability plot with confidence bands based on the implemen-tation in the car package (Fox and Weisberg 2011), the scatterplot of the predicted values against the observed values, autocorrelation and partial autocorrelation plots of the residuals.The default behaviour of plot.gcmr()adapts to the type of correlation matrix in way that, for example, autocorrelation plots are automatically displayed for ARMA(p, q) correlation specified with by function arma.cormat(p,q).
The profile log-likelihood can be obtained with a call to method profile.gcmr(fitted,which, low, up, npoints = 10, display = TRUE, alpha = 0.05, progress.bar= TRUE, ...) where argument which is the index of the parameter to be profiled, low and up are the lower and the upper limits used in the computation, npoints is the number of points used in the computation of the profile likelihood, alpha is the significance level, display controls whether the profile likelihood should be plotted or not and progress.barsets a "progress bar" to visualize the progression of the time-consuming profile likelihood computation.If the values of limits low and up are not provided, then they are set equal to the estimated parameter minus and plus three times the standard error, respectively.

Applications
The usage of gcmr is illustrated below with three different data sets covering various forms of dependence frequently arising in real applications.

Longitudinal count data
The first example considers the well-known longitudinal study on epileptic seizures described in Diggle et al. (2002): R> data("epilepsy", package = "gcmr") The data comprise information about 59 individuals observed at five different occasions each.
The baseline observation consists of the number of epileptic seizures in a eight-week interval, followed by four measurements collected at subsequent visits every two weeks.Available variables are the patient identifier id, the patient age, the indicator trt whether the patient is treated with progabide (trt = 1) or not (trt = 0), the number of epileptic seizures counts, the observation period time in weeks, that is time = 8 for baseline and time = 2 for subsequent visits, and the indicator visit whether the observation corresponds to a visit (visit = 1) or the baseline (visit = 0).Diggle et al. (2002) analyzed the seizure data with the method of generalized estimating equations assuming a log-linear regression model for counts with the logarithm of time used as offset and covariates trt, visit and their interaction.Moreover, Diggle et al. (2002) suggested to omit an "outlier" patient -here corresponding to patient id = 49 -with an extremely high seizure count at baseline (151 counts) that even double after treatment (302 counts after 8 weeks of measurement).Indeed, estimated model coefficients vary considerably if this patient is set aside.
The corresponding Gaussian copula analysis described below assumes a negative binomial marginal distribution with mean specified as in Diggle et al. (2002).We start the analysis assuming a working independence correlation matrix for the Gaussian copula: Robust sandwich standard errors essentially confirm the previous results.The strong significance of the dispersion parameter provides support to the choice of the negative binomial marginal in place of the Poisson distribution.
However, a more accurate description of the data also accounts for the serial correlation of the observations from the same subject.For example, the model can be re-estimated with the AR(1) Gaussian copula correlation matrix: R> mod.ar1 <-update(mod.ind,cormat = cluster.cormat(id,"ar1"), + seed = 12345, nrep = 100) The previous command illustrates the use of the gcmr fitting options.The random number generator seed is fixed to ensure reproducibility of the results, while the number of Monte Carlo replications nrep is set to a number lower than the default, a possibility that it is useful during the model specification phase.
Robust sandwich standard errors confirm the presence of substantial autocorrelation between observations from the same patient.In fact, the estimated AR( 1 Differently from the working independence model, the autoregressive model identifies a significant effect of the interaction between visit and treatment that was undetected with the working independence model.The result qualitatively agrees with that obtained with the generalized estimating equation analysis by Diggle et al. (2002) that can be reproduced with package geepack (Yan 2002;Højsgaard et al. 2006): R> library("geepack") R> gee.ar1 <-geeglm(counts ~offset(log(time)) + visit + trt + visit:trt, + data = epilepsy, id = id, subset = (id != 49), family = poisson, + corstr = "ar1") R> summary(gee.ar1)Call: geeglm(formula = counts ~offset(log(time)) + visit + trt + visit:trt, family = poisson, data = epilepsy, subset = (id != 49), id = id, corstr = "ar1") Among the advantages of the likelihood analysis implemented in gcmr with respect to nonlikelihood methods such as generalized estimating equations, there is the possibility to compute profile log-likelihoods.Consider, for example, the profile log-likelihood for the interaction effect of visit with treatment that can be obtained with a call to profile.gcmr() with argument which = 4 because the interaction effect corresponds to the fourth model parameter: R> profile(mod.ar1,which = 4) The profile log-likelihood reported in Figure 1 illustrates the significant negative coefficient associated to the interaction of visit with treatment.

Time series of rates
The second example regards the time series of the hidden unemployment rate (HUR) in São Paulo, Brazil, obtained from the database of the Applied Economic Research Institute (IPEA) of the Brazilian Federal Government (www.ipea.gov.br): -1.0 -0.8 -0.6 -0.4 -0.2 0.0 -887 -886 -885 -884 visit:trt log-likelihood profile analysis made by Roca and Cribari-Neto (2009), we consider a Gaussian copula model with marginal beta distribution and ARMA(p, q) copula correlation.The mean and precision of the beta marginals are assumed to both depend on a linear trend.In order to avoid numerical instabilities, the trend is centered and scaled: R> trend <-scale(time(HUR)) Below we illustrate the model with ARMA(1, 3) errors.This model was selected because it has the minimum AIC value among the sixteen ARMA(p, q) models obtained with orders p and q ranging from 0 to 3: R> mod <-gcmr(HUR ~trend | trend, marginal = beta.marg,+ cormat = arma.cormat(1,3)) The previous command illustrates the use of the extended formula HUR ~trend | trend to specify that both the mean and the dispersion depend on the (scaled) trend.The summary of the fitted model confirms the presence of a statistically significant trend: Evidence that the assumptions of the above model are met is provided by graphical inspection of quantile residuals reported in Figure 3:

Spatially correlated binomial data
The last example regards the malaria prevalence in children recorded at 65 villages in Gambia.Differently from the original data presented in Thomson et al. (1999) and available in the geoR package (Ribeiro and Diggle 2015), here we consider aggregated data at village level available through gcmr with the data frame malaria: R> data("malaria", package = "gcmr") The data contain information about the village coordinates (x, y), the number of sampled children (size) with malaria (cases) in each village, the mean age of the sampled children in each village (age), the frequency of sampled children who regularly sleep under a bed-net in each village (netuse), the frequency of sampled children whose bed-net is treated (treated), a satellite-derived measure of the greenness of vegetation in the immediate proximity of the village (green), the indicator variable denoting the presence (1) or absence (0) of a health center in the village (pch) and an indicator of geographical regions characterized by potentially different malaria risk (area).We refer to Diggle and Ribeiro (2007) for more details.The aim is to model the relationship between the number of cases and the various covariates, while accounting for the potential presence of spatial dependence of malaria spread between the villages.
The first step of the data analysis is the construction of the matrix of the distances between the villages using, for example, the function spDists() from package sp (Pebesma and Bivand 2005;Bivand et al. 2013): R> library("sp") R> D <-spDists(cbind(malaria$x, malaria$y)) / 1000 The distances are expressed in kilometers through scaling by factor 1,000.Scaling is helpful for avoiding potential numerical instabilities in the estimation of the spatial dependence parameter.
The first model describes the cases of malaria with a spatial Gaussian copula logistic regression model.The covariates are netuse, pch and green scaled by factor 100. Spatial dependence is modeled with an exponential correlation matrix corresponding to the default value of the shape parameter (alpha = 0.5) in matern.cormat(D,alpha): Covariates netuse and phc are associated to a significant reduction of the malaria cases while green is associated to a higher risk of disease.The estimate of the dependence parameter tau indicates the presence of very local spatial dependence.In fact, the estimated value for tau identifies a "practical range" (e.g., Diggle and Ribeiro 2007)  The summary confirms that covariate area contains relevant information about the geographic variation of malaria risk in the study region.Indeed, the estimate of the spatial dependence parameter tau in model mod.area shows that the residual spatial dependence is essentially negligible.

Conclusions
This article presented the R implementation of Gaussian copula marginal regression available in the gcmr package.The discussed examples illustrate the capability of the package to handle various types of data and dependence structures.Models are fitted with the method of maximum (simulated) likelihood that requires repeated Cholesky factorization of the Gaussian copula correlation matrix.In the current version of gcmr, the order of computations needed for likelihood evaluation is O(n 3 ), with n denoting the number of observations.In case of large data sets, consisting, for example, of several thousands of observations, the computational cost may prevent routine use of gcmr.However, the Cholesky factorization can be implemented more efficiently for some specific dependence structures.For example, autoregressive and moving average correlation matrices can be factorized in a linear number of computations exploting the Kalman filter through the state space representation.
Future research will focus on implementation of computationally convenient methods to handle specific dependence forms within the general framework of Gaussian copula regression.Promising approaches include composite likelihoods to reduce the computational effort through convenient likelihood factorizations (Varin et al. 2011) and sparse methods designed to approximate the Gaussian copula correlation matrix with a more manageable block-diagonal matrix.
Several authors exploited Gaussian and t copulas to construct joint regression models for multiple responses, also of mixed type (e.g., Frees and Valdez 2008;Song et al. 2009;Wu and de Leon 2014;Jiryaie et al. 2015).Methods for handling multiple responses are planned to be included in future versions of gcmr.

A.1. Specify a new marginal model
The simpler way to specify a new object of class marginal.gcmr is to use one of the available marginal distributions in gcmr as prototype, such as, for example, the Poisson marginal: npar returns the number of dependence parameters in the Gaussian copula correlation matrix.In the specific case of the Matérn correlation model, there is a single dependence parameter tau that describes the degree of spatial dependence; start() is a function that returns the vector of starting values for the dependence parameters.
In the specific case of the Matérn correlation model with given shape parameter alpha, the spatial dependence parameter tau is set, arbitrarily, equal to the median distance observed in the data; chol() is a function of the vector of dependence parameters tau in the Gaussian copula correlation matrix and the vector of indices of the observed data not.na.The output is the Cholesky factor of the Gaussian copula correlation matrix.If the Cholesky factorization fails, then chol returns NULL.

Figure 1 :Figure 2 :
Figure 1: Seizure data.Profile log-likelihood for the interaction between visit and treatment.

Figure 3 :
Figure 3: Hidden unemployment rate data.Standard diagnostic plots for time series data produced by plot.gcmr.
<-function(y, x, z, offset) { lambda <-coef(glm.fit(x,y, offset = offset$mean, family = fm)) names(lambda) <-dimnames(as.matrix(x))[[2L]]lambda } ans$npar <-function(x, z) NCOL(x) ans$dp <-function(y, x, z, offset, lambda) { mu <-fm$linkinv(x %*% lambda + offset$mean) cbind(dpois(y, mu), ppois(y, mu)) } ans$q <-function(p, x, z, offset, lambda) { mu <-fm$linkinv(x %*% lambda + offset$mean) qpois(p, mu) } ans$fitted.val<-function(x, z, offset, lambda) { fm$linkinv(x %*% lambda + offset$mean) } ans$type <-"integer" class(ans) <-c("marginal.gcmr")ans } <environment: namespace:gcmr> Function poisson.marg()receives as input the link function as in glm() and produces as output a list with several components described below.start() is a function of the vector of responses y, the design matrix x and the offset.Among the inputs, there is also the design matrix z for the dispersion, although this argument is unused with the Poisson model that assumes a constant dispersion.The output of start() is the vector of starting values for the parameters in the marginal model.The starting values are typically computed as if the observations were independent.In the specific case of Poisson marginals, the starting values are obtained with a call to glm.fit(x, y, family = Poisson); npar() is a function of the design matrix that returns the number of parameters in the marginal model.In the special case of the Poisson model, the number of parameters corresponds to the number of mean regression coefficients.If the model includes also } <environment: namespace:gcmr> Function matern.cormat(D,alpha) receives as input the matrix D of the distances between the observations and the shape parameter alpha of the Matérn correlation model.The output is a list with three components:

Table 2 :
Correlation models available in gcmr version 0.7.5.As for the marginals, the user is allowed to construct her/his own correlation margin by specifying a new object of class cormat.gcmr,see Appendix A.2.

Table 3 :
Functions and methods available for objects of class gcmr.
(Zeileis 2004(Zeileis , 2006testraditional standard errors derived from the inverse of the observed Fisher information.A more appropriate choice for these longitudinal data is provided by the sandwich estimator that can be computed with the sandwich package(Zeileis 2004(Zeileis , 2006) and conveniently visualized with function coeftest() from package lmtest (Zeileis and Hothorn 2002): summary(mod) of about 4.5 Km (= 1.51 × 3).There are only 37 pairs of villages far apart less than 4.5 Km.The inclusion of area in the model yields a large drop in the AIC statistics: