gcKrig : An R Package for the Analysis of Geostatistical Count Data Using Gaussian Copulas

This work describes the R package gcKrig for the analysis of geostatistical count data using Gaussian copulas. The package performs likelihood-based inference and spatial prediction using Gaussian copula models with discrete marginals. Two diﬀerent classes of methods are implemented to evaluate/approximate the likelihood and the predictive distribution. The package implements the computationally intensive tasks in C++ using an R/C++ interface, and has parallel computing capabilities to predict the response at multiple locations simultaneously. In addition, gcKrig also provides functions to simulate and visualize geostatistical count data, and to compute the correlation function of the counts. It is designed to allow a ﬂexible speciﬁcation of both the marginals and the spatial correlation function. The principal features of the package are illustrated by three data examples from ecology, agronomy and petrology, and a comparison between gcKrig and two other R packages.


Introduction
Spatial data arise in numerous scientific disciplines. Following the classification of spatial data in Cressie (1993), we consider here geostatistical data with discrete responses, specifically geostatistical count data. This type of data appears in many applications such as species counts, disease counts, and mineral indicators. The number and variety of models for geostatistical count data are quite limited, when compared to models available for geostatistical continuous data, and hence the statistical software options for the former are also limited. The two main classes of models for the analysis of geostatistical count data are the hierarchical models proposed by Diggle, Tawn, and Moyeed (1998), and the Gaussian copula models proposed by Madsen (2009), Kazianka and Pilz (2010), and Kazianka (2013a). In this work we consider Gaussian copula models.
Likelihood-based inference for Gaussian copula models is computationally challenging, because the likelihood function can only be expressed as a high dimensional multivariate normal integral. The initial strategies to circumvent likelihood computations were to base inference on surrogate likelihoods that are less computationally demanding. Prime examples of these for the analysis of geostatistical count data are the pairwise likelihood (PL) method proposed in Kazianka and Pilz (2010) and , and the generalized quantile transform (GQT) method proposed in Kazianka (2013a). The former uses bivariate copulas only, which is a particular case of the composite likelihood method (see Varin, Reid, and Firth 2011 for a review), while the latter uses a continuous approximation based on copula densities. As for software implementation, Kazianka (2013b) implemented both methods in the MATLAB (The MathWorks Inc. 2017) toolbox spatialCopula, and Bai et al. (2014) implemented the pairwise likelihood method in the R (R Core Team 2018) package geoCopula  for the analysis of spatial-clustered data 1 .
A different approach consists of evaluating the likelihood via Monte Carlo simulation, which is called the simulated likelihood method. An example of this approach is the use of a sequential importance sampling algorithm based on a variant of the so-called Geweke-Hajivassiliou-Keane (GHK) simulator (Geweke 1991;Hajivassiliou, McFadden, and Ruud 1996;Keane 1994). Masarotto and Varin (2012) implemented this approach in the R package gcmr Varin 2017, 2018) to analyze longitudinal, time series and spatial data using Gaussian copulas; see also Han and De Oliveira (2019). Another example of this approach is to use a quasi-Monte Carlo algorithm to evaluate the likelihood, which is implemented in the R package mvtnorm (Genz and Bretz 2009;Genz, Bretz, Miwa, Mi, and Hothorn 2018). Nikoloulopoulos (2013Nikoloulopoulos ( , 2016 used the latter to make inference about Gaussian copula models with discrete marginals.
The aforementioned packages allow to carry out some of the main tasks of scientific interest for the analysis of geostatistical count data, but not others. In some applications, predictive inference about spatially varying counts at unobserved locations is a task of equal or greater importance than inference about the model parameters. But except for spatialCopula, these packages carry out inference only about model parameters. Other tasks of scientific interest are the computation of the correlation function of the process of observed counts, and the determination of its restrictions for a given set of marginal distributions. Closed-form expressions for these are usually not available for Gaussian copula models, and the required numerical computations are not implemented in the aforementioned packages.
The R package gcKrig (Han 2018) carries out most of the main tasks of scientific interest for the analysis of geostatistical count data based on Gaussian copula models and it is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/ package=gcKrig. It offers tools that range from simulation and visualization to estimation and prediction. First, gcKrig can compute approximate maximum likelihood estimates and confidence intervals for the model parameters, using either the GHK simulator or the less computationally intensive option based on the GQT method. The latter is appealing for the analysis of moderately large datasets. Second, the package can compute plug-in predictors and prediction intervals at a given set of locations, using either the GHK simulator or the GQT method for the computation of predictive distributions. The prediction tasks are based on parallel computation capabilities, an appealing feature since prediction is usually sought for a large number of locations. Third, gcKrig can compute the correlation function of the process of counts and the upper bound of this correlation for a given set of marginal distributions (the so-called Fréchet-Hoeffding upper bound), using either a series expansion or a Monte Carlo method. Finally, the package also includes functions for simulation and visualization of spatial data from Gaussian copula models.
This article provides a detailed description of gcKrig, illustrates its use with three realworld geostatistical count datasets, and ends with a comparison between it and two other R packages, mvtnorm and gcmr, which can also perform some of the tasks implemented in gcKrig.

The Gaussian copula random field
Let {Y (s) : s ∈ D}, D ⊂ R 2 , be a random field taking values on N 0 = {0, 1, 2, . . .}, and {F s (· ; ψ) : s ∈ D} be a family of marginal cumulative distribution functions (cdfs) with support contained in N 0 and corresponding probability mass functions (pmfs) f s (· ; ψ), where ψ are marginal parameters. We assume that F s (· ; ψ) is parameterized in terms of its expected value E(Y (s)), which is given by where g −1 (·) is the inverse of a suitable link function, β = (β 1 , . . . , β p ) are regression parameters, f (s) = (f 1 (s), . . . , f p (s)) are known location-dependent covariates, and t(s) is a known "sampling effort". The cdf F s (· ; ψ) can also depend on other marginal parameters, for example on a dispersion parameter σ 2 > 0, assumed to be constant in D. Typical examples are the negative binomial and zero-inflated Poisson distributions, for which it holds that VAR(Y (s)) = E(Y (s))(1 + σ 2 E(Y (s))); (2) see Section 3.2 for the marginal families of distributions implemented in gcKrig.
The Gaussian copula random field Y (·) is defined by the property that for every n ∈ N and s 1 , . . . , s n ∈ D, the joint cdf of (Y (s 1 ), . . . , Y (s n )) is given by with mean 0, variance 1 and correlation function K ϑ (d ij ). Then the Gaussian copula random field Y (·) can be written as where F −1 s · ; ψ is the quantile function defined as Unlike the hierarchical models proposed by Diggle et al. (1998), Gaussian copula models allow the separate modeling of the marginal and dependence structures of the data. The marginals (and hence the mean function) of the random field are modeled explicitly by the family of cdfs {F s (· ; ψ) : s ∈ D}, while the dependence structure is modeled through the family of correlation functions K ϑ (·), albeit not in an explicit way; see Section 3.5 for the correlation function of Y (·).
Remark. Genest and Nešlehová (2007) caution modelers about some potential limitations of the use of copula models to describe multivariate discrete data, since many of the properties of these models that hold when the marginals are continuous, do not hold when the marginals are discrete. One of these is the uniqueness of the copula that couples the marginal cdfs to produce a given joint cdf; this, in general, may pose an identifiability problem. This problem does not arise in model (3) though, since this model restricts consideration to Gaussian copulas, and for any u 1 , . . . , u n the copula Φ n Φ −1 (u 1 ), . . . , Φ −1 (u n ); Ψ ϑ is a one-to-one function of Ψ ϑ .
). Then the GHK simulator approximates L(η; y) bŷ where n , using the standard algorithm for simulating from truncated normal distributions. This method is also implemented in the package gcmr Varin 2017, 2018). The GHK simulator requires an ordering of the sampling locations, which for geostatistical data is arbitrary. Although empirical experiments suggest that parameter estimates are not sensitive to different orderings, gcKrig provides the option to use the ordering obtained by the algorithm in Gibson, Glasbey, and Elston (1994), which orders the sampling locations based on the impact that their corresponding observations have on the likelihood (integral); see Genz and Bretz (2009) and Ridgway (2016) for details.
Many other approaches have been proposed for the approximation of n-dimensional normal integrals, like those in Equation 7. One of these is the quasi-Monte Carlo approximation implemented in the R package mvtnorm (Genz and Bretz 2009;Genz et al. 2018), and used for Gaussian copula estimation by Nikoloulopoulos (2013Nikoloulopoulos ( , 2016. Another is a recent approach that uses the minimax tilting algorithm proposed by Botev (2017); see also Genz and Bretz (2009) for a review of classical approximations. A simulation experiment carried out by Han and De Oliveira (2019), comparing the GHK simulator, quasi-Monte Carlo approximation and a third method, showed that the GHK simulator provides the best balance between statistical efficiency and computational efficiency.
A different strategy consists of avoiding the calculation of the likelihood in Equation 7 altogether, and instead base inference on a surrogate likelihood. The gcKrig package also implements the so-called generalized quantile transformation (GQT) method proposed in Kazianka and Pilz (2010) and Kazianka (2013a,b), which is (partially) implemented in the MATLAB toolbox spatialCopula. In this strategy the likelihood is replaced by the surrogate likelihood The gcKrig package approximates the maximum likelihood estimates by optimizing the righthand-side of either Equation 9 or 10, using the box-constrained quasi-Newton algorithm implemented in the R function optim(). The asymptotic standard errors are computed in the classical way, by calculating the inverse of the (approximate) log-likelihood Hessian matrix evaluated at the estimates. It should be noted that when the "discreteness" in the data is extreme, e.g., when analyzing binary data, the GQT approximation becomes highly inaccurate (Kazianka 2013a), so parameter estimates and their estimated variances are not reliable. In general we recommend using the GHK simulator as the preferred option, if the size of the data is not too large and the analysis not too time consuming, and leaving the GQT method for moderately larger datasets.

Predictive inference
Letη be the vector of parameter estimates, and s 0 ∈ D a location for which Y (·) is not observed. The (plug-in) predictive distribution of Y (s 0 ) is defined as Once the predictive distribution is computed for all y 0 in the (practical) support of Equation 11, one can compute different optimal predictors, depending on the loss function that is assumed. Under the squared error loss, the optimal predictor isŶ (s 0 ) = E(Y (s 0 ) | Y ;η). This is guaranteed to be non-negative, but not integer-valued. According to Jeske (1993), the optimal integer-valued predictor is given bŷ where · is the floor function, and the uncertainty measure associated to this predictor is Kazianka and Pilz (2010) and Kazianka (2013a) suggested to approximate the numerator and denominator of the predictive distribution using the GQT method, as in Equation 10. The computational speed of this alternative is fast, but the approximation error can be substantial in the case of low marginal variance and strong spatial dependence (Nikoloulopoulos 2016). A more precise alternative is to compute the numerator and denominator of the predictive distribution in Equation 11 using the GHK simulator. The package gcKrig implements both alternatives. Since spatial prediction is usually needed for a large number of locations, gcKrig speeds up the computations with the help of parallel computing techniques using the R package snowfall (Knaus 2015); see Section 3.4 for details.

Correlation function of the Gaussian copula random field
Recall that K ϑ (d) is the correlation function of the latent Gaussian process Z(·) in Equation 5. The correlation function of the Gaussian copula random field Y (·) is not available in closed form for most families of marginals. The package gcKrig implements the series expansion discussed in De Oliveira (2013) to approximate the covariance (and correlation) function of Y (·). Specifically with where φ(t) is the pdf of the standard normal distribution and H k (t) is the (probabilists') Hermite polynomial of degree k (H 1 (t) = t, H 2 (t) = t 2 − 1, . . . ). Hence, COR{Y (s), Y (u)} can be quickly and accurately approximated by truncating the series in Equation 14 and approximating the coefficients in Equation 15 by numerical quadrature (e.g., using the R function integrate()). Han and De Oliveira (2016) used this expansion to investigate properties of this correlation function, founding it to be flexible in several regards; see Section 3.5 for computational details.

Package functionality
The R package gcKrig offers a number of tools for the analysis of geostatistical count data, ranging from simulation and visualization to estimation and prediction, with the two core functions being mlegc() (for parameter estimation) and predgc() (for spatial prediction).

Data simulation and visualization
The function simgc() simulates geostatistical count data at a given set of m locations from Gaussian copula models with different marginal and correlations structures: simgc(locs, sim.n = 1, marginal, corr, longlat = FALSE) The argument locs is a numeric m × 2 matrix or data frame containing the coordinates of the locations, sim.n is a positive integer indicating the number of simulated datasets. Also, marginal is an argument specified with an object of class 'marginal.gc' indicating the family of marginals, and corr is an argument specified with an object of class 'corr.gc' indicating the family of correlation functions; see Section 3.2. If longlat = TRUE, the great circle distance is used in the correlation function; otherwise the Euclidean distance is used. The function returns a list of class 'simgc' with the simulated data and the corresponding locations.
To visualize a simulated geostatistical count dataset, an S3 method plot is available for objects of class 'simgc': plot(x, index = 1, plottype = "Text", xlab = "xloc", ylab = "yloc", xlim = NULL, ylim = NULL, pch = 20, textcex = 0.8, plotcex = 1, angle = 60, col = 4, col.regions = gray(90:0/100), ...) The argument x is an object of class 'simgc' generated from the function simgc(), index is the index of the simulated dataset, plottype is one of the following, "Text", "Dot" or "3D", denoting the type of the plot, and col.regions is the color vector to be used that represents the magnitude of the observations at the locations. The general strategy for a good visualization is to set the color vector with gradually varying colors. The other arguments have the same meanings as in the standard R function plot(). The function can generate three types of plot, allowing the visualization of different aspects of the dataset; see Figure 1 and its legend. .

Classes of marginals and correlations
We describe here how to set the families of marginal and correlation functions needed to specify Gaussian copula models in gcKrig. To make the package flexible, the arguments in the gcKrig's functions that specify marginals and correlations are of class 'marginal.gc' and 'corr.gc', respectively. These arguments are needed in both the functions simgc() and corrTG() that are used to simulate geostatistical data and compute correlations between the non-Gaussian variables, as well as in mlegc() and predgc() that are used to make inference about the model parameters and spatial prediction. For simulation of geostatistical count data or computation of correlations between the count variables, the parameters need to be set as arguments in the corresponding functions. For inference about the model parameters and spatial prediction the parameters are not set, as they are estimated from the data, which is the default setting for discrete marginals and all correlation functions. In this case, the link functions are also specified as arguments. When fitting a correlation model, the nugget effect can be estimated (nugget = TRUE) or set to zero (nugget = FALSE), with the former being the default. Also, for some correlation functions a shape parameter kappa needs to be specified. Table 1 lists all the marginals and correlation functions available in gcKrig.
Three families of isotropic correlation functions are implemented in gcKrig: the Matérn family given byK where Γ(·) is the gamma function and K κ (·) is the modified Bessel function of the second kind and order κ; the power exponential family given bȳ and the spherical family given bȳ For these families φ is called a range parameter, which controls the rate of correlation decay, and κ is called a shape parameter, which controls the smoothness of the random field.

Estimation
The core function for parameter estimation is mlegc(), which provides either maximum simulated likelihood estimates computed via the GHK simulator (Masarotto and Varin 2012) or maximum surrogate likelihood estimates computed via the generalized quantile transform (Kazianka and Pilz 2010): mlegc(y, x = NULL, locs, marginal, corr, effort = 1, longlat = FALSE, distscale = 1, method = "GHK", corrpar0 = NULL, ghkoptions = list(nrep = c(100, 1000), reorder = FALSE, seed = 12345)) The argument y is a vector containing the response counts, x is a matrix or data frame containing the covariates (x = NULL when no covariates are available), and locs is a matrix or data frame containing the coordinates of the sampling locations. By default these are assumed Cartesian coordinates. When longitude and latitude coordinates are used, then longlat = TRUE should be set. The argument marginal is a function of class 'marginal.gc' that specifies the marginal distributions of the counts, and corr is a function of class 'corr.gc' that specifies the correlation function defining the Gaussian copula; the current version of gcKrig includes the four families of discrete marginal distributions and the three families of isotropic correlation functions listed in Table 1. The argument effort is a vector containing the sampling efforts associated with the sampling locations. For instance, in the case of binomial marginals these efforts are the "total number of trials"; they are all set to 1 by default. The argument distscale is an optional scalar factor to be multiplied by the elements of the distance matrix. For example, if the original distance is measured in kilometers, we can convert this into meters by setting distscale = 1000.
The package gcKrig includes two fitting methods. If method = "GQT" is set, the likelihood is replaced by the generalized quantile transform; if method = "GHK" is set, the likelihood is approximated by the GHK simulator. In the latter case the arguments in the list ghkoptions need to be specified: nrep sets the Monte Carlo size, seed sets the seed of the pseudo-random number generator, and reorder controls whether or not an ordering of the locations following the algorithm in Gibson et al. (1994) is computed before each likelihood evaluation. As in the R package gcmr, the argument nrep can be a vector with increasing positive integers, in which case the model is fitted several times with different Monte Carlo sizes.
To optimize the likelihood approximation, gcKrig uses the quasi-Newton algorithm with box constraints implemented in the R function optim(), with method = "L-BFGS-B". The starting values of the marginal parameters are set as the estimates obtained by treating the observations as if they were independent. The starting values of the covariance parameters can be set by the argument corrpar0. By default the starting value of the range parameter is set at half the median of the elements of the distance matrix, and the starting value of the nugget parameter is set at 0.2. When the model is fitted several times with different Monte Carlo sizes, the starting values for each fit are set at the parameter estimates from the previous fit. Once the likelihood function evaluated at the maximum likelihood estimates is available, it is possible to conduct model selection using information criteria such as AIC, BIC or AIC c ; these are computed by gcKrig and illustrated in Section 4.1.
The function mlegc() outputs a variety of objects of class 'mlegc'. The following S3 methods are available for this class of objects: summary() and print(), which extract the basic summaries, following the format in the R package gcmr Varin 2017, 2018); plot() generates contour plots and 3-D scatter plots of the original data and the fitted mean response; vcov() displays the covariances and correlations between parameter estimates, computed from the inverse of the observed Fisher information matrix.
The package gcKrig computes both Wald-type and likelihood-based confidence intervals for the parameters, where the latter are computed by evaluating the profile likelihood and inverting a likelihood ratio test; see Meeker and Escobar (1995) for details. The likelihood-based confidence intervals are implemented by the S3 method profile for 'mlegc' objects. The endpoints of the profile likelihood confidence intervals are computed, not by evaluating the profile likelihood on a selected grid of parameter values, as Masarotto and Varin (2017) do, but by using two runs of the Newton-Raphson algorithm, each with a different starting point (see below). The function call is as follows profile(fitted, par.index, alpha = 0.05, start.point = NULL, method = "GQT", nrep = 1000, seed = 12345, ...) where fitted is the fitted model of class 'mlegc' and par.index is the index of the parameter to be profiled. For example, when par.index = 1, the confidence interval of the intercept parameter will be calculated. The argument alpha is a scalar between 0 and 1 denoting the confidence level, and start.point is a numeric vector of length 2 indicating the starting points of the Newton-Raphson iteration; by default these points are the endpoints of the Wald-type confidence interval. The argument method specifies the likelihood evaluation method with choices "GQT" and "GHK". If method = "GHK", the Monte Carlo size and the seed of the pseudo-random number generator are specified by nrep and seed, respectively.

Prediction
The core function for spatial prediction is predgc(), which approximates the predictive distribution in Equation 11 and computes optimal predictors at a set of prediction locations. The prediction task can become computationally intensive since prediction is usually sought for a large number of locations on a grid covering the region of interest. Noting that the denominator in Equation 11 is the same for all prediction locations and the numerators can be computed independently of each other, gcKrig assigns the computation of the latter to various cores of the computer using parallel computing techniques. Although by default R will only use one processor regardless of the number of available cores, gcKrig takes advantage of the parallel computing techniques from the R package snowfall (Knaus 2015). This is carried out by the function predgc(): predgc(obs.y, obs.x = NULL, obs.locs, pred.x = NULL, pred.locs, longlat = FALSE, distscale = 1, marginal, corr, obs.effort = 1, pred.effort = 1, method = "GHK", estpar = NULL, corrpar0 = NULL, pred.interval = NULL, parallel = FALSE, ghkoptions = list(nrep = c(100,1000), reorder = FALSE, seed = 12345), paralleloptions = list(n.cores = 2, cluster.type = "SOCK")) where many of its arguments are the same as those in mlegc(). The argument obs.y is the vector of observed counts, obs.x and pred.x are the matrices or data frames of the covariates at the sampling and prediction locations, respectively, and obs.locs and pred.locs are the matrices or data frames containing the coordinates of the sampling and prediction locations. The arguments obs.effort and pred.effort are the respective efforts at the sampling and predictive locations. The argument estpar is the vector of parameters needed to carry out plug-in prediction. It is set to NULL by default, which means the function will call mlegc to fit the observed counts first, and then these parameter estimates are used in the plug-in prediction. It is also possible to specify estpar as a numeric vector with some preliminary estimates, so the model will not be re-fitted.
The function predgc() can also compute prediction intervals, which can be of two types. One of them is the "equal-tail" prediction interval that divides equally the complement of the confidence level to both tails of the predictive distribution in Equation 11. The other is the "highest probability mass" prediction interval that includes in the interval all the values in the support of Y (s 0 ) with the highest values in Equation 11. In general, the latter type of prediction interval is recommended, since predictive distributions are often highly asymmetric. When prediction intervals are needed one sets the argument pred.interval equal to a number between 0 and 1 representing the confidence level, in which case both types of prediction intervals are computed. By default predgc() does not calculate prediction intervals.
When the argument parallel = FALSE is set, a serial version of the function will be called, while when parallel = TRUE the function calls for parallel computation. The argument parallel.options is a list with elements n.cores and cluster.type that set the number of cores and type of cache for parallel computing, respectively. Four cache types are available: "SOCK", "MPI", "PVM" or "NWS", with "SOCK" being the default. More details can be found in the manual of the snowfall package (Knaus 2015).
The output of the function predgc() is a list of objects with class 'predgc'. The S3 method summary() for 'predgc' objects extracts key summary information, including prediction locations, the mean of the predictive distribution, predicted discrete response, estimated prediction variance, and two types of the prediction intervals, if pred.interval is provided. The S3 method plot for 'predgc' objects works in a similar way as the previously described method for 'simgc' objects, which can generate five plots. One of them is a 3-D scatter plot with both observed and predicted counts. The other four are contour plots displaying the observed number of counts, the estimated means, the predicted responses and estimated prediction variances. The usage of predgc() is illustrated with an example in Section 4.

Computation of correlations
For any two locations the function corrTG() computes the correlation between Y (s) and Y (u), either by using the series expansion in Equation 14 or a Monte Carlo method: corrTG(marg1, marg2, corrGauss = 0.5, method = "integral", nrep = 1000) The arguments marg1 and marg2 are the marginal distributions of Y (s) and Y (u) from the class 'marginal.gc', and corrGauss is the correlation of the latent Gaussian random field Z(·). If method = "integral", the covariance is computed by using the series expansion in Equation 14. The series is approximated by truncating Equation 14 up to a term so that the magnitude of the next term is smaller than the square root of the smallest positive floatingpoint number (sqrt(.Machine$double.eps)). The coefficients in Equation 15 are computed by the R function integrate(). If method = "mc", a Monte Carlo method is used, which consists of two steps. First, a random sample of size nrep is generated from the bivariate Gaussian distribution with means (0, 0), variances (1, 1) and correlation in Equation 4 using the function mvrnorm() in the R package MASS (Venables and Ripley 2002). Second, the transformation in Equation 5 is applied to the bivariate Gaussian draws and the desired correlation is approximated by the sample correlation of the transformed draws. A similar approach was carried out by Demirtas and Hedeker (2011) to compute the Fréchet-Hoeffding upper and lower bounds.

Computation of Fréchet-Hoeffding upper bounds
The possible correlations between two non-Gaussian random variables are, in general, a subset of [−1, 1]. The maximum and minimum correlations that are attainable for a given pair of marginal distributions are called the Fréchet-Hoeffding bounds (Nelsen 2006). An important property of Gaussian copula models is that the correlation between two random variables following these models always attains the Fréchet-Hoeffding upper bound, when sup{K ϑ (d) : d > 0} = 1 (Nelsen 2006;Grigoriu 2007). The Fréchet-Hoeffding upper bound can be computed for any pair of marginals with the function corrTG(), by setting the argument corrGauss = 1.
When both marginal distributions are discrete, there is an alternative expression for the Fréchet-Hoeffding upper bound, proposed by Nelsen (1987), that may be faster to compute in some cases. It is given by where The computation of Equation 21 is faster than the computation of Equation 14 when both marginal means are small. This alternative expression for the upper bound is implemented in the function FHUBdiscrete(), where the summation over a two-dimensional grid is coded in C++: FHUBdiscrete(marg1, marg2, mu1, mu2, od1 = 0, od2 = 0, binomial.size1 = 1, binomial.size2 = 1) The arguments marg1 and marg2 are the names of the possible discrete marginals: "poisson", "zip", "nb" and "binomial" (for Poisson, zero-inflated Poisson, negative binomial and binomial marginals, respectively). The argument mu1 is the mean of the first marginal, od1 is the overdispersion parameter of the first marginal, when this is negative binomial or zeroinflated Poisson, and binomial.size1 is the number of trials, when this is binomial. The other arguments have the same interpretation, but for the second marginal.
Below are two examples. The first example, using FHUBdiscrete(), computes the Fréchet-Hoeffding upper bounds for the correlations between the negative binomial random variable with mean 10 and overdispersion 0.2, and negative binomial random variables with means varying from 0.01 to 15 and overdispersion 0.2. The second example, using corrTG(), computes the Fréchet-Hoeffding upper bounds for the correlations between the gamma random variable with shape 0.5 and rate 1, and gamma random variables with shapes varying from 0.01 to 15 and rate 1. These bounds are plotted in Figure 2, which show that in some scenarios the Fréchet-Hoeffding upper bound can be much lower than 1.

Integrated datasets
The package gcKrig includes the following real-world datasets, which have been previously analyzed in the literature: AtlanticFish: This dataset contains fish counts sampled at 119 locations in a mid-Atlantic region of the USA, together with several stream characteristics (covariates) related to water quality; the covariates are standardized to have mean 0 and variance 1. It was analyzed by Johnson and Hoeting (2011) to investigate the relation between abundance of pollution tolerant fish and several environmental factors.
LansingTrees: This dataset was obtained by aggregating the lansing dataset in the R package spatstat (Baddeley, Turner, and Rubak 2018), which comes from an investigation of a 924 ft × 924 ft (19.6 acres) area in Lansing Woods, Michigan, USA. The original point process data describe the locations of 2,251 trees and their botanical classification (into maples, hickories, black oaks, red oaks, white oaks and miscellaneous trees). The original plot size has been rescaled to the unit square and the number of different types of trees has been counted within squares of length 1/16. See Kazianka (2013a) and Han and De Oliveira (2019) for analyses of this dataset.
OilWell: This is a binary dataset that records the locations of successful and unsuccessful drilling oil wells in the northwest shelf of the Delaware basin in New Mexico, USA. This region is densely drilled in some parts, but has also some sparsely drilled subareas. The original dataset was transformed to a central area of about 65 square kilometers. See Hohn (1999, Chapter 6).
Weed95: This dataset consists of weed counts and the percentages of organic matter at 270 sampling locations in an agricultural field in Denmark. The weed species Viola Arvensis was counted within circular frames of 0.25 square meters each, except for 10 missing sites in the first row. See Christensen and Waagepetersen (2002) for an analysis of this dataset.

Examples
In this section we describe brief analyses of the AtlanticFish, Weed95 and OilWell datasets to illustrate the fitting of spatial Gaussian copula models using gcKrig; Han and De Oliveira (2019) provide an analysis of the LansingTrees dataset. All the analyses were run on a 2015 MacBook Pro with 2.8 GHz Intel Core i7 processor, with 8 threads available for parallel processing.

The AtlanticFish data
The AtlanticFish dataset was analyzed by Johnson and Hoeting (2011) with the goal of identifying important covariates that affect fish abundance and estimate their effects. They fitted a spatial generalized linear mixed model using a Bayesian approach. Out of nine potential environmental covariates, three stood out as having a significant influence on fish abundance. The important covariates and their respective posterior inclusion probabilities were the Strahler stream order (ORDER, 0.883), the percentage of watershed classified as disturbed by human activity (DISTOT, 0.787) and the index of fish habitat quality at the stream site (HAB, 0.738). The covariate with the fourth highest posterior inclusion probability was the watershed area (WSA, 0.426). However, no strong evidence was found for a significant effect of WSA on fish abundance.
We fit a spatial Gaussian copula model to this dataset using gcKrig. We restrict attention to the aforementioned four environmental covariates, and seek to assess their impacts on fish abundance. The model assumes a family of negative binomial marginals with the log link function, and an isotropic exponential correlation model with a nugget. We fit two models, one with three covariates (ORDER, DISTOT, HAB), and the other with four covariates (ORDER, DISTOT, HAB, WSA).  All the regression parameters are significant at level 0.05. As in Johnson and Hoeting (2011), it is found that ORDER and HAB are positively associated with fish abundance, while DISTOT is negatively associated. In addition, there is strong evidence of overdispersion in the response variable, and the large estimate for the nugget parameter together with a small estimate for the range parameter (the median distance between sampling locations is 267.39 km) indicate that the spatial association is modest. The above analysis took about 23 seconds to run.
After the model is fitted, the 95% confidence intervals for the regression and overdispersion parameters were computed with R> profile(Fitfish, par.index = 1, method = "GHK") where the argument par.index is varied from 1 to 5. The Wald-type and profile likelihoodbased confidence intervals are displayed in Table 2. The two types of confidence intervals are similar, but the Wald-type confidence intervals are narrower.
We also fitted a Gaussian copula model to this dataset, but now adding the environmental covariate WSA to the other three, with the code and summary results displayed below.  This new fit shows that the effect of WSA is not statistically significant. In addition, the model comparisons based on any of the information criteria show that the initial model with the three covariates is preferable. These findings based on a Gaussian copula model are consistent with those reached by Johnson and Hoeting (2011) based on a spatial generalized linear mixed model.

The Weed95 data
The Weed95 dataset was analyzed by Christensen and Waagepetersen (2002) using a generalized linear mixed model, with the goal of investigating the relation between weed occurrence and soil properties. It consists of weed counts and percentage of organic matter at 270 small circular frames within an agricultural field. As the response consists mostly of small counts, generalized linear mixed models may not be able to represent the spatial correlation in these data (De Oliveira 2013). Hence, we fit a Gaussian copula model with negative binomial marginals to the Weed95 dataset, which has a more flexible correlation structure (Han and De Oliveira 2016).
To assess the accuracy of the predictions, we compute the two types of 95% prediction intervals described in Section 3.4 for all the 200 prediction locations. Out of the 200 locations, the "equal-tail" prediction intervals did not include the actual counts at 11 locations, while the "highest probability mass" predictions intervals did not include the actual counts at 9 locations; so the empirical coverages of these prediction intervals were 94.5% and 95.5%, respectively. The average length of the two types of prediction intervals were 2.815 and 2.445. These findings indicate that, for this example, the prediction intervals are accurate, with the "highest probability mass" prediction intervals being slightly preferable, as they are on average shorter.

The OilWell data
The OilWell dataset was analyzed by Hohn (1999) with the goal of assessing oil abundance in a field in the northwest shelf of the Delaware basin in New Mexico, USA. Oil wells were drilled at 333 locations, with oil found at 100 locations (coded as 1), and oil not found at the remaining locations (coded as 0). The result is a spatial binary dataset in which there were no covariates. The original region with irregular borderlines was rescaled to a central area of about 65 square kilometers. The goal of the analysis is to estimate the probability of oil presence at many unsampled locations. This will help engineers decide where to drill in the future to assess the economic potential of the oil field. Preliminary analyses and information criteria suggest that the fitted model lacks a nugget effect and that an exponential correlation function fits this dataset better than the spherical correlation function and other correlation functions from the Matérn family. This is consistent with the analysis in Hohn (1999) using indicator kriging. We compute the predictive distribution of the response at each location on a 40 × 40 grid covering the study area. Due to the binary nature of the response, the predictive distribution of Y (s 0 ) is determined by a single number, the conditional probability of oil occurrence given the data. This is computed using gcKrig with parallel computing using both the GHK and GQT methods. The code below computes both the probability of oil occurrence at 1600 locations and their corresponding prediction variances using the GHK method. It also computes the top graphs in Figures 4 and 5. Replacing the argument method = "GHK" with method = "GQT" will generate the bottom graphs in Figures 4 and 5. The computation of the above predictive distributions at 1600 locations based on 4 cores took about 556 seconds for the GHK method and 172 seconds for the GQT method; these include both the estimation and prediction time. In general terms both methods provide similar results. It is more likely to have a successful drill in the areas where the majority of the existing drills were successful, and it is unlikely to have a successful drill in the areas surrounded by dry wells. Also, the predictive variances are smaller in the places where the prediction locations are closer to sampling locations, especially in the areas containing mostly only one type of well (successful or unsuccessful). But there are two main differences between the maps obtained by these methods. The map of the probability of oil occurrence obtained from the GHK method is smoother than that obtained from the GQT method, and the predictive uncertainty measures from the GQT method are smaller; the latter are likely to underestimate the true prediction uncertainty. This illustrates the inaccuracy of the GQT method, alluded to in Sections 2.1 and 2.2, when the "discreteness" in the data is extreme.

Comparison with other packages
In this section we provide a comparison between gcKrig and two other R packages, mvtnorm and gcmr. The package mvtnorm (Genz et al. 2018) approximates general n-dimensional integrals of normal (and t) densities, Equation 7 being a special case, using a quasi-Monte Carlo method; Nikoloulopoulos (2013) used it to make inference about Gaussian copula models with discrete marginals. The package gcmr Varin 2017, 2018) can fit Gaussian copula models to longitudinal, time series and spatial data, with both continuous and discrete marginals. It uses the GHK method when fitting the models to discrete data. The packages are compared by fitting a simulated dataset over the region [0, 1] × [0, 1], with 121 sampling locations forming a regular 11 × 11 grid. The model for the simulated data has negative binomial marginals with mean function exp(1 + 0.5x + y), s = (x, y), overdispersion 1 and correlation function of the latent field exp(− s − u /0.3) (no nugget).
When comparing gcmr and gcKrig in terms of other capabilities, we note that gcKrig can perform predictive inference at unobserved locations, as well as computation of the correlation function of the random field of counts and Fréchet-Hoeffding upper bounds, while these tasks are not available in gcmr. On the other hand, gcmr computes a type of residuals relevant for assessing the fit of Gaussian copula models, while gcKrig does not (currently) perform this task.
Finally, we note that the current version of gcmr does not seem to handle adequately correlation functions that include a nugget effect. Although the currently available options for the correlation structure in gcmr do not include a nugget effect in the correlation matrix, the user can write a function of class 'cormat.gcmr' that includes a nugget effect; see Masarotto and Varin (2017). But when we explored this for several simulated datasets, the estimated nugget parameter turned out to be negative for some datasets, mostly when the true nugget parameter was close to zero. This is presumably due to the optimization algorithm used in gcmr not constraining the nugget parameter to be in [0, 1].

Conclusions
In this work we described the use of the R package gcKrig for the analysis of geostatistical count data using Gaussian copulas. The package implements most of the main tasks commonly required in the analysis of this kind of data: simulation and visualization, parameter (point and interval) estimation, spatial (point and interval) prediction, and computation of the correlation function of the process of counts. The inferential tasks rely either on an accurate approximation to the likelihood (the GHK simulator) or on a surrogate likelihood (the GQT method); the latter is appealing for the analysis of moderately large datasets. The package implements the computationally intensive tasks in C++ using an R/C++ interface, and has parallel computing capabilities to predict the response at multiple locations simultaneously. Nevertheless, the more accurate method based on the GHK simulator cannot handle large datasets effectively. Hence, we plan to enhance the package in the future with the implementation of methods to base inference on pairwise likelihoods or other efficient surrogate likelihoods.
Another possible modeling approach is the use of copulas based on graphical models called vines, along the lines described in Panagiotelis, Czado, and Joe (2012). Gräler (2014) and Erhardt, Czado, and Schepsmeier (2015a,b) used this approach for the modeling of geostatistical continuous data, which is implemented in the R packages spcopula (Gräler 2017) and VineCopula (Schepsmeier, Stoeber, Brechmann, Graeler, Nagler, and Erhardt 2018). The application of this approach to the modeling of geostatistical count data seems a promising topic of future research.

A. Specification prototypes
This appendix provides prototypes for users to specify marginals, link functions and correlation functions. We first show how to construct a marginal of class 'marginal.gc' with a user-defined link function. Then, we provide source code as an example on how to specify correlation functions of class 'corr.gc'. These prototypes serve as templates and can be easily modified by interested users to construct user-defined marginals, link functions and correlation functions that are not yet available in the package gcKrig.

A.1. Specification of a new marginal of class 'marginal.gc'
This example provides the source code for specifying the binomial distribution with link function t and df.t degrees of freedom as marginal of class 'marginal.gc'. This function can be an input for the argument marginal in the functions simgc(), corrTG(), mlegc() and predgc(), and serves as prototype to construct user-defined marginals and link functions.
start is a function of the response vector y, the matrix or data frame of the covariates x and the sampling effort effort. The output of start() is a vector of the starting values for marginal parameters.
bounds is a function of the response vector y, the matrix or data frame of the covariates x, the vector of parameters pars and the sampling efforts effort. This function computes the upper and lower limits of the integral in Equation 7.
pdf is a function of the response vector y, the matrix or data frame of the covariates x, the vector of parameters pars and the sampling efforts effort. This function returns a vector of the marginal probabilities of all observations.
cdf is a function of the response vector y, the matrix or data frame of the covariates x, the vector of parameters pars and the sampling efforts effort. This function returns a vector of the marginal cumulative distribution function of all observations. margvar returns the variance of the distribution for the given set of parameters.
int.marg is a function that returns the coefficients in Equation 15.
rsample is a function that generates a random sample of size nrep from this distribution.
q is the quantile function defined above.

A.2. Specification of a new correlation of class 'corr.gc'
This example provides the source code for specifying the power exponential correlation function of class 'corr.gc', which can be the input for the argument corr in the function simgc(), mlegc() and predgc(). It serves as a prototype to construct different correlation functions for geostatistical data simulation, model fitting and spatial prediction.