sparr : Analyzing Spatial Relative Risk Using Fixed and Adaptive Kernel Density Estimation in R

The estimation of kernel-smoothed relative risk functions is a useful approach to examining the spatial variation of disease risk. Though there exist several options for performing kernel density estimation in statistical software packages, there have been very few contributions to date that have focused on estimation of a relative risk function per se . Use of a variable or adaptive smoothing parameter for estimation of the individual densities has been shown to provide additional beneﬁts in estimating relative risk and speciﬁc computational tools for this approach are essentially absent. Furthermore, little attention has been given to providing methods in available software for any kind of subsequent analysis with respect to an estimated risk function. To facilitate analyses in the ﬁeld, the R package sparr is introduced, providing the ability to construct both ﬁxed and adaptive kernel-smoothed densities and risk functions, identify statistically signiﬁ-cant ﬂuctuations in an estimated risk function through the use of asymptotic tolerance contours, and visualize these objects in ﬂexible and attractive ways.


Introduction
In epidemiological studies it is often of interest to have an understanding of the dispersion of some disease within a given geographical region. A common objective in such analyses is to determine the way in which the 'risk' of contraction of the disease varies over the spatial area in which the data has been collected. In order to avoid confounding by the underlying population dispersion, it is necessary to obtain not only the disease case location data, but also control data describing this at-risk distribution. By then finding the ratio of estimated case to control densities (Bithell 1990(Bithell , 1991, the resulting relative risk function is a common tool for describing the spatial variation in disease risk (see for example Kelsall and Diggle 1995b;Wheeler 2007;Clough, Fenton, French, Miller, and Cook 2009).
In these and indeed most other examples, kernel smoothing is used to estimate the densities. Kernel smoothing provides a flexible approach to modeling the highly heterogeneous spatial distributions encountered in problems in geographical epidemiology, as well as an accessible framework for subsequent analysis. To date, the use of a fixed smoothing parameter (that is, a constant degree of smoothing for all observations) has dominated in the literature. A fixed bandwidth is relatively simple to implement and often effective, but can perform poorly with highly heterogeneous populations. For such distributions, the well-known variable (or adaptive) smoothing parameter method of Abramson (1982) has been shown to provide both theoretical and practical benefits with respect to the final density estimate (Abramson 1982;Hall and Marron 1988;Davies and Hazelton 2010). It is not unreasonable to assume that the increased complexity of the adaptive approach over the fixed, as well as the lack of specific computational tools for performing this kind of relative risk estimation, are at least partly to blame for the somewhat limited number of examples in the literature.
We seek to motivate further interest in this specific field by introducing the package sparr (spatial relative risk) for use with the statistical programming environment R (R Development Core Team 2010); freely available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=sparr. There already exist a number of packages that can perform bivariate kernel density/intensity estimation in R; see for example spatstat (Baddeley and Turner 2005), ks (Duong 2007), KernSmooth (Wand and Ripley 2010) and spatialkernel (Zheng and Diggle 2011). With the exception of spatialkernel (which itself provides only fixed smoothing), however, none provide the explicit capability to estimate fixed and adaptive relative risk functions. In addition to this functionality, sparr also implements the unique ability to construct asymptotically derived p value surfaces for both fixed (Hazelton and Davies 2009) and adaptive (Davies and Hazelton 2010) relative risk function estimates. Superimposition upon a plot of a certain risk function of tolerance contours (based on these p values) at given significance levels can help to identify sub-regions of statistically significant departures from uniformity of risk; often of interest in studies in geographical epidemiology. For a further review of spatial point pattern analysis in R, see Bivand, Pebesma, and Gómez-Rubio (2008).
This work is organized as follows. Section 2 gives a brief overview of the (bivariate) kernel density estimator (for both fixed and adaptive smoothing), discusses correction for boundary bias and outlines estimation of a relative risk function. Section 3 describes the development and use of asymptotic tolerance contours. Section 4 walks the reader through an analysis using sparr and an example dealing with the spatial distribution of primary biliary cirrhosis in a region of northeast England. Concluding remarks discussing the package and its limitations, as well as future goals, are provided in Section 5.

Kernel density estimation and the risk function
Consider n bivariate observations X 1 , X 2 , . . . , X n drawn from an unknown density f . Using kernel smoothing, this density may then be estimated byf , given bŷ where K is the kernel function, typically chosen to be a radially symmetric probability density function, and h i is the smoothing parameter or bandwidth for the ith observation.
We may replace h i in (1) by some constant value h F to give us an explicit definition for the fixed-bandwidth kernel density estimator. For the adaptive estimator, we turn to an approach suggested by Abramson (1982), and calculate the varying bandwidths by where h 0 is a secondary smoothing multiplier we refer to as the global bandwidth, and γ is simply the geometric mean of the f (X i ) −1/2 terms, in place to alleviate the dependency of the h i s on the scale of the recorded data. The form of variable bandwidth calculation in (2) is intuitively natural since amount of smoothing depends inversely on the local amount of data.
In practice we must replace the unknown density in (2) by a pilot density. The pilot densitỹ fh is itself a fixed-bandwidth kernel density estimate constructed with the pilot bandwidthh.
Suppose we now have not one but two sets of bivariate observations, X 1 , X 2 , . . . , X n 1 and Y 1 , Y 2 , . . . , Y n 2 , representing cartesian coordinates of (disease) case and control locations respectively. Bithell (1990Bithell ( , 1991 suggests that the relative risk function may be expressed as the ratio of the (unknown) case and control densities f and g respectively. The function r is therefore written as In order to symmetrize our treatment of the two densities, Kelsall and Diggle (1995a,b) advocate the use of the log-risk function ρ, where ρ = logr. Thus, by replacing f and g in (3) by their fixed or adaptive kernel estimates based on the sampled data, we obtain our estimated (log) relative risk function (ρ = logr);r =f /ĝ.
A common consideration in these problems is the fact that the data have been collected over a restricted geographical region R. This is important because a portion of the kernel 'contributions' of observations that lie near the region boundary can fall outside the defined region. If ignored, this creates a negative bias around the boundary that can adversely affect the resulting density (and risk function) estimate. To correct for this by creating an asymptotically negligible level of boundary bias, we must implement methods described in Kelsall and Diggle (1995a) (fixed) and Marshall and Hazelton (2010) (adaptive), which involves quantifying the proportion of kernel weight left within the boundary. That is, we correct a given density estimatef by dividing it at each location z by the value q h (z) (z), given as where h (z) is the bandwidth at location z. For the fixed density estimate, we recall that The R package sparr includes functions to perform the aforementioned tasks: kernel estimation of bivariate edge-corrected fixed and adaptive densities, as well as estimation of a relative risk function. Code examples are given in Section 4.

Asymptotic tolerance contours
It is often desirable in analyses involving relative risk functions to be able to identify statistically significant fluctuations in the risk itself. For example, we may wish to determine whether or not a given peak in an estimated surface reflects truly heightened risk or is simply a product of random variation. This can be thought of as an upper tailed hypothesis test. That is, for a given log relative risk function ρ, we write Calculation of pointwise p values over the estimated log relative risk functionρ can be achieved with respect to the above hypotheses. Then, we may superimpose upon a plot ofρ at given significance levels tolerance contours, which highlight any identified 'extreme' sub-regions of elevated risk.
The question arises as to how to calculate these p value surfaces. Until recently, this was done by Monte-Carlo (MC) permutations for fixed risk functions (see Kelsall and Diggle 1995a). However, this approach would be extremely computationally expensive for the adaptive approach, and findings in Hazelton and Davies (2009) suggested that the MC tolerance contours are prone to signify artefactual risk hotspots in areas where there are no data.
An alternative technique to calculate the p value surfaces was outlined in Hazelton and Davies (2009) (fixed) and Davies and Hazelton (2010) (adaptive). These methods use the idea of asymptotic normality of a kernel density estimate (Parzen 1962) as well as asymptotic approximations to the variances of the fixed and adaptive risk function to calculate pointwise test statistics corresponding to a null hypothesis of uniform risk. From Davies and Hazelton (2010), the test statistic at location z for an (edge-corrected) adaptive risk functionρ is given by where γ ω is the geometric mean of the pilotω(X i ) −1/2 terms for the adaptive pooled casecontrol density estimateω, h 0,(f ) and h 0,(g) are the global bandwidths used for the case and control density estimates respectively, and for any given density ν, In (7), h(z;ν) represents the bandwidth at z based on an adaptive density estimateν with pilot densityν, and where u 1 and u 2 are the first and second components of the coordinate vector u respectively.
These test statistics, interpretable in the usual fashion with respect to a standard normal distribution, yield the asymptotically derived p value surface given a specific alternative hypothesis such as in (5). The resulting asymptotic tolerance contours are far less computationally expensive than their MC counterparts, and appear more stable in sparsely populated areas for the fixed-bandwidth case.
In addition to the capability of estimating fixed and adaptive risk functions, sparr includes the functionality to produce corresponding asymptotic p value surfaces and their tolerance contours, as well as flexible visualization options. These are put to use in the example in the following section.

Code examples
To illustrate use of sparr we make use of a dataset concerning liver disease in a set of adjacent health regions in northeast England. The data, first presented and analyzed by Prince, Chetwynd, Diggle, Jarner, Metcalf, and James (2001) and included in sparr, comprise the locations of 761 cases of primary biliary cirrhosis (PBC), along with 3020 controls randomly selected using weighted postal zones, to represent the at-risk population. The following code produces Figure 1.
R> title(xlab = "Easting", ylab = "Northing") R> points(PBC$data[PBC$data$ID == 1, 1:2], cex = 0.8) R> plot(PBC$owin, main = "controls") R> axis(1) R> axis(2) R> title(xlab = "Easting", ylab = "Northing") R> points(PBC$data[PBC$data$ID == 0, 1:2], pch = 3, cex = 0.8) R> par(mfrow = c(1, 1)) The goal is to construct an edge-corrected, adaptive (log) relative risk function; calculate an appropriate (asymptotic) p value surface searching for elevated PBC risk 'hotspots' (5) Prior to commencing, we must decide upon the selection of appropriate bandwidths. For fixed relative risk functions, we are required to select two bandwidths, h F 1 and h F 2 , for the case (f ) and control (ĝ) density estimates respectively. Often, a 'jointly optimal' smoothing parameter is chosen for both densities when fixed estimates are used, that is h F 1 = h F 2 = h J , due to certain resultant theoretical benefits. This reduces the number of distinct values required to one. Adaptive density estimates, on the other hand, require both a global and pilot bandwidth. In Davies and Hazelton (2010), the authors used a common value for both case and control global bandwidths, but left distinct the two pilot smoothing parameters. We repeat this procedure for this example.
This leaves the issue of precisely how to calculate the required smoothing parameters. Kelsall and Diggle (1995a,b) and Hazelton (2008) suggested methods to select a common bandwidth given a fixed risk function (also applicable to a common global in the adaptive version), but we have found these approaches to have unreliable performance and be computationally expensive. Davies and Hazelton (2010) instead made use of methods designed for single density estimates in conjunction with the pooled case/control dataset. They considered a maximal smoothing principle suggested by Terrell (1990), as well as a least-squares cross-validation approach available in the package sm (as summarized in Bowman and Azzalini 1997). These methods have been made available in sparr.

R> n1 <-sum(PBC$data$ID
On the previously mentioned machine, these calculations take under two minutes collectively. We are left with our adaptive density estimates based on the pooled dataset, as well as the case and control data separately, represented as objects of class bivden (bivariate density) by sparr. This class is essentially a named list of components describing the density estimate; individual components are accessible via the familiar $ notation.
Simple S3 support in the form of print, summary and plot methods is available for the bivden class. As usual, entering the name of a stored object within the current workspace will invoke the relevant print command, the appearance of which is seen for the 'case' density object below.

R> pbc.case
Bivariate kernel density estimate We are now ready to calculate a (log) relative risk function; a trivial task in sparr through the use of the function risk. This method produces a named list object of the class rrs, for which the same S3 support is available as described above.
Prior to visualization it is of interest to attempt to identify any sub-regions on our risk function estimate that exhibit statistically significant fluctuations in risk, as opposed to 'unimportant' background noise. In our example, we aim to determine whether or not there exist any areas within the study region that correspond to a significantly heightened risk of PBC.
As mentioned in Section 3 the ability to construct asymptotic tolerance contours is present in sparr; important due to the limitations of MC contours (particularly with respect to an adaptive risk function). This is achieved by use of the function tolerance: R> pbc.tol <-tolerance(rs = pbc.risk, pooled = pbc.pool, test = "upper") The function provides a commentary (that can be disabled) as it calculates the integral components required for the asymptotic p value surface (the components named K2 and L2 represent the first and second integrals in the right hand side of equation (7) respectively). We note a computation time of roughly 20 seconds in this instance. The result is a named list with the x and y coordinates of the evaluation grid and the standardized Z and corresponding P values. The p value component can be put to use using the contour function when adding to a heat plot (for example) of the relevant risk function. This is demonstrated in the following code block, which produces Figure 2. R> plot(pbc.risk, display = "heat", col = heat.colors(12)[12:1], + main = "Adaptive log-risk function, PBC", + xlab = "Easting", ylab = "Northing") R> contour(x = pbc.tol$X, y = pbc.tol$Y, z = pbc.tol$P, + levels = c(0.01, 0.05), drawlabels = FALSE, add = TRUE, + lwd = c(2, 1.5), lty = c(1, 2)) We note that the contours have indicated a significant sub-region of elevated risk on the eastern border of the region, drawing at significance levels of 0.05 and 0.01.
Though perhaps the most informative display type of a risk function if we wish to include tolerance contours, the heat plot is not the only option available when plotting a risk function (or indeed a single bivariate density estimate) in sparr. Using the impressive capabilities of the rgl package (Adler and Murdoch 2011), we may produce an interactive 3-D plot of a density or risk function. This is available through the plot.bivden and plot.rrs methods. To produce a 3-D plot of the risk function in Figure 2, we execute the following command: R> plot(pbc.risk, display = "3d", col = heat.colors(12)[12:1], + main = "", xlab = "Easting", ylab = "Northing", aspect = 1:2) Two screenshots of the resulting plot are given in Figure 3. Holding the left mouse button allows the user to rotate the plot; the right button enables zooming. These plots are an effective way to visualize a surface as we are able to focus on specific regions that may be of interest and more difficult to examine with another plot type. Furthermore, the relative magnitude of peaks/troughs can be clearer than in a 2-dimensional image.
The operations sparr can perform would be computationally expensive if one were to evaluate objects such as density estimates and p value surfaces directly. This is especially true when using an adaptive bandwidth. For this reason, various coding techniques and some approximations have replaced what would otherwise be irritatingly slow code. A key tool in accelerating execution time for both fixed and adaptive kernel density estimates in sparr is the powerful Fast-Fourier Transform code used to evaluate edge-corrected fixed-bandwidth density estimates in spatstat (via its function density.ppp). Though this is a trivial translation for bivariate.density to make for the fixed bandwidth case, more thought is required to utilize the advantageous speed of the spatstat code for adaptive estimation, where we require bandwidths for all possible grid coordinates to edge-correct. In this situation, bivariate.density calculates these bandwidths and finds the unique values of the 100 integer quantiles thereof. Each of these unique values is then treated as a fixed bandwidth, and the corresponding fixedbandwidth edge-correction factors are extracted from the Fast-Fourier Transform-calculated objects. A given coordinate within the study region is then matched to the nearest appropriate integer quantile value by its bandwidth, and the edge-correction factor corresponding to that bandwidth and coordinate is assigned. This approximation is significantly faster than directly evaluating factors at each distinct coordinate and associated bandwidth, at a negligible cost of accuracy. Nevertheless, the user can turn off the option to use this Fast-Fourier Transform assistance by setting the bivariate.density argument use.ppp.methods = FALSE. Other approximations of note are made for the integral components of the asymptotic p value surfaces in tolerance, again using the Fast-Fourier methods of spatstat as well as approaches that involve single evaluations of a function, translated to all possible grid coordinates. See the documentation pages for bivariate.density and tolerance for further information.
In addition, it is worth noting that a reduction in computation time may be had by reducing the resolution of the grid over which we estimate the density or p value surface. The default resolution is to use a 50 by 50 grid, which we consider to be a suitable minimum resolution that provides serviceable estimates and images, while keeping computation costs at a relatively low level. The user may find that a modest increase in resolution may be warranted for aesthetic reasons. In the interests of flexibility it is possible, for example, to evaluate a p value surface over a coarser grid than the corresponding risk function density estimates. This allows tolerance counters to be computed relatively quickly, and then superimposed over a more aesthetically pleasing image of the risk surface.
Finally, we mention that the complexity of the polygon describing the study region can, in extreme cases, result in a prohibitive computational cost for density and p value surface estimation. A polygon with many thousands of vertices (e.g. one obtained from a detailed geographical map) will cause delays, particularly when the complex region needs to be checked by the function multiple times for calculation of edge-correction factors. We recommend users working with such regions reduce the order of the polygon by, for example, retaining only every kth vertex. Analysis can then be performed at a generally negligible cost of accuracy (depending on the magnitude of k) using the reduced polygon. For imaging, the researchers can easily display the efficiently computed results using the original geographical map.

Final comments
Though concentrating on one particular approach for analyzing the spatial dispersion of disease, we hope that sparr will prove a useful tool for epidemiologists and researchers working with point process case-control data in geographical epidemiology. The functions in this package are flexible in the types of arguments they accept and the user can tailor their analyses to a varying level of detail. Indeed, by accepting a number of default values, the user is able to directly estimate a relative risk function by supplying the raw data to risk, bypassing the need to run bivariate.density explicitly.
Though there exist several packages for R that perform kernel density estimation, none (to date) provide the capabilities found here, particularly with the use of the variable bandwidth.
With sparr (version 0.2-0 at the time of writing) the user is provided with flexible functions to perform fixed and adaptive bivariate kernel density estimation, including boundary correction with respect to a complex study region, for both smoothing regimens. In addition, the user can calculate a relative risk function and use asymptotic theory to calculate corresponding p value surfaces and tolerance contours. Finally, sparr provides simple-to-use yet powerful approaches to visualizing the results.
The advancements in this package would not be possible without several other important contributions to CRAN; these are reflected as sparr's package dependencies. As already mentioned, rgl by Adler and Murdoch (2011) provides the ability to plot in 3 dimensions and interact with the device. The comprehensive spatial point pattern package spatstat (see Baddeley and Turner 2005) provides functions to enable efficient region handling, as well as the aforementioned Fast-Fourier Transform implementation. Package sm provides access to the LSCV bandwidth selector; named CV.sm here. The base package MASS (see Venables and Ripley 2002) provides utility support for internal functions.
There remains scope for further extensions to sparr. The issue of bandwidth selection is a difficult problem with respect to risk functions, and the somewhat ad hoc approach taken by most authors (present included) in tackling practical applications of the methodology has shifted the focus of sparr to the actual implementation and analysis of a risk function. Though, as mentioned, the package does provide the functionality of two bandwidth selectors for density estimation (OS and CV.sm), sparr generally aims to leave the bandwidth choice(s) for risk functions up to the user.
As almost all practical applications of the methodology use the Gaussian kernel function, this is the only kernel currently supported for the bivariate kernel density estimation in sparr. The infinite tails of this kernel are useful in areas of a given region with sparse observations, where a bounded kernel can result in a discontinuous estimate. Future versions of this package will, however, endeavor to expand the available choices for K.
Finally, it is worth noting the ever-present computational demands of estimating a kernelsmoothed relative risk function. Producing adaptive over fixed estimates will increase this cost. The size of the dataset, opting to edge-correct, as well as grid resolution, also impact on execution time. Though sparr has sought to minimize this cost as much as possible for initial release, it remains up to the user to find an acceptable balance between the aforementioned issues and computing time for their projects.