SPODT: An R Package to Perform Spatial Partitioning

Spatial cluster detection is a classical question in epidemiology: Are cases located near other cases? In order to classify a study area into zones of different risks and determine their boundaries, we have developed a spatial partitioning method based on oblique decision trees, which is called spatial oblique decision tree (SpODT). This non-parametric method is based on the classification and regression tree (CART) approach introduced by Leo Breiman. Applied to epidemiological spatial data, the algorithm recursively searches among the coordinates for a threshold or a boundary between zones, so that the risks estimated in these zones are as different as possible. While the CART algorithm leads to rectangular zones, providing perpendicular splits of longitudes and latitudes, the SpODT algorithm provides oblique splitting of the study area, which is more appropriate and accurate for spatial epidemiology. Oblique decision trees can be considered as non-parametric regression models. Beyond the basic function, we have developed a set of functions that enable extended analyses of spatial data, providing: inference, graphical representations, spatio-temporal analysis, adjustments on covariates, spatial weighted partition, and the gathering of similar adjacent final classes. In this paper, we propose a new R package, SPODT, which provides an extensible set of functions for partitioning spatial and spatiotemporal data. The implementation and extensions of the algorithm are described. Function usage examples are proposed, looking for clustering malaria episodes in Bandiagara, Mali, and samples showing three different cluster shapes.


Introduction
Spatial cluster detection is a classical question in epidemiology: are cases located near other cases? Among various approaches, general methods allow us to detect high risk zones of unspecified locations within a study area, without specifying any a priori point source (Colonna, Esteve, and Menegoz 1993;Elliott, Martuzzi, and Shaddick 1995;Wakefield, Quinn, and Rabb 2001;Waller and Gotway 2004;Chirpaz, Colonna, and Viel 2004;Gaudart, Ramatriravo, and Giusiano 2006b). Global detection methods, such as Moran's or Tango's ones (Tiefeldorf 2002;Tango 2002), test a statistic estimated over the entire study area, whereas local detection methods, such as Anselin's or Kulldorff's ones (Anselin 1995;Kulldorff 1997), test several statistics estimated over distinct zones within the study area. By scanning the study region with a circular or elliptic window, the SaTScan algorithm (Kulldorff 1997) compares observed and expected cases, inside and outside each potential cluster. It has the advantage of not depending on the underlying spatial architecture, although the choice of windowing is often critical and sensitive to edge effects (Gregorio, Samociuk, DeChello, and Swede 2006). These methods are also sensitive to geographical constraints, such as rivers, mountains, seas, or walls and corridors for outbreaks in buildings (e.g., healthcare-associated infections, or legionellosis).
We have introduced a spatial partitioning method based on oblique decision trees, called spatial oblique decision tree (SpODT), in order to classify a study area into zones of different risks and determine their boundaries, while being less sensitive to edge effects (Gaudart et al. 2006b). This non-parametric method is based on the classification and regression tree (CART) approach introduced by L. Breiman (Breiman, Friedman, Olshen, and Stone 1993). Beyond the basic function, we have developed a set of functions for an extended analysis of spatial data, providing: inference, graphical representations, spatio-temporal analysis, adjustments on quantitative covariates, spatial weighted partition, and the gathering of similar adjacent final classes.
In this paper, we propose a new package SPODT for R (R Core Team 2014) which provides an extensible set of functions for partitioning spatial and spatio-temporal data. The implementation and extensions of the algorithm are described and function usage are proposed based on a field observation datafile (malaria episodes in Mali) (Coulibaly et al. 2013) and samples showing three different cluster shapes. The results are compared to the CART approach using the tree package (Ripley 2014). All results were obtained using R 3.1.0 (Windows 7, Intel Core i7, CPU Q820 @1.73GHz, 64-bit). The SPODT package is freely available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=SPODT.

Basic algorithm
This non-parametric method is based on classification and regression tree (CART) (Breiman et al. 1993;Crichton, Hinde, and Marchini 1997;Gaudart, Poudiougou, Ranque, and Doumbo 2005). For each covariate, the CART algorithm searches for the threshold to split the covariate space into two classes, which optimizes a defined criteria (such as interclass variance). Then, the CART algorithm pursues recursively the binary partition of the covariate space, reaching stopping rules. Applied to epidemiological spatial data, the CART algoritm searches among the planar coordinates {x i , y i } (of each location M i ) for a threshold or boundary between two spatial classes (two geographic zones), so that the risks estimated in these two classes are as different as possible (maximum interclass variance or sum of squared errors SSE inter ). The algorithm then continues splitting recursively each of these two classes, and stops when reaching stopping rules. The root of the resulting regression tree is the entire study area. The final classes are sub-classes splitting the whole study area. Regression trees estimate changing lines of a constant function in each class of R 2 (Gey 2002), interpreted as boundaries between zones (spatial classes) of different risks. However, the CART algorithm leads to rectangular classes (Murthy, Kasif, and Salzberg 1994;Cantu-Paz and Kamath 2003), providing perpendicular splits of the projected longitudes and latitudes. The SpODT (spatial oblique decision tree) algorithm (Gaudart et al. 2005) is a modification of the CART algorithm providing oblique splitting of the study area, which is more appropriate and accurate for spatial epidemiology. Oblique decision trees can be considered as non-parametric regression models. The functional form can be written as: where {x i , y i } are the planar coordinates of each point location M i , i = 1...N , and ε i ∈ R.
These coordinates have to be euclidean coordinates in case of small area (e.g., hospital wards, rooms within buildings) or projections of geographical coordinates. Note that the use of non projected geographical coordinates may lead to erroneous results. The function f (x i , y i ) can be written as: where class j , for j = 1, ...P , are the final P classes after splitting the whole study area; z j = 1 N j M i ∈class j z i is the mean of observed values at N j locations M i ∈ class j. In other words, for each point location M i belonging to a class j, the predicted risk will be z i =z j ± ε i .
The main problem is to determine the class set {class j , j = 1, ..., P }. Boundaries between classes are linear functions s j (x i , y i ) of the planar coordinates (ax i + by i + c = 0). These boundaries, or splitting directions, are recursively determined for each location sample, also called node ξ, corresponding to the whole study area at the beginning of the algorithm, or corresponding to a zone (geographical class) issued from a previous split. This node ξ is split into two classes by the partition direction s j (x i , y i ). If s j (x i , y i ) < 0, then the location M i will belong to the left "child" class jl of the tree. If not, the location M i will belong to the right "child" class jr. For each node ξ constituted by a set of n(ξ) locations, the algorithm searches, among the S set of every linear functions of (x i , y i ), for the function s j (x i , y i ) such as: We have shown (Gaudart et al. 2005;Fichet, Gaudart, and Giusiano 2006) that S, the set of every linear functions splitting a finite set of points in R 2 , is a finite set. There are an infinite number of lines splitting a set of points into two sub-sets. However, several lines lead to the same classification, splitting the point set identically. Therefore, the algorithm has to identify the possible lines to analyze only once each separate partition. For that purpose, the algorithm uses properties related to the order of abscissas of the points to be split, after rotation of the x-axis. Then, the algorithm performs vertical splitting of images of the x-axis  for each rotation. To determine the angles of these rotations, critical angles associated to each pair of points are defined. They allow to define angular sectors within which the image of the axis preserves the order of the point abscissas. Indeed, during a rotation center O of the x-axis, the order of the point abscissas can be changed. For two points M 1 and M 2 , the critical angle θ 12 , associated with the pair (M 1 , M 2 ), defines the minimum angle of rotation to be applied to the x-axis so that points M 1 and M 2 have their abscissas u 1 and u 2 permuted ( Figure 1 and Figure 2). During the passage of the x-axis image from an angular sector to the next, only the points associated to the critical angle, formed by the line delimiting the two angular sectors, have their abscissa order changed. The algorithm splits the plane perpendicularly to x-axis and x-axis images after rotations. Thus, permutations in the abscissa order scan the interval [0, π[, and characterize distinct splits that will be tested to maximize the interclass variance of the generated classes.
After splitting the initial set into 2 classes, the algorithm continues recursively. The number P of final classes (or zones) is recursively defined by the number of terminal nodes of the regression tree, after reaching stopping criteria.
A node ξ is a terminal node if one of the following criteria is reached: , a new partition will not explain enough variance; where R 2 ξ is the explained variance calculated over the split of a node ξ and R 2 c is the minimal explained variance (fixed by the user).
2. n(ξ) ≤ n c1 , where n(ξ) is the size of node ξ and n c1 is the fixed minimal size of a node below which the splitting algorithm is stopped (fixed by the user).

The maximal number of tree levels (fixed by the user).
Once the oblique regression tree is obtained (partition of the entire area into spatial classes of different risks), the main feature of this model is the overall variance explained in the dependent variable by the terminal classification, R 2 global , defined as the ratio of the sum of squared deviations between classes (calculated on the overall terminal classes) to the total sum of squares.
This approach, defined as a general method detecting spatial clusters, can be interpreted either as a global assessment of a spatial structure, or as a local analysis producing a map of the response variable.

Program developments
We have developed different R functions for a complete analysis of spatial data, according to our method. On the basis of the basic algorithm, several extensions have been developed: Spatio-temporal analysis: Integration of splits of a time covariate. The statistical unit is then defined by planar coordinates and a date. On an unique location different values can be observed at different dates. As CART algorithm, SpODT algorithm can thus provide a spatial splitting or a temporal splitting.
Adjustments: Following the same procedure, the SpODT algorithm can provide a classification of different quantitative covariates. For these covariates, the standard CART algorithm is applied (i.e., no oblique split is performed).
Gathering similar adjacent final classes: This option makes possible to gather similar adjacent classes at the end of the recursive splitting algorithm. Indeed, because of the recursiveness of the algorithm, the left branch of the tree ignores the right branch and conversely. This can lead to a final classification with similar adjacent classes, only separated because of the recursion. In this approach, the global R 2 global is calculated after grafting these two adjacent classes, and this grafted new classification is kept if this new global R 2 global is not sufficiently different from the previous one (without grafting classes).
Weighting the classification criterion: In the basic SpODT algorithm, the calculation of the interclass variance doesn't take into account the child class sizes nor the spatial distribution of the locations within each child class. However, a class is all the less important in the analysis as its size is small and its locations are dispersed. We have then developed a weighted sum of squared error The weight function α j has to be a continuous non-decreasing bounded function of the size n(class j ) (size of the class j ∈ 1, 2) and the spatial dispersion δ j . The weight function actually proposed is where δ j = det(V j ) and V j is the variance-covariance matrix for each class j.
Inference: A "test" function has been developed in order to test the final SpODT classification using a Monte-Carlo approach. This test function simulates a specified number of data sets under a specified null hypothesis conditionally to the location, and the spodt function provides a classification tree for each of the simulated data set. The empirical distribution of the global R 2 global under the null hypothesis is obtained and, then, the test function provides a p value.

Basic function
The spodt function performs the classification of the data set.

Arguments:
z~1: A formula, using the formula function, with a response but no interaction terms. The left hand side has to contain a quantitative response variable (numeric). The right hand side should contain the quantitative and qualitative variables to be split according to a non oblique algorithm (e.g., z~V1 + V2). For single spatial analysis (with no cofactor) the right hand side should be z~1.
data: A SpatialPointsDataFrame containing the coordinates and the variables. SpODT functions need planar coordinates. Geographic coordinates have to be projected. Otherwise, euclidian coordinates can be used (for small area analysis such as rooms within buildings).
weight: A logical value indicating whether the interclass variances should be weighted or not.
graft: A numerical value between 0 and 1 indicating the minimal modification of R 2 global required to graft the final classes. If graft = 0 the algorithm will not graft any adjacent classes.
level.max: The maximal level of the regression tree above which the splitting algorithm is stopped.
min.parent: The minimal size of a node below which the splitting algorithm is stopped (n c1 ).
min.child: The minimal size of the children classes below which the split is refused and the algorithm is stopped (n c2 ).
rtwo.min: R 2 c , the minimal value of R 2 ξ above which the node split is refused and the algorithm is stopped. Specified as a numerical value between 0 and 1.
Value: The spodt function computes an object of class spodt with the different components of the classification tree, i.e., i) at each step: the point locations within each class, R 2 ξ , coefficients of the splitting line; ii) global results: the global R 2 global (object@R2), the final partition (object@partition) including the graft results.

spodt.tree(object)
This graphical function provides the tree issued from the spodt function. Each step of the classification is presented with main statistics. object is an object of class spodt, usually a result of a call to spodt. For graphical convenience, grafted classes are not presented but only indicated by their id number.

spodtSpatialLines(object, data)
This function provides the SpatialLines object (see the package sp, Bivand, Pebesma, and Gómez-Rubio 2013) that contains the boundaries of the spatial classification issued from the spodt function. object is an object of class spodt, usually a result of a call to spodt. data is the initial SpatialPointsDataFrame containing the planar coordinates and the variables. The SpatialLines object obtained can be used, for example to obtain maps.

Hypothesis testing
The test.spodt function provides a Monte Carlo hypothesis test of the final classification issued from the spodt function. This function performs simulations of the specified null hypothesis and the classification of each simulated data set, using the same rules as the observed data set classification.

Arguments:
z~1: A formula, such as in the spodt function, with a response but no interaction terms. The left hand side has to contain a quantitative response variable (numeric). The right hand side should contain the quantitative and qualitative variables to be split according to a non oblique algorithm (e.g., z~V1 + V2). For single spatial analysis (with no cofactor) the right hand side should be z~1.
data: A SpatialPointsDataFrame containing the coordinates and the variables. SpODT functions need planar coordinates. Geographic coordinates have to be projected. Otherwise, euclidian coordinates can be used (for small area analysis such as rooms within buildings).
obs.R2: The global R 2 global issued from the previous spodt final classification of the observed data set. Specified as a numerical value between 0 and 1.
rdist: A description of the distribution of the dependent variable under the null hypothesis. This can be a character string naming a random generation of a specified distribution, such as "rnorm" (Gaussian distribution), "rpois" (Poisson distribution), "rbinom" (binomial distribution), "runif" (uniform distribution), . . .

nb.sim:
The number of simulations, specified as a positive integer.
weight, graft, level.max, min.parent, min.child, rtwo.min: These arguments have to be specified, similarly to the previous spodt classification of the observed data set. . This local heterogeneity is driven by a variety of factors including distance to breeding sites, housing constructions and socio-behavioral characteristics (Koram, Bennett, Adiamah, and Greenwood 1995;Coleman, Mabuza, Kok, Coetzee, and Durrheim 2009;Ernst et al. 2009). The study was conducted in Bandiagara, Mali, following a cohort of 300 children, at 168 locations. The household of each child was geo-located (decimal degrees). Approval from Institutional review boards at the Faculty of Medicine, Pharmacy and Dentistry of the University of Mali, community approval and written informed consents from parents were obtained before inclusion (see Coulibaly et al. 2013, for further details). We applied SPODT functions to classify the entire area into different risk zones with homogeneous number of malaria episodes per child at each household, from Figure 3: Classification tree (spodt.tree(object)) of malaria episodes in Bandiagara, Mali. This classification was obtained by using the SPODT package. Each node (excluding terminal nodes) is presented with its id number, mean, variance and local R 2 ξ after splitting, as well as the function of the splitting line. Each terminal node is presented with its id number, number of locations, mean and variance.
The non-grafted tree (Figure 3) showed 12 final classes with different risks before grafting ( Figure 6). Adjacent classes were grafted according to the graft criteria described in Table 1, which finally provides 6 classes, with R 2 global = 0.49 (given by spodt.results@R2). This result shows that spatial variations can explain an important part of the malaria risk variability, although other factors remain such as behaviors, genetic, personal medical history, household characteristics etc. The spatial classification ( Figure 4) highlighted a central low risk cluster (class id 109) with a mean malaria episode of 0.08 per child (95% confidence interval, CI[0.04-0.11]) (Table 2), with a polygonal and asymmetric shape. Around this low risk cluster, the mean malaria episodes per child was higher (0.47 [0.39-0.55]). Note that there is a pond in the north of the city and a river in the south, which are breeding sites for malaria transmission mosquitoes (Coulibaly et al. 2013). The remaining zone showed an alternation of high and low risk clusters.
The test of the tree algorithm was performed using 99 simulated samples following a Poisson distribution and with the same criteria as previously, such as follows (results were obtained in 28.46 seconds): R> test.spodt(z~1, data = dataMALARIA, spodt.results@R2, "rpois", + c(length(dataMALARIA@data$loc), mean(dataMALARIA@data$z)), 99, + weight = TRUE, graft = 0.13, level.max = 7, min.parent = 25, + min.child = 2, rtwo.min = 0.01) With a p value of 0.01, the classification obtained by the spodt function was significantly different from a homogeneous spatial distribution of malaria episodes ( Figure 5).  Among the different tuning parameters of the spodt function, level.max, min.parent, min.child and rtwo.min are similar to those of the tree package, and have to be chosen similarly to CART approaches. In the SPODT package, as we have introduced a gathering option, a graft tuning parameter has been added. In order to assess the sensitivity of the SpODT algorithm to this option, we ran it with different values of graft ranging from 0.0 to 1 (with a step of 0.001), the other tuning parameters being fixed as previously. We also assessed the sensitivity of the SpODT algorithm to rtwo.min values, running the algorithm   with values ranging from 0.0 to 1 (with a step of 0.001), the other tuning parameters being fixed as previously (graft = 0.13).
The R 2 global obtained ranged from 0.6 (12 final classes) to 0.003 (2 final classes), showing a step decrease of the number of classes (Table 3) when graft increased. When rtwo.min increased, the algorithm stopped rapidly with no classification (Table 4). Choice of the tuning parameters has thus to be made between no classes and too many classes, such as for CART approaches. From a practical point of view, together with field knowledge, the number of final classes, the R 2 global and the test procedure provided by this package can guide the user in this choice. Note that the choice of a deep tree will be corrected by the graft parameter.
The results were compared to the CART approach, using the tree package, tuning parameters being set as follows: mincut = 5, minsize = 10, mindev = 0.01. The CART approach showed a less accurate classification with 16 final classes (Table 5 and Figure 7). A central low risk cluster was also detected (class id 27) as well as the alternation of high and low risk    clusters in the South, but this approach failed to detect the polygonal shape and to gather similar adjacent classes (Figure 8). From an epidemiological point of view, numerous small classes is not very useful in this context. Note that changes in the tuning parameters did not change the global interpretation of the results. In the case of a greater mindev value (e.g., > 0.0134), the central low risk cluster was not detected (data not shown).

Different cluster shapes and levels
We assessed the SPODT functions analyzing three different situations, and in comparison to the CART algorithm (tree package). The following situations have been studied: Clustered data with a high level within a centered rotated square, and a low level outside.
Clustered data with a low level inside a centered ball shape, and a high level outside. Clustered data with a high level under a "V" shape border, and a low level above.
A dependent variable following a Gaussian distribution with a constant variance (0.09) and a constant mean for the two level zones: µ 1 = 1 for the low level zone, µ 2 = 1 + β for the high level zone. For each situation, we used four samples: β = 0 (no cluster), β = 0.5, β = 1.5 and β = 2.   As planar coordinates were used, no projection were applyed to the SpatialPointsDataFrame. This provides a warning message when using spodt and test.spodt functions.
For both SpODT and CART approaches, default tuning parameters were used, except for graft = 0.2 (SpODT algorithm). Changing these parameters did not greatly change the interpretation of the comparisons.
Rotated square shape situation: The SpODT algorithm did show the central cluster even for low values in the high level cluster (Figure 9, β = 0.5, β = 1.5 and β = 2 panels). But the obtained shape was only approximatively a rotated square. A contrario, the shape obtained with the CART algorithm was accurate only for higher values (β = 2), but showed no rotated square ( Figure 10).   Ball shape situation: The SpODT and the CART algorithms failed to precisely detect this particular form, but precisely located square clusters (Figures 12, 13, β = 0.5, β = 1.5 and β = 2 panels). Again, CART failed to detect only two levels: it detected few classes in the high level zone only for β = 2.
"V" shape border : The SpODT algorithm detected a very accurate border even for low values in the high level zone (β = 0.5). The CART algorithm failed to detect such a particular shape. Nevertheless, it showed lower values in the north, higher values in the south, and a mitigate central band (with numerous different classes).

Spatial partition with a time covariate
A sample was build that concatenates 6 different situations: 2 rotated square situations (β = 2 and β = 1.5), 2 "no cluster" situations (β = 0), and two "V" shape situations (β = 2 and β = 1.5), which thus form a numeric time covariate (1 unit of time up to 6). The spodt function was used to provide a classification of the area, including this time covariate, with a weighted classification criteria, a maximum of 5 tree levels, a minimal parent size of n c1 = 10, a minimal child size of n c2 = 5, a minimal R 2 c = 0.001, and a grafting option of graf t = 0.2. The function can be written as follows: R> data("dataCOV") R> coordinates(dataCOV) <-c("x", "y") R> spodt.results.cov <-spodt(z~V1, data = dataCOV, weight = TRUE, + graft = 0.2, level.max = 5, min.parent = 10, min.child = 5, + rtwo.min = 0.001) The non-grafted tree (Figure 20), provided by the SpODT algorithm, showed 16 final classes, with 2 time splits: less than 2 and less than 5. These 3 time periods was related to 3 situations: rotated square with high values (β = 2), "no cluster" or rotated square with medium values(β = 1.5), and "V" shape situation (β = 1.5 and β = 2). The graft option led to two main classes (Figure 18), a high level zone in the South and a low level zone in the north, which highlight the impact of the "V" shape situation in this exemple (more locations showing high values in this part of the area at this period). The CART algorithm provided a similar tree with the same time splits (Figure 19), but 15 different spatial classes.

Conclusion
Among the different tools used dedicated to spatial classification (e.g., Assuncao, Neves, Camara, and da Costa Freitas 2006;Oden, Sokal, Fortin, and Goebl 1993), the proposed SPODT package provides a classification of a spatial area based on the spatial variability of a dependant variable. Space splitting can be oblique and this classification can be adjusted on covariates and gather similar adjacent classes. Associated functions (spodt.tree and spodtSpatialLines) are useful for graphical representations of the classification, and the spodt.test function provides a test of the oblique decision tree algorithm. SPODT package is provided with a real example set of malaria cases observed in Mali. Using this set and others, SpODT detected spatial and spatio-temporal clusters more accurately than the CART algorithm in all performed comparisons.