R Package gdistance : Distances and Routes on Geographical Grids

The R package gdistance provides classes and functions to calculate various distance measures and routes in heterogeneous geographic spaces represented as grids. Least-cost distances as well as more complex distances based on (constrained) random walks can be calculated. Also the corresponding routes or probabilities of passing each cell can be determined. The package implements classes to store the data about the probability or cost of transitioning from one cell to another on a grid in a memory-eﬃcient sparse format. These classes make it possible to manipulate the values of cell-to-cell movement directly, which oﬀers ﬂexibility and the possibility to use asymmetric values. The novel distances implemented in the package are used in geographical genetics (applying circuit theory), but also have applications in other ﬁelds of geospatial analysis.


Introduction: The crow, the wolf, and the drunkard
This article describes gdistance (van Etten 2017), a package written for use in the R environment (R Core Team 2016). It provides functionality to calculate various distance measures and routes in heterogeneous geographic spaces represented as grids. Distance is fundamental to geospatial analysis (Tobler 1970). It is closely related to the concept of route. For example, take the great-circle distance, the most commonly used geographic distance measure. This distance represents the shortest line between two points, taking into account the curvature of the earth. Implicit in this distance measure is a route. The great-circle distance could be conceived of as the distance measured along a route of a very efficient traveller who knows where to go and has no obstacles to deal with. In common language, this is referred to as distance "as the crow flies".

neighbours
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 8 neighbours q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 16 neighbours q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 1: Three commonly used ways to generate neighbor graphs for distance calculations.
analyses are done with calculations using cost, friction or resistance values. In graph theory, weights can also correspond to conductance (1/resistance), which is equivalent to permeability (a term used in landscape ecology). The weights can also represent probabilities of transition.
Graphs are mathematically represented as matrices to do calculations. These can include transition probability matrices, adjacency matrices, resistance/conductance matrices, Laplacian matrices, among others. In gdistance, we refer collectively to these matrices to represent graphs as "transition matrices". These transition matrices are the central object in the package; all distance calculations need one or more transition matrices as an input.
In gdistance, conductance rather than resistance values are expected in the transition matrix. An important advantage of using conductance is that it makes it possible to store values in computer memory very efficiently. Conductance matrices usually contain mainly zeros, because cells are connected only with adjacent cells, and the conductance for direct connections between remote cells is zero. This makes conductance matrices suitable to store in the memory-efficient sparse matrix format. Sparse matrices do not store zero values explicitly in computer memory; they just store the non-zero values and their respective row and column indices and assume that the other values are zero. Sparse matrices do not work for resistance matrices, however, as resistance is infinite (∞) between unconnected cells.
The calculation of the edge weights or conductance values is usually based on the values of the pair of grid cells to be connected. These cell values represent a property of the landscape. For instance, from a grid with altitude values, a value for the ease of walking can be calculated for each transition between cells. In gdistance, users define a function f (i, j) to calculate the transition value for each pair of adjacent cells i and j. With this approach, it is possible to create asymmetric matrices, in which the conductance from cell i to adjacent cell j is not always the same as the conductance from j back to i. This is relevant, among other things, for modeling travel in hilly terrain, as shown in Example 1 below (Section 9). On the same slope, a downslope traveler experiences less friction than an upslope traveler. In this case, the function to calculate conductance values is non-commutative: A problem that arises in grid-based modeling is the choice of weights that should be given to diagonal edges in proportion to orthogonal ones. For least-cost path distance and routes, this is fairly straightforward: weights are given in proportion to the distances between the cell centers. In a grid in which the orthogonal edges have a length of 1, the diagonal edges are √ 2 long. McRae (2006) applies this same idea to random walks. However, as Birch (2006) explains, for random walks this is generally not the best discrete approximation of the dispersal process in continuous space. Different orthogonal and diagonal weights could be considered based on his analytical results.
For random walks on longitude-latitude grids, there is an additional consideration to be made.
Considering the eight neighboring cells in a Moore's neighborhood, the three cells that are located nearer to the equator are larger in area than the three cells that are closer to the nearest pole, as the meridians converge when moving from the equator to either pole. So the cells closer to the poles should have a slightly lower probability of being reached during a random walk from the central cell. More theoretical work is needed to investigate possible solutions to this problem. For projected grids and small areas, we can safely ignore the surface distortion.
When the transition matrix has been constructed, different algorithms to calculate distances and routes are applied.
• The least-cost distance mimics route finding "as the wolf runs", taking into account obstacles and the local "friction" of the landscape. In gdistance the least-cost path between two cells on the grid and the associated distance is obtained with Dijkstra's algorithm (Dijkstra 1959).
• A second type of route-finding is the random walk, which has no predetermined destination (a "drunkard's walk"). Commute distance represents the random walk commute time, e.g., the average number of edges traversed during a random walk from an starting point on the graph to a destination point and back again to the starting point (Chandra, Raghavan, Ruzzo, Smolensky, and Tiwari 1996). Resistance distance reflects the average travel cost during this walk (McRae 2006). When taken on the same graph these two measures differ only in their scaling (Kivimäki, Shimbo, and Saerens 2014). Commute and resistance distances are calculated using the analogy with an electrical circuit (see Doyle and Snell 1984, for an introduction). The algorithm that gdistance uses to calculate commute distances was developed by Fouss, Pirotte, Renders, and Saerens (2007).
• A third type of route-finding is by randomized shortest paths, which are an intermediate form between shortest paths and Brownian random walks, introduced by Saerens, Yen, Fouss, and Achbany (2009). By setting a parameter, θ (theta), in the randomized shortest paths calculation, distances and routes can be made more randomized. A lower value of θ means that walkers explore more around the shortest path. When θ approaches zero, the randomized shortest paths approach a random walk. van Etten and Hijmans (2010) applied randomized shortest paths in geospatial analysis (and see Example 2 in Section 10 below).

Raster basics
Analyses in gdistance start with one or more rasters. For this, it relies on another R package, raster (Hijmans and van Etten 2016). The raster package provides comprehensive geographical grid functionality. Here, we briefly discuss this package, referring the reader to the documentation of raster itself for more information.
The following code shows how to create a raster object.
R> library("gdistance") R> set.seed ( The first line loads the package gdistance, which automatically loads the package raster as well. The second line creates a simple raster with 3 columns and 3 rows. The third line assigns the values 1 to 9 as the values of the cells. The resulting object is inspected in the fourth line. As can be seen in the output, the object does not only hold the cell values, but also holds metadata about the geographical properties of the raster.
It can also be seen that this is an object of the class RasterLayer. This class is for objects that hold only one layer of grid data. There are other classes which allow more than one layer of data: RasterStack and RasterBrick. Collectively, these classes are referred to as Raster*.
A class is a static entity designed to represent objects of a certain type using "slots", which each hold different information about the object. Both raster and gdistance use so-called S4 classes, a formal object-oriented system in R. An advantage of using classes is that data and metadata stay together and remain coherent. Consistent use of classes makes it more difficult to have contradictions in the information about an object. For example, changing the number of rows of a grid also has an effect on the total number of cells. Information about these two types of information of the same object could become contradictory if we were allowed to change one without adjusting the other. Classes make operations more rigid to avoid such contradictions. Operations that are geographically incorrect can also be detected in this way. For example, when the user tries to add the values of two rasters of different projections, the raster package will detect the difference and throw an error.
Classes also make it easier for the users to work with complex data and functions. Since so much information can be stored in a consistent way in objects and passed to functions, these functions need fewer options. Functions can deduce from the class of the object that is given to it, what it needs to do. The use of classes, if well done, tends to produce cleaner, more easily readable, and more consistent scripts.
One important thing to know about raster is how grid data are stored internally in Raster* objects. Consecutive cell numbers in rasters go from left to right and from top to bottom. The 3 × 3 raster we just created with its cell numbers is shown in Figure 2.

Transition* classes
As explained in Section 2 on the theory behind gdistance, transition matrices are the backbone of the package. The key classes in gdistance are TransitionLayer and TransitionStack.
Most functions in the package have an object of one of these classes as input or output.
Transition* objects can be constructed from an object of class Raster*. A Transition* object takes the necessary geographic references (projection, resolution, extent) from the original Raster* object. It also contains a matrix which represents a transition from one cell to another in the grid. Each row and column in the matrix represents a cell in the original Raster* object. Row 1 and column 1 in the transition matrix correspond to cell 1 in the original raster, row 2 and column 2 to cell 2, and so on. For example, the raster we just created would produce a 9 × 9 transition matrix with rows and columns numbered from 1 to 9 (see Figure 3 below).
The matrix is stored in a sparse format, as discussed in Section 2. The package gdistance makes use of sparse matrix classes and methods from the package Matrix, which gives access to fast procedures implemented in the C language (Bates and Maechler 2017).
The construction of a Transition* object from a Raster* object is straightforward. We can define an arbitrary function to calculate the conductance values from the values of each pair of cells to be connected.
In the following chunk of code, we use the RasterLayer that was created above. First, we set all its values to unit. The next line creates a TransitionLayer, setting the transition value between each pair of cells to the mean of the two cell values that are being connected. The directions argument is set to 8, which connects all cells to their 8 neighbors (Moore neighborhood).
R> r[] <-1 R> tr1 <-transition(r, transitionFunction = mean, directions = 8) If we inspect the object we created, we see that the resulting TransitionLayer object retains much information from the original RasterLayer object. To illustrate how to create an asymmetric matrix, we first create a non-commutative distance function, ncdf. We then use this function as an argument in the function transition. To make sure that the resulting transition matrix is indeed asymmetric, we set the symm argument in transition to FALSE. Note the difference between tr1 and tr2 in the slot "matrix class". This slot holds information about the matrix class as defined in the package Matrix (Bates and Maechler 2017). The class dsCMatrix is for matrices that are symmetric. The class dgCMatrix holds an asymmetric matrix.

R>
Different mathematical operations can be done with Transition* objects. This makes it possible to flexibly model different components of landscape friction.
Operations with more than one object require that the different objects have the same resolution and extent. Also, it is possible to extract and replace values in the matrix using indices, in a similar way to the use of indices with simple matrices. [ The functions adjacent (from raster) and adjacencyFromTransition (from gdistance) can be used to create indices. Example 1 below illustrates this (Section 9).
Some functions require that Transition* objects do not contain any isolated "clumps", islands that are not connected to the rest of the raster cells. This can be avoided when creating Transition* objects, for instance by giving conductance values between all adjacent cells a small minimum value. It can be checked visually if there are any clumps. 2 There are several ways to visualize a Transition* object. For the first method, the user can extract the transition matrix with function transitionMatrix. This gives a sparse matrix which can be vizualized with function image. This shows the rows and columns of the transition matrix and indicates which has a non-zero value, which represents a connection between cells (Figure 3).  R> plot(raster(tr3), main = "raster(tr3)", xlab = "Longitude (degrees)", + ylab = "Latitude (degrees)")

Correcting transition matrix values
The function transition calculates transition values based on the values of adjacent cells in the input raster. However, diagonal neighbors are more remote from each other than orthogonal neighbors. Also, on equirectangular (longitude-latitude) grids, West-East connections are longer at the equator and become shorter towards the poles, as the meridians approach each other. Therefore, the values in the matrix need to be corrected for these two types of distance distortion. Both types of distortion can be corrected by dividing each conductance matrix value by the distance between cell centers. This is what function geoCorrection does when type is set to "c".
R> tr1C <-geoCorrection(tr1, type = "c") R> tr2C <-geoCorrection(tr2, type = "c") However, as explained in Section 2 above, for commute distances (random walks) not only distance distortion plays a role, but also surface distortion. When type is set to "r" the function geoCorrection weights the probability of reaching an adjacent cell in a random walk by simply making it proportional to the surface covered by the cell. Computationally, the function corrects the surface distortion by multiplying the North-South transition values with the cosine of the average latitude of the two cell centers that are being connected.
In some cases, Transition* objects with equal resolution and extent need to be corrected many times. For example, determining the optimal landscape friction weights using a genetic algorithm involves repeating the same calculations with many transition matrices that only differ in their values, but not in their resolution or extent (van Etten and Hijmans 2010). In this case, computational effort can be reduced by preparing an object that only needs to be multiplied with the Transition* object to produce a corrected version of it. The following chunk of code is equivalent to the previous one.
R> CorrMatrix <-geoCorrection(tr3, type = "r", multpl = TRUE, scl = TRUE) R> tr3R <-tr3 * CorrMatrix Object CorrMatrix is only calculated once. It can be multiplied with Transition* objects, as long as they have the same extent, resolution, and directions of cell connections. We need to take special care that the geo-correction multiplication matrix (CorrMatrix) contains all non-zero values that are present in the Transition* object with which it will be multiplied (tr3 in this case).

Calculating distances
After obtaining the geographically corrected Transition* object, we can calculate distances between points. It is important to note that all distance functions require a Transition* object with conductance values, even though distances will be expressed in 1/conductance (friction or resistance) units (see Section 2 above).
To calculate distances, we need to have the coordinates of point locations. This is done by creating a two-column matrix of coordinates. Functions will also accept a SpatialPoints object or, if there is only one point, a vector of length two.
R> costDistance(tr3C, sP) The costDistance function relies on the package igraph (Csardi and Nepusz 2006) for the underlying calculation. It gives a symmetric or asymmetric distance matrix, depending on the TransitionLayer that is used as input.
Commute distance is the number of cells traversed during a random walk from a cell i on the grid to a cell j and back to i (Chandra et al. 1996).
rSPDistance gives the cost incurred during the same walk (θ approaches zero, so this is the cost incurred during a random walk, see Section 2). To obtain the commute costs we sum the corresponding off-diagonal elements: d ij + d ji . This is the distance of a commute from i to j and back from j to i. 3

Dispersal paths
To determine dispersal paths of a (constrained) random walk, we use the function passage. This function can be used for both random walks and randomized shortest paths. The function calculates the average number of passages through cells or connections between cells before arriving in the destination cell. Either the total or net number of passages can be calculated. The net number of passages is the number of passages that are not reciprocated by a passage in the opposite direction. In other words, this is the probability of the "last forward" passage going through a cell-to-cell connection (McRae et al. 2008). Figure 5 shows the net passages through each cell, assuming randomized shortest paths with the parameter θ set to 3.

Path overlap and non-overlap
One of the specific uses for which package gdistance was created, is to look at dispersal trajectories of organisms that expand their range coming from a single source (van Etten and Hijmans 2010). The degree of coincidence of two trajectories can be determined by calculating the minimum of the net passages of the two trajectories. With a formula presented in van Etten and Hijmans (2010), we can approximate the non-overlapping part of the trajectory. This is done in the following code.
R> r1 <-passage(tr3C, origin, sP[1,], theta = 1) R> r2 <-passage(tr3C, origin, sP[2,], theta = 1) R> rJoint <-min(r1, r2) R> rDiv <-max(max(r1, r2) * (1 -min(r1, r2)) -min(r1, r2), 0) The first two lines create two different trajectories, both coming from the same point of origin (O) and going to different end destinations (referred to as sP1 and sP2 in the following). The third line obtains the minimum probability as a measure of the overlap between the two trajectories. The resulting raster rJoint is visualized in Figure 6. The fourth line calculates the non-overlapping part of the trajectories with a more complicated formula. The result, rDiv, is shown in Figure 7. With the function pathInc we can calculate measures of path overlap and non-overlap for a large number of points. These measures can be used to predict patterns of diversity if these are due to dispersal from a single common source. If the argument type contains two or more elements, the result is a list of distances matrices. The default for type is to calculate joint and divergent length of the dispersal paths.

Example 1: Hiking around Maunga Whau
The previous examples were theoretical, based on randomly generated values. More realistic examples serve to illustrate the various uses that can be given to this package.
Determining the fastest route between two points in complex terrain is useful for hikers. Tobler's hiking function provides a rough estimate of the maximum hiking speed (s) given the slope of the terrain (m) (Tobler 1993). The maximum speed of off-path hiking (in m/s) is s = 6e −3.5|m+0.05| .
Note that the function is not symmetric around 0 (see Figure 8). Hikers walk fastest on gently downward slopes (m = −0.05), where they can walk faster than on flat terrain (m = 0). We use the hiking function to determine the shortest path to hike around the volcano Maunga Whau (Auckland, New Zealand). First, we read in the altitude data for the volcano. This is the R base dataset (see ?volcano), which has been geo-referenced using the information provided by T. Hengl at http://geomorphometry.org/content/volcano-maungawhau.
R> r <-raster(system.file("external/maungawhau.grd", package = "gdistance")) The hiking function requires the slope (m) as input, which can be calculated from the altitude (z) and distance between cell centers (d), The units of altitude and distance should be identical. Here, we use meters for both. First, we calculate the altitudinal differences between cells. Then we use the geoCorrection function to divide by the distance between cells.
R> altDiff <-function (x)  The transition function throws a warning, because a matrix with negative values cannot be used directly in distance calculations. Here this warning can be safely ignored, however, as the negative values are only present in intermediate steps. Subsequently, we calculate the speed. We need to exercise special care, because the matrix values between non-adjacent cells is 0, but the slope between these cells is not 0! Therefore, we need to restrict the calculation to adjacent cells. We do this by creating an index for adjacent cells (adj) with the function adjacent. Using this index, we extract and replace adjacent cells, without touching the other values.
R> adj <-adjacent(r, cells = 1:ncell(r), pairs = TRUE, directions = 8) R> speed <-slope R> speed[adj] <-6 * exp(-3.5 * abs(slope[adj] + 0.05)) Now we have calculated the speed of movement between adjacent cells. We are close to having the final conductance values. Attainable speed is a measure of the ease of crossing from one cell to another on the grid. However, we also need to take into account the distance between cell centers. Travelling with the same speed, a diagonal connection between cells takes longer to cross than a straight connection. Therefore, we use the function geoCorrection again.

R> Conductance <-geoCorrection(speed)
This gives our final "conductance" values. What do these values mean? The function geoCorrection divides the values in the matrix between the distance between cell centers. So, with our last command we calculated conductance (C) as follows: This looks a lot like a measure that we are more familiar with, travel time (t): In fact, the conductance values we have calculated are the reciprocal of travel time (1/t), Maximizing the reciprocal of travel time is exactly equivalent to minimizing travel time.
Distances calculated with this conductance matrix represent travel time according to the hiking function.
In the next step, we define two coordinates, A and B, and determine the paths between them. We test if the quickest path from A to B is the same as the quickest path from B back to A. The following code creates the shortest paths.

Example 2: Geographical genetics
A correlation between genetic differentiation and geographic distance of individuals and populations is expected due to a mechanism known as isolation by distance (Wright 1943). This correlation is expected when random, symmetric dispersal occurs in homogeneous geographic spaces. For random dispersal in heterogeneous landscapes, recent work has shown that genetic differentiation correlates with the resistance distance between their locations (McRae 2006). In this section, we look at human genetic diversity in Europe, using the data presented by Balaresque et al. (2010).
First, we read in the data: a map of Europe, the coordinates of the populations (see Figure 10) and mutual genetic distances (see ?genDist for more information on these data).
R> Europe <-raster(system.file("external/Europe.grd", + package = "gdistance")) R> Europe[is.na(Europe)] <-0 R> data("genDist") R> data("popCoord") R> pC <-as.matrix(popCoord[c("x", "y")]) Then we create three geographical distance matrices. The first corresponds to the great-circle distance between populations. The second is the least-cost distance between locations. Travel is restricted to the land mass. The third is the commute distance (using the same conductance matrix), which is related to effective resistance between points if we conceive of the grid as an electrical circuit (Chandra et al. 1996;McRae 2006).
[1] 0.5889319 R> cor(genDist, resDist) [1] 0.192114 Among the distance measures evaluated until now, the great-circle distance between points turns out to be the best predictor of genetic distance. The other distance measures incorporate more information about the geographic space in which geneflow takes place, but do not improve the prediction. It follows that prehistoric people in Europe did not move like wolves (least-cost distance) or drunkards (commute or resistance distance), but rather like crows (great-circle distance).
An important assumption behind these distance measures, however, is that dispersal is symmetric. This is often not the case. For example, diffusion from a single origin (Africa) explains much of the current geographical patterns of human genetic diversity (Ramachandran et al. 2005). As a result, the mutual genetic distance between a pair of humans from different parts from the globe depends on the extent they share their prehistoric migration history. Within Europe, genetic diversity is often thought to be a result of the migration of early Neolithic farmers from Anatolia (now part of modern Turkey) to the west.
How well does a geographic wave of expansion from Anatolia explain the spatial pattern? The function pathInc calculates the overlap (and non-overlap) of dispersal paths from a common origin on the grid as a distance measure between points.
R> origin <-unlist(popCoord[22, c("x", "y")]) R> pI <-pathInc(trC, origin = origin, from = pC, functions = list(overlap)) R> cor(genDist, pI[[1]]) [1] -0.7178576 At least at first sight, the overlap of dispersal routes explains the spatial pattern better than any of the previous measures. The negative sign of the last correlation coefficient was expected, as more overlap in routes is associated with lower genetic distance. Additional work would be needed to improve predictions and compare the different models more rigorously.

Future work
Improvements of gdistance and methodological refinements are expected in various areas.
• All measures based on random walks depend critically on solving sparse linear systems. This is the most time-consuming part of the calculations. Faster libraries could improve the gdistance package if they become available in R in the future.
• Research on distances in graph theory is a very dynamic field in the computational sciences. New measures and algorithms could be added to gdistance when they become available.
• More research on the consequences of connecting grids in different ways is necessary, as indicated in Section 2. This should bring more precision to random walk calculations in geospatial analysis. Comparing the results of grid-based calculations to continuous space simulations or analytical solutions would be the way forward (Birch 2006).

Final remarks
Questions about the use of gdistance can be posted on the R-SIG-Geo email list. Bug reports and requests for additional functionality can be mailed to jacobvanetten@yahoo.com.