OasisR : An R Package to Bring Some Order to the World of Segregation Measurement

Interest in social segregation measurement has increased strongly over the years and the number of segregation indices proposed in the literature have become more complex. However there are only a few software applications that can be employed to analyze social segregation, and these are usually available as a plug-in/package in geographic information system (GIS) software or as limited stand-alone application. Thus, the development of a package which exploits the power and versatility of the R environment for statistical computing and graphics would be desirable. Also, analysis of the segregation indices shows that there are ambiguities and errors in the literature, and consequently in the available software applications. This is an even more important reason why we need to develop a new tool to bring some order to the world of segregation measurement. This paper contributes also by proposing an automatic statistical testing methodology for these indices, using several resampling techniques: randomization tests, bootstrap and jackknife.


Introduction
Segregation refers to the organizational (school, occupation, health, etc.) or spatial (residential) separation of social groups. Social segregation is an important issue in modern society because of its consequences for economic efficiency, social cohesion and equity. Over the past few decades, the political agendas in several countries have set objectives and introduced measures to promote socio-spatial diversity, and segregation analyses are being published in official reports and statistics (Iceland, Weinberg, and Steinmetz 2002;Maurin and Schneider 2015).
Despite the growing use of segregation indices and their increased complexity, few software Name (Language) Integration Indices Authors (AML-Splus) ArcInfo 7 4 Wong and Chong (1998) (Avenue) ArcView 3.2 7 Wong (1996Wong ( , 2003   The Oasis web platform developed by Tivadar et al. (2014) is novel in not requiring any software installation, requests are made via the web navigator, and computation is conducted on distant servers. The user has the possibility of wide segregation analysis on a territory (including auto-correlation indices, descriptive statistics and web mapping) using either a historical French data base or their own data. Another original feature of Oasis is that it allows Monte Carlo simulations (permutation tests) to test the statistical significance of the indices. However, this tool basically has the same advantages and disadvantages of standalone applications.
Other applications have been developed as statistical software packages, but have a small number of indices: e.g., Reardon and Firebaugh (2002)'s Stata (StataCorp 2017) module, and Hong and O'Sullivan (2018)'s R package seg. The R package seg differs in that it is able to compute more recent, surface-based measures, developed in response to the so-called modifiable areal unit problem. Similar to package seg, package OasisR (Tivadar 2019) is a package implemented in R, an open source software environment for statistical computing and graphics. The main advantage of these tools is that they are flexible, and allow control over many input parameters. They benefit also from the advantage of integration in statistical software which provides the possibility of further analysis without exporting results, integration into other software, automatization, etc. Their main disadvantage is that they require basic knowledge of programming in statistical software.
The article is structured in two main sections. In Section 2, we provide a brief summary of segregation measurement, focusing on aspects that are relevant to the present work, i.e., definition and computation. Section 3 uses some practical examples to show how to use OasisR. The article ends with conclusions and further developments.

Segregation indices
The objective of this paper is not to provide a comprehensive analysis of segregation indices, but rather to unravel several errors and ambiguities, and to provide clearer definitions. The mathematical formulas are presented in Appendix A.
In line with most of the existing literature, we present the indices developed in OasisR following the five dimensions of segregation defined by Massey and Denton (1988). These dimensions are: evenness (population distribution across units), exposure (potential contact between individuals), concentration (space occupied by social groups), clustering (population concentration in contiguous spatial units) and centralization (spatial distribution around the area's center). There are many critics of this classification. According to some, the relevance of centralization has diminished due to the contemporary polycentric form of cities (Brown and Chung 2006), while others believe that the five dimensions could be reduced to a two-dimensional continuum: evenness-clustering and isolation-exposure (Reardon and O'Sullivan 2004), or evenness-concentration and clustering-exposure (Brown and Chung 2006), or separation-location (Johnston, Poulsen, and Forrest 2007). A distinction is made between one-group indices (segregation of a group compared to the rest of the population), between group indices (which measure the segregation between pairs of groups), multi-group indices (which analyze the distribution of several population groups simultaneously) and social diversity indices (which can be understood as zonal multi-group or local indices).
We compare the results of OasisR with those from two other software implementations. The geo-segregation analyzer (GSA version 1.1) developed by Apparicio et al. (2014), is the most complete automatic tool available so far. The application needs an input shape file (data and maps), and produces 43 indices. The second tool is the R package seg, version 0.5-1, developed by Hong et al. (2014). It computes 11 indices, most available only for a population composed of two social groups. For comparison, we used hypothetical segregation patterns with two groups (Morrill 1991;Wong 1993;Hong et al. 2014). Theoretical examples are available in OasisR as a data object, adapted from Hong and O'Sullivan (2018). The space is represented by a 10×10 checkboard, with different distributions of the two social groups in the area.

Evenness
Evenness refers to the distribution of different groups across spatial units and can be interpreted as a form of spatial inequality: The more uneven the group distribution compared to other social groups, the more segregated is that group. This is the reason why several evenness indices are based on a spatial form of the Lorenz inequality curve, also called segregation curve.

Standard evenness indices
Indices were introduced initially by Jahn et al. (1947) to measure "ecological" segregation between black and white populations. Duncan and Duncan (1955a) demonstrated the mathematical relationships between these indices, and provided a graphical interpretation. They showed that the information provided by previous indices could be derived from the dissimilarity index D k 1 k 2 and the social group proportions. The dissimilarity index can be interpreted as the share of a group k 1 that would have to move to achieve an even distribution compared to group k 2 . Similar to many other indices, the index was defined in the context of a two-group population (minority vs. majority). Duncan and Duncan (1955b) adapted the dissimilarity index to a one-group form, known as Duncan's segregation index IS k . It measures the dissimilarity between a group k and the rest of the population. In the case of two group populations, the segregation and dissim- × errors for 2 groups - ilarity indices are identical. The Gorard segregation index GS k (Gorard and Taylor 2002) is a slightly different form which computes the dissimilarity between a group and the total population. Gorard's index has some disadvantages (the upper boundary is less than 1, the index is asymmetric) but has the advantage that unlike dissimilarity based indices it is a strong composition invariant index.
Another standard one-group index based on the segregation curve is G k , the spatial version of the Gini index (Gini 1921) adapted by Duncan and Duncan (1955a). We developed a between group form of Gini, G2 k 1 k 2 , by computing the index for a sub-population formed by two groups.
The Atkinson index A k (Atkinson 1970) was adapted to a segregation context by James and Taeuber (1985), with the mathematical formula corrected by Massey and Denton (1988). Compared to other indices based on segregation curves, the Atkinson index allows the researcher to decide the weights of the spatial units in different zones of the segregation curve by introducing an inequality aversion parameter δ with values between 0 and 1. If δ < 0.5, the spatial units where the minority is underrepresented compared to the average, contribute more to the segregation. The reverse is valid for δ > 0.5.
The entropy index (or the information index) was proposed by Theil (Theil and Finizza 1971;Theil 1972) as an index of school segregation for a two-group population. It measures the departure from evenness as the population weighted average deviation of each spatial unit from the area's entropy (or social diversity). In the case of a population with more than two groups, the local and area entropy need to be calculated for each group as the "minority" and the rest of the population as the "majority".
The results obtained using OasisR and GSA are identical apart from the entropy index. According to its definition, the entropy index for a two group population should have the same value for both groups. If we take the example of complete segregation from the theoretical distributions, the index should be equal to 1 (as in OasisR), while GSA provides H 1 = 0.75 and H 2 = 0.25. Using empirical data with several social groups produces identical results. The Gorard index is computed only in OasisR, and in the case of the Atkinson index, the user is limited in GSA to three standard values of the inequality aversion parameter δ (0.1, 0.5 and 0.9). The seg package computes only the dissimilarity index for individual pairs of groups.

Spatial evenness indices
Spatial evenness indices were developed by geographers in response to a major criticism of standard segregation indices: Although satisfactory for organizational segregation studies, they seem less appropriate in a geographical context where segregation is a "separation created by spatial structure" (Wong 1993, p. 559). For instance, if we make random permutations of the populations between spatial units, standard evenness indices do not change, but the social structure of the area is obviously different. Spatial evenness indices are based on the dissimilarity index, and were defined in the context of a two-group population. The first developments were made by Jakubs (1981) and Morgan (1983a) but they require complex linear programming methods. Morrill (1991) developed a contiguity modified dissimilarity index D k 1 k 2 (adj), where the probability of contact between groups is modeled via the contiguity matrix: Interactions between groups emerge if two spatial units are adjacent. In the case of D k 1 k 2 (adj) with a population formed by more than two groups, an ambiguity arises from the spatial interaction term: It is not sufficiently clear how the population proportions across spatial units should be computed. In the original paper, the author uses total populations t i while Wong and Chong (1998) present these totals specifically as the sum of two groups t k 1 k 2 i . This difference has consequences for the index if the population is composed of more than two groups. The only detailed generalized formula is that proposed by Apparicio et al. (2008) but it seems incorrect since the proportions are determined using the entire population. Instead, the partial total population should be used because Morrill's index is based on the dissimilarity index which compares the distributions between each pair of groups independently of the others, and it seems logical to assume that spatial potential interactions should also take account only of each pair of groups (see Appendix A).
For the theoretical two group distributions, OasisR, GSA and seg provide identical results. In the case of more than two groups, Morrill's index is computed only in OasisR and GSA. The results in GSA are incorrect since the spatial interaction term is based on population totals rather than the groups involved in the comparison. An empirical confirmation is provided by the fact that the result matrix is not symmetrical as it should be (the dissimilarity between two groups is by construction symmetrical). Apparicio et al. (2008) adapted the original index to construct d Morrill's segregation index IS k (adj) (one-group version of the index). Similar to Duncan's dissimilarity and segregation indices, IS k (adj) can be interpreted as the dissimilarity between group k and the rest of the population, and the use of a group's proportion within the total population of each unit in the spatial interaction is correct (see Appendix A). Computation of the index gives the same results in OasisR and GSA and is not provided in seg.
One limitation of Morrill's indices is that they take account only of direct interactions between adjacent spatial units, and it would be interesting to expand these interactions further in space. One solution would be to go beyond the first order contiguity by generalizing Morrill's indices to the kth order contiguity D k 1 ,k 2 (Kadj). The contiguity order is considered to have a negative effect on spatial interactions. Generalized Morrill's indices are computed only in OasisR, and we propose two forms for the spatial function: negative exponential and reciprocal. Wong (1993) developed two indices for a population with two groups: D k 1 k 2 (w), where spatial interactions between contiguous spatial units are proportional to the length of the shared boundary and D k 1 k 2 (s), which also includes the perimeter/area ratio. In Wong's original paper both indices have errors in their mathematical definition. The first error is the division by 2 of the spatial interaction term and the second is the row standardization of the spatial   Hong (2014) developed scripts to present the R package seg and similar to OasisR, he defines the spatial matrix using a global standardization. Wong and Chong (1998) presented improved versions of the indices formulae, where the proportions in the spatial interaction effect are clearly defined, but the definition of the spatial interaction matrix seems incorrect since Wong and Chong (1998) use double standardization of the spatial matrix (row standardization followed by overall standardization).
Furthermore, there are ambiguities concerning the definition of each spatial unit's perimeter, necessary for the computation of D k 1 k 2 (s). To obtain the same results as in Wong (1993) and Hong et al. (2014), we need to use only the "internal" perimeter of each spatial unit, defined as the sum of the boundaries shared with other spatial units, and ignore the area's external borders. To overcome this issue, in OasisR the user can choose the perimeter definition.
For the theoretical two-group distribution, using the corrected mathematical formula (see Appendix A) and the "internal" definition of the perimeter, OasisR provides the same results as the seg package and Wong's original paper. The results of GSA are incorrect because the software uses the original wrong mathematical definitions. We also generalized Wong's indices to a case with more than two groups, and its one-group form. Apparicio et al. (2008) define these indices mathematically but these definitions have similar problems to Morrill's index generalization, and carry the errors from the original definition. Finally, seg and OasisR allow the user to compute a modified version of the index using their own definition of the spatial interaction matrix.

Exposure
Exposure measures the potential contact between members of the same group (isolation) or between members of different groups (interaction) as the probability that they live in the same spatial unit.

Standard exposure indices
The first exposure indices were developed by Shevky and Williams (1949), and were normal- ized and explained as a probabilistic model by Bell (1954). The notations of these indices were introduced by Lieberson (1981).
The isolation index xP x k is defined as the probability that a group member shares the same spatial unit with another member of the same group. Without the computational power of a computer, it was difficult to calculate the index according to its definition and Bell (1954) provided an approximate version of the index (see Appendix A for details). Presently, there are no particular reasons not to compute its exact value xP x k * , and in OasisR, the user has the possibility to choose between the two versions.
The isolation index can be adjusted to control for the effect of population composition, which has a strong effect on the index value. Bell (1954) also developed the normalized isolation index (an approximate version) which is equivalent to the correlation ratio Eta2 k (White 1986) and to the mean square contingency or phi square 1 for a dichotomous population (Duncan and Duncan 1955a). Since the index can be computed in different ways (Bell 1954;Coleman 1966;Zoloth 1976), debate emerged over its dimension and interpretation (James and Taeuber 1985;Massey and Denton 1988;Stearns and Logan 1986).
The interaction index xP y k 1 k 2 (Lieberson 1981) is a between group segregation measure which computes the probability that a member of a group k 1 shares the same spatial unit with a member of group k 2 . Similar to the isolation index, we can compute its exact or approximate value. The results of all the approximate standard exposure indices are the same in OasisR, seg and GSA. The exact versions can be computed only in OasisR. Morgan (1983b) developed two exposure indices that take explicit account of the distance between spatial units, which influences the potential contact between members of the same social group (distance-decay isolation index DP xx k ) or different groups (distance-decay interaction index DP xy k 1 k 2 ). The hypothesis is that people also come into contact outside of their own spatial units, and the number of potential contacts increases with distance, but their intensity decreases.

Spatial exposure indices
Similar to the other indices based on distance, the use of a gravity exponential function makes the result sensitive to the distance measure. There are some ambiguities about the definition of distance within a spatial unit since it could be null or a function of the spatial unit shape (area, perimeter). In OasisR, the user can choose between different spatial matrix diagonal definitions: null, 0.6 of the area's square root, as proposed by White (1983), and a user matrix.
In GSA, the distance within a spatial unit is considered null. Results for the linear definition 1 Williams (1948) defined the mean square contingency or phi square as a conversion of chi square for a population with two groups into an index (from 0 to 1) by dividing it by the total population. of the distance are identical in GSA and OasisR. For the gravity version of the index, GSA results are incorrect: The diagonal of the distance matrix remains null after transforming to exponential form, and should be at its maximum level as exp(0) = 1 (highest spatial interaction). If we compute these indices for theoretical two group patterns using the wrong null exponential diagonal, we obtain the same results as GSA. For more than two groups, the results provided by the two applications are different. OasisR and GSA provide metric conversion options (measure in and measure out), necessary for comparison between studies, and to avoid situations where indices cannot be computed because of the digital approximations that rapidly approach zero in the negative exponential distance function. These indices are not computed in seg. Reardon and O'Sullivan (2004) develop several spatial indices, including a spatial version of the exposure/isolation index. The spatial exposure index is computed as the average percentage of a group within the local environment of each member of another group. The spatial isolation of a group is simply the spatial exposure of a group to itself. In OasisR we used only the functions developed by Hong and O'Sullivan (2018) in the seg package, formatting the output as the other OasisR functions.

Clustering
In clustering, the more contiguous spatial units occupied by a group (forming an enclave in the area) the more segregated that group. There are arguments in the literature about the need for a separate dimension since modern evenness indices take explicit account of the phenomena of space and clustering (Reardon and O'Sullivan 2004). The distinction between evenness and spatial clustering might be just an artifact of the reliance on spatial subareas at some chosen geographical scale of aggregation (evenness at one level of aggregation is strongly related to clustering at a lower level of aggregation). For Brown and Chung (2006), clustering and exposure constitute a single dimension since high clustering is a manifestation of low exposure, and vice versa: If the members of a group are located close to each other, especially in a large cluster, their exposure to other groups will be reduced.

Proximity measures
The first proximity indices were introduced by White (1983) for two groups, and later generalized to apply to more than two groups by White (1986): the mean proximity between the members of the same group P xx k (one-group index), and between two different groups P xy k 1 k 2 (between-group index), and the mean proximity between persons in the area without regard to the group P oo (multi-group index). In the original papers, White proposed considering the distance within a spatial unit as non-null, and advised a function of the area (0.6 √ A) but a null diagonal distance matrix is most commonly used in the literature and computed using software packages (Apparicio et al. 2014;Tivadar et al. 2014). In OasisR, the user can choose among these options or exploit a user value. These measures can be determined using a linear function of the distance. The result represents the average distance between individuals (from the same or different groups). With a gravity form such as the exponential of the negative distance, the measure becomes an index. As for the other distance based measures, the exponential function makes the result sensitive to spatial measure units. Therefore, a metric converter is provided in OasisR.
By using a null distance within spatial units and a linear distance matrix in OasisR, the

Index
OasisR GSA seg One-group mean proximity × only null diagonal, errors -Between group mean proximity × only null diagonal, errors -Multi-group mean proximity × --Multi-group mean proximity (between group) results for P xx k and P xy k 1 k 2 are the same as for GSA 2 . In relation to the other distancebased indices, their gravity form is incorrect in GSA: The diagonal for the negative exponential distance matrix is null but should be equal to 1. If we use this incorrect spatial definition, we obtain the same results for P xx k but different ones for P xy k 1 k 2 . There is probably an additional error in the GSA computation, as the result matrix is not symmetric as it should be (the spatial proximity is identical if we permute the groups). Moreover, we tested the mathematical properties that these indices should respect for two group populations (White 1983); they hold only for OasisR. The seg package does not compute these measures.
Based on proximity measures, White (1983) computes a segregation statistic called spatial proximity which is simply the average of one-group proximities, weighted by the fraction of each group in the population. The initial index was defined for the case of two groups which created certain ambiguities related to its nature (between group or multi-group index) since the result is the same SP = SP 1,2 . White (1986) generalized the index to more than two groups by using the multi-group form of the index but with an error in the mathematical definition since the populations are squared. In contrast, Apparicio et al. (2008) keep the between group definition if the population includes more than two groups, and compute the index ignoring the rest of the population. This means also that the mean proximity between persons regardless of group should have a between group form P oo k 1 k 2 , as used to compute SP k 1 k 2 . The spatial proximity index is computed using only the gravity form, but can also be used with linear distance in the opposite interpretation. If we compute the gravity form of the spatial proximity index using the wrong null diagonal, we obtain the same result as with the seg package (which computes only the multi-group form) and GSA (which provides only the between group version). The index can also be computed as the one-group version SP k , which compares the proximity among the members of a group P xx k and the average proximity of the population P oo.
Clustering indices Massey and Denton (1988) propose two clustering measures. The absolute clustering index ACL k expresses the average number of members of groups in nearby spatial units as a proportion of the total population in those proximate spatial units. Spatial interactions can be computed using the contiguity matrix, although the ACL k index could produce negative values despite Massey and Denton (1988)'s claim that its values always range between 0 and 1. The problem arises from missing information about the particular form of the contiguity matrix: Contiguity between a spatial unit and itself should be equal to 1 (Konstantinidis and Townshend 1999). As in White (1983), the index also has a gravity form (exponential of the negative distance), and it is recommended to use a non-null diagonal of the spatial interaction matrix as a function of the area. Using the gravity form, the index is subject to the same issue of sensitivity to the distance measure.
GSA provides only the contiguity form of the index, and the results appear incorrect, independent of the contiguity matrix diagonal definition (0 or 1). We used a very simple case of a theoretical 2×2 grid, where the first cell is inhabited exclusively by the minority, and all other cells include the majority. With a null contiguity matrix diagonal, the index has aberrant values (negative or superior to 1), and with a diagonal equal to 1, the index should be 0 for both groups which is not the case for GSA.
The relative clustering index RCL k 1 k 2 is a between-group index based on White's proximity measures which compares the average distance between the members of one group to the average distance between the members of another group. The index is computed using only the gravity form but we can easily adapt the index to linear distance. If we use the wrong null diagonal in the distance matrix for the exponential form of the index, the results in OasisR are similar to GSA.

Concentration
According to Massey and Denton (1988, p. 289) "the concentration refers to relative amount of physical space occupied by a group". The first index to measure spatial concentration is the Delta index ∆ k , proposed by Hoover (1941) and adapted by Duncan, Cuzzort, and Duncan (1961). This is a dissimilarity index between the distribution of a group and the distribution of available space. Massey and Denton (1988) developed an absolute concentration index ACO k , by comparing the average area inhabited by a group to the average land area they would inhabit under maximum spatial concentration (if they were all located in the smallest areal units). The relative concentration index RCO k 1 k 2 (Massey and Denton 1988) takes the ratio of one group concentration to another group concentration, and compares it to the maximum possible ratio that would be obtained if the first group was maximally concentrated and the second minimally concentrated. The index is standardized to obtain values between −1 and 1, but in contrast to what Massey and Denton (1988) claim, the index can be smaller than −1. Moreover, Egan, Anderton, and Weber (1998) identify several mathematical and conceptual problems with that index which is why RCO k 1 k 2 is no longer used in Census Bureau analyses (Iceland et al. 2002). In its mathematical formula, intermediary sums do not have the indices required to identify the maximum/minimum concentrations for each group which can lead to ambiguities. In the original paper, these parameters are presented as "defined as before", but they should differ from one group to another (see Appendix A).
It is impossible to compute concentration indices for the theoretical distributions (the denominator is null since spatial units have the same size). Thus, we use empirical examples for comparisons between OasisR and GSA. For one-group indices (∆ k and ACO k ) the results are identical while for relative concentration only some of the results are the same. There is an error in GSA since the matrix should not be symmetric.

Centralization
According to Massey and Denton (1988), centralization is the degree to which a group is spatially located near the center of an area. The first true centralization index was developed by Duncan and Duncan (1955b) to introduce some spatiality into segregation measuring. The index was presented as a one-group index, so we describe it as Duncan's absolute centralization index. The literature uses the relative centralization index RCE k 1 k 2 , adapted and proposed by Massey and Denton (1988) 3 which measures the extent of one group's centralization relative to another. These centralization indices are particular forms of the Gini index, and measure the localization unevenness of two groups around a specific point (the center) by ordering spatial units according to their distance from the center. Massey and Denton (1988) introduced an absolute centralization index ACE k which compares the spatial distribution of a group to the distribution of available land around the area's center. Since ACE k computation needs information on area, the results can sometimes contradict RCE k 1 k 2 . For this reason, in OasisR, we also compute the mathematical adaptation of RCE k 1 k 2 , to correspond to Duncan and Duncan (1955b)'s original description (DCE k ). With the exception of Duncan's centralization index, provided only by OasisR, the results of the other indices are similar to GSA.
One of the reasons why centralization lost popularity in the literature was that this dimension has little meaning in relation to increasingly polycentric and sprawling modern cities. To resolve this issue, we adapt the centralization indices to a polycentric spatial configuration. The option retained is to compute the distance between spatial units and each center, and to take account only of the distance to the closest point. This method is implemented in OasisR by the RCEPoly, ACEPoly, and ACEDuncanPoly functions. According to Folch and Rey (2016), we can spatially limit the effect of centrality. We consider two options: defining a parameter k as the number of nearest neighbors affected by each center, or choosing the distance of influence k dist . The constrained version of the index can be computed only for the indices developed by Duncan and Duncan (1955b) (RCEPolyK and ACEDuncanPolyK).

Social diversity indices
Diversity indices measure the level of social diversity in an area, without taking account of the spatial distribution of different groups. Shannon-Wiener index H SW (Shannon 1948) is based on the entropy concept and measures the heterogeneity of a population from perfect homogeneity (0) to maximum heterogeneity (natural logarithm of the number of groups). The normalized version H SW is obtained by dividing it by its maximum. Simpson's interaction  index I S measures the probability that individuals selected randomly from the area (regardless of their location), do not belong to the same social group (Simpson 1949). These indices are available only in OasisR.

Multi-group indices
We treat multi-group segregation indices separately since their appearance in the literature is later, and only some can be attributed to standard dimensions of segregation. These indices analyze the distribution of several population groups simultaneously. Reardon and O'Sullivan (2004) define a general approach to measuring spatial multi-group segregation for several indices: multi-group normalized exposure index P * (James and Taeuber 1985;Reardon and Firebaugh 2002) and a set of general multi-group spatial/clustering indices: multi-group information theory index H * (Theil 1972;Reardon and Firebaugh 2002) 4 , multi-group relative diversity index RD * (Carlson 1992;Reardon 1998), and multi-group dissimilarity index D * (Morgan 1975;Sakoda 1981). Other multi-group indices provided in OasisR are multi-group Gini index G * (Reardon 1998) and the squared coefficient of variation C * (Reardon and Firebaugh 2002). Spatial versions of certain multi-group indices (dissimilarity, information theory and relative divesirty) were developed by Reardon and O'Sullivan (2004). For these spatial versions of multi-group indices, we formatted only the output of an existing function in the seg package. The results obtained in OasisR, GSA and seg are identical if the indices are available.
In addition to the previous measures, we developed functions in order to compute two specific types of multi-group segregation indices. Using the variation ratio approach described in Reardon and Firebaugh (2002); Reardon (2009) proposes four indices adapted to the particular case of groups defined by ordered categories: ordinal information theory index, ordinal variation ratio index, ordinal square root index, and ordinal absolute difference index. Reardon, Firebaugh, O'Sullivan, and Matthews (2006) and then Reardon (2011) and Reardon and Bischoff (2011), developed rank-ordered indices (rank-order information theory index, rank-order variation ratio index, and rank-order square root index), adapted from the ordinal methodology, to analyze segregation using a continuous variable (but not necessarily one that is interval-scaled) such as income. In practice, data on income distribution is available in classes ordered by income thresholds. Empirically, the methodology includes the following steps: First, for each threshold, we compute the corresponding ordinal segregation indices (ordered information theory, variation ratio, and square root index) between those above and below the income threshold; second, we fit a polynomial regression model to approximate the information theory/variation ratio/square root function; third we use the model's estimated coefficients to compute an estimate of the rank-order indices. All these outputs are included in OasisR.

Local indices
Local indices can be mapped which allow us to identify spatial patterns in the area. First, the location quotient LQ k i (Isard 1960) identifies spatial units where a group is over-represented or under-represented. Moreover, the social diversity indices can be computed at the local level. The local entropy index H2 i which is equivalent to H SW (Theil 1972;Theil and Finizza 1971), measures social diversity within each spatial unit (H2 i = 0 for a homogeneous population and H2 i = 1 for maximal diversity, when all groups are equal in size). The results are identical in OasisR and GSA. We can adapt the local level Shannon's diversity index and Simpson's interaction index which are available only in OasisR. GSA provides Poulsen's typology Forrest 2010, 2011) which is not developed in OasisR.

Resampling tests
In contrast to other tools, OasisR offers functions that allow the statistical testing of indices using resampling techniques (randomization tests, bootstrap and jackknife) based on individual or unit sampling.

Random population distribution as comparison
The idea that segregation should be analyzed as the deviation from a random pattern rather than from a complete theoretical desegregation was introduced at the end of the 1970s by Cortese et al. (1976) and Winship (1977). Winship (1977) considers the binomial distribution as a natural model for random segregation, while Cortese et al. (1976) and Falk et al. (1978) suggest using a hypergeometric distribution. Assuming these statistical hypotheses, it is possible to parameterize Duncan's dissimilarity index distribution but a generalization for other segregation indices is not feasible. In a more recent paper, Ransom (2000) examines the sampling distributions of dissimilarity and Gini indexes by deriving their exact sampling distributions, and developing asymptotic inference procedures. Allen, Burgess, Davidson, and Windmeijer (2015) developed this framework further, and show that the use of bootstrap methods can improve test procedures.
Resampling methods are valid, nonparametric alternatives to conventional inferential statistics. These methods are particularly interesting in the context of segregation analysis, because the data used often are a sample of the total population, and even if the analyst has data on the entire population, there is a risk of data collection and manipulation errors. Resampling allows us to create simulated distributions of the indices as the basis for testing different null hypotheses which depend on the resampling technique and the sampling unit (individual or spatial unit).

Randomization tests
Permutation tests (also called randomization tests or exact tests) are statistical significance tests in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the statistic under rearrangements of the labels on the observed data. If the number of possible combinations is too high, we can use asymptotically equivalent tests such as Monte Carlo permutation (or approximate permutation or random permutation) tests. First, we generate random population distributions, and for each replicate we compute the index which gives us the simulated reference distribution. This allows us to compute a pseudo p value that is equal to 1 minus the relative rank of the index in the reference distribution (Anselin 2003) which can be interpreted as a statistical significance test of the hypothesis that the segregation index is the result of random processes.
In the case of individual sampling units, each individual makes a random draw without replacement among spatial units, according to a probability vector. Location probability can be identical or constrained, e.g., by unit population (Cortese et al. 1976), or by area (Tivadar et al. 2014). As in Findlay and Findlay (1984) and Carrington and Troske (1997), we generate random localization patterns by resampling directly from the original data (aggregation of individual independent random draws) instead of sampling with theoretical distribution (Boisso et al. 1994;Tivadar et al. 2014). If we set sampling on spatial units (Feitosa et al. 2007;Tivadar et al. 2014), we have a similar framework to the permutation test developed in spatial auto-correlation analysis (Anselin 1995). Random localization is obtained from per-mutations of entire populations among spatial units which allows us to test the significance of the spatial component of the index.

Bootstrapping
Because segregation measures are often computed from sample data, distributional information required to test hypotheses can be obtained by the bootstrap method (Efron 1979) used to estimate the distributions of a statistic by resampling with replacement from the data set. These distributions can be examined in order to establish a probability that the statistic's value will include the value implied under the null hypothesis. Thus, bootstrap techniques allow us to construct confidence intervals around the original point estimate, using Efron's percentage method (Efron 1979).
If we consider individual sampling, the construction of bootstrap distributions is achieved by using resampling directly from the original data -the alternative being to use draws from a theoretical distribution as in Boisso et al. (1994). The method is appropriate especially if the initial data are based on a population sample. If the initial data are based on a sample of units, then bootstrap sampling should be applied at the unit level. This technique can be employed for spatial segregation analysis (Lee et al. 2015), but seems more appropriate for the analysis of organizational segregation (Carrington and Troske 1997).

Jackknife
Although temporally jackknife preceded bootstrap (Quenouille 1956;Tukey 1958), the method is similar to the bootstrap, and is used in statistical inference mainly to estimate the bias and standard error (variance) of a statistic. The simulated index distribution is obtained by systematically recomputing the statistic, excluding one or more observations at a time from the sample set. Jackknife with individual sampling seems less useful in the context of segregation analysis, but is a particularly interesting method in the case of unit sampling because it allows detection of replicates that represent outliers. If a significant inferior outlier is found in the index reference distribution, this means that without a specific spatial unit, the segregation level would be significantly lower and implies that this spatial unit is playing a significant role in segregation. To our knowledge, the jackknife technique has been used in the segregation literature only by Massey (1978) in order to estimate dissimilarity index variances. OasisR allows several automatic standard outlier detection techniques, such as boxplot and standard deviation methods, and different score methods (normal, t Student and chi-squared scores) and the median absolute deviation method, based on functions developed in the outliers package (Komsta 2011). Lee et al. (2015) proposed a new method for estimating the dissimilarity index and quantifying its uncertainty, based on a Bayesian hierarchical modeling approach (with inference based on Markov chain Monte Carlo simulation). The authors consider two distinct models: a globally smooth model (binomial generalized linear mixed model, where the set of random effects are spatially auto-correlated) and a locally smooth model which allows geographically adjacent areal units to have very similar or very different minority proportions. In both cases an estimate and a 95% credible interval for the dissimilarity index can be obtained, by computing the posterior predictive distribution of the index. The model was implemented using the CARBayes package (Lee 2013) in the R software environment. This methodology represents an opportunity for future developments applying the Bayesian spatial modeling approach to other segregation measures.

Using OasisR
The R functions developed were designed to address many different situations in the easiest way possible. For all OasisR aspatial segregation functions, the only input required is the population distribution table within units. The table should not include row totals (unit total populations) which could be interpreted as a supplementary social group. For spatial indices, a second necessary input is spatial data, which can be provided in three ways (see the example below). Many functions have specific parameters, but their input is not obligatory since by adopting their default values, functions compute the usual form of indices. For a detailed description of the parameters, see the OasisR manual (Tivadar 2019). Generally, the output of segregation functions is a numeric value for multi-group indices, a vector of each group's index value for one-group indices, and a numeric matrix for between group indices. Values are rounded to four digits.
As support, we use a very simple 10×10 grid theoretical example which is used in many studies (Morrill 1991;Wong 1993;Lee et al. 2015;Hong et al. 2014); the data are provided by the package. From the various available distributions (Hong 2014), we chose two situations: one with complete segregation of the minority, and one with a particular social groups mix. The population in the dark gray cells is formed only of minority members, in the white cells it is formed only of majority members, and in the light gray ones the population contains a mix of the two groups. In each cell, we consider that the total population number is 100 individuals. For more complex examples, see Appendix B.
To compute an aspatial segregation index, the script is basic since we need to use the function name and a distribution In the case of spatial indices, there are three ways to introduce spatial data in OasisR. The first solution is to provide a spatial R object, using the spatobj argument. It is possible also to import a shapefile, by using two arguments: (1) folder, to provide the path where the shapefile is located on the drive, and (2) shape, the name of the shapefile (without an extension). The shape import uses the readOGR function in the rgdal package (Bivand, Keitt, and Rowlingson 2019). Finally, spatial information can be provided directly as vectors/matrices of contiguity, common boundaries, areas, distances, etc. This spatial information can be computed within OasisR because the package includes geographical functions based on the spdep (Bivand, Pebesma, and Gómez-Rubio 2013;Bivand 2019) and rgeos (Bivand and Rundel 2019) packages, with appropriate output for the segregation functions.
R> foldername <-system.file("extdata", package = "OasisR") R> shapename <-"segdata" R> areavector <-area(segdata) R> Delta(A, spatobj = segdata) showed that certain proximity measures can be computed as multi-group, between group or one-group indices. The user can choose the index type via the argument itype. In the case of exposure indices xP x k and xP y k 1 k 2 , the logical argument exact determines whether indices are computed using the approximate or exact definition. The functions spatmultiseg and rankorderseg have several arguments specific to the seg package. For rank-ordered measures, polorder gives the order of the of polynomial regression model. Spatial segregation functions based on distance have particular arguments. Spatial interactions can be defined via the fdist argument: "l" for the linear and "e" for the exponential inverse function of distance. Other distance functions can be used by introducing a user distance matrix in the R functions, and by setting a linear function. Metric conversions are based on the conv_unit function in the birk package (Birk 2016), with distin and distout the respective arguments for the input and output measures. Argument diagval defines the distance within a spatial unit: "0" for the null diagonal and "a" for White's formula (0.6 square root of the area). Other versions can be used by introducing a user distance matrix in the function. Examples of how to use these functions are provided in Appendix B. Indices based on the contiguity matrix have a supplementary logical argument queen, to choose the criterion used for contiguity matrix computation: TRUE for queen and FALSE for rook (by default). For centralization indices, the user must introduce the argument center which is the number of the spatial unit in the table representing the area's center . For polycentric versions, the input must be a vector. For measures based on the generalized contiguity matrix (K-order matrix), two arguments can shape spatial interactions: argument K represents the order of the contiguity matrix (equal to 2 by default), and argument f designates the function used for the distance decay effect, the negative exponential (by default) or reciprocal function. Argument ptype determines whether Wong's indices are computed using only internal boundaries (ptype = "int") or all the borders of the spatial units (ptype = "all"). For the absolute clustering index ACL, it is possible to define the spatial interactions matrix that will be used, based on the spatmat argument: "c" for contiguity matrix (by default) and "d" for the distance matrix.

R>
With the help of function ResampleTest the user can conduct all the statistical tests based on sampling, as described in the previous section. The main inputs of the function are the population distribution table x, the name of the function to be tested fun, the simulation type simtype ("Boot" to generate bootstrap replications, "Jack" to generate jackknife replications and "MonteCarlo" for a randomization test using Monte Carlo simulations), the number of simulations nsim (equal to 99 by default), the sampling unit used: sampleunit = "unit" when the sampling is based on spatial/organizational units and sampleunit = "ind" for individual sampling). In the bootstrapping technique, the argument perc is a vector with the percentiles to be displayed in the output, and the argument samplesize gives the size of the sample used for bootstrapping. If null, the sample size equals the number of spatial units (in the case of unit sampling), or the total population (in the case of individual sampling). For jackknife simulations, there are two specific arguments. When the argument outl is TRUE the function provides the outliers obtained by jackknife iterations. Argument outmeth defines the outlier detection method: boxplot, standard deviation, normal scores, t Student scores, chi-squared scores and median absolute deviation. Estimations based on scoring methods are obtained from the outliers package. If outliers are detected, the argument sdtimes is used as a multiplication factor of the standard deviation used to detect outliers, and QRrange determines the boxplot thresholds as the multiplication of IQR (inter quartile range). The argument proba is used for random location processes that are not equiprobable (a vector of probabilities should be provided). If the jackknife technique is employed, proba indicates the probability (confidence interval) for scoring tests. If the argument setseed is set to TRUE, a zero seed is set for the random number generator, which is useful to have replicable simulations. In addition, specific arguments such as geographical data and other arguments presented above, should be introduced to allow the segregation function to be tested.
The ResampleTest output is a list of several objects: index name, simulation type, summary statistics of the simulations, simulated values of the index, simulated population distribution. If outliers detection is used, additional objects are included: outliers matrix and outliers values as list and plot. The ResampleTest output can be used by the ResamplePlot function to plot the main results. Certain additional graphic arguments can be used to customize the output: the colors and the legend (position, format and character size). Here we provide a simple script to test the spatial component of Morrill's index; for more examples, see Appendix B.

Conclusions
OasisR is a package implemented in the R software which allows computation of many segregation indices. It was designed to respond to a range of applications in the easiest way possible. This package has the advantage that it is implemented in R which allows total control of the input arguments, most of which have default values that correspond to the standard use of indices. This feature enables less experienced users to conduct segregation analysis with ease. Moreover, there is the possibility to develop further analysis within R, to automate the scripts and to integrate the analysis into other software packages. Another advantage compared to other segregation tools, is its noncommercial use which makes it available to a wide range of individuals who can download it for free directly from the Comprehensive R Archive Network (CRAN) where it is available at https://CRAN.R-project.org/package=OasisR. Since it is an open source package, it can be improved by the scientific community.
As we saw in Section 2, one of the most important benefits of OasisR is that it clarifies many ambiguities concerning definition of the segregation indices by providing proper computation and the possibility to choose among the different forms of the indices in the literature.
Another important contribution is the development of several resampling methods which allow testing of the statistical significance of indices. Three distinct types of simulations are provided: randomization tests, bootstrapping, and jackknife. Additionally, graphic functions are provided for a better visualization of results. It is clear that is still work to do in this field, and especially concerning the application of Bayesian inference.
The OasisR package is currently in its third version. Despite optimization efforts, more work is needed to improve this aspect regarding certain complex indices whose computation can take time, especially in the case of big study zones.

A. Indices definition and use in OasisR
The table in this appendix gives an overview on indices definition and use. The following notation is used in the table: n -number of spatial units; x k i -population of group k in spatial unit i; X k -population of group k in the area; t i -total population in spatial unit i; T -total population in the area; i -population of groups k 1 and k 2 in spatial unit i; T k 1 ,k 2 = X k 1 + X k 2 -population of groups k 1 and k 2 in the area; p k i -proportion of population k in spatial unit i; P k -proportion of population k in the area; -proportion of group k 1 in the population k 1 and k 2 in spatial unit i; f (λ) -function of contiguity interaction, similar to distance interaction. We propose two forms: reciprocal f (λ) = 1/λ and exponential f (λ) = exp (−βλ), where β is a distance decay parameter; R -a spatial region; q -points within the region; τ q -population density at point q; τ k q -population density of group k at point q; τ k q -population density of group k in the local environment of point q; π k q -proportion in group k at point q; π k q -proportion in group k in the local environment of point q; p,q -percentile ranks in the population of interest corresponding to a continuous variable (such as income); n k 1 -rank of spatial unit where the sum of all t i equals or exceeds X k (from 1 to n k 1 ), spatial units being ordered by geographic size; n k 2 -rank of spatial unit where the sum of all t i equals or exceeds X k (from n to n k 2 ), spatial units being ordered by geographic size; T k 1 -sum of all t i from spatial unit 1 to spatial unit n k 1 ; T k 2 -sum of all t i from spatial unit n k 2 to spatial unit n; X k i -the cumulative percentage of the k group population through the ith spatial unit, spatial units being ordered by distance to the center (standard version of centralization) or to the closest center (polycentric version); t k i -the cumulative percentage of total population through the ith spatial unit, spatial units being ordered by distance to the center (standard version of centralization) or to the closest center (polycentric version); n -the magnitude of the centrality effect:ñ = n for unconstrained centralization, andñ < n for local/constrained centralization indices.

Exposure
Isolation index (exact version) Interaction index (exact version) Interaction index (approximate version) Spatial isolation/exposure index.
See Reardon and O'Sullivan (2004) for details.

Multi-group indices
Multi-group dissimilarity Multi-group Gini Multi-group normalized exposure PMulti(x)

Multi-group information theory
HMulti(x) Multi-group relative diversity RelDivers(x) Squared coefficient of variation Spatial multi-group dissimilarity See Reardon and O'Sullivan (2004) Rank-order segregation indices The next example illustrates the variation of the generalized adjusted dissimilarity index as a function of the contiguity order k (distribution A).
R> xtest <-ResampleTest(B, fun = "ISDuncan", simtype = "Boot", + sampleunit = "unit", spatobj = segdata) R> xtest$Summary q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q