Flexible Scan Statistics for Detecting Spatial Disease Clusters: The rflexscan R Package

The spatial scan statistic is commonly used to detect spatial disease clusters in epidemiological studies. Among the various types of scan statistics, the flexible scan statistic proposed by Tango and Takahashi (2005) is one of the most promising methods to detect arbitrarily-shaped clusters. In this paper, we introduce a new R package, rflexscan (Otani and Takahashi 2021), that provides efficient and easy-to-use methods for analyses of spatial count data using the flexible spatial scan statistic. The package is designed for any of the following interrelated purposes: to evaluate whether reported spatial disease clusters are statistically significant, to test whether a disease is randomly distributed over space, and to perform geographical surveillance of disease to detect areas of significantly high rates. The functionality of the package is demonstrated through an application to a public-domain small-area cancer incidence dataset in New York State, USA, between 2005 and 2009.


Introduction
Evaluating whether a disease is randomly distributed or tends to occur as clusters over space is among the most crucial aspects of epidemiological studies, and this can be primarily performed using disease mapping. As an example, Figure 1 shows a choropleth map of standardized incidence ratios (SIRs) of breast cancer (female) in the Manhattan borough of New York City, comprising 982 census blocks for the years 2005-2009 based on the age-specific incidence rates from the 2010 census counts for New York State (Boscoe, Talbot, and Kulldorff 2016). The total observed number of breast cancer cases for the 5 years was 6,219, and the SIR for the entire area was 1.07. Note that the age adjustment is based on the population structure of New York State, not that of Manhattan. One might interpret from the map that breast cancer cases are clustered in a certain area. However, such a map often does not indicate the existence of meaningful clusters clearly and identifying them objectively is still challenging. Various statistical tests have been proposed to address this issue (Kulldorff 2006;Rogerson and Yamada 2008;Tango 2010). Among those, cluster detection tests (CDTs) investigate whether a disease pattern is completely random over a space, without any prior information, while indicating regions with high disease prevalence. The spatial scan statistic (Kulldorff 1997) based on the maximum likelihood ratio is one of the most powerful methods of the CDT. However, Kulldorff's statistic considers only circular or elliptic shaped clusters and has difficulty in correctly detecting clusters with other shapes. To detect arbitrarily-shaped clusters, Tango and Takahashi (2005) proposed the flexible scan statistic, which is designed so that the detected cluster can be flexible in shape, while concurrently, the cluster is confined within relatively small neighborhoods of each region. Tango and Takahashi (2012) further proposed a flexible scan statistic with a restricted likelihood ratio that consumes much less computation time than the original one and tends to detect clusters of any shape reasonably well as the relative risk of the cluster increases.
Several software for performing tests based on the Kulldorff's scan statistic are available. The SaTScan software developed by Kulldorff and Information Management Services, Inc. (2018) is freely available and has been widely used. R (R Core Team 2021) packages such as rsatscan (Kleinman 2015), SpatialEpi (Kim and Wakefield 2021), scanstatistics (Allévius 2018), and smerc (French 2021) are also available from the Comprehensive R Archive Net- work (CRAN). Other R packages for detection of clusters include the DClusterm package (Gómez-Rubio, Moraga, Molitor, and Rowlingson 2019) which uses a model-based approach and allows the inclusion of covariates, and the SpatialEpiApp package (Moraga 2017) which detects clusters using the spatial scan statistics implemented in SaTScan and allows to visualize the results through interactive maps and tables. Meanwhile, there is no R package that can efficiently conduct tests based on the flexible scan statistic, although a stand-alone application, FleXScan (Takahashi, Yokoyama, and Tango 2013), is freely available. Packages scanstatistics and smerc can detect clusters using the flexible scan statistic, and a wrapper of smerc, FlexScan (Du and Hao 2021), is also available. However, it is difficult to detect large clusters using these packages because of heavy computational load (see Section 5).
In this paper, we introduce a new R package rflexscan (Otani and Takahashi 2021), that is an R implementation of the FleXScan software for performing purely spatial analysis using the flexible scan statistic more efficiently. Package rflexscan is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=rflexscan. The feasibility of the package is demonstrated through an application to the public domain small-area cancer incidence data for New York State during 2005-2009(Boscoe et al. 2016, which are available at https://www.satscan.org/datasets/nyscancer/. Table 1 shows an example of breast cancer cases in Manhattan consisting of the observed and age-adjusted expected number of cases of breast cancer as well as centroid coordinates for each region. The whole dataset contains the observed and the age-and sex-adjusted expected number of cases for 23 anatomic sites of cancer diagnosed in New York State at the census block group level as well as geographic identifiers and centroid coordinates. The paper is organized as follows. Section 2 introduces the spatial scan statistics, and Section 3 describes the functionality of the rflexscan package. In Section 4, we demonstrate the feasibility of the package by applying it to cancer incidence data. Section 5 discusses the computational efficiency of the package and provides a benchmark result, and finally, Section 6 concludes the paper with some points to consider.

Methods
The spatial scan statistic (Kulldorff 1997) tries to identify the most likely cluster (MLC), defined as the set of connected regions, i.e., window, that achieves the maximum likelihood ratio by searching (scanning) over a set of candidates for the hotspot cluster with signifi-cant elevated risk. In this paper, we describe the Poisson model that is based on the observed/expected number of disease cases and can adjust for potential confounders such as sex and age. In contrast, the Binomial model that is based on the observed number of disease cases and the background population at risk in each area is also implemented in rflexscan. For more information on the Binomial model, please refer to Kulldorff (1997).
Let us assume that the entire study area is divided into m regions (e.g., counties or enumeration districts). The number of cases of a particular disease in region i is denoted by the random variable N i with observed value n i (i = 1, . . . , m) and the total number of cases n = n 1 + · · · + n m where m is the number of regions in the study area. Let W be the set of all potential scanning windows of any size. With the use of the notation of window w ∈ W, let us assume that the relative risk is θ w for regions inside of w and is θw for regions outside of w. The N i (i = 1, . . . , m) are independent Poisson variables such that where Poisson(ξ) denotes the Poisson distribution with mean ξ, ξ i is the age-adjusted expected number of cases in region i under the null hypothesis, and n = ξ i + · · · + ξ m . For calculation of the expected number of cases adjusted for potential confounders such as age, we can use indirect standardization (Waller and Gotway 2004).

Spatial scan statistic
The null hypothesis of no clustering is expressed as Meanwhile, under the alternative hypothesis H 1 , there is at least one window w for which the underlying risk is higher inside the window than the outside, that is, Under the Poisson assumption, the likelihood ratio for window w is given by where n(w) is the observed number of cases in the window w. The MLC w * is defined as and a test (scan) statistic to assess the statistical significance of w * is defined as λ * = λ(w * ).
The above statistic is widely used; however, Tango and Takahashi (2005) and Tango (2000) have shown that the scan statistic using the likelihood ratio given in Equation 2 tends to detect an MLC that is much larger than the true cluster by swallowing neighboring regions with non-elevated risk. To avoid or scale back such undesirable phenomena, Tango (2008) proposed the following restricted likelihood ratio by considering each region's risk.
where I() is the indicator function, and p i is the one-tailed p value of the test for H 0 : N i = ξ i given by the middle p value and α 1 is a prespecified significance level for the individual region. The MLC w * using the restricted likelihood ratio is defined as and the test statistic is λ * = λ(w * ). Note that it is equivalent to the original likelihood ratio in Equation 2 when α 1 = 1.

Windows to be scanned
According to the different definition of W, various scan statistics can be derived. The rflexscan package implements the flexible scan statistic (Tango and Takahashi 2005) and Kulldorff's scan statistic (Kulldorff 1997).
The flexible scan statistic imposes a flexibly shaped window on each centroid of the region by connecting its adjacent regions. For any given region i, the method creates the set of flexibly shaped windows with length k consisting of k connected regions including i and lets k move from 1 to the prespecified maximum length K. To avoid detecting a cluster of unlikely peculiar shape, i.e., over-sized oddly shaped with multiple narrow branches, the connected regions are restricted to the subsets of the set of regions i and K-nearest neighbors to the region i. In total, multiple different, but overlapping arbitrarily-shaped windows are created, each with a different location and size, and each being a potential cluster. Let w ik , (k = 1, . . . , K) denote a window composed by the region i and (k − 1) nearest neighbors to i. Also, let w ikl (k = 1, . . . , K, l = 1, . . . , L ik ) denote a flexibly shaped window that is a set of k regions connected starting from the region i, where L ik is the number of windows satisfying w ikl ⊆ w iK . Then, all the windows to be scanned are included in the set Meanwhile, the Kulldorff's original scan statistic imposes a circular window on each centroid of regions. For any of those centroids, the radius of the circle varies from zero to an upper limit defined by the user. If the window contains the centroid of a region, then that whole region is included in the window. In total, as in the flexible spatial scan statistic, several different, but overlapping circular windows are created. In the rflexscan, the upper limit is determined by specifying the maximum length K of nearest neighbors, while the standard option in SaTScan is to specify it as 50% of the population at risk. With the use of the notation of window w ik , all the windows to be scanned are included in the set The scan statistic based on W c will be referred to as circular scan statistic.

Calculating p value
The Monte Carlo test (Dwass 1957) is used to determine the significance of w * . Under the null hypothesis, we randomly generate B samples of the observed number of cases n i in each region i using the Poisson distributed random number generator (rpois method) and calculate test statistics λ b = λ(w * ) (b = 1, 2, . . . , B) based on each sample. For the test statistic λ * derived from the actual data, its p value is approximated bŷ Note that the distribution of the test statistic under the null hypothesis varies depending on the scan statistic used, and the parameters used in constructing the window, and the resulting p value also varies.

Detecting secondary clusters
The aforementioned procedure was intended to identify only the primary cluster, w * 1 = w * . Kulldorff (1997) extended its use for detecting multiple clusters; the procedure was repeatedly used to identify other clusters, i.e., secondary clusters, w * 2 , w * 3 , . . . , among which there were no overlaps, i.e., w * k ∩ w * k = ∅ for k = k . Consequently, their likelihood ratios always followed a descending order, The statistical significance of secondary clusters was evaluated in the same way as that of the MLC, i.e., the likelihood ratio of each secondary cluster was compared with that calculated from randomly generated data sets.
Note that the above procedure does not require a correction for multiple testing. The scan statistic avoids multiplicity by testing whether at least one cluster exists rather than repeating the test for the existence of each clusters. This is also the case for the detection of secondary clusters. The procedure for secondary clusters evaluates these clusters one by one, and each corresponding p value is calculated as if the cluster were the primary one, i.e., the MLC. The interpretation of this approach is that we are evaluating whether the secondary clusters are able to reject the null hypothesis in Equation 1 on their own strength, whether the MLC is a true cluster or not. A drawback of this approach is that the p values are conservative (Kulldorff 1997;Zhang, Assunção, and Kulldorff 2010) with a corresponding loss of statistical power.

The rflexscan package
The R package rflexscan includes functions for analyzing spatial count data using the flexible spatial scan statistics as well as the circular scan statistic. This package can be installed and loaded as usual: R> install.packages("rflexscan", dependencies = TRUE) R> library("rflexscan") Figure 2 shows examples of flexibly shaped clusters that can be detected using rflexscan. SaTScan software and rsatscan package can detect circular and elliptical clusters, but not cross and loop shaped clusters. In contrast, the rflexscan package can detect clusters of any shape shown in Figure 2.
The main function of the package is rflexscan, which takes the following arguments: R> rflexscan(x, y, lat, lon, name, observed, expected, population, nb, + clustersize = 15, radius = 6370, stattype = "ORIGINAL", + scanmethod = "FLEXIBLE", ralpha = 0.2, simcount = 999, + rantype = "MULTINOMIAL", comments = "", verbose = FALSE) Centroid coordinates for each region should be specified by Cartesian coordinates using arguments x and y or by latitudes and longitudes using arguments lat and lon. name specifies identifiers for each region and observed specifies the observed number of cases. For the Poisson model described in Section 2, the expected number of cases under the null hypothesis should be specified using expected. In addition, for the Binomial model (Kulldorff 1997), the background population at risk in each area should be specified using population. These arguments should be specified by vectors that have length m. The i-th element of the vector represents the data of region i.
nb is a list of neighbors or an adjacency matrix that expresses a structure of the connection between regions. When a list is specified by users, it should contain m integer vectors, where the i-th vector contains either the indices in the range from 1 to m of the neighbors of region i, or as.integer(0) to signal no neighbors (see Section 4.1 for details). When users specify an adjacency matrix A, it should be a symmetric m × m matrix with zeros on its diagonal. Its element A ij is one when region i and j (i = j) are connected, and zero when there is no connection.
In addition to the above necessary input data, users can specify the following parameters to control functionality.
• clustersize: The number of maximum spatial cluster size to scan (K), i.e., the maximum number of regions included in the detected cluster.
• radius: Radius of the earth in kilometers to calculate a distance between two sets of latitude and longitude. It is approximately 6370 km in Japan. This is deprecated. The distance calculated using this parameter is not accurate. This feature is implemented to maintain compatibility with FleXScan. It is recommended to transform latitude and longitude onto the Cartesian coordinate system beforehand and use the x and y parameters that are projected coordinates.
• stattype: Statistic type to be used (case-insensitive). If "ORIGINAL" is specified, the likelihood ratio statistic by Kulldorff and Nagarwalla (1995) is used. If "RESTRICTED" is specified, the restricted likelihood ratio statistic by Tango (2008) is used with a preset parameter ralpha for restriction.
• scanmethod: Scanning method to be used (case-insensitive). If "FLEXIBLE" is specified, the flexible scan statistic by Tango and Takahashi (2005) is used. If "CIRCULAR" is specified, the circular scan statistic by Kulldorff (1997) is used.
• ralpha: The prespecified significance level for the individual region used for the restricted likelihood ratio statistic (α 1 ).
• simcount: The number of Monte Carlo replications to calculate a p value for the statistical test.
• rantype: The type of random number for Monte Carlo simulation (case-insensitive). If "MULTINOMIAL" is specified, the total number of cases in the whole area is fixed. This option can be chosen in either Poisson or Binomial model. If "POISSON" is specified, the total number of cases is not fixed. This option can be chosen only in the Poisson model.
• comments: Comments for the analysis which will be written in a log of processing.
The return value of the rflexscan function is an object of 'rflexscan' class that contains analysis results and parameters specified by the user. To output summaries of results and to visualize detected clusters using a graph representation, S3 methods summary and plot are available for this class, respectively. Also, the choropleth method can be used to make a choropleth map displaying detected clusters.

Examples
In this section, we demonstrate the feasibility of the rflexscan package through an application to the small-area cancer incidence dataset (Boscoe et al. 2016), which is a public-domain dataset containing data for 23 anatomic sites of cancer diagnosed in New York State, USA between 2005 and 2009 at the census block group level. Here, we use a dataset provided by the ESRI shapefile format that is freely available from the SaTScan (Kulldorff and Information Management Services, Inc. 2018) website at https://www.satscan.org/datasets/ nyscancer/. The dataset contains the following information for each region as well as geometric information: • Geographical identifier consisting of 12 digits (coded from the state, county, tract, and census block group) based on the 2010 census.
• Centroid coordinates by latitude and longitude.
• Number of diagnosed (observed) cases for specific cancer.
• Expected number of cases for specific cancer adjusted for sex and 5-year age groups up to 85+ years, using the 2010 census counts for New York State.
First, let us load the dataset into the R environment. Although we use the rgdal (Bivand, Keitt, and Rowlingson 2021) package here, other packages for spatial analysis such as the sf (Pebesma 2018) package can also be used.
R> library("rgdal") R> nys <-readOGR("NYS_Cancer/NYSCancer_region.shp", + stringsAsFactors = FALSE) For example, the observed and expected number of breast cancer cases (OBREAST and EBREAST) as well as geographic identifier (DOHREGION) and centroid coordinates (LATITUDE and LONGITUDE) are contained as follows: R> head(nys@data[c("DOHREGION", "LATITUDE", "LONGITUDE", + "OBREAST", "EBREAST")]) Here, we try to analyze the spatial count data of breast cancer cases in Manhattan. Because a prefix of the geographic identifiers for Manhattan is "36061" ("36" is a state code for New York State and "061" is a county code for Manhattan), we can easily extract data of Manhattan from the whole dataset as follows:
Although the rflexscan method has a function to calculate the distance between two sets of latitude (lat) and longitude (lon), it is deprecated because of accuracy issues. This feature is implemented to maintain compatibility with FleXScan. We recommend transforming latitude and longitude onto the Cartesian coordinate system beforehand and use the x and y parameters.

Basic usage
For performing spatial data analysis using the flexible scan statistics, a neighbors list or an adjacency matrix that expresses a connection status between regions should be constructed. Although it is desirable to determine connections based on the actual geographical and/or social factors, here we automatically construct the neighbors list from the SpatialPolygonsDataFrame object using the poly2nb function in the spdep ( In this case, region 1 is connected with regions 2, 3, 4, 11, 12, 13, 30, and 31. Figure 4 shows the resulting neighbors list via a graph representation. The object contains 6,614 connections for the 982 regions. Now, we are ready to perform the main analysis using the flexible scan statistic. This can be easily done with the following lines of code (which might take several seconds): R> fls <-rflexscan(name = manhattan$DOHREGION, + x = coord$x, y = coord$y, nb = nb, + observed = manhattan$OBREAST, expected = manhattan$EBREAST) The return value fls is an object of 'rflexscan' class. The S3 method print for the class briefly shows the results: The first part of the output displays the function call specified by the user. The next part displays identifiers of regions included in the MLC with the p value. In addition, the last part displays the number of secondary clusters.
The member variable cluster in the 'rflexscan' object is a list of 'rflexscanCluster' objects. The 'rflexscanCluster' class contains properties of detected clusters. The following code display properties of the MLC w 1 using the print method for this class: The properties include areas in the cluster, the maximum distance between areas in the cluster, the number of cases, the expected number of cases, the relative risk of the disease, the likelihood ratio statistic, the rank obtained in the Monte Carlo simulation, and the p value.
Properties of secondary clusters can also be displayed with a similar code. For example, the following code displays properties of w 2 :

Scanning method: FLEXIBLE Statistic type: ORIGINAL
The first part is the function call as with the output of the print method for the 'rflexscan' class. The next part labeled as Clusters displays a list of detected clusters. The first row of the list is the MLC, and others are secondary clusters. rflexscan reports clusters with p values calculated by the Monte Carlo test less than one. For each cluster, the number of areas (NumArea), the maximum distance between areas in the cluster (MaxDist), the observed number of cases (Case), the expected number of cases (Expected), the relative risk of disease (RR), the likelihood ratio statistic (Stats), and p value (P) are presented. Statistically significant clusters are marked; in this case, we detected seven significant clusters (p < 0.05). The final part displays some summary statistics and model parameters used.
Users can evaluate whether a disease is randomly distributed over space from the Clusters part. In this case, p value corresponding to the MLC is 0.001 (see the first row of the list), which implies breast cancer cases are clustered rather than randomly distributed over the study area. Further, p values are reported for the MLC and secondary clusters that enable us to evaluate which cluster is statistically significant.
The MLC at the east of Central Park (colored by black) corresponds to a previously reported cluster with unusually high cancer incidence (Boscoe et al. 2016). However, the shape of the MLC is not circular, whereas the reported clusters determined based on the scan statistic of Kulldorff (1997) are all circular. The shapes of secondary clusters are also diverse, although they are located in reported block groups with high incidence (Boscoe et al. 2016).

Restricted likelihood ratio
The rflexscan package also implements the flexible scan statistic with the restricted likelihood ratio (Tango and Takahashi 2012). The following codes can be used to analyze breast cancer data with the significance level α 1 = 0.2 for the individual region, the stattype argument specifies a likelihood ratio statistic to be used, and the ralpha specifies the significance level α 1 : R> fls2 <-rflexscan(name = manhattan$DOHREGION, + x = coord$x, y = coord$y, nb = nb, + observed = manhattan$OBREAST, expected = manhattan$EBREAST, + stattype = "RESTRICTED", ralpha = 0.2) R> summary(fls2) Call: rflexscan(x = coord$x, y = coord$y, name = manhattan$DOHREGION, observed = manhattan$OBREAST, expected = manhattan$EBREAST, nb = nb, stattype = "RESTRICTED", ralpha = 0. In this case, we detected four significant clusters (p < 0.05) and a suggestive cluster (p < 0.1). Also, Figure 6 (a) shows a choropleth map of significant clusters. As mentioned in Section 2.3, the distribution of the test statistic of clusters under the null hypothesis varies depending on the parameters used, and the resulting p value also varies. In this case, the MLC is the same as the one detected by the original flexible scan statistic, while the secondary clusters are slightly more compact than those of the original statistic. The restricted likelihood ratio eliminates neighboring regions with non-elevated risk of disease occurrence (Tango and Takahashi 2012) and avoids creating over-sized clusters that are oddly shaped with multiple narrow branches, i.e., the octopus effect (Costa, Assunção, and Kulldorff 2012;Duczmal and Assunção 2004).
The choice of α 1 varies depending on the situation and/or the user's specific consideration. Tango and Takahashi (2012) shows the following guidance regarding the choice of α 1 for a restricted likelihood ratio statistic of the nominal α level of 0.05: (1) α 1 between 0.10 and 0.20 to detect small clusters with a sharp increase in risk; (2) α 1 between 0.20 and 0.30 to detect small to middle-sized clusters with a moderate increase in risk; and (3) α 1 between 0.30 and 0.40 to detect larger clusters with a slight increase in risk. Tango (2008) further recommends α 1 = 0.20 as a default.

Cluster size
Users can specify the maximum number K of nearest neighbors to be scanned using the clustersize argument of the rflexscan function (default is 15). For the flexible scan statistic with the original likelihood ratio (Tango and Takahashi 2005) given in Equation 2, we recommend K ≤ 30 because of heavy computational load. Meanwhile, the flexible scan statistic with the restricted likelihood ratio (Tango and Takahashi 2012) given in Equation 3 allows us to consider larger clusters efficiently (see Section 5 for details).
As shown in Figure 6, different values of K produce different results. If prior information is not available, set K to as large a value as possible. Package rflexscan will then look for clusters of both small and large sizes without any pre-selection bias in terms of the cluster size. However, K should be set to no more than half of the number of regions m. A cluster of larger size would indicate areas of exceptionally low risks outside the window rather than an area of exceptionally high risks within the window. Besides, Tango and Takahashi (2005) pointed that it seems to be unlikely that the size of the true cluster would be larger than 10%-15% of the total number of regions.

Monte Carlo replications
The simcount argument specifies the number of Monte Carlo replications to calculate p values of detected clusters (default value is 999). For example, the following codes perform a similar analysis as Section 4.3, with 9,999 replications. Note that the computation time increases as the number of replications increases. The resulting 'rflexscan' object includes p values estimated more precisely than with the default value as follows:

Circular scan statistic
In addition to the flexible scan statistic, the rflexscan package implements the original spatial scan statistic proposed by Kulldorff and Nagarwalla (1995) and Kulldorff (1997) that considers only the set of circular-shaped windows W c . In this case, the scanmethod argument should be specified as "CIRCULAR" as follows: R> fls5 <-rflexscan(name = manhattan$DOHREGION, + lat = manhattan$LATITUDE, lon = manhattan$LONGITUDE, nb = nb, + observed = manhattan$OBREAST, expected = manhattan$EBREAST, + scanmethod = "CIRCULAR") R> summary(fls5) In this case, we detected five significant clusters (p < 0.05) and a suggestive cluster (p < 0.1). As shown in Figure 7 (a), all detected clusters are circular and different from those detected via the flexible scan statistic.
Similar with the flexible scan statistic, users can specify the maximum number K of nearest neighbors to be scanned using the clustersize argument:  Users might hesitate in deciding whether it is better to use the scan method "FLEXIBLE" or "CIRCULAR". Although the best choice depends on the situation and/or the user's specific consideration, the power characteristics of each method might be helpful. The circular spatial scan statistic shows better power for detecting circular-shaped clusters, but it has zero power for accurately detecting regions as a single non-circular cluster without any false positive regions or false negative regions, while the flexible scan statistic shows better powers for detecting non-circular clusters. Users can choose a more powerful method according to the expectation of the shape of clusters (based on the shape of regions, for example). Performance evaluations of the flexible scan statistic have been conducted in Tango and Takahashi (2005) and Tango and Takahashi (2012). Package rflexscan gives the same output as presented in these studies, resulting in the same performance for type I error rate and statistical power (and for sensitivity, specificity, and positive predictive value).

Computational load
Because numerous windows will be scanned in the flexible scan statistic for large clusters, the computational load is one of the main concerns in practical uses. The rflexscan package implements an efficient procedure (Tango and Takahashi 2005) used in the original FleXScan software through the Rcpp (Eddelbuettel and François 2011) package. Note that the procedure correctly detects the MLC and secondary clusters considering all possible windows, although unnecessary cluster candidates can be pruned during the scanning. Furthermore, the package also implements the flexible scan statistic with the restricted likelihood ratio (Tango and Takahashi 2012), allowing us to consider large clusters more efficiently.
As a demonstration, we conducted benchmarks using the breast cancer data in Manhattan described in Section 4. Again, Manhattan has 982 regions, and there are 6,614 connections. We measured computation times of the rflexscan method varying the parameter K, the maximum size of clusters to be considered. Other parameters were set to default values, the number of Monte Carlo replication was 999, and the significance level α 1 used for the restricted likelihood ratio was 0.2. Also, we conducted the same analysis using existing R packages scanstatistics (Allévius 2018) and smerc (French 2021). Figure 8 shows the measured computation times of each procedure for each value of K. The computation times of scanstatistics and smerc increase exponentially as K increases because the package enumerates all possible windows without pruning of unnecessary candidates to search the MLC. The rflexscan method with the original likelihood ratio takes comparably lower time than these packages because of the pruning procedure; however, its computation time also increases exponentially. In addition, smerc and scanstatistics required larger amount of memory since these packages constructed a list of the complete set of windows W f . The size of the list of windows increases exponentially with respect to the maximum size of the cluster (K), and therefore, these packages resulted in out of memory for a large K. rflexscan does not construct the list of the complete set of windows and then searches for MLCs, but instead sequentially scans windows using depth-first search to find the MLC avoiding out of memory. Due to this reason, rflexscan does not have a method implemented in smerc and scanstatistics that returns a list of the complete set of windows.
Meanwhile, a marked improvement was achieved by using the restricted likelihood ratio that eliminates neighboring regions with non-elevated risk. The rflexscan method required an almost constant calculation time of <5 seconds regardless of the setting of K in this case. Note that the effect of the restriction depends on the choice of α 1 . Longer computation time will be required when α 1 is increased; setting α 1 = 1 is equivalent to the use of the original likelihood ratio in Equation 2. To illustrate this point, we conducted another benchmark varying the threshold parameter α 1 . The maximum size of clusters K was 50 to have no impact on the computational burden, and the number of Monte Carlo replication was 999. Also, we conducted the same analysis using rflex.test method in smerc (French 2021) that implements the restricted likelihood ratio statistic. Figure 9 shows the measured computation times of each procedure as a function of α 1 . Since the size of the clusters is primarily limited by the value of α 1 , the computation time of rflexscan was less than 3 seconds for α 1 ≤ 0.2 while it increased exponentially as α 1 increased. rflex.test in smerc required more time and was unable to perform the analysis for α 1 > 0.17 because memory usage increased exponentially as α 1 increased.
In addition, we conducted the same analysis using rsatscan (Kleinman 2015) with SaTScan version 9.6 and measured computation times. We used a standard setting for the maximum cluster size, which specifies the upper limit of cluster size as 50% of the population at risk.
For detecting circular-shaped clusters, rsatscan took about 3 seconds as described below.
R> system.time({satscan("NYS_Cancer/", "breast_circular")}) user system elapsed 0.08 0.03 3.05 In contrast, for detecting elliptically-shaped windows, it took more computation time because of the increase in the number of candidate windows. With the standard setting for the maximum cluster size, rsatscan looks for clusters of both small and large sizes exhaustively. Nevertheless, it is highly optimized and is able to detect clusters very fast as described above. However, note that is has a limitation on statistical power to detect non-circular or non-elliptic clusters. Although rflexscan tends to take more computation time than rsatscan, it can detect arbitrarily-shaped clusters.

Conclusions
We presented the rflexscan package that was designed for analyzing spatial count data using the flexible scan statistic. The package is designed for any of the following interrelated purposes: to evaluate whether reported spatial disease clusters are statistically significant, to test whether a disease is randomly distributed over space, and performing geographical surveillance of disease to detect areas of significantly high incidence and prevalence. It is implemented in R, and no external program has to be installed. By combining rflexscan with many useful R packages such as rgdal (Bivand et al. 2021), sf (Pebesma 2018) or spdep (Bivand and Piras 2015;Bivand et al. 2013), users will be able to perform analysis and visualization easily and rapidly.
Some points should be noted for practical uses. As demonstrated, the clusters detected vary depending on the scan statistics and the control parameters used in constructing the window, and the test statistics and the p values will also vary. We recommend checking the robustness of your results using various statistic and control parameters. To check the robustness, users should conduct sensitivity analysis varying values of K and α 1 . When the results of the hypothesis testing are same on any parameter settings, i.e., p value of the MLC are similar, the result will be robust. Users also need to make sure that the clusters are detected in the same location. As mentioned earlier, some guidelines for choosing values of K and α 1 exist. We recommend to perform sensitivity analysis within these ranges. Further, the spatial count data and the geographical information to be input should be constructed appropriately. For example, although we automatically constructed the connections between regions using the poly2nb method for demonstration in the paper, the actual geographical and/or social factors in real-world practice should be carefully considered. However, owing to the presence of obstacles that separate the districts, such as mountains and rivers, it may not be appropriate to treat them as a single cluster, even if the polygons are in contact with each other. There are also areas where polygons are defined but not actually inhabited (e.g., Central Park in Manhattan). Therefore, simply determining a connection from a shapefile can be problematic. The use of inappropriate data and parameters will produce strange results and will lead you to false decisions.
Besides, the use of flexible scan statistic has some disadvantages such as the need to specify K and α 1 in advance, and the high computation time when the method is set to search clusters with a numerous areas. An alternative approach to detect areas of unusually high risk is to detect areas with high exceedance probabilities, i.e., areas that have a high probability that the relative risk exceeds a given threshold (Moraga 2019). When computing exceedance probabilities, the user will obtain the probability that the region exceeds a given threshold. If the probability is small, say less than 0.05, they can highlight the risk of the region as unusual. This approach has the advantage that the highlighted areas can be of any shape, allow the assessment of the effects of covariates, and time is not a restriction when numerous areas are detected. In contrast, the advantage of the spatial scan statistics is that it can evaluate the significance of clusters via statistical hypothesis testing, i.e., it can calculate p value corresponding to the detected clusters.
Currently, the rflexscan package only implements methods for performing purely spatial analysis, while the SaTScan software and scanstatistics allow space-time analysis. The flexibly shaped space-time scan statistic (Takahashi, Kulldorff, Tango, and Yih 2008) will be included in future versions.

Computational details
The analyses were performed on a laptop with an Intel Core i7-8565U CPU at 1.80GHz and 16GB RAM.