SPECIES : An R Package for Species Richness Estimation

We introduce an R package SPECIES for species richness or diversity estimation. This package provides simple R functions to compute point and conﬁdence interval estimates of species number from a few nonparametric and semi-parametric methods. For the methods based on nonparametric maximum likelihood estimation, the R functions are wrappers for Fortran codes for better eﬃciency. All functions in this package are illustrated using real data sets.


Species richness estimation problem
The species problem has a wide range of important applications spanning multiple disciplines including ecology (Fisher et al. 1943;Boulinier et al. 1998), linguistics (Efron and Thisted 1976;McNeil 1973;Thisted and Efron 1987), numismatics (Stam 1987), and genomics (Mao 2002;Acinas et al. 2004;Hong et al. 2006). A typical species data set contains a series of counts x i , i = 1, ..., D, recording the number of individuals observed from a total of D distinct species in the sample. The counts data are often further summarized into the frequency of frequencies data in the form of n = (n 1 , ..., n K ) where n j = D i I{x i = j} (I is the indicator function) is the number of species with j individuals observed, and K = max i x i is the maximum number of individuals observed from any single species. In the following context, we shall reserve i for indexing the individual species, and j for the sample species abundance. The total number of the distinct species N in the underlying population is the parameter for estimation.

Overview of this package
A rich literature exists on this problem. For an excellent review, we recommend Bunge and Fitzpatrick (1993). There are a few well-known software tools available for computing species diversity, including EstimateS (Colwell 2009), SPADE (Chao 2010) and ws2m (Turner et al. 2003). These tools all offer a menu-driven interface to calculate the estimates for a single data set, but none provides the functionality that allows users to calculate the estimates repeatedly from the command line. This feature is appealing when one needs to systematically investigate the behavior of different estimators using Monte-Carlo simulations.
Recently, several new methods have been developed based on nonparametric maximum likelihood (NPML) estimation (Norris and Pollock 1998;Wang 2010). These methods are competitive in performance while all complicated in computing. Therefore it is highly desirable to integrate these methods into a software tool. The R (R Development Core Team 2011) package SPECIES is a creation to this end, having seven main functions including chao1984(), ChaoLee1992(), ChaoBunge(), jackknife(), unpmle(), pnpmle() and pcg(), implementing the lower bound estimator by Chao (1984), two coverage-based estimators by Chao and Lee (1992), the coverage-duplication estimator by Chao and Bunge (2002), the Jackknife estimator by Overton (1978, 1979), the unconditional NPML estimator (NPMLE) by Norris and Pollock (1998), the penalized conditional NPMLE by , and the the Poisson-Compound Gamma estimator by Wang (2010) respectively. The SPECIES package is available from the the Comprehensive R Archive Network at http://CRAN.R-project.org/package=SPECIES.

Methods and package functions
All functions in SPECIES require the input data to be summarized in the format of frequency of frequencies. The input data, denoted as n, must be defined as a two-column matrix or data frame, where the first column is j and the second column is n j for j = 1, ..., K, sorted in the ascending order of j. The zero-frequencies (n j = 0) can be omitted from n. The following example is the famous Malayan butterfly data (Fisher et al. 1943)  In this study, a total of D = 620 distinct butterfly species were observed, of which, 118 were singletons. The frequency n 25 denotes the collapsed count j≥25 n j . Other data sets stored in SPECIES include the expressed sequence tag(EST) data (EST, , the microbial species data (microbial, Hong et al. 2006), the traffic data (traffic, Böhning and Schön 2005), cottontail rabbits data (cottontail, Chao 1987) and the insects data (insects, Burnham and Overton 1979).
The argument n is the input data as described above. The argument method in ChaoLee1992() can be chosen as ACE or ACE-1 (Chao and Lee 1992). One can also specify method = "all" (default) to compute both estimators. The argument t is an integer-valued cut-off that defines the less abundant (j ≤ t) or more abundant species (j > t). The species data are often extremely right skewed. The less abundant species are more informative in predicting the number of the unseen species. The estimators ACE, ACE-1, and ChaoBunge are sensitive to the choice of t. Avoiding an over-large t helps reduce the risk of extreme variance or bias. The default value of t is 10 as suggested by the authors in their original papers. For the confidence interval with a specified level by the argument conf (default 0.95), we used a log-transformation procedure from Chao (1987) for better coverage.
Let D = K j=1 n j , and T = K j=1 j · n j . The chao1984 estimator is a lower-bound estimator as follows:N The estimatorN chao1984 is simple, but typically biased downward as named (see a systematic investigation in Wang and Lindsay 2005, or the traffic data example below). The ACE, ACE-1, and ChaoBunge estimators are all based on a concept called coverage, denoted as C, defined in Good (1953) as follows: where p i is the relative abundance of species i, and x i = 0 if species i is not observed. The term coverage measures the total abundance of observed species in the population. If the species abundance p i is homogeneous across all species, then clearly N = D/C. An estimator of the C from Good (1953) isĈ = 1 − n 1 /T , resulting in the Good's estimator LikeN chao1984 , the Good's estimatorN G is well known for under-estimation because natural species populations are typically heterogeneous (Chao and Lee 1992). To account for the heterogeneity of species abundance, Chao and Lee (1992) proposed two improved estimators by estimating the coefficient of variation (CV) of p i . The resulting estimators arê whereγ andγ are two estimators of CV. In particular the second provides further bias correction beyondγ, but typically incurring larger variance ofN ACE−1 .
The kth order jackknife estimator iŝ The Jackknife order is used to balance bias and variance. A higher order corrects for more bias, but causes larger variance as well. The Jackknife order is specified by the argument k with a default value 5. This function also automatically computes the order using a step-wise testing procedure from the above papers. The argument conf specifies the confidence level not only for the confidence interval, but also for the critical value used for the step-wise Z test to determine the order. If the specified order is larger than the obtained from the test, then the latter is used in the output. If the test-selected order exceeds 10, the estimate at k = 10 will be reported (which though rarely happens in practice).
The rest three functions unpmle(), pnpmle() and pcg() are three variants of the nonparametric maximum likelihood based approaches under a Poisson mixture model. They were all wrapped from Fortran codes for computing speed consideration. Suppose X follows a Poisson mixture distribution f (x; Q) where the mixing distribution is Q for the mean parameter. Then n = (n 1 , ..., n K } follows a multinomial distribution with corresponding cell probabilities f (j; Q), j = 1, ..., K. The resulting likelihood can be factored into two parts as follows: where L m and L c are referred to as the marginal and conditional likelihood respectively (conditioning on that a species is observed, e.g., for n j 's with j > 0).
In this approach, a pair of (N ,Q) is found to maximize the unconditional likelihood L(N, Q; n). The argument t is the same cut-off as described in the ChaoLee1992() function with a default value 15. We recommend to use t ≥ 10. The argument C ( = 1 or 0) specifies whether a confidence interval should be calculated. Since there is no analytical form for the confidence interval, a bootstrap confidence interval of level specified by argument conf is provided (Wang 2010). The arguments b and seed specify how many bootstrap samples to be generated, and what seed to be used in random number generation for bootstrap respectively. If seed is not specified, the R internal random seed is used. These two arguments are ignored if C = 0. The argument method specifies which method to be used to find the unconditional NPMLE of N . The first method is "N-P", in which an iterative algorithm by Böhning and Schön (2005) is used. Sometimes this method can be extremely slow. Alternatively one can use the default method "W-L" by , in which the approximate unconditional NPMLE (with high precision) of Q is found from the following penalized likelihood: The approximate method typically yields identical or nearly identical estimate as the exact method, while it can be drastically faster (see illustrations below). The unconditional NPMLE can be as extreme as ∞. If the point estimate progresses beyond 20 · D within iterations, the algorithm stops and reports the current point estimate. Otherwise the reported unconditional NPMLE can be even more unstable. The last argument dis specifies whether the mixture estimates should be output to the screen. Turning this off by setting dis = 0 allows to avoid overflow of screen information in Monte-Carlo simulations.
To improve the stability of the NPML estimators,  proposed a penalized NPMLE by applying a quadratic penalty function to the conditional likelihood. This method is implemented in function pnpmle(), pnpmle(n, t = 15, C = 0, b = 200, seed = NULL, conf = 0.95, dis = 1).
All the arguments are the same as described in unpmle().
This method was motivated by severe under-estimation observed from popular nonparametric estimators due to interplay of inadequate sampling effort, large heterogeneity and skewness . Unlike unpmle or pnpmle method where the species abundance distribution is estimated by a discrete distribution, a compound Gamma with a unified shape parameter (α) is used in pcg method to bring more bias correction in targeted situations. The unified shape parameter α is chosen by a cross-validation procedure on a grid specified by the argument alpha to balance the bias and variance of the resulting estimates of zero-truncated Poisson mixture probabilities. This function automatically appends α = ∞ onto the grid for cross-validation. We recommend to use a grid with α ≥ 1 to avoid extreme variability. The other arguments are the same as defined in pnpmle() or unpmle(). We also recommend to use a relatively larger t than the unpmle() or pnpmle() (default value is 35) since we are fitting a continuous curve for Q. Caution should be taken if the last count in n, n K , is a collapsed count of species that have x ≥ K. For example, in the butterfly data from this package, n 25 stands for j≥25 n j , and therefore t ≤ 24 should be used.

Illustrations
In this section, we illustrate all main functions using the data sets from original publications. If the result differs from the reported, it is explicitly pointed out below. We first illustrate the chao1984() function using the cottontail rabbit data from Chao (1987, p. 787). This data set was from a capture-recapture experiment. The species number estimation methods also apply to this type of data.
R> library("SPECIES") R> data("cottontail") R> cottontail The reported point estimate and 95% confidence interval in Chao (1987) were 134 and (103,202) respectively. The minor difference in the lower bound is probably due to rounding error. As another illustration, we applied chao1984() to the Taxicab data from Sampling scheme B.g in Table 1 of Chao (1987), the results are identical.
Note the selected order based on the stepwise test is 2 at significance level 0.05. Therefore if the user had specified a higher order, the output would still be the same, e.g., at order = 2. We noticed that results in Table 1 are identical to the original results (with negligible rounding errors) except for T k and P k at k = 1. Further work is need to figure out the cause of this slight but apparent discrepancy. So far the author has not yet found any discrepancy between the output from this function and the published reports in terms of point estimate and standard error.  Table 6 of Burnham and Overton (1979, p. 935). The order k jackknife estimate is denoted asN Jk , and se(N Jk ) is its standard error . The test statistic T k is the Z-statistic and P k is the two-sided p-value for testing order= k vs. order= k + 1.
We illustrate ChaoLee1992(), ChaoBunge(), unpmle(), and pnpmle() using the butterfly data that was analyzed in Chao and Bunge (2002,  R> unpmle(butterfly, t = 15, method = "N-P") Method: Unconditional NPMLE method by Norris andPollock 1996, 1998, using  R> unpmle(butterfly, t = 15, method = "W-L", C = 1) Method: Unconditional NPMLE method by Norris andPollock 1996, 1998, using  For unpmle(), it took about 4 minutes to compute the bootstrap confidence interval based on 200 samples using the approximate method "W-L" on a Mac OSX machine with a 2.93 GHz processor. The exact method "N-P" uses an algorithm by Böhning and Schön (2005), treating the unobserved species as missing data. Its computing time depends on the fraction of the missing information. For example, in the butterfly data, about 14% of the species were not observed (based on the point estimateN = 722). It took the exact method about 20 minutes to finish 200 bootstrap estimates. In the traffic data below, about 83% of the species were not observed. As a result it took about 2 hours to finish 200 bootstrap samples using the exact method in contrast to about 4 minutes using the "W-L" method.
The pnpmle and pcg methods are both based on the conditional likelihood. The NPML estimates (p and pi) from pnpmle() are the mean parameters and their respective weights in the zero-truncated Poisson mixture. For pcg(), the output p and pi are the mean parameters and their weights of the Gamma mixture in the zero-truncated Poisson-compound Gamma model under the selected α model. The computing time for pnpmle() is typically a few minutes based on 200 bootstrap samples. The pcg() procedure is much more computing intensive because of a cross-validation procedure used in model selection. It is common to take more than one hour for 200 bootstrap samples. We illustrate it with the traffic data, which originally appeared in Simar (1976) and was reanalyzed recently by Böhning and Schön (2005) and Wang (2010 R> unpmle(traffic, t = 7, C = 1) Method: Unconditional NPMLE method by Norris andPollock 1996, 1998, using  Clearly N is substantially under-estimated by most of the estimators except unpmle and pcg. The negative estimate was observed for ChaoBunge method probably because the true species abundance distribution deviated from the Gamma distribution assumed in this method (see also Wang andWang 2010).