Mean and Variance Modeling of Under- and Overdispersed Count Data

This article describes the R package CountsEPPM and its use in determining maximum likelihood estimates of the parameters of extended Poisson process models. These provide a Poisson process based family of ﬂexible models that can handle both underdispersion and overdispersion in observed count data, with the negative binomial and Poisson distributions being special cases. Within CountsEPPM models with mean and variance related to covariates are constructed to match a generalized linear model formulation. Use of the package is illustrated by application to several published datasets.


Introduction
Modeling using extended Poisson process models (EPPMs) was originally developed in Faddy (1997), where the construction of discrete probability distributions having very general dispersion properties was described. The Poisson and negative binomial distributions are special cases of this modeling which includes both underdispersion and overdispersion relative to the Poisson, with the negative binomial having the most extreme level of overdispersion within the EPPM family. Faddy and Smith (2008) incorporated covariate dependence in the mean via a reparameterization using an approximate form of the mean; and Faddy and Smith (2011) extended this to incorporate covariate dependence in the dispersion, this being achieved by a reparameterization using an approximate form of the variance. The supplementary material for Faddy and Smith (2011) contained R code illustrating fitting these models. This R code has been extended and generalized to have inputs and outputs more akin to those of a generalized linear model (GLM) as in the R function glm and the R function betareg (Cribari-Neto and Zeileis 2010, Grün, Kosmidis, andZeileis 2012). Both Hilbe (2011) and Hilbe (2014) have comments about a software package for EPPMs being developed in the R system (R Core Team 2016); the package CountsEPPM (Smith and Faddy 2016) whose use is described in this article is that software. This article relates to version 2.1 of the package submitted to the Comprehensive R Archive Network (CRAN) in 2016. The important differences between versions 1.0 and 2.0, 2.1 involve the use of arguments formula and data when running the function CountsEPPM.

Extended Poisson process models (EPPMs)
Only a summary of these models is given. More details, in particular of their theoretical background and development, can be found in Faddy (1997). The defining equation is: where p is a row vector of probabilities and Q is the matrix: −λ 1 λ 1 · · · 0 · · · · · · · · · · · · · · · 0 0 0 · · · −λ n      with the λ parameters being rates of an extended Poisson process, and exp(Q) is the matrix exponential function. Constant λ's here correspond to the simple Poisson process with the distribution of the number of events in a time interval of length t being Poisson with mean λt. The extension allowing the λ's to depend on i, the cumulative number of events occurring, was first suggested in Faddy (1997). The probability p i of obtaining a count of size i (i = 0, 1, . . . , n) in a time interval of unit length is the (i+1)th element of p, so discrete probability distributions can be constructed from a sequence of λ's.
A three parameter function for the λ i sequence: λ i = a(b + i) c , for i = 0, 1, 2, . . . , where a > 0, b > 0 and c ≤ 1 will be used to produce discrete distributions including the Poisson (c = 0) and negative binomial (c = 1) as special cases with the probability mass function for the negative binomial distribution being in the form: Generally, the parameter c > 0 in Equation 2 results in distributions overdispersed relative to the Poisson distribution, and c < 0 distributions underdispersed relative to the Poisson. The mean and variance of distributions from EPPMs defined by Equation 2 generally have to be determined numerically directly from the probability distribution of Equation 1 using a suitable truncation (n); however, approximations are available as in Faddy (1997), Faddy and Smith (2008), Faddy and Smith (2011): The mean approximation can be used to obtain a in Equation 2 as: Covariates x can then be incorporated using a log link: Covariate dependence in the variance can be incorporated by having v as a function of another linear predictor and then solving the equation: for c in terms of v, m and b. Such covariate dependence in the variance permits the modeling of datasets where some subsets exhibit underdispersion and others overdispersion. As the right-hand-side of Equation 7 is a convex increasing function of c, Newton-Raphson iteration (initial value c 0 = 1) will converge to the solution. If a solution c > 1 is indicated, that parameterization is inadmissible since the resulting probability distribution will be improper, so the parameter space is bounded by c ≤ 1. The above approximations are only used to derive the reparameterization of Equations 5 and 7; exact calculation of means and variances are done numerically from the probability distribution of Equation 1 to minimize the effect of this approximation. This means in practice that for c = 0, 1 the actual relationship between log(mean) and the covariates x is not quite linear, but Toscas and Faddy (2003) noted that in their example any departure from linearity was almost imperceptible. Use of exact means (and variances) in this way overcomes, to a great extent, the bias in the approximations noted by Grunwald, Bruce, Jiang, Strand, and Rabinovitch (2011).
The parameter b in Equation 2 is something of a nuisance parameter which is often poorly estimated, and better estimated as log(b) (Faddy and Smith 2011), so the default model is specified by the two parameters a and c in Equation 2 dependent on covariates with b always a single scalar-valued constant. This model fits well to data overdispersed with respect to the Poisson distribution but Faddy and Smith (2011) found that poor convergence to the maximum can occur for underdispersed data if the parameter b becomes large and the hessian at an apparent maximum is poorly conditioned; in such circumstances a limiting form of the λ i 's of Equation 2 for b → ∞ with a = αb −b/β and c = bβ, and a special case of Equation 2 is used (Faddy and Smith 2011): with β < 0, corresponding to c < 0 in (2) and underdispersion; the approximate mean and variance are given by: and v = exp(2βm) − 1 2β (10) with solutions of Equations 9 and 10 achieving the reparameterization in terms of the approximate mean and variance.
When fitting mean and variance models of Equations 3 and 4 (or Equations 9 and 10) use of a log link function to relate mean and variance to the linear predictors means that the scale factor (variance/mean) can be modeled rather than the variance by simply including log(mean) as an offset in the linear predictor for the variance. An option to do this is included.
An alternative formulation only involving the mean model is available Faddy and Smith (2008), where only the proportionality constant a in Equation 2 (or α in Equation 8) is dependent on the covariates x; this does not actually quantify the mean but is based on increasing (decreasing) values of a leading to increased (decreased) rates λ i 's of the extended Poisson process and hence increased (decreased) counts on average.

Models other than EPPMs
Other distributional models exist that can handle under-or overdispersed count data and for specific data sets these may fit better than the EPPMs. This is more likely for overdispersed data because there are many alternative models. The most commonly used of these alternatives for overdispersed data is the negative binomial. Hilbe (2011) is a comprehensive overview of the various forms and manifestations of the negative binomial with R code included. Ridout and Besbeas (2004) describe various models for handling count data that show underdispersion with respect to the Poisson distribution, focusing on exponentially weighted Poisson distributions but also looking at others including a less general EPPM corresponding to Equation 2 with b = 1 which they label as a changing birth rate (CBR) model. Sáez-Castillo and Conde-Sánchez (2013) describe hyper-Poisson regression models for count data that can handle under-and overdispersion with both mean and dispersion dependent on covariates.

Description of the functions
The main function of the package is also named CountsEPPM and is focused on models with two covariate dependences linked to the mean and variance. The input into the function is a formula involving a single response variable and one or two formulae related to the mean and variance models. Although the input formula involves a single response variable, the actual model fitting has a list of frequency distributions list.counts in place of the response variable, which is either input or constructed from the input data according to whether a list or a data.frame is input. For all models the GLM link function between response variable (mean, variance) and linear predictor of covariates is log; the log of parameter b is also used but the parameter c of Equation 2 is untransformed. The full three parameter version of Equation 2 has been labeled the Faddy distribution as in Grunwald et al. (2011). Because of possible issues with the parameter b, in version 2.0 variants of the models where b is fixed have been included in the lists of models. This enables profile likelihoods to be produced for this parameter.
Nash (2014) is a recent reference on optimization using R functions and contains information on, and insights into, the methods used. All models are fitted to the data using maximum likelihood, the optimization method used being either the R function optim (simplex method of Nelder and Mead 1967), or the R function nlm (a Newton method using derivatives). A facility to change options for these two functions through use of the argument control was introduced into version 2.0. The elements of this argument are the options for optim or nlm as described in R Core Team (2016). The default values set within CountsEPPM are fnscale = -1, trace = 0, maxit = 1000 for optim, and for nlm they are fscale = 1, print.level = 0, stepmax = 1, gradtol = 1e-8, steptol = 1e-10, iterlim = 500. Although for most data sets the two functions optim and nlm give similar results in terms of log(likelihood) and parameter estimates, etc., some results may be a little different depending on particular features of the data set. The function optim, being an implementation of the simplex method, is robust to discontinuities in the log(likelihood) surface. However, it is slow to converge. In contrast the function nlm makes use of derivatives, in this case numerical derivatives, resulting in faster convergence, but there is a reliance on the log(likelihood) surface being smooth, i.e., no sudden changes in derivative values. Only nlm makes use of derivatives in the actual model fitting, but both functions have options to calculate a hessian matrix which can be used to produce standard errors for the parameter estimates. These function options were used in version 1.0 but not in version 2.0 where the calculation of the numerical derivatives and hessians use the functions grad and hessian from the package numDeriv (Gilbert and Varadhan 2015). This replacement was made because the derivatives are more accurately calculated resulting in better model fitting and better conditioned hessians. However, as Nash (2014, p. 131) states, numDeriv (Gilbert and Varadhan 2015) takes a longer time than alternative central difference approximations. The occurrence of NA in the vector of standard errors is an indication of problems with the model fitting, possibly caused by an inappropriate model. This is particularly so when all estimates of parameter standard errors are NA which results when the hessian matrix can not be be inverted due to its determinant being zero or it being otherwise ill-conditioned. Having derivatives available means that they can be reviewed, together with the hessian, at the final parameter estimates to evaluate whether maximum likelihood estimates have been attained.
In versions 2.0 and 2.1 data can be either a list or a data.frame whereas in version 1.0 it could only be a list. Also, in versions 2.0 and 2.1 the response variable in formula is a vector if a data.frame is input or a list if a list is input. In version 1.0 it was required to use the names mean.obs and variance.obs for the response variables in formula. In versions 2.0 and 2.1the response variables mean.obs and variance.obs are constructed within CountsEPPM prior to being used to fit models. The structure of the model fitting within CountsEPPM is unchanged from version 1.0. The R package Formula of Zeileis and Croissant (2010) is used to extract model information from the formula input to CountsEPPM. data.frame or list, i.e., response variable list of frequency distributions model.type "mean only" "mean and variance" "mean and variance" if model.type = "mean only" (only a in Equation 2 modeled) model "Poisson" "negative binomial" "negative binomial fixed b" "Faddy distribution" Equation 2 "Faddy distribution fixed b" if model.type = "mean and variance" (both modeled) model "general" as Equations 3 and 4 "general" "general fixed b" "limiting" as Equations 9 and 10 offset one or two element list of offset vectors vectors of 0's initial vector of initial values for parameters, Poisson glm output means first followed by the variances augmented by 0's and/or parameters c, log(b) for other parameters ltvalue lower truncation value (excluded) NA utvalue upper truncation value (excluded) NA optimization.method "optim" or "nlm" "optim" control list of control parameters, see text for more "optim" or "nlm" detail scale.factor.model "yes" if scale factor modeled "no" "mean and variance" models only fixed.b value b is fixed at NA method has also been added. In addition, the object returned by CountsEPPM has been assigned a class of "CountsEPPM", and the base package stats imported, which enables use of functions such as AIC.
As iteration is involved in the model fitting, initial estimates of the parameters are needed. These can optionally be provided in the vector initial. Within CountsEPPM, if initial is unset, a Poisson model is fitted using glm and the estimates from that fit are used to provide estimates for the parameters of the mean linear predictor. If the variance is also being modeled the initial estimates of the parameters of the variance linear predictor are set to the same values as those for a mean model recognising that for the Poisson distribution the variance equals the mean. The initial value of log(b) is set to zero. The matrix exponential function used for calculating the probabilities of Equation 1 is that of the package expm of Goulet, Dutang, Maechler, Firth, Shapira, and Stadelmann (2015) which depends on the package Matrix of Bates and Maechler (2016). The code for the main analysis function is Component Description model.type "mean only" "mean and variance" model "Poisson" "negative binomial" "negative binomial fixed b" "Faddy distribution" Equation 2 "Faddy distribution fixed b" "general" Equations 3 and 4 "general fixed b" "limiting" Equations 9 and 10 covariates.matrix.mean matrix of covariates for the mean covariates.matrix.variance matrix of covariates for the variance offset.mean offset vector for the mean offset.variance offset vector for the variance ltvalue lower truncation value (excluded) utvalue upper truncation value (excluded) scale.factor.model "yes" if scale factor not variance modeled; "mean and variance" models only fixed.b value b is fixed at estses a data frame with columns (name, estimates, se) vnmax a vector of maximum counts in each of the grouped data vectors loglikelihood the loglikelihood value mean.obs a vector of the mean values of the grouped data vectors variance.obs a vector of the variance values of the grouped data vectors with details of the arguments given in Table 1 together with defaults if any.
CountsEPPM prints out statements as to the type of data input; the optimization method used with its return code and associated message; together with the number of function calls (optim) or iterations (nlm); and warning messages. Further details of optim and nlm, their return codes and associated messages are available in R Core Team (2016). CountsEPPM also returns an object of class "CountsEPPM" summarizing the model fit, the components of which are given in Table 2.
Two generic (S3) methods and a function are available. The summary method prints out a summary of model information, estimates of parameters with their standard errors, loglikelihood, etc., in GLM-like form. The logLik method simply extracts the log(likelihood) object from the object output by CountsEPPM, enabling Akaike's information criterion (AIC) and other log(likelihood) based to be easily used, i.e., by functions such as AIC . The function distribution.CountsEPPM produces an object consisting of the fitted means, variances and, optionally, total probabilities and/or parameters of the Faddy distributions. Both methods Argument Description Default output.fn output object from CountsEPPM print.probabilities "yes" if desired to print out probabilities "no" FDparameters "yes" if desired to print out the parameters of the Faddy distribution Equation 2 "no"  Details of the input arguments of distribution.CountsEPPM are given in Table 3 together with defaults. Table 4 gives details of the components of the object returned by distribution.CountsEPPM.
As the vectors of frequency distributions are only required to be of length the maximum observed count value +1, this is how they are set up. However, the fitted models can have probability masses at counts greater than these maximum counts. A component of the output object from CountsEPPM, i.e., output.fn$vnmax is a vector of the maximum observed counts. The object returned by distribution.CountsEPPM includes the total probabilities for the fitted models. If the total probabilities over these ranges for the fitted models indicate inadequate cover, i.e., totals appreciably less than 1, which impacts on the exact means and variances calculated, the values in output.fn$vnmax can be increased and distribution.CountsEPPM run again to improve the calculated means, variances and cover.

Examples
The examples illustrate various ways in which CountsEPPM can be used to produce informatative analyses. For the first three examples data is a list where the dependent variable of counts is a list of vectors of frequency distributions, whereas for the last two data is a data frame. Significant time savings can be made by using the list form of input where applicable. The fitting of models and estimation of their parameters can be sensitive to the initial estimates and method of estimation chosen, with flatness of likelihood surface possible, particularly with respect to parameter b. It is recommended that analyses be run more than once using different initial estimates and optimization methods.

Number of young at varying effluent concentrations data
These Ceriodaphnia dubia data were used as an example by Faddy and Smith (2011). Ceriodaphnia dubia are water fleas used to test the impact of effluents on water quality. The data, originally from Bailer and Oris (1997), are counts of young at varying effluent concentrations, and are in list of frequencies and variables form. The defaults for model.type of "mean and variance", and model of "general", are used. The code given below is for three runs using optim followed by one using nlm with this last run being primarily to obtain the derivatives (gradients) at the final estimates. Although during these runs the estimate of log(b) changed sign, its standard error is relatively large and the estimates of the other parameters had relatively small changes in value (maximum change of 0.4 in the intercept of of log(variance).

Lüning et al. data
These data are from Lüning, Sheridan, Ytterborn, and Gullberg (1966) and are in the form of a list of frequencies and variables. The number of trials are stated in Lüning et al. (1966) to be both lower and upper truncated at 4 and 11 respectively, so the data are for counts of 5 to 10. Default initial values are used for fitting the default general model of Equations 3 and 4. The scale factor model was used because it fitted better than the variance model.

Number of attempts at feeding of herons
These data are originally from Zhu, Eickhoff, and Kaiser (2003) and are in the form of a list of frequencies and variables. Faddy and Smith (2005) described an alternative modeling approach to that of Zhu et al. (2003) constructing a bivariate EPPM for both count and binary data. Here a univariate EPPM for the numbers of trials (attempts at foraging) of 20 adult and 20 immature green-backed herons is considered. The first model fitted was a negative binomial using the default initial values.

Log-likelihood: -120.2042 on 4 Df
As can be seen, the Faddy distribution fit is nearly identical to that of the negative binomial. The function distribution.CountsEPPM can be used to evaluate how close to 1 the total probabilities are for the Faddy distribution (negative binomial).

Titanic survivors
To illustrate the inclusion of offsets, data of passenger survival from the 1912 Titanic shipping disaster are used. The data are in data frame form as given in Table 9.37 of Hilbe (2011), i.e., the numbers surviving out of the number of cases (passengers) within different age, sex, and class categories. The individual data for all 1316 passengers is available from package msme (Hilbe and Robinson 2014). Hilbe (2011, p. 265-268) analyzes the numbers surviving as count data with an offset of the log of the number of cases for the mean, and fits a negative binomial with variance function v = m + αm 2 which equates to the variance function of Equation 4 with α = 1 b . Both mean and scale factor need to be offset by the log of the number of cases. A series of models were fitted, i.e., a negative binomial with b fixed at the value from Hilbe (2011) of b = 9.615385; a negative binomial; a Faddy distribution; a general mean and variance model with only an intercept for the variance; and a general mean and scale factor model with only an intercept for the scale factor. The general mean and scale factor model with only an intercept for the scale factor was found to have the largest log(likelihood). Details of it and its fitting follow. ] <-c("Intercept mean", "age adult", "sex male", + "class 2nd class"," class 3rd class", "Intercept scale factor", The fit of the general mean and scale factor model is better than that of Hilbe (2011, p. 268), the log(likelihood) values being −39.400 and −43.719 respectively although the former has a one additional parameter.

Take over bids
These data, originally from Cameron and Johansson (1997), are used as example data in Cameron and Trivedi (2013) as well as in Sáez-Castillo and Conde-Sánchez (2013). The takeover.bids.case came from the website associated with Cameron and Trivedi (2013). The dependent variable NUMBIDS is the number of bids received by the firm targetted for takeover after the initial bid. Prior to analysis the binary variables were recast as factors and the continuous variables scaled to have zero mean and unit standard deviation. It was found that the scaling of the continuous variables improved the model fitting. The scaling starts by dropping unused variables from the data set and then taking out the count variable.
not achieve convergence of the fitting algorithm for this model. The variables Sáez-Castillo and Conde-Sánchez (2013) do not include in their dispersion model as they are not significant are LEGLREST, WHTKNGHT, INSTHOLD, SIZE, SIZESQ. There is reasonable agreement between the results reported above and those of Sáez-Castillo and Conde-Sánchez (2013) about the important variables in the mean model; in both, SIZE and SIZESQ have very large t statistics. However, in the dispersion model the results reported above show that SIZE and SIZESQ also have very large t statistics, whereas in the Sáez-Castillo and Conde-Sánchez (2013) model they are not included due to being not significant.
Examination of the estimated λ i sequences of Equation 2 shows that the λ 0 values are generally very different from the λ i values for i ≥ 1, as a consequence of the very small estimate of the parameter b. A more appropriate model for these data might be one that treats the zero counts differently from the non-zero counts; this makes some sense as a zero count would correspond to the initial bid being accepted by the targetted firm, and different circumstances (in the form of different covariate dependence) might be operating. Such a model is not readily constructed from those considered here, and is therefore beyond the scope of the functions developed. Such modeling will be the subject of future research.

Concluding remarks
This article has described use of the R package CountsEPPM to fit EPPMs to count data that exhibit under-or overdispersion relative to the Poisson distribution. A variety of covariate dependencies and data structures are covered in examples that provide illustrations of the variety of ways in which the package can be used in the analysis of count data.