Mean and Variance Modeling of Under-Dispersed and Over-Dispersed Grouped Binary Data

This article describes the R package BinaryEPPM and its use in determining maximum likelihood estimates of the parameters of extended Poisson process models for grouped binary data. These provide a Poisson process family of ﬂexible models that can handle unlimited under-dispersion but limited over-dispersion in such data, with the binomial distribution being a special case. Within BinaryEPPM , models with the mean and variance related to covariates are constructed to match a generalized linear model formulation. Combining such under-dispersed models with standard over-dispersed models such as the beta binomial distribution provides a very general form of residual distribution for modeling grouped binary data. Use of the package is illustrated by application to several data-sets.


Introduction
Modeling using extended Poisson process models (EPPMs) was originally developed in Faddy (1997) where the construction of discrete probability distributions having very general dispersion properties was described. Smith and Faddy (2016) was concerned with generalizations of the Poisson distribution to deal with over-and under-dispersion. This article is about similar generalizations of the binomial distribution which is another special case of the modeling described in Faddy (1997). Covariate dependence can be incorporated via a re-parameterization using approximate forms of the mean and variance.
The supplementary material for Faddy and Smith (2012) contained R (R Core Team 2019) code illustrating the fitting of these models. This code has been extended and generalized to have inputs and outputs akin to those of the generalized linear model function glm from the packages stats and betareg (Cribari-Neto and Zeileis 2010; Grün, Kosmidis, and Zeileis 2012). The resulting package BinaryEPPM (Smith and Faddy 2019), whose use is described here, is available as a contributed package from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=BinaryEPPM.
There exists a number of generalized binomial models that can deal with over-dispersion relative to the binomial: for example, mixed models (Williams 1996) and correlated models (Kupper and Haseman 1978). Although the resulting probability distributions can admit some under-dispersion, where the residual variance is less than that corresponding to a binomial distribution, this may be rather too limited for them to be considered general models for under-dispersed data. BinaryEPPM complements these models by modeling under-dispersion (and limited over-dispersion). Both the mean and variance can be formulated in terms of associated covariates. Observed data can then be modeled using these generalized binomial distributions, leading to better fitting models and model checking diagnostics, and more appropriate assessment of the precision of any estimated quantities.

Extended Poisson process models (EPPMs)
The models described in Faddy (1997) can be summarized as describing probability distributions on 0, 1, 2, . . . , n in terms of the vector of probabilities p = (1 0 · · · 0) exp(Q), where Q is an (n + 1) × (n + 1) bi-diagonal matrix consisting of (Poisson process) rate parameters λ i (> 0) for i = 0, 1, . . . , n − 1 on the upper diagonal; and −λ i for i = 0, 1, . . . , n (with λ n = 0) on the diagonal. A function of linearly decreasing λ i 's λ i = a(n − i), for i = 0, 1, 2, . . . , n with a > 0 gives rise to the binomial distribution with probability p = 1 − exp(−a). If covariates, x say, influence the response then having log(a) = x T β (the usual linear predictor) in this binomial special case corresponds to generalized linear modeling with a complementary log-log link function (Dobson and Barnett 2008, Chapter 7). Other link functions (such as logistic) could be used, but the complementary log-log link function does arise quite naturally from this extended Poisson process modeling. Faddy and Smith (2008) considered a generalization of Equation 2 resulting in distributions analogous to those from correlated binomial modeling (Kupper and Haseman 1978) with concave sequences of λ i 's (0 < b < 1) corresponding to positive correlations and over-dispersion, and convex sequences (b > 1) to negative correlations and under-dispersion. Here approximations for the mean and variance of these distributions from Faddy (1997) are used to re-parameterize them in terms of the probability of a success p s in a single Bernoulli trial and scale-factor f s for the variance of the number of successes in n trials as in Equations 4 and 5.
with substantial under-dispersion possible for large b (f s → 0 as b → ∞) while over-dispersion is limited by f s < 1 1−ps (the value for b → 0). Since the complementary probability distribution of the number of failures will have (approximately) f s < 1 ps over-dispersion is effectively limited by f s < max 1 1−ps , 1 ps with this modeling. Although technically the scale-factor cannot exceed n, this is unlikely to be a practical limitation, so a simple log link can be used for covariate dependence; i.e., log(f s ) = x γ.
Given f s and p s Equation 5 can be solved for b using the R root finding function uniroot, then Equation 4 can be solved for a leading to from Equation 3. This parameterization based on approximate forms for the mean and variance results in the exact mean and scale-factor not being described perfectly by their respective link functions of the linear predictors but by some perturbations of these. However, for the examples discussed in the next section the effect of this on moment-based estimates is quite modest. The covariate coefficients β and γ describing the mean and scale-factor can be estimated by maximum likelihood from data y 1 , y 2 , . . . , y k using the likelihood p y 1 , p y 2 , . . . , p y k from the probabilities in Equation 1. Alternatively, the parameter b in Equation 6 can be estimated as a nuisance parameter if there is no interest in modeling the variance. Exact calculation of the mean and variance can also be done using the probabilities in Equation 1.

Models other than EPPMs
There are other distributional models available for over-dispersed binary data such as the correlated binomial and beta binomial distributions. These distributions differ in their interpretation with the former allowing the outcomes of successive trials to be correlated, and the latter being a mixed binomial distribution where the success probability p s is not fixed over the sequence of trials but varies according to a beta distribution. The mean and scale-factor of a simple correlated binomial with correlation ρ between the outcomes of any two trials are np s and 1 + ρ(n − 1), with probability mass function as in Kupper and Haseman (1978) The beta binomial distribution has probability mass function as in Smith (1983) (1 + rθ) , with mean µ and scale-factor 1 + θ (1+θ) (n − 1) (Hughes and Madden 1995). Both these distributions do admit some modest levels of under-dispersion; bounds on the scale-factor can be determined from those given for ρ in Kupper and Haseman (1978) for the correlated binomial, and for θ in Prentice (1986) for the beta binomial.
The EPPM generalization of the binomial distribution complements the beta binomial distribution as it allows quite general levels of under-dispersion but only modest levels of overdispersion. Therefore a distribution formed by a combination of beta binomial for f s > 1 and EPPM generalized binomial for f s ≤ 1 will allow for the full range of under-and overdispersion in observed data. With the mean and scale-factor being dependent on covariates as discussed in the previous sub-section, continuity is assured by both the EPPM generalized binomial and the beta binomial reducing to the simple binomial distribution for f s = 1. Standard likelihood methods would apply as f s = 1 is not on the boundary of the parameter spaces of either of the components forming the residual distribution.

Description of the functions
Models with two covariate dependencies linked to p s and f s are developed using Equations 1 and 3. The link function between p s and the linear predictor of covariates is either logit, probit, complementary log-log, cauchit, log, loglog, double exponential, double reciprocal, power logit, or negative complementary log. The last four of these link functions are not available in glm or betareg. References to them are Ford, Torsney, and Wu (1992), Gaudard, Karson, Linder, and Tse (1993), and Tibshirani and Ciampi (1983). Only a log link function is used for the scale-factor f s . Fitting to data is done using maximum likelihood, the optimization method used being one of two of the options available in the R function optim, i.e., the simplex method of Nelder and Mead (1967) ("Nelder-Mead"), or the "BFGS" method which uses first derivatives. The first derivatives used in the latter method, and in calculating the hessian matrix, are numerical ones obtained using the gradient function of the R package numderiv of Gilbert and Varadhan (2019).
The R package Formula of Zeileis and Croissant (2010) is used to extract model information from the formula input to BinaryEPPM. Offsets are included in the formulae specifications. To avoid repeated extractions within subsidiary functions, extraction of model information such as covariates.matrix.mean is only done once. As iteration is involved in the model fitting, initial estimates of the parameters are needed. These can be provided in the vector initial with a default, if unset, of initial estimates being produced within BinaryEPPM by fitting a binomial model using glm. The matrix exponential function used for calculating the probabilities of Equation 1 is from the package expm of Goulet, Dutang, Maechler, Firth, Shapira, and Stadelmann (2019) which depends on the package Matrix of Bates and Maechler (2019). Three pseudo R-squared are available, the first, is the square of the correlation between the observed and predicted GLM linear predictor values; the other two are commonly used in logistic regression, relevant references being Cox and Snell (1989) and Nagelkerke (1991).
The arguments of BinaryEPPM are BinaryEPPM(formula, data, subset = NULL, na.action = NULL, weights = NULL, model.type = "p and scale-factor", model.name = "generalized binomial", link = "cloglog", initial = NULL, method = "Nelder-Mead", pseudo.r.squared = "square of correlation", control = NULL) vector if data is a data.frame vector of ones a list if data is a list list of lists of ones attributes normalization, norm.to.n both NULL model.type "p only" "p and scale-factor" (only p s in Equation 4 modeled) "p and scale-factor" (p s and f s modeled) model.name "binomial" ("p only") "generalized binomial" "beta binomial" "correlated binomial" "generalized binomial" link the GLM link function for p s "cloglog" "logit" "probit" "cloglog" "cauchit" "log" "loglog" "doubexp" "doubrecip" "negcomplog" "powerlogit" attribute "power" "power" = 1 initial parameter initial values vector glm fit of binomial method "Nelder-Mead" "Nelder-Mead" "BFGS" attribute "grad.method" attribute "simple" which is "simple" or "Richardson" pseudo.r.squared "square of correlation" "square of correlation" "R squared" "max-rescaled R squared" control list of control parameters see text for more detail with details given in Table 1 together with defaults if any. The dependent variable is either a column, or columns, where data is a data.frame; or a list within data where it is a list. For the latter, the response variable list is one of frequency distributions. Several of the example data sets are available in both forms to illustrate how to deploy them. Table 2 gives details of the fitted model object of class 'BinaryEPPM' returned. It is a list similar to those of objects with classes 'glm' and 'betareg' returned by calls to glm and betareg. Table 3 gives details of a set of S3 generic extractor functions for objects of class 'BinaryEPPM'. The set is similar to that of Table 1   which work because of the information supplied by the functions of the first four blocks. Package lmtest (Zeileis and Hothorn 2002) needs to be loaded to use coeftest and lrtest. Function AIC comes from stats which is a default package loaded when R is started. In Table 2 both n and nobs are included, so that functions from both packages lmtest and stats can use the object returned. The limits on the values of θ (beta binomial) or ρ (correlated binomial) can be obtained from the S3 extractor function predict with argument type = "distribution.parameters". For given values of n and p s tables of limits can be constructed using the subsidiary function Model.BCBinProb of BinaryEPPM. The supplementary file of examples has code for calculating the table of limits for ρ as given in Kupper and Haseman (1978).

Examples
To run the examples as shown the package lmtest needs to be installed and loaded.

Data of number of rope spores in a dilution series of potato flour
These dilution series data originate from Finney (1971), where a number of samples (n = 5) at each of a series of dilutions of a suspension of potato flour were examined for rope spores. The data are given in Faddy and Smith (2008), Faddy and Smith (2012). Both forms of the data are available with data("ropespores.grouped", package = "BinaryEPPM") and data("ropespores.case", package = "BinaryEPPM") representing list and data.frame respectively. All models fitted have the (approximate) p s modeled according to the series of dilutions using a cloglog link function The preliminary analysis of these data in Faddy and Smith (2008) was based on a binomial distribution from Equation 2 with log(a) = β 0 −log(dilution), corresponding to the parameter a being proportional to the reciprocal of the dilution, and log(dilution) an offset. Here, 1 − exp(−a) is the probability of a single sample being fertile for rope spores and exp(−a) the probability of a single sample being sterile. Fitting a binomial followed by generalized binomial Equation 3 with constant b using the data.frame form of input R> data("ropespores.case", package = "BinaryEPPM") R> output.fn <-BinaryEPPM(number.spores / number.tested+ 1 + offset(logdilution), data = ropespores.case, + model.name = "binomial") R> output.fn.one <-update(output.fn, model.type = "p only", + model.name = "generalized binomial") R> summary(output.fn.one) Dependent variable a vector of numerator / denominator. The generalized binomial model with constant b is superior to the binomial with significant under-dispersion apparent according to the likelihood ratio test, although not according to the Wald test due to considerable asymmetry in the profile log-likelihood as a function of this parameter. The estimates of the other parameter β 0 are rather different due to the formulation of the generalized binomial model in terms of the approximate mean (Equation 4), but this has only a small effect on the actual means of the fitted model.
The complementary log-log link function is asymmetric about the 50% (ED50) as compared to the symmetric logit link function. To assess how a more general asymmetric link function might perform, the profile likelihood can be optimized for a power logit link function.

Frequency of sex combinations in litters of pigs
The title of Brooks, James, and Gray (1991) suggests that the data they consider show underdispersion relative to the binomial distribution. Of the three data sets mentioned, only those for the Yorkshire breed will be used here. The fitting of a binomial distribution to these data with litter size treated as a factor with 9 levels suggests that such a model might be a satisfactory fit.
R> print(data.frame(size = Yorkshires.litters$vsize, + mean = predict(output.fn.two, type = "mean"), + variance = predict(output.fn.two, type = "variance"), + p = predict(output.fn.two, type = "p"), + scale.factor = predict(output.fn.two, type = "scale.factor"), + lower = predict(output.fn.two,  The calculation of these exact summary statistics is done using the probabilities in Equation 1 for the generalized binomial, rather than the approximate formulae in Equations 4 and 5. The following code uses predict to compare these approximate forms with the above predicted values of p s and f s . Figure 1 shows plots of these where the lines represent the approximate (linear) values and the symbols the exact (non linear) values.

Chromosome aberrations
The two data sets are of chromosome aberrations amongst survivors of the atomic bombs exploded over Japan in 1945. The response variable is the number of cells that show chromosome aberrations out of one hundred tested. Although nominally the same data there are differences between the two data sets. The Prentice (1986) set Hiroshima.grouped consists of four frequency distributions, i.e., one for a zero dose and three others where the doses are of ranges, and it is assumed that every survivor has had one hundred cells tested. The Morel and Neerchal (2012) set Hiroshima.case is for individual survivors and not all survivors had one hundred cells tested. The doses of Hiroshima.grouped have been transformed to a standard normal gz to match those of Hiroshima.case which are represented by z, with zz and gzz representing dose 2 . Morel and Neerchal (2012, Section 5.4) fit a beta binomial model similar to that of "p and scale-factor" but to p s and the over-dispersion parameter θ of the beta binomial, both having a logit link function. The following sequence of commands replicates this model fit, but to a "p and scale-factor" model with log link for the scale-factor, using Hiroshima.grouped to provide initial estimates for fitting the model to Hiroshima.case.
Using an alternative generalized binomial model with complementary log-log link and log link functions for p s and f s respectively, resulted in a log-likelihood of −1755.506 for the Hiroshima.case data set, showing a much poorer fit than the beta binomial which would generally be preferred for over-dispersed data.

Food stamps
These data on food stamps are used as an example in Künsch, Stefanski, and Carroll (1989) available from package robustbase Todorov and Filzmoser 2009). Here they are used to illustrate how to use weights. The methodology used in BinaryEPPM is maximum weighted likelihood estimation, which is associated with robust estimation. The weights used come from use of glmrob from robustbase although their use here does not reproduce the analysis of glmrob. It reproduces the analysis of glm using the same weights. The weights are those of Example 5.2 of Künsch et al. (1989).
R> output.fn <-BinaryEPPM(participation / n~tenancy + suppl.income + + income, data = foodstamp.case, weights = foodstamp.case$weights1, + model.type = "p only", model.name = "binomial", link = "logit", + pseudo.r.squared.type = "max-rescaled R squared") R> summary(output.fn) Dependent variable a vector of numerator / denominator. Call: BinaryEPPM(formula = participation/n~tenancy + suppl.income + income, data = foodstamp.case, weights = foodstamp.case$weights1, model.type = "p only", model.name = "binomial", link = "logit", pseudo.r.squared.type = "max-rescaled R squared") These data are also available in frequency distribution form with the dependent variable now l.participation and the weights variable l.weights1 defined as list. Use of a list for data expects there to be a list within the list data which is named the dependent variable in Formula. The two normalization attributes of weights which is also expected to be a list, have been set to illustrate how normalization can be done. The code below is for analyzing data where the dependent variable and the weights are list. The output is essentially the same as that above and hence not shown.

Other data sets
Testing of BinaryEPPM used other data sets which have been included in BinaryEPPM but not reported here. Code for use with these other data sets is available in a supplementary file. These data sets are from Kupper and Haseman (1978), Williams (1996), Hilbe (2011) (Titanic data of Table 9.37), and Prater (1956) (gasoline yield). The last of these has gasoline yield as a binomial variable with n = 1000. The code reproduces the analyses of Cribari-Neto and Zeileis (2010) on these data, where the response is (gasoline yield)/n and is treated as a continuous beta distributed variable between 0 and 1. The analyses as a beta binomial variable were done to compare how closely the two analyses agree. Models with an n of such size stress BinaryEPPM due to the sizes of the matrices involved, and the time taken to run them, so they are not recommended. However, the close agreement of the results does illustrate the similarity of a beta binomial analysis of a discrete variable with a beta distribution analysis of the analogous continuous variable.

Concluding remarks
This article has described the use of package BinaryEPPM to fit EPPMs and other distributional models to grouped binary data exhibiting under-and/or over-dispersion relative to the binomial distribution. A variety of covariate dependencies and data structures are covered in examples that provide illustrations of the ways in which BinaryEPPM can be used in the analysis of grouped binary data. It complements the similar modeling in Smith and Faddy (2016) of count data using EPPMs. Package CountsEPPM (Smith and Faddy 2016) is available on the Comprehensive R Archive Network (CRAN) as a contributed package at https://CRAN.R-project.org/package=CountsEPPM.