Half-Normal Plots and Overdispersed Models in R : The hnp Package

Count and proportion data may present overdispersion, i.e., greater variability than expected by the Poisson and binomial models, respectively. Diﬀerent extended generalized linear models that allow for overdispersion may be used to analyze this type of data, such as models that use a generalized variance function, random-eﬀects models, zero-inﬂated models and compound distribution models. Assessing goodness-of-ﬁt and verifying assumptions of these models is not an easy task and the use of half-normal plots with a simulated envelope is a possible solution for this problem. These plots are a useful indicator of goodness-of-ﬁt that may be used with any generalized linear model and extensions. For GLIM users, functions that generated these plots were widely used, however, in the open-source software R , these functions were not yet available on the Comprehensive R Archive Network (CRAN). We describe a new package in R , hnp , that may be used to generate the half-normal plot with a simulated envelope for residuals from diﬀerent types of models. The function hnp() can be used together with a range of diﬀerent model ﬁtting packages in R that extend the basic generalized linear model ﬁtting in glm() and is written so that it is relatively easy to extend it to new model classes and diﬀerent diagnostics. We illustrate its use on a range of examples, including continuous and discrete responses, and show how it can be used to inform model selection and diagnose overdispersion.


Introduction
An important step of statistical modeling of any sort is to perform diagnostic analyses to assess goodness-of-fit. Several problems arise when model assumptions are not met such as misleading estimates, standard errors and p values and/or wrong conclusions about the process being studied. Among the reasons for a poorly-fitted model, the most common ones are • measurement error of observed and/or explanatory variables (including typos); • incomplete or inadequate linear predictor to describe the systematic structure of the data; • incorrect specification of the error distribution or link function; • unmodelled overdispersion; • combination of one or more of the above.
When fitting linear models under the normality assumption, goodness-of-fit can be checked using formal tests, such as the Shapiro-Wilk test for residual normality (Shapiro and Wilk 1965), or the Bartlett test for variance homogeneity (Bartlett 1937). However, these tests may fail under many circumstances, such as small sample sizes, and usually graphical techniques provide a better assessment for model goodness-of-fit. These techniques include plotting different types of residuals or influence measures (e.g., leverage and Cook's distance). Several types of residuals may be used (e.g., standardized residuals, studentized residuals, deletion residuals, Pearson residuals, deviance residuals, etc.). However, for other types of model, diagnostic checking may be problematic. Useful diagnostic-checking plots include • residuals vs. explanatory variables -indicates whether higher-order terms should be included in the linear predictor or the need for transformation of the response and/or explanatory variables (for quantitative explanatory variables); • residuals vs. explanatory variables not included in the model -a systematic relationship indicates that the explanatory variables should be included in the linear predictor; • added-variable plot -detects the relationship of the response variable with an explanatory variable not included in the model allowing for the effects of other variables; • residuals vs. fitted values -may reveal variance heterogeneity and/or outliers; • (half-)normal plot of residuals -detects outliers and indicates whether the error distribution was specified appropriately.
Under a normally distributed error assumption, when a model fits the data well, it is expected that studentized residuals follow a t distribution on the residual degrees of freedom, which for large samples converges to the standard normal distribution. In this case, the half-normal plot of these residuals should show a straight 45 • line. However, it is hard to interpret whether points are sufficiently aligned or if the inevitable irregularities are caused by something other than random fluctuations. Another difficulty posed by these plots is that the ordering of the observations may induce dependence. For generalized linear models, this plot may have several forms depending on the variance and link functions and response variable distribution. Atkinson (1985) proposes the addition of a simulated envelope so that interpretation is more straightforward. For a well-fitted model the envelope is such that model diagnostics are likely to fall within it. The purpose is not to provide a region for acceptance or rejection of observations but to serve as a guide of what to expect under a well-fitted model. These plots are also useful for detecting possible outliers, overdispersion, and if the link function and/or error distribution were properly specified (Demétrio, Hinde, and Moral 2014).
When fitting generalized linear models, or different types of extended models (e.g., zeroinflated models and mixed models), half-normal plots with simulated envelopes are useful to assess goodness-of-fit, especially when analyzing overdispersed data. Demétrio and Hinde (1997) wrote GLIM4 macros (Francis, Green, and Payne 1993) to produce these plots for different overdispersion models. We developed the R (R Core Team 2017) package hnp (Moral, Hinde, and Demétrio 2017) that provides functions for generating half-normal plots with a simulated envelope for a range of generalized linear models and extensions. The scope of the hnp() function can be easily extended to include different diagnostics and models by the user-specification of appropriate simulation, model fitting, and diagnostic extraction codes. Package hnp is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=hnp.

Generalized linear models and overdispersion
The generalized linear models (GLMs) framework of statistical modeling, formulated by Nelder and Wedderburn (1972), provides a unified theory for the application of normal, binomial, Poisson, gamma and inverse Gaussian regression models and brings together methods for the analysis of count, proportion, continuous measurement and time to event data, that until then were studied separately. As well as building on the standard normal regression model, this theory also generalizes the ideas of analysis of variance through the analysis of deviance.
For random variables Y 1 , . . . , Y n , with probability function (pf) (Y i discrete) or probability density function (pdf) (Y i continuous) f (y i ; θ i ), where θ i is associated with explanatory variables x 1 , . . . , x p , statistical modeling will typically be based on a random sample of n observations (y i ; x i ), x i = (x i1 , . . . , x ip ) . A GLM is defined in terms of three basic components. The first component, called the random component, is represented by the random variables Y 1 , . . . , Y n following a distribution which belongs to the exponential family of distributions and differs only in terms of a parameter θ i . Writing the exponential family in canonical form, the pf or pdf is given by: where b(·) and c(·) are known functions, φ > 0 is a known dispersion parameter and θ i is called the canonical parameter. The normal, Poisson, binomial, gamma, inverse Gaussian and negative binomial distributions (with suitable definitions of dispersion parameters) all belong to the exponential family and can be expressed in the canonical form (1). For this Writing X = (x 1 , . . . , x n ) as the design (model) matrix and β = (β 1 , . . . , β p ) as the vector of unknown parameters, the second component is a linear predictor η = Xβ, called the systematic component. Finally, the third component, the link function g(·), which is a monotonic and differentiable function, links the random and systematic components by relating the mean to the linear predictor through η i = g(µ i ). Nelder and Wedderburn (1972) proposed the scaled deviance as an overall discrepancy quantity to measure the fit of a GLM . This is given by wherel p andl n are the maximum of the log-likelihood functions for the current and saturated models, respectively, and the saturated model is one whose fit reproduces the observed data and typically has n parameters. These maximized likelihoods may be written aŝ whereθ andθ are the maximum likelihood estimates of the canonical parameter for the saturated and current models, respectively, hence we may also write the scaled deviance as Another important overall discrepancy measure of a fitted model is the generalized Pearson statistic X 2 p , given by For the true model and a known dispersion parameter φ both the scaled deviance and the scaled generalized Pearson statistic follow, asymptotically, a χ 2 distribution with n−p degrees of freedom, although Jørgensen (2002) recommends the use of the scaled X 2 p statistic as convergence to the reference distributions is more rapidly achieved as the sample size n increases. For further details on GLM theory, see McCullagh and Nelder (1989). When φ is unknown, there is no formal goodness-of-fit test available based on S p or scaled X 2 p and in this case X 2 p is often used to estimate φ (Jørgensen 2002). To analyze count and proportion data the Poisson and binomial models are, respectively, reasonable first choices. For these distributions the dispersion parameter is fixed and equal to 1. For a well-fitted model it is expected that the residual deviance and the generalized Pearson statistic should be approximately equal to the residual degrees of freedom, the expected value of the χ 2 reference distribution. When this does not happen, the model may not fit the data well (for various reasons, such as a misspecified linear predictor or link function, outliers, etc.), or simply, the variation in the data may be larger than is predicted by these models, a phenomenon often referred to as overdispersion, see Hinde and Demétrio (1998). There are different causes of overdispersion and failure to take it into account may result in misleading inferences. For a more thorough discussion, see Demétrio et al. (2014).
These basic models can be extended to incorporate overdispersion in several ways. A relatively simple extension involves the specification of a more general variance function. For example, to analyze count data, suppose that Y i ∼ P(µ i ), then the Poisson model assumes that E(Y i ) = µ i and that VAR(Y i ) = µ i . Then, a simple extension is to take VAR(Y i ) = φµ i and use a quasilikelihood approach to estimate φ > 1, as the variance is greater than expected under the Poisson model. This specification of the variance function is called constant overdispersion. This may be generalized further by taking For φ = 0, this is simply the variance function of the standard Poisson model; for δ = 0 it gives the constant overdispersion case; while if δ = 1 we get the same type of quadratic variance function as for the negative binomial type-II model.
A similar approach may also be adopted for proportion data. Suppose that Y i ∼ B(m i , π i ), then the standard binomial model assumes that If we now take an extended variance function for φ = 0, the expression is the same as for the standard binomial model; for δ 1 = δ 2 = 0, we obtain the constant overdispersion variance function; while if δ 1 = 1 and δ 2 = 0, this is the same form of variance function as the beta-binomial model. Another way to model overdispersion consists in adding an observation-level random effect in the linear predictor, i.e., η = Xβ + σz, where z = (z 1 , z 1 , . . . , z n ) is the vector of random effects, and typically we assume that Z i ∼ N(0, 1). For count data, this leads to the Poisson-normal model. For proportion data, when the logit link is used, this results in the binomial-logit-normal model. This approach is closely related to the ideas in two-stage models where the basic parameter of interest is assumed to be an individual level random variable. This leads to negative binomial models for count data (by assuming the Poisson mean to have a gamma distribution) and the betabinomial model for counted proportions (by assuming the binomial probability to have a beta distribution). Further details on overdispersion models for count and proportion data may be found in Hinde and Demétrio (1998) and Demétrio et al. (2014).

Half-normal plots with simulated envelopes
This relatively easy technique consists in plotting the ordered absolute values of a model diagnostic versus the expected order statistics of a half-normal distribution, which can be approximated as where i is the ith order statistic, 1 ≤ i ≤ n and n is the sample size, as in McCullagh and Nelder (1989, p. 407), following the results from Blom (1958) and Royston (1982). For a normal plot we use the ordered values versus the expected order statistics of a normal distribution, approximated as These order statistics are easily obtained using the R function qnorm, e.g., for n = 7 the expected order statistics of the half-normal distribution are R> i <-1:7 R> n <-7 R> qnorm((i + n -1/8) / (2 * n + 1/2)) 3. simulating 99 (or more) response variables using the same model matrix, error distribution and parameter estimates; 4. fitting the same model to each simulated response variable and extracting the same model diagnostics, and again sorting the absolute values; 5. computing the desired percentiles (e.g., 2.5 and 97.5) of the simulated diagnostic values at each value of the expected order statistic and using these to form the envelope.

Implemented model classes
Function hnp() handles many different model classes and more will be implemented as time goes by. So far, a range of generalized linear models and extensions are included: • models for continuous data -Gaussian (lm, aov and glm functions), gamma (glm function), and inverse Gaussian (glm function) models; • models for count data - • models for zero-inflated data -zero-inflated Poisson and negative binomial (zeroinfl function in package pscl), zero-inflated binomial (vglm function in package VGAM, gamlss function in package gamlss, glmmadmb function in package glmmADMB), and zero-inflated beta-binomial (gamlss function in package gamlss, glmmadmb function in package glmmADMB) models; • mixed models -Gaussian (lmer function in package lme4, Bates, Mächler, Bolker, and Walker 2015; Doran, Bates, Bliese, and Dowling 2007), Poisson-normal and binomialnormal (glmer function in package lme4) models.

Simulation procedures
For most of the implemented model classes, the simulation procedures are already implemented in the R base packages or in the packages being used to fit the models, such as the rbinom, rpois, rmultinom, rgamma and rnorm functions for binomial, Poisson, multinomial, gamma and normal models. Package mgcv's (Wood 2006) function rig was used for inverse Gaussian models; package gamlss's (Rigby and Stasinopoulos 2005;Stasinopoulos and Rigby 2007) functions rBB, rZIBI and rZIBB were used for beta-binomial, zero-inflated binomial and zero-inflated beta-binomial models fitted using gamlss; package VGAM's (Yee 2010) functions rbetabinom and rzibinom were used for beta-binomial and zero-inflated binomial models fitted using vglm; package lme4's (Bates et al. 2015;Doran et al. 2007) function simulate was used for generalized linear mixed models fitted using lmer and glmer; package MASS's (Venables and Ripley 2002) function rnegbin for negative binomial models.
For the quasi-binomial model the random samples were simulated using rbinom and then multiplied by φ (summary(model)$dispersion) and the residuals were then scaled by 1/ √ φ. For the quasi-Poisson model, the function rnbinom was adapted with the size argument set to µ/(φ − 1). New functions were written for zero-inflated binomial and zero-inflated betabinomial models fitted with glmmADMB (Skaug et al. 2014) and for zero-inflated and hurdle Poisson and negative binomial models fitted with pscl (Zeileis et al. 2008). The simulation procedures are all accessible in the code of package hnp.

New class implementation
To produce the half-normal plot with a simulated envelope, three procedures are required: (i) one to extract a diagnostic measure from the fitted model, (ii) one to simulate response variables using information from the model (error distribution, model matrix and fitted parameters), and finally (iii) one to refit the same model to the simulated data. The hnp function firstly recognizes the model class of the fitted object. If this class is not yet implemented, it returns an error. However, users may opt to supply their own diagnostic extraction, simulation and model fitting functions so that the half-normal plot is produced, see Section 4 for a practical guide.

Examples
Package hnp provides a range of examples drawn from Demétrio et al. (2014) and two of them will be discussed below for overdispersed proportion and count data. An example on orange tree embryogenesis (Tomaz, Mendes, Filho, Demétrio, Jansakul, and Rodriguez 1997) will be used to discuss new class implementation and an example on leukemia recurrence times (Miller 1997) will be used to discuss the survival analysis models implementation.

Overdispersed proportion data
A major pest of stored maize in Brazil is Sitophilus zeamais. In an experiment to assess the insecticide action of organic extracts of Annona mucosa (Annonaceae), Petri dishes containing 10g of corn were treated with extracts prepared with different parts of the plant (seeds, leaves and branches) at a concentration of 1500mg/kg or just water (control), using a completely randomized design with 10 replicates. Then 20 Sitophilus zeamais adults were placed in each Petri dish and, after 60 days, the numbers of damaged and undamaged corn grains were counted, see Ribeiro et al. (2013).
We begin by fitting a standard binomial model, i.e., Y ij ∼ B(m ij , π ij ), to the data using the logit link with the following linear predictor: where β 0 is the intercept and e i is the effect of the ith extract, i = 1, . . . , 4. We can fit this model in R and produce a simple analysis of deviance table using glm() and anova(): R> library("hnp") R> data("corn", package = "hnp") R> fit1_b <-glm(cbind(y, m -y)~extract, family = binomial, data = corn) R> anova(fit1_b, test = "Chisq") The residual deviance is much larger than the number of residual degrees of freedom, indicating that the model does not fit the data well. We can easily check this by producing the half-normal plot with simulation envelope, see Figure 1(a) and it is clear that this is not a good model fit.
The three overdispersion models seem to fit the data equally well, which is not a surprise because the binomial sample sizes do not vary much (from 32 to 39) and when they are equal the beta-binomial and binomial-logit-normal variance functions both reduce to the constant overdispersion form, which may be a reasonable choice for this case, see Ribeiro et al. (2013).

Overdispersed count data
For the same experiment described in the previous Section we now turn to focus on the number of emerged insects (progeny) after 60 days, see Ribeiro et al. (2013). We begin by fitting a standard Poisson model, i.e., Y ij ∼ P(µ ij ), using a log link with the following linear predictor: where β 0 is the intercept and e i is the effect of the ith extract, i = 1, . . . , 4. We can fit this model in R and produce a simple analysis of deviance as before, using glm() and anova(): R> data("progeny", package = "hnp") R> fit1_p <-glm(y~extract, family = poisson, data = progeny) R> anova(fit1_p, test = "Chisq") The residual deviance is again much larger than the number of residual degrees of freedom, indicating that the model does not fit the data well. This is confirmed by the half-normal plot with simulated envelope, see Figure 2  R> hnp(fit1_p, xlab = "Half-normal scores", ylab = "Deviance residuals", + main = "(a) Poisson model", pch = 4)

Analysis of Deviance
We now fit overdispersion models, and begin with a constant overdispersion quasi-Poisson model: R> fit2_p <-glm(y~extract, family = quasipoisson, data = progeny) R> summary(fit2_p)$dispersion [1] 2.365385 R> hnp(fit2_p, xlab = "Half-normal scores", ylab = "Deviance residuals", + main = "(b) Quasi-Poisson model", pch = 4) We observe that the estimated dispersion parameter isφ = 2.37 and the half-normal plot indicates that this model fit was satisfactory with most of the deviance residuals lying within the simulated envelope, see Figure 2(b). We now make use of package MASS to fit the negative binomial type-II model using the same linear predictor (11) and lme4 to fit the Poisson-normal model with a random effect in the linear predictor: where Z ij ∼ N(0, σ 2 ).

Negative binomial type-I model using package gamlss
We now turn to a data set on a tissue-culture experiment using the orange variety Caipira.
To study the effect of six sugars (maltose, glucose, galactose, lactose, sucrose and glycerol) on the stimulation of somatic embryos from callus cultures, the number of embryos after approximately four weeks was observed. The experiment was set up in a completely randomized block design with five blocks and the six sugars at dose levels of 18, 37, 75, 110 and 150 µM for the first five and 6, 12, 24, 36 and 50 µM for glycerol, see Tomaz et al. (1997). The main interest was in the dose-response relationship and the data shows high variability. In their analysis, Tomaz et al. (1997) used a quasi-Poisson model. An alternative is the negative binomial type-I model (Jansakul and Hinde 2004), which has the same variance function as the quasi-Poisson, that is, it is also a constant overdispersion model (see (5), with δ = 0). However, as the negative binomial type-I is a fully specified probability model (albeit not in the exponential family) it is possible to obtain standard maximum likelihood parameter estimates.
For sugars lactose and galactose there seems to be a quadratic relationship when we look at the scatter plot, see Figure (4), which justifies the use of the following linear predictor: where β 0 is the intercept, b j is the effect of the jth block, j = 1, . . . , 5, d i is the ith dose, i = 1, . . . , 5, and β 1 k and β 2 k are the linear and quadratic dose effects for the kth sugar, k = 1, . . . , 6. Package gamlss allows for simple implementation of the negative binomial type-I model using family = NBII() 1 : R> data("orange", package = "hnp") R> library("gamlss") R> fit_nbI <-gamlss(embryos~block + poly(dose, 2) * sugar, + family = NBII(), data = orange) The hnp function does not handle 'gamlss' objects with negative binomial type-I family.

GAMLSS
Attempting to use it yields an error. In order to use it we must pass three helper functions to hnp. The first function, passed to argument diagfun, must extract the desired model diagnostics. In this case, we will use the z-scores (see Rigby and Stasinopoulos 2005, for further details), which are the default standardized residuals computed from the function resid on a 'gamlss' object:

R> d.fun <-function(obj) resid(obj)
The second function, passed to argument simfun, is used to simulate random samples using the same model matrix and distribution used as when fitting the model to the original data: R> s.fun <-function(n, obj) rNBII(n, obj$mu.fv, obj$sigma.fv) A final function, passed to argument fitfun, is used to refit the simulated data using the same model as fitted to the original data: R> f.fun <-function(y.) + gamlss(y.~block + poly(dose, 2) * sugar, family = NBII(), + data = orange) By setting newclass = TRUE we can now pass the three helper functions and the data set to the hnp function to produce the half-normal plot with simulated envelope: R> hnp(fit_nbI, newclass = TRUE, diagfun = d.fun, simfun = s.fun, + fitfun = f.fun, xlab = "Half-normal scores", ylab = "z-scores", + main = "", pch = 4, cex = 1, cex.lab = .8, cex.axis = .8) As expected, this model also fits the data well, see Figure 3.

Exponential and Weibull models using package survival
In a clinical trial to assess the efficacy of maintenance chemotherapy for acute myelogenous leukemia (AML), after reaching a state of remission through treatment with chemotherapy, 23 patients were randomized into two groups. The first group continued receiving maintenance chemotherapy and the second did not. The observed outcome was the time (in weeks) until relapse and it was right-censored (Miller 1997). The AML data set is available as object leukemia in package survival (Therneau 2017): R> library("survival") R> data("leukemia", package = "survival") Two widely used models in survival analysis are the exponential and the Weibull models.

Discussion
Half-normal plots with simulation envelopes are a useful tool to assess goodness-of-fit for a range of different models, such as classical linear models, generalized linear models, models for overdispersed and zero-inflated data, survival analysis models, as well as mixed models.
The key point lies in the simulation procedures which sometimes may be problematic due to model complexity. The hnp package allows for implementation of any model for which it is possible to write diagnostic extraction, simulation and fitting codes. This may also be useful for didactic purposes and in simulation studies.
There are other packages that provide a (half-)normal plot with simulated envelope for a few model classes. For example, package ssym's function envelope produces a normal plot with simulated envelope for semi-parametric log-symmetric regression models (Vanegas and Paula 2015). Package pgam also includes an envelope function to produce normal plots with simulated envelopes for Poisson-gamma additive models fitted using the roughness penalty approach (Junger and Leon 2012). Package betareg includes a plot method (argument which = 5) that produces a half-normal plot with simulated envelope for residuals of a beta regression model (Cribari-Neto and Zeileis 2010). Also, package car provides a function, qqPlot, whose method for 'lm' objects generates a normal plot with simulated envelope for studentized residuals of a linear regression (Fox and Weisberg 2011).
In the hnp function, by default the half-normal distribution is used to plot the diagnostic quantities against its expected order statistics (the normal distribution may be used by setting halfnormal = FALSE). However there is no reason to do so other than several types of residuals are shown to follow the normal distribution and so we would expect them to form a straight line in the plot. Again the aim in producing a simulation envelope is to reduce subjective bias as to the departure of points from a straight line (when the diagnostic quantity distribution is expected to be normal) and to provide the expected shape of the plot when the distribution is not expected to be normal. Depending upon model complexity the simulation and especially the fitting procedures may be time consuming. Therefore, considerable time may be spent to produce a simulation envelope with the default 99 simulations when one fitting procedure already takes a long time. So, sometimes it may be wiser to use just 19 simulations (sim = 19) and form the envelope from the minimum and maximum values obtained in the simulations (conf = 1), as proposed by Atkinson (1985), so that there is a chance of approximately 1 in 20 that the observed value of the diagnostic quantity is the most extreme and lies outside of the envelope.
Throughout this paper we have used different types of residuals in the examples, but for generalized linear models mainly deviance residuals, which are shown to have important asymptotic properties and be suitable for diagnostic analyses (McCullagh and Nelder 1989).
For mixed models, the question of which type of residuals should be used for goodness-of-fit assessment is still an active area of research (Nobre and da Motta Singer 2007). However, any type of model diagnostic may be used because they are simply being compared with what we might expect if the model were true, as given by the repeated sequences of "data simulationmodel re-fitting -diagnostic extraction". It is important to point out that every time this plot is produced the envelope bands change slightly, hence it sometimes may be useful to produce several half-normal plots and observe how many points lie outside of the bands as well as their position. Of course, not only are we not ruling out the use of other goodness-of-fit assessment techniques, but also we encourage that different tools are used to ensure that the model fits the data well so that no misleading inference is made. The hnp package merely provides a, hopefully, useful and flexible tool to help in this assessment.