simplexreg: an R package for regression analysis of proportional data using the simplex distribution

Outcomes of continuous proportions arise in many applied areas. Such data are typi-cally measured as percentages, rates or proportions conﬁned in the unitary interval. In this paper, the R package simplexreg which provides dispersion model ﬁtting of the simplex distribution is introduced to model such proportional outcomes. The maximum likelihood method and generalized estimating equations techniques are available for parameter estimation in cross-sectional and longitudinal studies, respectively. This paper presents methods and algorithms implemented in the package, including parameter estimation, model checking as well as density, cumulative distribution, quantile and random number generating functions of the simplex distribution. The package is applied to real data sets for illustration. viable CD34+ cells assessed by post-cryopreservation recovery rates of the number of post-thaw viable CD34+ cells and that of pre-freeze viable CD34+ cells. It scientiﬁc investigate the mechanism by which the post-cryopreservation recovery rates are infected. A study enrolled 242 patients who consented to autologous PBSC transplant after myeloablative doses of chemotherapy between the years 2003 and 2008 at the Edmonton Hematopoietic Stem Cell Lab in the Cross Cancer Institute – Alberta Health Services. Age, gender and clin-ical characteristics, such as cancer type and chemotherapy type, were recorded. The patients are 18 to 71 years old and most of them are male (170), diagnosed with multiple myeloma, non-Hodgkin’s lymphoma, acute leukaemia, solid tumors, amyloidosis or others. Patients received primary chemotherapy, with 1 day protocol, 3 day protocol, G-CSF only or other types, for mobilizing CD34+ cells. The PBSC collection was initiated when the circulating CD34+ count in the peripheral blood reached or exceeded 15 cells/ µ L. PBSC products were cryopreserved and stored in a liquid nitrogen vapor until reinfusion. PBSC samples were assessed on the day of collection (pre-freeze) and post cryopreservation (post-thaw) for absolute viable CD34+ cells. Post-cryopreservation viability was calculated as the percentage of the absolute number of viable cells or colonies over the number of pre-freeze cells.


Introduction
The theory of generalized linear models (GLMs, McCullagh and Nelder 1989) attests that regression analysis requires an appropriate recognition about the type of response variable. While the normal distribution is popular in practice, Jørgensen (1997) pointed out that the normal distribution is an exception, rather than the rule, except for data with small dispersions. Fisher (1953) reminded us of the importance of describing data in their natural habitat. Analysis of non-normal data should therefore take into account the actual type of data if such knowledge is available. Nelder and Wedderburn (1972) were the first to show, by introducing the class of GLMs, that a large variety of non-normal data may be analyzed by a united technique. The GLMs were originally developed for exponential families of distributions, but the main ideas were later extended to a wider class of models, the so called dispersion models (Jørgensen 1997). The major contribution behind dispersion models is that the notions of location and scale may be generalized to position and dispersion, respectively, so that a comprehensive range of non-normal distributions is covered and more data types such as positive data, positive data with zero, count data, binomial data and directional data can be dealt with by dispersion models in their natural habitat. An important subclass of dispersion models are the exponential dispersion models, which are the now familiar class of generalized linear models. In the meanwhile, the residual sum of squares from the analysis of variance is generalized to the notion of deviance, making the analysis of deviance available as a general inference tool in model fitting and model selection.
In applied fields, outcomes of proportions often arise. Few models are suitable for fitting such data. The beta distribution is often used in Bayesian statistics as the conjugate prior distribution for binomial proportions. With suitable parameter transformations of the beta distribution, Ferrari and Cribari-Neto (2004) proposed a beta regression model for rates or proportions and implemented the related statistical estimation and inference methods in the R (R Core Team 2016) package betareg (Cribari-Neto and Zeileis 2010). Another R package gamlss (Rigby and Stasinopoulos 2005;Stasinopoulos and Rigby 2007) provides semi-parametric regression type models for proportional data. However, with an additional dispersion parameter, the simplex distribution (Barndorff-Nielsen and Jørgensen 1991) based GLM is more robust for analysis of continuous proportional data (Zhang and Qiu 2014). The simplex distribution includes a large class of distributions whose domains are confined in (0, 1). As shown by Jørgensen (1997), such a distribution is actually a dispersion model and shares many common analytic properties with the exponential dispersion models. Therefore, the GLM for continuous proportional data can be developed on the lines of the classical GLMs (Song and Tan 2000). In the literature, the simplex marginal model (Song and Tan 2000;Song, Qiu, and Tan 2004), and the simplex mixed-effects model (Qiu, Song, and Tan 2008) have been extensively studied. In this paper, we will briefly present some properties of the simplex distribution, which provides a foundation for the inference and computation in the package simplexreg (Zhang, Qiu, and Shi 2016). The package is implemented in the R system and available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=simplexreg. There are three major lines of functions: 1. Calculation of the density, cumulative distribution, quantile and random number generating functions for the simplex distribution.
2. Statistical inference in the simplex generalized linear model (SGLM, Zhang and Qiu 2014) via maximum likelihood (ML) for cross-sectional proportional data set.
3. Analysis of longitudinal proportional data set via an extended version of generalized estimating equations (GEE) using the simplex distribution.
The paper is organized as follows. Section 2 presents the methods to calculate density, distribution and quantile functions as well as the function generating random variables of the simplex distribution. Section 3 illustrates the methodology of modeling proportional data using simplex generalized regression. The generalized estimating equations for longitudinal proportional outcomes are given in Section 4. Then we address model diagnostics in Section 5. Section 6 presents the details of the simplexreg package. Section 7 further conducts analyses based on the simplex distribution in R with real data sets. Finally, plans for extending the package are described in Section 8.
With mean µ ∈ (0, 1) and dispersion parameter σ 2 > 0, the simplex distribution has a density function taking a similar expression to a normal density, where the unit deviance function is Then a random variable Y which follows a simplex distribution with mean µ and dispersion parameter σ 2 is denoted by Y ∼ S − (µ, σ 2 ). Jørgensen (1997) gave the variance of Y, τ 2 , as Another important property about the density of the simplex distribution worth to mention is that when the dispersion parameter σ 2 → 0, the small-dispersion asymptotic theory (Jørgensen 1997) It has been proved that the simplex distribution has a uni-mode if σ ≤ 4/ √ 3; otherwise, it yields multi-modes.
Given values of parameters, µ and σ, the calculation of the simplex density function is straightforward (see (1)). As for the distribution and quantile functions, we calculate the normalized cumulative distribution and quantile functions instead of simplifying the computation if the dispersion parameter is small. When the dispersion parameter is large, however, the distribution function is calculated through a numerical integration while the quantile function is obtained by solving nonlinear equations.
The interest in developing algorithms to generate random variables from the simplex distribution lies in many aspects, one of which is the need in simulation studies. Also, the effective algorithm to generate random variables from the simplex distribution is the base of Markov chain Monte Carlo methods in Bayesian inference. Simplex random variable generation can be developed based on a certain transformation which is motivated by the fact that the simplex distribution is transformed from the inverse Gaussian mixture distribution (Qiu 2001).
The inverse Gaussian mixture distribution (Jørgensen 1991), denoted by X ∼ M -IG(ξ, ǫ 2 , p), is the mixture of the inverse Gaussian distribution (with probability 1 − p) and its complementary reciprocal (with probability p), with probability density: , noting that the Jacobian is (1 − y) −2 in this transformation. Therefore, to generate random variables from the simplex distribution S − (µ, σ 2 ), we can first produce random variables from the inverse Gaussian mixture distribution, M -IG(ξ, ǫ 2 , p). Jørgensen (1991) presented the method to generate the inverse Gaussian mixture random variables from the inverse Gaussian distribution, denoted by IG(ξ, ǫ 2 ), and χ 2 1 distribution. To generate inverse Gaussian random variables, we adopt the method proposed by Michael, Schucany, and Hass (1976), which is based on the property that the kernel of the inverse Gaussian density has a χ 2 -distribution. The proposed transformation method for a simplex random number generation is built only upon the χ 2 -generator and uniform-generator, with the process listed as follows: 1. Set p = µ, ξ = µ 1−µ and ǫ 2 = σ 2 (1 − µ) 2 .
• Let X equals to X 1 + X 2 with probability p and X 1 with probability 1 − p.
The following properties of the simplex distribution will be used in the inference and computation of simplex models: The proof can be found in Song and Tan (2000), Qiu (2001), Song et al. (2004) and Zhang and Qiu (2014).

Simplex generalized linear models
Consider cross-sectional proportional responses The goal is to model the means of the responses and the dispersion as functions of these covariates. Assuming where the function g : (0, 1) → (−∞, ∞) is the link function, and β is the vector of regression parameters of interest, and the dispersion σ 2 i in the simplex heterogeneous model may be modeled with some covariates: where the function h : (0, ∞) → (−∞, ∞) is the link function, and γ is the vector of regression coefficients associated with the dispersion σ 2 i . The homogeneous simplex model is obtained by removing the varying dispersion effect in (5). This is an extension of the generalized linear models mentioned in Jørgensen (1997), noting that (i) given (ii) the parameters µ i may vary from subject to subject; (iii) the link function can be any monotonic and differentiable function. However, the simplex density function is not a member of the exponential family with the general form and the variance of Y i does not vary with the x i 's through µ i alone, which, in fact, depends on the dispersion parameter.
To estimate the parameters β and γ for the SGLM with varying dispersion, we follow the classical theory of GLMs and use the iteratively re-weighted least squares algorithm for the maximum likelihood estimation of β and γ. Clearly, the log-likelihood takes the form subject to a constant term.
Define the surrogate or "working" vector s = (s 1 , s 2 , . . . , s m ) ⊤ as follows: The ith component is given by where u(y i , µ i ) is the score function defined as and weights are given by The expression in (7) of the surrogate response is a natural extension to dispersion models. For exponential family models, this surrogate response reduces to a common form seen in the classical GLMs. By the deviations of (6), we may estimate β and γ by the standard iteratively re-weighted least squares method that iteratively solves the score equations of β and γ through updating − 1 2 , with the logarithmic link function h. In addition, Proposition 1(i) leads to a moment method, an alternative approach we employed to estimate the constant dispersion parameter σ 2 in the simplex homogeneous model. The moment estimator coincides with its MLE. By bias correction, the estimation of σ 2 has the closed form: in the invariant dispersion simplex GLMs. Details of the discussion on estimation and modeling of dispersion parameters can be found in Zhang and Qiu (2014).

Simplex marginal models for longitudinal data analysis
Let y ij , j = 1, . . . , n i be the sequence of observed repeated measurements on the ith of m subjects, and t ij , j = 1, . . . , n i , be the sequence of corresponding times on which the measurements are taken on each subject, and x ijk , k = 1, . . . , p, be p explanatory variables where x ij1 may be set to 1 corresponding to an intercept. We assume that y ij are realizations of are the mean parameters and σ 2 ij > 0 are the dispersion parameters, and both may be specified as functions of covariates.
A heterogeneous marginal simplex model consists of three components. The first component is a model to describe the population-averaged effects, where the mean parameter µ ij depends on the time-varying covariates x ij via a generalized linear model of the form where g is a known link function and β = (β 0 , . . . , β p−1 ) ⊤ are the regression coefficients to be estimated. The second component is a model to describe the pattern of dispersion parameters σ 2 ij as a function of covariates z ij (maybe a subset of x ij , which can be omitted in the homogeneous dispersion model), given by where h is a known link function and γ = (γ 0 , . . . , γ r−1 ) ⊤ with γ 0 corresponding to the intercept term. The third component is for modeling the correlation structure. The correlation between Y ij and Y ik is a function of the location parameters and perhaps of additional where ρ(·) is a known function. Various types of correlation structures may be used for the ρ function. Amongst others, three commonly used in the analysis of longitudinal data are the exchangeable, auto-regressive (AR) and m-dependent correlations.
Denote the mean vector of subject i by µ i = (µ i1 , . . . , µ in i ) ⊤ . Let the score vector for subject i be u i = (u i1 , . . . , u in i ) ⊤ , with u ij evaluated by (8), s i = diag{µ 3 ij (1 − µ ij ) 3 }u i be the working vector, and R(α) be an n i × n i working correlation matrix with a q × 1 vector of correlation parameters α. The working covariance matrix for s i is where VAR(u ij ) in VAR(s ij ) is calculated by Proposition 1(vii). The GEE1 (Liang and Zeger 1986) for the simplex margin corresponding to the estimating equation for β is given by where The estimating equation for the dispersion component is given as follows, where d i = (d(y i1 ; µ i1 ), . . . , d(y in i ; µ in i )) ⊤ , Σ i is a working covariance matrix, and Following Prentice and Zhao (1991), the additional set of estimating equations for the correlation parameters based on the standardized score residuals, is defined by It is easy to see that such score residuals satisfy moment properties of E(r ij ) = 0, VAR(r ij ) = 1 and E(r ij r ij ′ ) = COR(u ij , u ij ′ ) = COR(s ij , s ij ′ ). The estimating equation for the correlation parameter α then takes the form where Details of the sensitivity and variability matrices for the GEEs are referred in Song and Tan (2000) and Song et al. (2004). Using the Newton-scoring algorithm, the solution of the joint Equations 14, 15 and 17 can be obtained numerically by iteratively updating the values of the parameters.

Model diagnostics
Fitting data with a certain model means choosing appropriate forms for the predictor, the link function and the distribution function. In general, Pearson's χ 2 and the deviance perform the important roles as general goodness-of-fit statistics.
The Pearson residual takes the form whereτ i has no analytical form expression as it involves the incomplete gamma function in (2). For over-dispersed data, the dispersion parameter σ 2 is large and thus the variance of the response approaches to µ(1 − µ). This leads to an approximate Pearson residual: Replacing parameters by their corresponding estimates, the simplex distribution assumption can be checked by the plot ofr i P orr i a againstμ i , which aims to examine the mean-variance relation.
An informal check for the link function assumption could be done by McCullagh and Nelder (1989)'s plot of the adjusted dependent variable a i against the linear predictorη i . In our setting, define where u(y i ; µ i ) and V (µ i ) = µ 3 i (1 − µ i ) 3 are the score function (8) and variance function of the simplex model, respectively. It follows from Proposition 1 that E(a i ) = g(µ i ) since E(u i ) = 0, and VAR(a i ) = E {a i − g(µ i )} 2 = 1.
In longitudinal analysis, the scatter plots of the standardized score residuals defined by (16) at different lags can informally be used to check the assumption of the working correlation structure.
As an extension of GLMs, the measure of the discrepancy or the goodness-of-fit of the simplex GLM can be formed by deviance. Noting that for the unit deviance function, d(y; y) = d(µ; µ) = 0, setting the perfect fitμ i = y i gives the log-likelihood of the saturated model. Hence, by (6), the deviance function is which follows a χ 2 m−p . It becomes difficult in determining the degree m − p of the χ 2 distribution for the goodnessof-fit test in the presence of within-subject dependence for longitudinal data. Qiu (2001) proposed the partial deviance where T denotes a collection of all distinct times on which observations are made. Cross-sectionally, y ij 's are independent and hence D p j follows approximately χ 2 m j −p , with m j being the total number of y ij 's observed cross-sectionally at time t j . A series of the goodness-of-fit χ 2 tests can be performed along these time occasions. Both observed partial deviance D p j statistics and the corresponding critical values determined by χ 2 m j −p can be depicted and compared at each time point. The plot displays a detailed scenario of testing for the goodness-of-fit over the spectrum of time, and hence is more informative than an overall test based on a single summary statistics.

The simplexreg package
The simplexreg package carries out generalized linear model regression as well as generalized estimation equations based on the simplex distribution and provides related functions of the simplex distribution. Some routines, including updating parameters for both the homogeneous and heterogeneous simplex marginal models via the Newton-Raphson method, are written in C with the GNU Scientific Library (GSL; Galassi et al. 2009) to get support for vector and matrix operation tasks and facilitate the computation. All the C programs are written in double precision.

Functions of the simplex distribution
In the simplexreg package, the function dsimplex gives the density function, psimplex provides the distribution function, qsimplex calculates the quantile function and rsimplex gives random numbers generated from the simplex distribution. They thus possess the same forms as other distribution functions in R: dsimplex(x, mu, sig) psimplex(q, mu, sig) qsimplex(p, mu, sig) rsimplex(n, mu, sig) where x and q are vectors of quantiles, p is the vector of probability, n is the user-specified number of samples, arguments mu and sig are the mean parameter µ and the square root from left plots to right plots; σ 2 = (4 2 , 2 2 , 1) from top plots to bottom plots. Solid lines are corresponding to simplex densities. of the dispersion parameter σ 2 of the simplex distribution, respectively. Numerical integration is used in calculating distribution and quantile functions. However, to speed up, the corresponding functions of the normal distributions are computed when the approximation according to the small dispersion asymptotic theory is close enough, namely, |error| < 10 −6 .
To illustrate rsimplex and dsimplex, histograms of random numbers obtained from the simplex random number generator are compared to the densities of the corresponding simplex distributions (Figure 1).

Regression analysis of the simplex model
The function simplexreg fits proportional data with simplex regression models. The arguments are similar to packages implementing regression models in R: simplexreg(formula, data, subset, na.action, link = c("logit", "probit", "cloglog", "neglog"), corr = "Ind", id = NULL, control = simplexreg.control(...), model = TRUE, y = TRUE, x = FALSE, ...) The regression model and data can be specified via formula and data. Methods, such as for the generic functions print, summary and coef, are available for the returned S3 class object 'simplexreg'. The function simplexreg.fit gives an alternative approach to specify the model: simplexreg.fit(y, x, z = NULL, t = NULL, link = "logit", corr = "Ind", id = NULL, control = simplexreg.control()) The corr argument specifies the correlation structure in the marginal model, providing three options: independent, exchangeable and auto-regressive of order 1, given by "Ind", "Exc" and "AR1" respectively. The default value is "Ind", reducing the marginal model to a simplex GLM. To specify both mean and dispersion equations (see (4) and (5)) in the simplex model, we employ the form y~x1 + x2 | z1 + z2 (Zeileis and Croissant 2010), where y~x1 + x2 specifies the mean model and covariates z1 and z2 are associated with the dispersion parameter σ 2 . Without the latter part, a homogeneous dispersion model will be fitted.
To obtain initial values for regression coefficients β, the package fits the data with a linear model for logit transformed responses. The initial values for γ are from a log-linear model treating d(y i ; µ i ) as the response which has a Gamma distribution.
For longitudinal proportional data, the marginal simplex models consist of three components, the population-average effects, the pattern of dispersion and the correlation structure. A formula of homogeneous dispersion takes the form y~x1 + x2 | 1 | t and a formula in the form y~x1 + x2 | z1 + z2 | t is used for heterogeneous simplex marginal models (corr = "Exc" or "AR1"). The parameter t in the formula as well as in the function simplexreg.fit corresponds to the time covariate. A factor identifying clusters of the observations should also be specified by the argument id. And variables y, x, z, t are required to be sorted in accordance with the clusters.
The function provides four options for the link function of the mean, i.e., the function g in (4) and (11): link = "logit", "probit", "cloglog", "neglog", corresponding to the logit, probit, complementary log-log and negative-log functions, respectively. However, when it comes to the link of dispersion, only the logarithmic function is supported. And function simplexreg.control controls the fitting process of simplex models.
For the returned object of class 'simplexreg', the summary method lists a standard output, including Wald statistics as well as the p values for the regression coefficients. And based on the fitted values, a χ 2 test is performed for the simplex GLM model and the result is also reported. Argument type in the summary function specifies the type of residuals included in the output. The coef and vcov functions extract coefficients and their covariance matrix, respectively. Akaike's information criterion (AIC) defined as AIC = 2ℓ(β, γ)−2p and Bayesian information criterion (BIC) BIC = 2ℓ(β, γ) − p log n where p is the number of parameters, are calculated via the AIC and BIC methods. For simplex marginal models, these functions are not supported are not supported since the model is non-likelihood-based. Function predict provides predicted values of the mean or the dispersion for the responses or new observations. The plot method draws graphs for visually examining the correlation structure and the model assumption. Its arguments include: plot(x, type = c("residuals", "corr", "GOF"), res = "adjvar", lag = 1, ...) where x is the S3 class object returned from simplexreg fitting, type specifies the types of graphs. Residuals analysis is given by type = "residuals" with one of the four types chosen for res: stdPerr the exact standard Pearson residual r P i given in (18), appstdPerr the approximated Pearson residual r a i given in (19), stdscor the standardized score residuals r ij detailed in (16), and the adjvar adjusted dependent variable a i in (20). The first three can be plotted against the mean µ to examine the mean-variance relation as well as detect model assumption violation. The plot of adjusted dependent variable against the linear predictor η (see (4) and (11)) can be used to check the link function. All these residuals could be obtained the residuals method. Model diagnosis and residual analysis using this function is further demonstrated with examples in Section 7.
When type = "corr", a graph is drawn to explore the correlation structure. Standardized score residuals are used to examine the auto-correlation at lag. Partial deviances against the time covariates are plotted when type = "GOF" (leveraging the plotrix package, Lemon 2006).

Data examples
Two examples are used to illustrate the capacities of simplexreg. The first models the recovery rate of CD34+ cells after peripheral blood stem cell (PBSC) transplants and the second is the longitudinal study of decay of intraocular gas (C 3 F 8 ) presented in Song and Tan (2000), Song et al. (2004) and Qiu et al. (2008). These two proportional data sets are modeled via the simplex GLM technique and the GEE method, respectively. These analyses are done using R version 2.15.3.

PBSC study
Autologous peripheral blood stem cell (PBSC) transplants have been widely used for rapid hematologic recovery following myeloablative therapy for various malignant hematological disorders. Hematopoietic reconstitution largely depends on the reinfusion of sufficient numbers of stem cells to engraft in the bone marrow micro-environment, as indicated in Allan et al. (2002). The dose of viable CD34+ cells is considered an important marker of adequacy of PBSC harvest, as well as a predictor of hematopoietic engraftment. Studies have shown that the process of freezing, cryopreservation and thawing prior to reinfusion could inevitably damage PBSCs and remarkably decrease the number of viable CD34+ cells, as demonstrated by Yang, Acker, Cabuhat, Letcher, Larratt, and McGann (2005). The loss of viable CD34+ cells is usually assessed by post-cryopreservation recovery rates of the number of post-thaw viable CD34+ cells and that of pre-freeze viable CD34+ cells. It is of scientific interest to investigate the mechanism by which the post-cryopreservation recovery rates are infected.
A study enrolled 242 patients who consented to autologous PBSC transplant after myeloablative doses of chemotherapy between the years 2003 and 2008 at the Edmonton Hematopoietic Stem Cell Lab in the Cross Cancer Institute -Alberta Health Services. Age, gender and clinical characteristics, such as cancer type and chemotherapy type, were recorded. The patients are 18 to 71 years old and most of them are male (170), diagnosed with multiple myeloma, non-Hodgkin's lymphoma, acute leukaemia, solid tumors, amyloidosis or others. Patients received primary chemotherapy, with 1 day protocol, 3 day protocol, G-CSF only or other types, for mobilizing CD34+ cells. The PBSC collection was initiated when the circulating CD34+ count in the peripheral blood reached or exceeded 15 cells/µL. PBSC products were cryopreserved and stored in a liquid nitrogen vapor until reinfusion. PBSC samples were assessed on the day of collection (pre-freeze) and post cryopreservation (post-thaw) for absolute viable CD34+ cells. Post-cryopreservation viability was calculated as the percentage of the absolute number of viable cells or colonies over the number of pre-freeze cells.
The data set object sdac is included in the simplexreg package with the first five rows of the data frame shown as follows, R> library("simplexreg") R> data("sdac", package = "simplexreg") R> head(sdac, n = 5) where rcd denotes the recovery rate of CD34+ cells. We factorize the treatment with a dummy variable chemo indicating if a patient receives a chemotherapy on a one-day protocol (0) or on a 3-day protocol (1). For simplicity, two chemo categories, G-CSF and other, are combined into either 1 day protocol or 3 day protocol, by number of days they took. To reflect the age structure in the patient population, we adjusted the age by setting age< 40 as the baseline age and subtracting other ages by 40, ending up in a new covariate ageadj.
The range of CD34+ cells recovery rates is 40%-100%. The two extreme values, 100%, are replaced by 99% in the regression analysis to avoid the boundary issue for proportional data.
The data are fitted with simplex regression models using both the homogeneous and heterogeneous structures of dispersion. The mean response model is given by logit(µ i ) = β 0 + β 1 ageadj + β 2 chemo.
The dispersion parameter in the heterogeneous simplex model is regressed on age, with log(σ 2 i ) = γ 0 + γ 1 age.
The summary output of both models is given below: The p values of the χ 2 tests for both models are 0.488 and 0.433, respectively, implying no lack-of-fit. The chemotherapy type, adjusted by ageadj, is shown to be significant in the model. In addition, the result also indicates that the dispersions in the post-cryopreservation recovery rates are associated with patients' ages. Comparing the mean coefficients of the two models, it is clear that the dispersion assumption does not have a significant impact on the coefficients in the mean model.
For the purpose of model selection, information criteria are returned by the function AIC.
R> AIC(sim.glm1, sim.glm2) df AIC sim.glm1 4 -202.5467 sim.glm2 5 -189.0300 The AIC values show that the heterogeneous model has a better fit than the homogeneous one. Apart from the AIC criterion, the significance of the age coefficient in the dispersion model also indicates that the homogeneous dispersion assumption may be violated.
The graph of the adjusted dependent variableâ ij against the linear predictorη ij is shown in the left panel of Figure 4. Overall 97% points fall into the 95% confidence band and those points show an increasing linear trend. This implies that the logit link function is a reasonable choice for the data.

Conclusion
In this paper, we describe the capabilities of the simplexreg package for conducting the simplex regression analysis in R. Statistical inferences and residual analyses of the simplex regression models via maximum likelihood and generalized estimation equations methods in R were presented, together with properties of the simplex distributions.
The density, cumulative distribution, quantile and the random generating functions for the simplex distribution were implemented in the package. The simplex random number generator is shown efficient and accurate, providing a powerful tool for a simulation based inference approach, such as the Markov chain Monte Carlo method, to fit hierarchical simplex generalized linear models for multilevel proportional data using a Bayesia approach.
The multi-dimensional simplex model is an important extension to the simplex GLM discussed in this paper. Jørgensen and Lauritzen (2000) proposed a class of multivariate dispersion models for multivariate non-normal responses. When over-dispersion appears in the compositional data, the multivariate simplex distribution (Jørgensen and Lauritzen 2000) can be considered as the underlying distribution for multivariate regression modeling. Developing regression analysis of the multivariate simplex distribution will be our future work through further investigation on this study.

Acknowledgments
The authors gratefully acknowledge NSFC 107104-N11405 for support.