An R Package for a General Class of Inverse Gaussian Distributions

The inverse Gaussian distribution is a positively skewed probability model that has received great attention in the last 20 years. Recently, a family that generalizes this model called inverse Gaussian type distributions has been developed. The new R package named ig has been designed to analyze data from inverse Gaussian type distributions. This package contains basic probabilistic functions, lifetime indicators and a random number generator from this model. Also, parameter estimates and diagnostics analysis can be obtained using likelihood methods by means of this package. In addition, goodness-of-ﬁt methods are implemented in order to detect the suitability of the model to the data. The capabilities and features of the ig package are illustrated using simulated and real data sets. Furthermore, some new results related to the inverse Gaussian type distribution are also obtained. Moreover, a simulation study is conducted for evaluating the estimation method implemented in the ig package.


Introduction
There are several fields of statistical application where the normal distributional assumption does not hold. This is because some random variables (RV) considered in these fields follow asymmetrical models. When researchers encounter skewed data, they generally try to eliminate the asymmetry by transforming such observations so that the normal model can be used. Problems of interpretation may arise, however, when analyzing data based on such transformations. Thus, if a distribution is appropriate for modeling skewness data and besides mathematically treatable and available for users, then this one should be employed. For this reason, it is necessary to implement computational packages that take the utilization of asymmetrical models into consideration.
A positively skewed probability model that has received great attention in the last 20 years is the inverse Gaussian distribution (IGD). The IGD is a probability model also known as the first passage time distribution of Brownian motion with positive drift, which was developed by Schrödinger (1915). Later, in 1941, Tweedie proposed the name inverse Gaussian for this distribution since its cumulant generating function has an inverse relationship with that of the normal distribution; see Tweedie (1957). The interest for the IGD is a result of its attractive statistical and probabilistic properties. For example, the IGD belongs to the exponential family, it has the reproductive property and it possesses similar inferential properties to that of the normal model; see Mudholkar and Natarajan (2002). Thus, the IGD is a natural alternative candidate to the normal distribution for modeling non-negative data with positive skewness. For more details about the IGD see Chhikara and Folks (1989), Seshadri (1993Seshadri ( , 1999 and Johnson, Kotz, and Balakrishnan (1994, pp. 259-297).
Maximum likelihood estimates (MLE) of the parameters of the normal model can be sensitive to outlying observations. Lange, Little, and Taylor (1989) proposed the Student-t distribution (which we will call simply t distribution or t model) as an alternative to the normal case, since it has greater kurtosis than the normal model. Thus, cases which might be considered as outlying under normality, might not under the t model, producing stable parameter estimates. Recently, based on the relationship between the inverse Gaussian and normal distributions, Sanhueza, Leiva, and Balakrishnan (2008) generalized the IGD using a class of models known as elliptically contoured distributions. The elliptical univariate distributions contain all the symmetrical models in R, the real line, and possess different degrees of kurtosis related to the normal model. Therefore, this generalization, called the inverse Gaussian type distribution (IGTD), can produce stable parameter estimates in the presence of outliers. In general, the IGTD is highly flexible because it allows for different degrees of kurtosis and asymmetry, other than modality and bimodality.
The aims of this paper are: (i) to present the R (R Development Core Team 2008) package named ig, available from the Comprehensive R Archive Network at http://CRAN.R-project. org/package=ig, which can be used to compute probabilistic characters, estimate parameters, carry out diagnostics and produce goodness of fit for the IGTD; and (ii) to develop two new aspects of the IGTD: lifetime analysis and generation of random numbers.
The rest of this article is organized as follows: in the second section, we give a background about probabilistic properties of the IGTD and provide new results for lifetime analysis based on this distribution; in the third section, we give a background about statistical results of the IGTD and develop a procedure for obtaining random numbers from this model, which can be useful for simulation studies; in the fourth section, the main functions of the ig package and some illustrative examples are presented. Finally, some conclusions are drawn.

Probabilistic aspects of the IGTD
In this section, we present a summary of the most important properties of the IGTD obtained by Sanhueza et al. (2008) and also provide some new results for lifetime analysis based on this distribution. Specifically, in the first part of this section we discuss aspects related to symmetrical distributions in the real line, also known as univariate elliptically contoured (EC) distributions, and their implementation in the R software. Later, we summarize in seven properties the most important results of the IGTD obtained by Sanhueza et al. (2008). In the third part of this section, we develop a lifetime analysis for the IGTD mainly based on the hazard function (hf) including its change point. Finally, a shape analysis for this model is provided, which shows some graphical tools implemented in the ig package.

Symmetrical distributions in the real line
If a RV has a univariate EC distribution with location µ = 0 and scale σ = 1 parameters, then the notation Z ∼ EC 1 (0, 1; g) is used, where g is the kernel of the probability density function (PDF) of Z expressed as f Z (z) = c g(z 2 ), with z ∈ R and c being a normalization constant that allows f Z (z) to be a PDF. The cumulative distribution function (CDF) of Z is denoted by F Z (z). Models of this type are known as standard symmetrical distributions in the real line or symmetrical distributions about zero and can be denoted simply by Z ∼ S(g). For details about symmetrical distributions see Fang, Kotz, and Ng (1990) In the R software, the functions dlogis() and plogis(), dnorm() and pnorm() and dt() and pt() are already implemented, which allow the standard symmetrical PDF and CDF of the logistic, normal and Student-t models to be obtained, respectively. In addition, R also has the normalp package available, which can be used for the Laplace distribution; see Mineo (2003).

The inverse Gaussian type distribution
If a RV T follows the IGTD with mean µ, scale parameter λ and kernel g, then the notation T ∼ IGT(µ, λ; g) is used and the following results hold: where a t = a t (µ, λ) = λ/µ[ t/µ − µ/t] and f Z (·) denotes the PDF of Z ∼ S(g) (A2) The mode(s) of T denoted by t m is (are) obtained by the solution(s) of where a t is given in (A1) and ω g (u) = g (u)/g(u), with u > 0 and g (·) being the derivative of g(·) also given in (A1).
(A5) The quantile function (qf) of T must be numerically obtained by t(q) = F −1 T (q), with F −1 T (·) being the inverse function of F T (·).
where θ = λ/µ and υ r = E[U r ] < ∞, with U ∼ Gχ 2 (g). From the expression for E T r+1 given below, the variance, standard deviation (SD), coefficient of variation (CV), coefficient of skewness (CS) and coefficient of kurtosis (CK) of T can be obtained.
The IGTD may be generated from the Laplace, logistic, normal and t with ν degrees of freedom (denoted by t ν ) kernels, which will be called IGT-lap, IGT-logis, IG and IGT-St distributions, respectively. The Cauchy law is a t distribution with ν = 1 and the associated IGT will be called IGT-Cauchy. Table 1 presents Gχ 2 D associated with the Laplace, logistic, normal and t models. Specific expressions for the variance, CV, CS and CK for the IGT-lap, IGT-logis, IG and IGT-St distributions can be found in Sanhueza et al. (2008,

Lifetime analysis for the IGTD
For a nonnegative RV T with PDF f T (t) (and CDF F T (t)), lifetime distributions are equivalently characterized in terms of the survival function (sf) S T (t) = 1 − F T (t) and hf (or failure rate) h T (t) = f T (t)/S T (t), with t > 0 and 0 < S T (t) < 1. For example, if h T (t) is nondecreasing (or nonincreasing) in t, then F T (·) belongs to the class of increasing (or decreasing) failure rate (IFR) (or DRF) distributions; if h T (t) = λ > 0, for all t, we have S T (t) = exp(−λt), with t > 0, so that F T (t) is an exponential distribution.
If T ∼ IGT(µ, λ; g), then from (A1) and (A4), we have In some cases, instead of having a monotone hf, we have a change point, say t c , such that F T (t) is IFR for t ≤ t c and DFR for t ≥ t c . The IGTD belong to this switching monotone FR patterns and its change point t c is the solution of the equation

Shape analysis for IGTD
Here, we perform a graphical analysis for some IGTDs. Models presenting different degrees of kurtosis, asymmetry as well as modality and bimodality and the absence of moments are considered. In the following, we present the densities of these IGTDs.
Example 1. Let T ∼ IGT(µ, λ; g). Then, for the indicated IGTD, the PDF of T is The IGTD can also be generated from the Kotz type (KT) and Pearson VII (PVII) kernels; for more details about the IGTD based on the KT and PVII kernels, see Sanhueza et al. (2008). By using the ig package, we can create Figure 1, which shows the behavior of densities of IGTDs for several kernels. Figure 1 also shows a zoom of the right tails for different IGTDs in order to highlight their kurtosis. From these graphical plots, we note that the IGT-lap, IGT-logis and IGT-St models present greater kurtosis than the classical IGD. Thus, any of these models should produce stable parameter estimates. However, the kurtosis of the IGT-St model is flexible depending on the parameter ν, which allows one to consider the IGT-St model as a good candidate in order to produce stable parameter estimates in the presence of outliers.
From Section 2.3 and (A2), expressions for the lifetime indicators and the mode(s) of the IGTDs given in Example 1 can be specified. In order to do that, we consider the values for ω g (u) of the Laplace, logistic, normal and t distributions presented in Sanhueza et al. (2008 ,  Table 3), which have been implemented in the ig package.

Statistical aspects for the IGTD
In this section, we present a summary of inference and diagnostics tools for the IGTD presented by Sanhueza et al. (2008), which have been implemented in the ig package. Also, we provide new results useful for simulation studies from this distribution that have been implemented in this package as well. Specifically, in the first part of this section, we summarize some aspects related to estimation and inference for the parameters of the IGTD. Later, we present tools associated with influence diagnostics mainly based on local influence for this model. Finally, in the third part of this section, we derive a method for generating random numbers from the IGTD that has not been developed thus far, which allows one to conduct simulation studies.

Estimation and inference
The MLE of the parameter θ = (µ, λ) for the IGTD, which is implemented in the ig package, can be computed from where given in (A2). The expressions presented in Eq. (1) are obtained from the log-likelihood function associated with the model given in (A1), which is and The ig package incorporates the analytical form of the score vector,L = (L µ ,L λ ) . Numerical values for the MLEs of µ and λ must be computed by using iterative procedures which need initial values µ (0) and λ (0) . The MLEs of µ and λ for the classical IGD (see Eq. (1) with υ i = 1) are considered as starting values for the iterative procedure implemented in the ig package. For the classical IGD, we have υ i = 1 giving thus equal weight for each observation located at the tails or at the center of the distribution. In the IGT-St model, υ i gives less weight to the extreme cases, which can be viewed in Figure 2.
If the IGTD is obtained from a kernel different to the normal one, then E(L µλ ) = 0. Thus, an approximate (1 − α)100% confidence region for θ based onθ∼ N 2 (θ, Σθ ) and implemented in the ig package is given by where χ 2 1−α (2) denotes the (1 − α)th quantile of the χ 2 distribution with two degrees of freedom. Here, Σθ is the covariance matrix ofθ, which we approximate by −L −1 evaluated atθ, with −L being the observed information matrix obtained from the Hessian matrix. This matrix is incorporated in an analytical form in the ig package and given bÿ .
For details about the score vector, Hessian matrix and asymptotic inference for the IGTD, see Sanhueza et al. (2008).

Influence diagnostics
Diagnostics techniques are relevant due to the importance for evaluating the effect of possible atypical observations on the estimates. Also, these techniques allow one to verify the stability of the estimation procedure in the presence of those atypical data. Next, we summarize a diagnostics technique known as local influence, which is implemented in the ig package.
(2), we have The most influential cases can be identified by their large components of the direction vector, l max , which produces the greatest local change inθ and corresponds to the eigenvector associated with the largest eigenvalue of B = ∆ L −1 ∆. In this case, the total local influence of the ith case is given by C i = 2|b ii |, where b ii is the ith diagonal element of B. Those cases that satisfy the condition C i > 2 n n i=1 C i can be considered as potentially influential; see Verbeke and Molenberghs (2000). We have implemented C i and its cut point as local influence diagnostics tool in the ig package. For more details about diagnostics techniques in the IGTD see Sanhueza et al. (2008).

Simulation
In general, statistical inference tools may not exist in closed form for the IGTD. Hence, simulation and numerical studies are needed, which require a random number (RN) generator. Next, we present a RN generator for the IGTD following a procedure similar to the one given in Chhikara and Folks (1989, pp. 52-53). This generator has been implemented in the ig package. Thus, if T ∼ IGT(µ, λ; g), the algorithm steps are: (B1) Generate a RN u from U ∼ Gχ 2 (g) by using an appropriate generator.

The IG functions
In the ig package, the logistic, Laplace, normal and t kernels for the IGTD are implemented. This package provides mainly two groups of functions in R code for the IGTD. These groups are related to: (i) probabilistic characters, lifetime indicators and a random number generator, and (ii) estimation, inference, diagnostics and goodness-of-fit methods. In addition, some tools useful for exploratory data analysis (EDA) are also included. The ig package is independent of other R packages already created.

IG statistical functions
Another group of functions related to graphical analysis, estimation, diagnostics, simulation and goodness of fit for the IGTD is also available. In order to estimate the mean (µ) and scale (λ) parameters of the IGTD, we implemented the method summarized in Section 3.1. Four functions for estimating the parameters of the IG, IGT-lap, IGT-logis and IGT-St distributions have been developed. These functions are: mleig(), mleigt(), mleigStNuFixed() and mleigSt(). The parameters of the IGD based on a data set call lifetime, for example, may be estimated from mleig(lifetime) or mleigt(lifetime, kernel = "normal"). The parameters of the IGT-lap and IGT-logis models based on lifetime may be estimated from mleig(lifetime, kernel = "laplace") and mleig(lifetime, kernel = "logistic"), respectively. The parameters of the IGT-St model based on lifetime may be estimated from mleigStNuFixed(lifetime, nu = 1.0), if the parameter ν is fixed, which for example is useful for conducting simulation studies. However, if the purpose is to achieve the optimal value for ν, then the function mleigt(lifetime) must be used.
Next, two examples related to the use of these and other commands are presented. The first example corresponds to a simulation study carried out by using the functions rigt(), mleig() and mleigStNuFixed(). The second example is from a real data set available in the literature and implemented in the ig package. In this example, we illustrate that the estimation procedure presented for the IGT-St model is stable in the presence of atypical observations. Also, we present the functions implemented for EDA in the ig package.

Example 2. Simulation study
In this example, we study the behavior of the MLEs of the parameters of the IGTD by considering the Monte Carlo simulation. We simulate samples under different scenarios considering small, moderate and large sample sizes (n = 10, 25 and 100, respectively) and normal and t ν kernels, with ν = 2, 8 and 50, where the values ν = 2 corresponds to high kurtosis, ν = 8 to moderate kurtosis and ν = 50 to low kurtosis. The scale parameter was fixed at λ = 1.0, without a loss of generality, and the mean was fixed at µ = 0.5.
We use the bias and mean square error (MSE) of the MLE in order to study the quality of the estimation method. The samples were generated from an IGTD with a specific kernel (normal or t ν ) that we called the "true kernel" and the estimation of parameters computed using samples obtained from the same or other kernels, called the "assumed kernel". The empirical bias and MSE values are the average of simulated samples for each combination of sample size and kernel. The results of the simulations are presented in Tables 2 and 3 for the estimates of µ and λ, respectively.
We have noted that the bias of the estimators for µ and λ gets smaller when the sample size tends to be greater for those cases where the true and assumed kernels are the same. In addition, we find greater bias when we estimate λ instead of µ. On the other hand, We have also noted that the MSE gets smaller when the sample size gets bigger and when the kurtosis increases as well. Furthermore, we can see that there are greater MSEs when we estimate λ instead of µ.
Example 3. Application to practical data Birnbaum and Saunders (1969) reported a data set corresponding to fatigue life (T ) measured in cycles (×10 −3 ) of n = 101 aluminum coupons (specimens) of type 6061-T6. These specimens cut parallel to the direction of rolling and oscillating at 18 cycles per second. The coupons were exposed to a pressure with maximum stress of 31,000 psi (pounds per square inch). All specimens were tested until failure. The ig package incorporates these lifetime data so that if we input the command data("psi31") it will be ready to use the data.
Exploratory data analysis: In order to carry out an EDA for psi31, we implement the function descriptiveSummary() and some graphical tools. Firstly, descriptive statistics of the data can be obtained from the function descriptiveSummary(), which uses the command searchMode() to find the empirical mode of the data. This descriptive statistics is summarized in Table 4 and obtained by the instruction
The EDA based on Table 4 and Figure 3 indicates a slightly positively skewed distribution (CS = 0.33) with moderate kurtosis (CK = 0.97) and some atypical values (see box-plot). The IGTD considers the degrees of skewness and kurtosis presented in the data. The potential influence of the atypical cases will be analyzed next by the IGT-St model. This IGTD should produce stable parameter estimates to the potentially influential cases.
presented for the IGT-St model with ν = 7 is stable in the presence of atypical data. This figure shows potentially influential observations only in the classical IGD. These observations are the cases: 1, 2, 100 and 101.
In order to compute the magnitude of the changes produced on the MLE of µ and λ when the potentially influential observations are eliminated, we implement the function rcigt(). This function computes the relative change (RC), in percentage, of each estimated parameter, defined by RC θ j = |(θ j −θ j(I) )/θ j | × 100%, whereθ j(I) denotes the MLE of θ j after the set I of cases has been removed. Table 5 shows these RCs, which were obtained by using the commands rcigt(psi31, casesRemoved = 1, kernel = "t") rcigt(psi31, casesRemoved = 1, kernel = "normal") Analogously, the instructions casesRemoved = 2 and casesRemoved = c(1, 2, 100, 101) allow one to eliminate the case 2 and the cases {1, 2, 100, 101}, respectively. From Table 5, we note that the RCs are greater in the classical IGD than in the IGT-St model with ν = 7, as we expected.
Goodness of fit: In order to select the best model, we implement three goodness-of-fit tools for the IGTD: (i) probability-probability (PP) and quantile-quantile (QQ) plots; (ii) the Kolmogorov-Smirnov (KS) test; and (iii) a criteria of model selection. Next, the goodnessof-fit functions implemented in the ig package will be described. Firstly, the commands ppigt() and qqigt() enables to detect the fitting of the IGTD to the data by means of the visualization of the PP and QQ plots. The commands ppigt() and qqigt() allow also incorporate a straight line that sketches a line passing through the first and third quartile of the theoretical IGTD in order to check graphically if a set of observations follows a particular IGTD. Also, the coefficients of determination (R-square in %) of these plots are calculated and shown inside these graphs. The following instructions give QQ plots for psi31: R> qqigt(psi31, kernel = "t", line = TRUE) R> title(main = "IGT-St distribution", cex.main = 1.5) while Figure 6 displays these graphs. Once again, from this figure, we note that the estimation procedure presented for the IGT-St model with ν = 7 is stable on the tails of this distribution. This is because the IGT-St model behaves better than the classical IGD on the tails and then produces stable parameter estimates to outliers. The R 2 of the PP plot for psi31 by considering the IGT-St distribution with ν = 7 is 99.46% (see Table 6), which demonstrates the model's excellent fit to the data (for the QQ plot, the R 2 is 98.85% upon this same model).
Secondly, we also implement a command for a goodness-of-fit test useful for the IGTD. This function is ksigt(), which calculates the KS test and also provides a comparative graph of the theoretical and empirical distribution functions. This graph is shown in Figure 8 (on the right) and is produced when the instruction graph = TRUE is added in ksigt(). Finally, we implemented the Schwartz information criterion (SIC) by the function sicigt(), which is a well known criterion of model selection that can also be used for goodness of fit.
Next, the results of the fit for psi31 are presented in Table 6, which show that the IGT-St distribution with ν = 7 presents a better fit to the data.
In addition, by using the invariance property of the MLEs, the estimated PDF may be sketched on the histogram adding the instruction densityLine = TRUE into the function histigt(). Figure 8 (on the left) shows the histogram with the estimated PDF from which we observe  Remark: Seven other data sets called psi21, psi26, precipitations, repairtimes, shelflife, fracture and runoff used frequently in the literature of this topic, have also been incorporated into the ig package.

Concluding remarks
In order to analyze data from the IGTD, we have developed the ig package, which can execute several useful commands. The created functions are related to probability and lifetime indicators, estimation, diagnostics, goodness-of-fit, simulation and graphical tools. In addition, we have provided new theoretical results of the IGTD related to lifetime analysis and generation of random numbers. Furthermore, we have conducted a brief simulation study for evaluating the estimation method used here. By using numerical examples, we illustrate the different capabilities and features of the ig package. We believe that, in the future, the ig package can be improved by incorporating other functions related to the estimation for censored data and regression methods.