icenReg : Regression Models for Interval Censored Data in R

The non-parametric maximum likelihood estimator and semi-parametric regression models are fundamental estimators for interval censored data, along with standard fully-parametric regression models. The R package icenReg is introduced which contains fast, reliable algorithms for ﬁtting these models. In addition, the package contains functions for imputation of the censored response variables and diagnostics of both regression eﬀects and baseline distribution


Introduction
In the setting of survival analysis, interval censored data occurs when an event time is known only up to an interval. Two common forms of interval censored data are current status data (Hoel and Walburg Jr 1972) and mixed case censoring (Schick and Yu 2000). Current status data occurs when each subject is observed at a single time and all that is recorded is whether the event of interest has occurred or not. This results in all subjects being either left or right censored. A classic current status dataset includes mice that are sacrificed at random times and inspected for lung tumors. If tumors were detected, the mice were recorded to be left censored at time of sacrifice. If no tumors were found, they were recorded as right censored. The more general type of interval censoring, called mixed case censoring, can include left censored, right censored, uncensored and observations that are censored but neither right nor left censored. The last type of censoring can occur if a subject is regularly inspected and all that is known is that the event of interest occurred between inspections. A classic mixed case interval censored dataset includes semi-regular dentist visits by children, with the event of interest being emergence of permanent teeth (Vanobbergen, Lesaffre, and Declerck 2000). By selecting the last visit without permanent teeth and the first with permanent teeth, the researchers know the event time up to an interval. The standard assumption is that this observation time is independent of the event of interest, although the observation time may be random or fixed by design.
Standard parametric models can be used and are fairly straight forward to implement using standard algorithms. Implementations of general location-scale transformed models (the most well known being the accelerated failure time model) for interval censored data can be found in the R package (R Core Team 2017) survival (Therneau 2017;Therneau and Grambsch 2000), fit with the survreg function. In addition, the R package flexsurv (Jackson 2016;Jackson, Sharples, and Thompson 2010) can be used to fit accelerated failure time, proportional hazards and proportional odds models. These models must be used with some caution in regards to interval censored data; they are heavily influenced by the choice of parametric model, for which the model inspection can be extremely difficult.
Because of this, non-parametric models are often preferred, if at least for diagnostics. For univariate data, the non-parametric maximum likelihood estimator (NPMLE; Turnbull 1976) is often preferred, a generalization of the Kaplan Meier curves (Kaplan and Meier 1958) for interval censored data. This is also referred to as the Turnbull estimator in the literature. This can be fit by the function EMICM in the package Icens (Gentleman and Vandal 2011;Wellner and Zhan 1997). Alternatively, this can be fit by the function computeMLE in the package MLEcens (Maathuis 2013;Groeneboom, Jongbloed, and Wellner 2008). In terms of statistical properties the NPMLE is notoriously inefficient; for current status data, the convergence rate has been shown to be n 1/3 (Groeneboom and Wellner 1992) instead of the more standard n 1/2 , while for mixed case interval censored data, the convergence rate has been conjectured to be n 1/2 − n 1/3 (Groeneboom 1996;Huang 1999), depending on the severity of the censoring.
For non-parametric comparison of different strata, a log-rank test can be used (Fay 1999). This is complicated by the fact that the NPMLE is characterized by a large number of parameters, many of which may be on the boundary. Alternatively, permutation tests may be used to compare separate groups. R implementations of both these tests can be found in the R package interval (Fay and Shaw 2010) and called by the function ictest. A generalized log-rank test can be found in the R package glrt (Zhao, Zhao, Sun, and Kim 2008;Zhao and Sun 2015).
For semi-parametric regression modeling of interval censored data, a Cox proportional hazards model (Cox 1972) can be used. Separating the estimation of the regression parameters from the estimation of the baseline parameters is not as simple as in the right censored case. One proposed method is to use the likelihood over the sum of all possible rankings in the dataset (Satten 1996). An MCEM approach can also be used (Goggins, Finkelstein, Schoenfeld, and Zaslavksy 1998). Both these methods are very computationally intensive.
The model can also be kept semi-parametric by using the NPMLE as the baseline survival distribution rather than separating the baseline and regression parameters (Finkelstein 1986). Even though the rate of convergence can be as low as n 1/3 for survival estimates based on the NPMLE, it has been shown that the regression coefficients converge at the standard n 1/2 rate and are asymptotically normal (Huang 1995), allowing for efficient comparisons with the semi-parametric proportional hazards model. Inference on the regression parameters can be done using bootstrap standard errors (Efron 1979). A semi-parametric proportional odds model can also be used (Rossini and Tsiatis 1996;Rabinowitz, Betensky, and Tsiatis 2000).
In this manuscript, these models will be referred to as the semi-parametric Turnbull (SPT) models.
It has been shown that the SPT model can be fit with an ICM algorithm (Argaon and Eberly 1992;Pan 1999;Huang and Wellner 1997). This algorithm has been implemented in R in the package intcox (Henschel and Mansmann 2013). Caution must be used with this package, as it was found that the default tolerance is set too slack, often converging far from the MLE (shown later in the manuscript). In addition, it has been shown that while the ICM algorithm can preform well for purely interval censored data, it behaves very poorly in regards to required number of iterations when the data also contains a mixture of censored and uncensored response variables (Anderson-Bergman 2016).
To the best of this author's knowledge, there is no R package available for fitting an SPT proportional odds model for interval censored data as defined in this manuscript, other than the package being presented.
Outside the SPT model, there exist several R packages that can be used for alternative semi-parametric models. The package coxinterval (Boruvka and Cook 2015b) implements a Cox-Aalen model (Boruvka and Cook 2015a). The ICsurv (McMahan and Wang 2014) package provides semi-parametric models that use splines for the baseline distribution (Wang, McMahan, Hudgens, and Qureshi 2016), as does the flexsurv package. For these methods, knot selection is still an open question. The MIICD (Delord 2017) package provides inference through a multiple imputation approach (Delord and Génin 2016).
Outside of the R environment, there are a few options for interval censored regression models. In Stata (StataCorp. 2015), the intreg routine fits a parametric AFT model (Cameron and Trivedi 2010) that allows for interval censored data. In SAS (SAS Institute Inc. 2013), the LIFEREG procedure fits parametric AFT models as well (SAS Institute Inc 2013) and the ICPHREG procedure fits a semi-parametric model with a spline-based estimate of the baseline distribution.
In icenReg, fast, reliable implementations of the fundamental tools for interval censored data are provided, intended for analysis of real data in R. This includes fitting the NPMLE through the function ic_np ("interval censored non-parametric"), the SPT proportional odds and hazards model through ic_sp ("interval censored semi-parametric") and the fully parametric accelerated failure time, proportional odds and proportional hazards model with a variety of choices for baseline distribution through ic_par ("interval censored parametric"). The functions diag_baseline and diag_covar are provided for visual diagnostics for parametric assumptions and covariate effects which rely on either the SPT or non-parametric models. The function imputeCens can be used to impute random samples of the interval censored data, conditional on the fitted model. In Section 2, the different models fit by icenReg are described, along with a brief description of the algorithm used. The algorithms are tested on simulated data against competing packages, when available. In Section 3, the diagnostic tools for model fitting are presented. In Section 4, the imputation method used by icenReg is presented. In Section 5, the various tools of icenReg are applied to a real dataset. In Section 6, future plans for the package are discussed.

Models
Some notation and the formal definition of the generating process for interval censoring is established first. For subject i, let t i represent the true event time of interest. The value t i is generally not known exactly, but rather to be contained within an "observation interval", for which the left and right side are denoted by l i and r i . This allows for left censoring (l i = 0), right censoring (r i = ∞), uncensored observations (l i = r i ) and other interval censoring (0 < l i < r i < ∞). Whether these intervals are open, closed or partially open does not affect estimation for fully parametric models, but can have an effect on the non-parametric and SPT models (Ng 2002). In icenReg, the default behavior is to treat intervals is left-open, right-closed (i.e., (l i , r i ]) as recommended by Ng (2002), but this can be controlled by the argument B in ic_np and ic_sp. Regardless of the choice of B, if l i = r i , the observation is treated as though it were uncensored, even though this would technically be undefined unless the interval were closed on both sides.
The generating process assumed for the censored intervals is the mixed censoring case, as defined in Schick and Yu (2000). To rephrase this definition, let K be a vector of random positive integers and let C be a random set of inspection times such that For each subject i, the interval such that t i ∈ (C i,j , C i,j+1 ] is observed and so l i = C i,j , r i = C i,j+1 . By allowing K i = ∞, all the intervals for subject i can be arbitrarily small, allowing for uncensored observations. Current status data is the special case where K i = 1 for all i. An important assumption is that (K, C) are independent of t i , i.e., the event time does not affect the inspection mechanism.

Non-parametric maximum likelihood estimator
For the NPMLE, the log-likelihood is defined as such that S is a non-increasing function that maps R → [0, 1]. More precisely, this is the log-likelihood with closed observation intervals; the likelihood with open or partially open intervals is of the same form but requires "clipping" the open ends of the interval for censored observations (i.e., replacing l i− with l i and r i with r i− if l i < r i ). The NPMLE is any proper survival function S that maximizes the log-likelihood function. It has been shown that the NPMLE only assigns positive probability mass to disjoint Turnbull intervals (Turnbull 1976), but how the probability mass is distributed within a Turnbull interval does not affect the likelihood. As shown in Figure 1, this means the NPMLE is not necessarily unique, but rather can be defined as any function that lies between a upper and lower step function.
Computation of the NPMLE is a high dimensional constrained optimization problem, as the number of Turnbull intervals, k, can grow linearly with n. An established algorithm for fitting the NPMLE is the EMICM algorithm (Wellner and Zhan 1997). Original implementations of this algorithm calculated each iteration in O(nk) time. Later algorithms considered optimization over an active set of parameters (Groeneboom et al. 2008;Wang 2008), where the active set is defined as the parameters with positive probability mass. This reduced computation of each iteration to O(nk a ), where k a is the size of the active set. In the case of heavily censored data, this was a massive improvement, but with lightly censored data, k a is often close to k and in such cases these algorithms were significantly slower than the EMICM algorithm. In icenReg, an efficient implementation of the EMICM algorithm is used. By taking advantage of the linear form of the data, each iteration can be calculated in O(n) time, rather than O(nk) (Anderson-Bergman 2017). This provides a massive speed up compared with both the original EMICM and the active set algorithms.
The performance of ic_np was compared against the EMICM algorithm in Icens (called by EMICM) and the support reduction algorithm in MLEcens (called by computeMLE). Data was simulated using icenReg's simIC_weib function as such: R> simdata <-simIC_weib(n = 100, b1 = 0, b2 = 0) This simulates a dataset with 100 observations. The arguments b1 and b2 define the regression effects; by setting them to 0, this simulates data from a Weibull(2, 2) distribution. Further description of the simulation methods can be found in ?simIC_weib. Sample sizes of n = 10 2 , 10 3 , 10 4 , 10 5 and 10 6 were considered. Each scenario was tested 100 times. For each algorithm, the average time in seconds and average relative error were reported, where average relative error was defined as the difference in log-likelihood compared with the maximum likelihood achieved across all fits.
The results can be seen in Table 1. All three algorithms consistently achieved a tolerable amount of error. With regards to speed, ic_np dominates the competing algorithms, except in the n = 100 case where it is virtually tied with MLEcens::computeMLE.
All simulations were run on a 2015 Macbook Air with a 2.2 GHz i7 processor.

Semi-parametric models
In the case of the SPT proportional hazards model, the log-likelihood can be written as where S o is the baseline survival function, the row vector X i contains individual's covariates without an intercept and the column vector β is a vector of coefficients. Like the definition for the NPMLE in Section 2.1, this the definition of the log-likelihood with closed observation intervals. Similarly, the log-likelihood function in the proportional odds model can be written as Unfortunately, there is no established method for fitting an AFT model with the NPMLE as the baseline distribution. While the necessary baseline parameters for the proportional hazards and proportional odds model can be found using the same methods as with the NPMLE (Anderson-Bergman 2016), this cannot be generalized to the AFT model, as the log-likelihood cannot be fully characterized by evaluation of S o at a finite number of points (which can be done with the proportional hazards and odds models).
Traditional methods for computing the SPT model includes using an ICM algorithm to update the baseline parameters and conditional Newton-Raphson on the regression parameters (Pan 1999;Huang and Wellner 1997). While it was found that this algorithm works well with heavily censored data, it behaves very poorly with lightly censored data (Anderson-Bergman 2016). This is very similar to the results for the NPMLE; the ICM algorithm behaves very poorly with lightly censored data, so it is paired with the EM algorithm which behaves well in that case resulting in the EMICM algorithm. In icenReg, a novel algorithm is used which augments the ICM algorithm with a constrained gradient ascent step, similar to augmenting the ICM with the EM algorithm in the NPMLE case. Each iteration of the algorithm now includes three steps: a conditional Newton-Raphson that updates the regression parameters, an ICM step that updates the baseline parameters on the log cumulative hazard scale and a constrained gradient ascent step that updates the baseline parameters on the probability mass scale. In the ICM step, a pool-adjacent violators algorithm (Van Eeden 1958) is used to  optimize the baseline hazard function while still respecting the monotonic constraint of the cumulative hazard. The algorithm is described in more detail in Anderson-Bergman (2016).
To demonstrate that the algorithm finds the correct solution, results were compared with intcox's intcox function on simulated data. The review begins with a more detailed examination of a single dataset. A simulated dataset with 500 interval censored observations from a proportional hazards model was creating using the simIC_weib function with the default settings, except for setting the sample size to 500. This implies the simulated true event times came from a proportional hazards model with a baseline Weibull distribution with shape and scale parameters equal to 2 and regression coefficients 0.5 and −0.5.
To test the reliability and speed of the algorithm, the above procedure was repeated 100 times for n = 100, 500, 1,000 and 5,000 (although the intcox algorithm was excluded from the n = 5, 000 case due to speed). The computation time and relative error was recorded. Results are presented in Table 2.
The results show that ic_sp consistently found the highest log-likelihood. Given a strict enough tolerance, it would appear that intcox always converged to the same solution. This provided assurance that the ic_sp algorithm was finding the correct solution. In addition, the new algorithm appears to be around 1,000 times faster than intcox given the same level of tolerance. It should be noted that these simulations are that of heavily censored data, for which the standard ICM algorithm (without the constrained gradient ascent step) does well; for lightly censored data, the intcox algorithm does significantly worse. For a more thorough review of performance, see Anderson-Bergman (2016).
This author is not aware of any packages that fit an SPT proportional odds model which could be used to compare the results with. Because of this, it is instead demonstrated that the algorithm produces estimates that behave as expected. To investigate, 1,000 proportional odds datasets were simulated using simIC_weib for n = 500 and n = 2, 000 and examined the mean and variance of the estimated regression coefficients. The true regression coefficients were β 1 = 0.5, β 2 = −0.5. A single dataset with n = 500 can be simulated according to the following code: R> simdata <-simIC_weib(n = 500, model = "po") Based on 400 samples the following estimates are obtained: for n = 500,

Fully parametric models
When parametric models are considered, the log-likelihood function must treat uncensored observations in a distinct manner. Because of the continuous nature of the baseline distributions considered, the distinction between open and closed intervals is no longer necessary.
In the case of the proportional hazards model, the log-likelihood can be written as where α contains the parameters associated with the baseline distribution, the column vector β contains the regression parameters, the row X i contains subject i's covariates, f o and S o are the baseline density and survival functions, with the first n 1 subjects being the uncensored subjects, and the remaining n 2 subjects are interval censored.
To maximize this likelihood function, a two step algorithm is used. A simple conditional Newton-Raphson step is used to update the regression parameters, as the function will be concave under standard regression conditions (non-singular design matrix). The log-likelihood function is not necessarily concave as a function of the baseline parameters, and it was occasionally found to be non-locally concave for poor starting choices of α. To handle this, the algorithm would first check if the Hessian was negative definite. If so, a conditional Newton-Raphson step was used. If not, a gradient ascent step was used until the log-likelihood function is locally concave.
To compare this algorithm with an established implementation, icenReg's ic_par was compared with the results of survival's survreg function. The default model from survreg is a Weibull AFT model. This can be directly compared to two ic_par models: the Weibull AFT model and the Weibull proportional hazards. This is because for the Weibull distribution, the AFT and proportional hazards models are identical up to a linear transformation of variables (see Appendix A for derivation). As such, both models must have the same maximum likelihood. Datasets of n = 10,000 and 100,000 were simulated using simIC_weib, with 100 simulated datasets for each scenario. Mean time and relative error for each algorithm is presented in Table 3. All three algorithms were sufficiently precise, having mean relative error on the order of 1.0e-10 to 1.0e-9. In addition, the maximum absolute difference in regression parameters across all models (after rescaling the proportional hazards model to be on the same scale as the AFT parameters) was less than 1.29e-07 across all simulations. A speed advantage was held by survreg, being on average 6-8 times faster on the simulated data than either model fit with ic_par.
Even though survreg holds a speed advantage, ic_par has many model choices not available with survreg. As such, a proportional hazards or proportional odds model may be preferred if they fit the data better. In addition, the ic_par objects interact more seamlessly with the other tools provided by icenReg, so a user may choose to use ic_par with model = "aft", despite the loss to speed, in order to use the other utilities provided by icenReg.
Currently, six parametric families are supported: exponential, gamma, Weibull, log-normal, log-logistic and generalized gamma (Stacy 1962). It is worth noting that for several of these distributions, the parametric family is not preserved given the link function. For example, having proportional hazards to a log-normal distribution does not imply log-normality. In these cases only the baseline distribution will actually belong to the given parametric family. As such, it is very important to note that ic_par centers the covariates before fitting the model for numeric stability; the baseline distribution refers to subjects with mean covariate values, rather than 0. Some users may find it displeasing that the link does not preserve the parametric family. However, standard probability functions (i.e., pdf, cdf, inverse cdf) are easily computed despite this. The package includes a function getFitEsts that allows easy extraction of the estimated cdf or inverse cdf from a fitted model.
The software is written in an object-oriented manner such that it requires minimal effort to add new parametric distributions; all that is needed is a C++ implementation of both the pdf and cdf function. From there, optimization is handled generically.

Diagnostic tools
When fitting parametric regression models, the researcher makes an assumption about the effect of the covariates and the baseline parametric model. When fitting an SPT model, the second assumption is dropped, but the covariate assumption is still required for valid inference. In either case, it is important to assess the validity of the assumptions. With interval censored data, this can be fairly difficult. The icenReg package includes easy to use routines for examining both sets of assumptions using the SPT model. Unfortunately, these methods only apply to proportional odds and hazards models, as there are no methods to fit the SPT AFT model.
To examine the parametric baseline assumption, diag_baseline fits and plots the baseline survival distribution of a variety of parametric choices. It also plots the SPT estimated baseline distribution. This can help an investigator assess if there appears to be a systematic deviation from the assumed baseline distribution.
To examine the functional form of the covariates, diag_covar uses the fact that for both models, there is a transformation of the survival function such that differences in covariate effects will result in constant differences. For the proportional hazards model, note that log(− log(S(t|X, β))) = log(− log(S o (t) e Xβ )) = Xβ + log(− log(S o (t))).
Likewise, for the proportional hazards model, note that To investigate whether the functional form is appropriate for a given covariate, diag_covar first stratifies the dataset on different levels of that covariate. It then fits an SPT model for each strata and plots the given transformation of the baseline survival functions. If the functional form of the covariate is correct, the difference between the two strata's transformed baselines should be approximately constant. To help visualize the difference, the average of all the strata is subtracted off of each strata by default. By subtracting off the average of all the curves, each curve should be a flat line (with stochastic noise) under the correct regression model. When the regression model is incorrect, often the different curves will converge together or cross. Users also have the option to examine the transformed survival curves without the mean subtracted (may be necessary if there is little overlap between strata) and the raw baseline survival function estimates.
Because there is no established method for calculating the AFT SPT model, neither of these methods can be applied to the AFT model. The author is currently working on alternative diagnostic tools for the AFT model.

Imputation
In some cases, the analyst may wish to impute the missing data (i.e., exact event time). For example, this could be a step in a multiple imputation analysis (Rubin 1987). This functionality is provided in imputeCens.
Three imputation strategies are allowed by imputeCens. The simplest is median imputation (imputeType = "median"), in which event times are imputed with the median value, conditional on being inside the given observational intervals and the parameter values at the MLE. The next strategy (imputeType = "fixedParSample") takes a random draw of the event time, conditional on being within the given observation interval and the parameter values at the MLE. Finally, the last strategy (imputeType = "fullSample") takes a random sample of the parameters. Then, conditional on those parameter values, it takes a random sample of the event times, conditional on being within the observational interval and the randomly drawn parameter values. How the parameters are sampled depends on the model. For the fully parametric model, the asymptotic normality of the estimators motivates taking a random draw from a multivariate normal with mean and covariance provided by the point estimates and negative inverse Hessian matrix. However, this method cannot readily be applied to the NPMLE and SPT models, as the baseline parameters do not follow an asymptotically normal distribution. As such, for the SPT model the "fullSample" option fixes the baseline parameters and takes a random sample of the bootstrapped regression coefficients before sampling the conditional event times. For the NPMLE, "fullSample" is equivalent to "fixedParSample".
In addition, the analyst must keep in mind that the NPMLE and SPT model only apply probability mass onto Turnbull intervals. Because of this, on most fits there will be several "gaps" [a j , b j ) 1 on [0, ∞) for which the SPT model estimates P(T ∈ (a j , b j )|X) = 0 for all values of the covariates X. In fact, consider r o = min{r i : r i ≥ l j , j ∈ 1, . . . , n}. There will be 0 probability mass assigned to (r o , ∞), as by definition r o will be the right side of the maximum Turnbull interval. Likewise, if l o = max{l i : l i ≤ l j , j ∈ 1, . . . , n}, then zero probability mass is assigned to [0, l o   is completely contained within one of the gaps. Note that this cannot happen with data that was used to fit the SPT model, as each observation must contain at least one Turnbull interval with positive probability mass to have a finite log-likelihood, but it could happen if a user attempted to impute data for an observation interval that was not used to fit the model. Given these complications, it is advised that the analyst uses the most appropriate parametric model for imputation unless doing so leads to clear bias in imputation. Table 4 provides a very quick summary of the public functions provided in icenReg. To help illustrate the use of the package, a sample analysis is presented.

Using icenReg
Thee IR_diabetes dataset (Borch-Johnsens, Andersen, and Decker 1985) from icenReg is used, which was initially imported from glrt (Zhao and Sun 2015). In this dataset, 731 patients (454 males and 277 females) are followed, with time from onset of diabetes to onset of diabetic nerphronpathy being the response time of interest. For many of the patients (595), the event time was known exactly but for others (136) the exact time was known only up to an interval due to limited follow up. The dataset contains three variables: left, right and gender. The variables left and right represent the observational interval. In this example, the effect of gender will be examined.  First, the NPMLE is fit to each group. The syntax for this is very similar to fitting the Kaplan Meier curves with survival::survfit, but the response must either be a 'Surv' object with type = "interval2" or of the form cbind(l, r), where l, r are the left and right side of the observation interval for each subject. This syntax is also used for ic_sp and ic_par.
R> npmleFit <-ic_np(cbind(left, right)~gender, data = IR_diabetes) Plots of the NPMLE for each group can be created using the plot function. This can be seen in Figure 2.
R> plot(npmleFit, main = "NPMLE by gender", col = c("blue", "orange")) While the two NPMLE fits give a full picture of comparing the two groups, an investigator may want to use a regression model to more succinctly describe the difference between the two groups. One can begin with visually assessing which regression model appears more appropriate.
R> diag_covar(cbind(left, right)~gender, data = IR_diabetes, + model = "ph", col = c("blue", "orange")) R> diag_covar(cbind(left, right)~gender, data = IR_diabetes, + model = "po", col = c("blue", "orange")) Examining Figure 3, some deviation from the regression assumptions can be seen; the transformed difference between the groups is greater early on but becomes less as time goes on. However, for the proportional odds model, the deviation is less and acceptable for inference. Because of this, it was chosen to model the data with a proportional odds effect of gender.
Given that the bootstrap is required for estimating the standard errors, it is simple to use multiple cores to speed up computation. For parallel computing of bootstrap samples, icen-Reg works seamlessly with R's doParallel package (Revolution Analytics and Weston 2015), although it is left to the user to set up the cluster (as to not interfere with other processes that may be running). This is demonstrated below. The fitting of the model and an additional 1,000 bootstrap samples took just under 30 seconds utilizing 2 cores. R> library("doParallel") R> myClust <-makeCluster(2) R> registerDoParallel(myClust) R> sp_fit <-ic_sp(cbind(left, right)~gender, model = "po", + data = IR_diabetes, bs_samples = 1000, useMCores = TRUE) R> stopCluster(myClust) The summary function can be used to review the results.  From the summary, it can be seen that there is a statistically significant difference in the odds of not having experienced diabetic nerphronpathy at a given time after diabetes between men and women in the study. It is estimated that the odds of survival at any given time will be 1.49 times higher for men than for women (95% CI = [1.13, 1.97], p = 0.0044) under the assumption of proportional odds.
Using the plot method, the user can plot estimated survival curves between the two groups. This is done by including a new dataset with the covariates for each survival curve to be plotted. If no newdata argument is included, the baseline group is plotted. The plotted curves can be found in Figure 4.
R> newdata <-data.frame(gender = c("female", "male")) R> rownames(newdata) <-c("Female", "Male") R> plot(sp_fit, newdata, main = "Semi-parametric Fits by Gender", + col = c("blue", "orange")) In some cases, a parametric model may be preferred. For example, even though bootstrapping can be used for inference on the regression parameters, it cannot be used for inference on the baseline survival distribution (and thus the conditional survival probabilities as well). If a parametric model is chosen, it is important to check that the chosen baseline distribution is appropriate. Using diag_baseline with a proportional odds model, the user can view how the different parametric models compare with the SPT fit. If the argument dist is left blank, default behavior is to plot all available parametric distributions against the SPT model. This is plotted in Figure 5. It was decided that the log-logistic distribution was the most appropriate, given that there appears no systematic deviation from the SPT fit.
The function getFitEsts can be used to get estimated event time or probabilities from a fitted model. For example, a user could use the following code to extract the median event time for males and the survival probability at t = 10 for females.

Discussion
The author's vision for the icenReg package is to provide analysts with a reliable, organized set of tools for the analysis of interval censored data. As such, implementing established methods over novel methods is preferred, although there is interest in fast new algorithms, as implemented for the non-parametric and SPT model.
With this general guideline, there are several improvements to the package planned in the near future. At this time, there is work on developing diagnostic tools for the AFT model. In addition, there are plans for including residuals, such as those found in Farrington (2000). Time varying coefficients, as found in Sparling, Younes, and Lachin (2006) would be another useful addition.