p3state.msm: Analyzing Survival Data from an Illness-Death Model

In longitudinal studies of disease, patients can experience several events across a followup period. Analysis of such studies can be successfully performed by multi-state models. In the multi-state framework, issues of interest include the study of the relationship between covariates and disease evolution, estimation of transition probabilities, and survival rates. This paper introduces p3state.msm, a software application for R which performs inference in an illness-death model. It describes the capabilities of the program for estimating semi-parametric regression models and for implementing nonparametric estimators for several quantities. The main feature of the package is its ability for obtaining nonMarkov estimates for the transition probabilities. Moreover, the methods can also be used in progressive three-state models. In such a model, estimators for other quantities, such as the bivariate distribution function (for sequentially ordered events), are also given. The software is illustrated using data from the Stanford Heart Transplant Study.


Introduction
In many medical studies, patients may experience several events. Analysis in such studies is often performed using multi-state models (Andersen, Borgan, Gill, and Keiding 1993;Meira-Machado, Cadarso-Suárez, de Uña-Álvarez, and Andersen 2009). These models are very useful for describing event-history data, affording a better understanding of the disease process, and leading to a better knowledge of the evolution of the disease over time. Issues of interest include the estimation of transition probabilities, survival rates or assessing the effects of individual risk factors. Although the mortality model for survival analysis can be considered the simplest multistate model, the scope of multi-state models provides a rich framework for handling complex p3state.msm: Analyzing Survival Data from an Illness-Death Model known as the disability model ( Figure 2). These models can be used to study the effect of binary time-dependent covariates, such as the appearance of "recurrence" in a breast cancer study (Cadarso et al., 2009), "bleeding episodes" in patients with liver cirrhosis (Andersen et al., 2000), or "transplantation" in heart diseases (Meira-Machado et al., 2008  More examples of multi-state models can be found in books by Andersen et al. (1993) and Hougaard (2000), or in papers by Putter et al. (2007) and Andersen and Perme (2008).
Despite its potential, multi-state modeling is not used by practitioners as frequently as other survival analysis techniques. It is our belief that lack of knowledge of available software and non-implementation of the new methodologies in userfriendly software are probably responsible for this neglect. One important contribution to this issue was given by the R/S-PLUS survival package. Thanks to this package, survival analysis is no longer limited to Kaplan-Meier curves and simple Cox models.
Indeed, this package enables users to implement the methods introduced by Therneau and Grambsch (2000) for modeling multi-state survival data. In R (R Development Core Team 2008), multi-state regression can also be performed using the msm package  situations that involve more than two states and a number of possible transitions among them. The most common models in the literature include: the progressive three-state model depicted in Figure 1; and the illness-death model, also known as the disability model ( Figure 2). These models can be used to study the effect of binary time-dependent covariates, such as the appearance of "recurrence" in a breast cancer study (Cadarso-Suárez, Meira-Machado, Kneib, and Gude 2010), "bleeding episodes" in patients with liver cirrhosis (Andersen, Esbjerg, and Sørensen 2000), or "transplantation" in heart diseases (Meira-Machado et al. 2009).
More examples of multi-state models can be found in books by Andersen et al. (1993) and Hougaard (2000), or in papers by Hougaard (1999) and Putter, Fiocco, and Geskus (2007).
Despite its potential, multi-state modeling is not used by practitioners as frequently as other survival analysis techniques. It is our belief that lack of knowledge of available software and non-implementation of the new methodologies in user-friendly software are probably responsible for this neglect. One important contribution to this issue was given by the R/S-PLUS survival package (Therneau and Lumley 2010). Thanks to this package, survival analysis is no longer limited to Kaplan-Meier curves and simple Cox models. Indeed, this package enables users to implement the methods introduced by Therneau and Grambsch (2000) for modeling multi-state survival data. In R (R Development Core Team 2010), multi-state regression can also be performed using the msm package by Christopher Jackson (continuous-time Markov and hidden Markov multi-state models; Jackson (2011)) and mstate Putter 2010, 2011). The changeLOS package Wangler, Beyersmann, and Schumacher (2006) implements the Aalen-Johansen estimator (Aalen and Johansen 1978) for general multi-state models, and the etm package Allignol, Schumacher, and Beyersmann (2011)  In addition, the current version of p3state.msm can also be used to draw inferences in the progressive three-state model depicted in Figure 1. This software can be used to fit the time-dependent Cox regression model (TDCM) as well as semi-parametric Cox proportional hazard regression models (Cox 1972) to all permitted transitions, by decoupling the whole process into various survival models (Cox Markov Model, CMM, and Cox semi-Markov Model, CSMM, Andersen et al. 2000). Numerical and graphical output for all methods can be easily obtained.
Regression (Cox-like) models can be fitted using, e.g., the tdc.msm software, and estimation of the Aalen-Johansen (Markov) estimator can be computed using the etm package. However, in the absence of the Markov property, without our software, appropriate methods for the computation of the transition probabilities would still be lacking. Moreover, in the progressive three-state model the package p3state.msm also provides other summary measures that greatly helps to understand the disease process.
The following section provides a brief introduction to the methodological background. Notation is introduced and nonparametric estimators for bivariate distribution function and transition probabilities are presented. Regression methods based on semi-parametric Cox regression models are also discussed. An overview of the use of p3state.msm is given in Section 3 and an example of its application in Section 4. A discussion is in Section 5.

Methodological background
Multi-state processes are characterized through transition probabilities between states h and j, which are expressed for s ≤ t as where H s − (σ − algebra) denotes the history of the process, which is generated and consists of the observation of the process over the interval [0, s); or through transition intensities, which are expressed as α hj (t) = lim ∆t→0 p hj (t, t + ∆t)/∆t representing the instantaneous hazard of progression to state j conditionally on occupying state h, and which are assumed to exist.
A number of possible models for the transition rates have been studied. These include: timehomogeneity; the Markov assumption; and the semi-Markov assumption.
This section will focus on the estimation of transition probabilities for the illness-death model (Meira-Machado, de Uña-Álvarez, and Cadarso-Suárez 2006). These estimators also apply to the case of the progressive three-state model, since this model can be viewed as a particular case of the illness-death model where no transitions are observed on disease-free mortality transition (1 → 3). For the progressive three-state model, estimators are also derived for the bivariate distribution function of the pair of gap times and for the distribution of the second gap time (de Uña-Álvarez and Meira-Machado 2008). The idea behind the proposed estimators is using the Kaplan-Meier estimator pertaining to the distribution of the total time to weight the data. The estimators for the transition probabilities can be regarded as an alternative to Aalen-Johansen estimators since they do not rely on the Markov assumption.

The illness-death model
Consider the illness-death model depicted in Figure 2. Let {X(t), t ≥ 0, X(0) = 1} denote the underlying stochastic process, where X(t) denotes the state being occupied at time t, for which all individuals are in state 1 at time zero. The stochastic behavior of the process is represented by a random vector (T 12 , T 13 , T 23 ) , where T hj is the potential transition from state h to state j, 1 ≤ h < j ≤ 3, in which T 23 is the sojourn time in state 2. The survival time of the process is given by This random vector may be subjected to a random right-censoring variable, denoted as C and assumed to be independent of (T 12 , T 13 , T 23 ) . Owing to censoring, only the following are observed: sojourn time in state 1, U = min(T 12 , T 13 , C) ; sojourn time in state 2, V = min(T 23 , C − T 12 ) ; observed total time Y = U + δV = min(T, C) (δ = I (T 12 ≤ min(T 13 , C))); and indicator statuses ∆ 1 = I (min(T 12 , T 13 ) ≤ C) and ∆ 2 = I(T ≤ C).
Traditionally, the transition probabilities are estimated via the non-parametric model (Aalen-Johansen estimator, Aalen and Johansen 1978). The performance of Aalen-Johansen estimator of stage occupancy probabilities was investigated by Datta and Satten (2001) when the process is not Markovian. These authors concluded that the Aalen-Johansen method provides consistent estimators in this case. However, no similar result is available when the target is the transition (rather than the occupancy) probability. Recently, Meira-Machado et al. (2006) verified that, in non-Markov situations, the use of these estimators for empirical estimation of the transition probabilities, p hj (s, t), may be inappropriate. Within the scope of the illnessdeath model, these authors propose alternative estimators for the transition probabilities, which do not rely on the Markov assumption. The quantities are determined by the joint distribution of (T 12 , T 13 , T 23 ). Specifically, knowledge of the distribution H of min(T 12 , T 13 ) will suffice for recovery of p 11 (s, t) while expectations of type S(φ) = E[φ(U, Y )] arise when handling p 12 (s, t) and p 22 (s, t) . The estimators are expressed aŝ where W i are the Kaplan-Meier weights attached to Y (i) ,Ĥ is the Kaplan-Meier estimator based on the pairs ( Note that for the illness-death model, the transition probabilities to be estimated reduce to p 11 (s, t), p 12 (s, t), and p 22 (s, t), since p 13 (s, t) = 1 − p 11 (s, t) − p 12 (s, t) and p 23 (s, t) = 1 − p 22 (s, t).

The progressive three-state model
Consider the progressive three-state model depicted in Figure 1. The stochastic behavior of the process is represented by a random vector (T 12 , T 23 ), deemed to be a pair of gap times of successive events which may be subjected to random right-censoring. Let C be the rightcensoring variable, assumed to be independent of (T 12 , T 23 ), and let T = T 12 + T 23 be the survival time of the process and Y = min(T, C) the observed total time.
The methods shown above (for the illness-death model, 1-3) apply to the progressive threestate model. Estimation of the marginal distribution for the second gap time, F 2 (y) = P (T 23 ≤ y), and estimation of the bivariate distribution function, F 12 (x, y) = P (T 12 ≤ x, T 23 ≤ y), constitute two additional topics of interest within the scope of the model depicted in Figure 1. Indeed, as T 23 and C 2 = (C − T 12 ) I (T 12 ≤ C) will in general be interdependent, estimation of the marginal distribution for the second gap time is not a simple issue. The same applies to the bivariate distribution function. A simple estimator was proposed by de Uña-Álvarez and Meira-Machado (2008), with the Kaplan-Meier estimator pertaining to the distribution of the total time being used to weight the data. The estimators are expressed asF (4) one can obtain an estimator for the marginal distribution of the second gap time, namely,

Regression models
One important goal in multi-state modeling is to study the relationships between the different predictors and the outcome. To relate the individual characteristics to the intensity rates through a possibly time-dependent covariate vector, Z, several models have been used in literature. A common simplifying strategy is to decouple the whole process into various survival models, by fitting separate intensities to all permitted transitions using semi-parametric Cox proportional hazard regression models, while making appropriate adjustments to the risk set.
For the illness-death model of Figure 2, the transition intensities, α hj (t; Z) , 1 ≤ h < j ≤ 3, may be modeled using Cox-like models of the form α hj (t; Z) = α hj0 (t) exp β T hj Z assuming the process to be Markovian. These models are known as Cox Markov models (CMM). The Markov assumption states that the future depends on the individual's past solely by means of his current state. However, by ignoring disease history behavior, Markov models may have severe limitations, thus rendering them inappropriate. One alternative approach is to use a Cox semi-Markov model (CSMM) in which the future of the process does not depend on the current time but rather on the duration in the current state. These models are also called "clock reset" models, because each time the patient enters a new state time is reset to 0. Assuming an illness-death model, the only difference between CMM and CSMM (Andersen et al. 2000) resides in transition 2 → 3 , in which intensity α 23 is modeled in a different way. Specifically, the corresponding intensity α 23 in the CSMM is given by α 23 (t − T 12 ; Z) = α 230 (t − T 12 ) exp β T 23 Z where T 12 is the entry time into state 2. These

p3state.msm: Analyzing Survival Data from an Illness-Death Model
Cox-like models (Markovian and semi-Markovian) can be fitted by means of most of the statistical packages (R, S-PLUS, SAS, etc.), provided that a counting process notation is used, with each patient being represented by several observations (Meira-Machado et al. 2009).
Though we assume different baseline hazards for the transition intensities, we note that, in some cases, one may impose some restriction on the baseline hazards. For example, for the illness-death model, one approach that is often considered is to assume the baseline hazards for transition 1 → 3 and for the 2 → 3 transition to be proportional. In such cases, the model for these transitions is given by α 13 (t; Z) = α 130 (t) exp β T 13 Z and α 23 (t; Z) = α 130 (t) exp β T 23 Z + δ .

Package description
The p3state.msm software contains nonparametric statistical methods for estimating quantities of interest such as transition probabilities, the bivariate distribution function for censored gap times, etc. This software is intended to be used with the R statistical program (R Development Core Team 2010). In the R language, programming is based on objects, and computations are basically specialized functions designed to perform specific calculations.
Our package is composed of 6 functions that enable users to fit the proposed models and methods. Table 1 provides a summary of the objects in this package.
Users can fit the proposed models and methods discussed in the previous section by means of the three functions, namely, p3state, summary and plot. Table 2 provides a summary of the arguments in the three functions.
It should be noted that only data is a required argument. Records in the data file must contain the following variables: times1, delta, times2, time, status, covariate1, covariate2, and so on. The structure of the data input is as follows: each individual is represented by one line Function Description p3state Main function for fitting regression models and obtaining multistate estimates (transition probabilities, bivariate distribution function, etc.). plot A function that provides the plots for transition probabilities, bivariate distribution function, and marginal distribution of the second time (the last two available solely for the progressive threestate model). summary Summary method for objects of class p3state. data.creation.reg Provides the adequate dataset for implementing regression models (TDCM, CMM, and CSMM). Same input data as for p3state. pLIDA Provides estimates for the transition probabilities using the methods in paper by Meira-Machado et al. (2006).

Biv
Provides estimates for the bivariate distribution function, using the paper by de Uña-Álvarez and Meira-Machado (2008). Available solely for the progressive three-state model. Starting value for computing the transition probabilities. NULL is equivalent to 0. col.biv A logical variable indicating whether you want color to be used in the filled.contour plot. By default col.biv = FALSE. of data. The variable times1 represents the observed time in state 1, and delta the indicator of transition to state 2 (taking a value of 1 if a transition to state 2 is observed, and a value of 0 otherwise). The variable times2 represents the observed time in state 2. If no transition into state 2 (delta = 0) is observed then times2 = 0. The variable time is just the observed total time (times1 + times2) whereas status is the final status of the individual (1 if the event of interest, representing state 3, is observed and 0 otherwise). The following variables are the covariates to be studied in the regression models. Note that possible courses for the individual include: 1 → 1 (the individual remains in state 1 until the end of the study; if delta = 0 and status = 0); 1 → 3 (a direct transition from state 1 into state 3 is observed; if delta = 0 and status = 1); 1 → 2 → 2 (if delta = 1 and status = 0); and 1 → 2 → 3 (if delta = 1 and status = 1).
Although only data is a required argument in p3state, note that for implementing regression models (alone) the argument formula is also necessary. The p3state function returns an object of class p3state with the following components: descriptives: vector with transition between states.
datafr: data.frame to be used for obtaining the nonparametric estimates and plotting purposes.
tdcm: a coxph object with the fit of the Cox regression model with time-dependent covariates.
msm12: a coxph object with the fit of the Cox model for transition from state 1 to state 2.
msm13: a coxph object with the fit of the Cox model for transition from state 1 to state 3 (solely for the illness-death model).
cmm23: a coxph object with the fit of the Cox Markov model for transition from state 2 to state 3.
csmm23: a coxph object with the fit of the Cox semi-Markov model for transition from state 2 to state 3.
tma: a coxph object with the fit of a Cox model for testing the Markov assumption.
tma2: the same as tma but with all the covariates in the model.
The object obtained when using the p3state function is the only argument required for summary. However, the arguments, regression, time1 and time2 are also required if the results from the regression model and the estimates for the other nonparametric methods are sought. This function prints several numerical results on the screen, i.e., parameter estimates with standard errors for the covariates for TDCM, CMM and CSMM models, transition probabilities estimates, and estimates for the bivariate distribution function and the marginal distribution of the second time (only in the case of the progressive three-state model).
The plot function provides the following graphical output: transition probabilities estimates; bivariate distribution function; and marginal distribution of the second time.
The dataset included in p3state.msm package is the well-known Stanford Heart Transplant data in a different format. Details about this dataset are given below.

Example of application: Stanford Heart Transplant data
An example of application is provided using the Stanford Heart Transplant data. A copy of the data may be obtained from statlib or in the R survival package. This data set is also available in the book by Kalbfleisch and Prentice (1980, Appendix I, pp. 230-232) or in the paper by Crowley and Hu (1977). The data set covers the period until 1974-04-01. In this period some patients died before an appropriate heart could be found. Of the 103 patients, 69 received heart transplants; the number of deaths was 75; the remaining 28 patients contributed with censored survival times. For each individual, an indicator of final vital status (censored or not), survival times (time since acceptance into the transplantation program to transplant and to death) from patient entry into the study (in days), and a vector of covariates including age at acceptance (age), year of acceptance (year), previous surgery (surgery: coded as 1 = yes; 0 = no), and transplant (coded as 1 = yes; 0 = no) were recorded. The covariate "transplant" is the only time-dependent covariate, while the other covariates included are fixed. These time-dependent covariates can be re-expressed as a multi-state model, with states based on the values of the covariate. If all subjects observe the intermediate event, then the time-dependent covariate renders it possible for the progressive three-state model to be used ( Figure 1); otherwise, it is feasible for the illness-death model, depicted in Figure 2, to be used. For the transplant heart data, the time-dependent covariate can be expressed as an intermediate event that can be modeled using an illness-death model with states, "alive without transplant", "alive with transplant", and "dead". This relationship will be used below to compare the Cox model with time-dependent covariates against common multi-state regression approaches (CMM and CSMM). Other targets include the estimation of transition probabilities. This will be done using p3state.msm.
In the following, we will demonstrate the package capabilities using data from the Stanford Heart Transplant Study. Bellow is an excerpt of the data.frame with one row per individual R> library("p3state.msm") R> data("heart2") R> head(heart2) Individuals represented in lines 1, 2, 5 and 6 experienced a direct transition from state 1 to state 3 (1 → 3); individuals represented in lines 3 and 4 had a heart transplant at 1 and 36 days, respectively, after enrolment, and died at 16 and 39 days (1 → 2 → 3), respectively. We note that delta = 1 and status = 0 corresponds to individuals with a transition from state 1 to state 2 and, afterwards, he/she exhibits a censored sojourn time in state 2 (1 → 2 → 2; individuals that receive a new heart and remain alive until de end of study); finally, delta = 0 and status = 0 corresponds to a censored sojourn time in state 1 (1 → 1; remained alive without a heart transplant).
Two central questions that arise in multi-state survival data are: what is the relationship between the different covariates and disease evolution; what is the rate (hazard) at which persons in state h move to state j. Both questions can be answered using our software. This will be shown below.
The p3state.msm software enables several semi-parametric Cox models to be fitted. The timedependent Cox model or multi-state Cox-like models (CMM and CSMM) can be constructed with the following input command: here treat denotes the time-dependent covariate associated with the occurrence of the intermediate state (in our application, transplantation, which is a binary time-dependent covariate); and in which the likelihood ratio test is of the model with all covariates versus a model with intercept only.
The output for such model can be obtained using the command obj1.p3state$tma2 (results not shown).
The patients course over time may also be studied through transition probabilities. To obtain these estimates (for a model with no covariates), the following input command must be typed: The results obtained with the last two input commands can be obtained with a single input command (results not shown), namely: R> summary(obj1.p3state, model = "CMM", time1 = 20, time2 = 200) The package also provides plots for several functions. Estimates of the the transition probabilities (for a model with no covariates) are displayed in Figure 3. These plots can be obtained with: R> plot(obj1.p3state, plot.trans = "all", time1 = 20) Just for the purposes of illustration, we created a new data set in which the individuals who experienced a direct transition from state 1 to state 3 (1 → 3) are taken as censored on death time. The aim of this subset is to illustrate the program in a progressive three-state model. This can be done with the following three lines of command: Estimates for the transition probabilities and estimates of regression effects can be obtained in the same way as for the illness-death model.
The outputs for the bivariate distribution function and for the marginal distribution of the second gap time (time since transplantation) are useful displays that greatly helps to understand the patients' course over time. Estimates and plots for these quantities can easily be obtained. The following three input commands provide the corresponding numerical and graphical output (Figures 4, 5 and 6):

Conclusion
This paper discusses implementation in R of some newly developed methods in multi-state models. The p3state.msm package uses methods proposed by Meira-Machado et al. (2006) (transition probabilities) and de Uña-Álvarez and Meira-Machado (2008) (bivariate distribution function for the censored gap times in the progressive three-state model). The main novelty of these estimators is that they do not rely on the Markov assumption, typically assumed to hold in a multi-state model. The software also enables the user to easily ob- tain estimates of regression parameters, assuming that each transition may be specified by a Cox-type model. Numerical results as well as graphics are easily obtained.
As in other Cox analyses some care is necessary when performing regression modeling (assumptions, baseline hazards, etc.). We note that the software returns an object of class p3state with components that allow users to perform a detailed analysis of these models. We mention three important topics that we shall consider in future versions of the package. First, covariates have not been included in our nonparametric methods, e.g. transition probabilities. Another topic of much practical interest is that of providing pointwise confidence bands for the transition probabilities. To this end, we note that the (1 − α)100% limits for the confidence interval of p hj (s, t) can be obtained using pointwise confidence bands based on the bootstrap. Though this can be achieved using the current version of our paper it will be quite demanding. Finally, it is our belief that it may be valuable to include an option where the baseline intensities (for Cox-like regression models) are estimated using methods described in paper by Meira-Machado et al. (2006).
We plan to constantly update p3state.msm to cope with other multi-state models such as the progressive k-state model and the bivariate model (for bivariate failure times).