SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models

Multi-state models provide a relevant tool for studying the observations of a continuous-time process at arbitrary times. Markov models are often considered even if semi-Markov are better adapted in various situations. Such models are still not frequently applied mainly due to lack of available software. We have developed the R package SemiMarkov to ﬁt homogeneous semi-Markov models to longitudinal data. The package performs maximum likelihood estimation in a parametric framework where the distributions of the sojourn times can be chosen between exponential, Weibull or exponentiated Weibull one. The package computes and displays the hazard rates of sojourn times and the hazard rates of the semi-Markov process. The eﬀects of covariates can be studied with a Cox proportional hazards model for the sojourn times distributions. The number of covariates and the distribution of sojourn times can be speciﬁed for each possible transition providing a great ﬂexibility in a model’s deﬁnition. This article presents parametric semi-Markov models and gives a detailed description of the package together with an application to asthma control.


Introduction
In multi-state models of longitudinal data usually a process is assumed to be Markovian that is the conditional probability distribution of future states depends only on the present state, not on the whole sequence of past events.In various applications, the intensities between states are supposed to be constant in time (homogeneity assumption) or piecewise constant (Huszti et al. 2012;Saint-Pierre et al. 2003;Aguirre-Hernandez and Farewell 2002).A few R packages have been developed to simplify the usage of multi-state Markov models.The 2SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models msm package (Jackson 2012) allows to fit homogeneous Markov or hidden Markov model in continuous-time and discrete-time.Non and semi parametric estimation of non homogenous Markov models or competing risks models are possible using mstate package (Putter et al. 2011).The etm package (Allignol 2012) computes the Aalen-Johansen empirical transition matrix whereas p3state (Meira-Machado and Roca-Pardinas 2012) focuses on the illness-death model.
A non homogeneous Markov model is well adapted when the process evolution depends on calendar time, age or time since the beginning of the study.However, the memoryless property implies that the waiting times distributions in a Markov model is exponential.In cases when this assumption is too restrictive, semi-Markov models can be considered since they involve distributions of sojourn times as parameters.From a theoretical point of view, several results are given in Limnios and Oprisan (2001).Non homogeneous semi-Markov models are very complex and are rarely used in practical situations (Monteiro et al. 2006).In the homogeneous case, a non parametric estimation of the semi-Markov process hazard rate can be found in Gill (1980) or in Ouhbi and Limnios (1999).The parametric maximum likelihood estimation is based on a parametric definition of the sojourn times distributions (Pérez-Ocón et al. 1999).Indeed, the Weibull or the exponentiated (generalized) Weibull distributions are efficient and flexible to fit the ∩ or ∪ shape (of the hazard rates) common in biology (Foucher et al. 2006), life science and reliability.Moreover, the parametric model allows to incorporate covariates in the distribution of sojourn times using a proportional-hazards regression model (Cox 1972).
Few R (R Development Core Team 2010) packages have been developed to handle semi-Markov or hidden semi-Markov model.The mhsmm package (O'Connell and Hojsgaard 2012) performs estimation and prediction for multiple observation sequences in hidden semi-Markov models.The msSurv package (Ferguson et al. 2012) provides non parametric estimation in semi-Markov models but covariates are not considered.However, it seems that the parametric approach is not yet implemented in statistical software.We have developed an R package named SemiMarkov (Listwon and Saint-Pierre 2013) which performs parametric estimation in a homogeneous semi-Markov model.The waiting times distributions can be chosen to be the exponential, the Weibull, or the exponentiated Weibull distribution.Maximum likelihood estimations of both, hazard rates of the semi-Markov process and hazard rates of sojourn times can be deduced.Moreover, the effects of covariates on the process evolution can be studied using a semi-parametric Cox model for the distributions of sojourn times.The number of states, the possible transitions between them and the number of covariates affecting each transition can be chosen in order to fit sparse model adapted to a specific application.
The rest of this paper is organized as follows.Section 2 describes the multi-state semi-Markov models and the parametric maximum likelihood estimation used in the SemiMarkov package.Section 3 describes the SemiMarkov package whereas the section 4 illustrates the different functions included in the package through an example on severe asthma.Conclusions and possible future extensions of this R package are discussed in Section 5.

Homogeneous semi-Markov process
Let us consider a Markov renewal process (J n , T n ) n∈N where 0 = T 0 < T 1 < . . .< T n < ∞ are the successive times of entry to states J 0 , J 1 , . . ., J n where J n = J n+1 for all n ∈ N. The sequence (J n ) n∈N is an embedded homogeneous Markov chain taking values in a discrete finite state space E with transition probabilities p hj = P (1) Let N (t) = sup{n ∈ N : T n ≤ t , t ∈ R + } the counting process which counts the total number of observed transitions during the time interval [0, t].The process J N (t) , which represents the state of the process at time t, defines a homogeneous semi-Markov process.
The probability distribution function of sojourn times is related to the semi-Markov kernel through the transition probabilities of the embedded Markov chain, Let us suppose that the survival function G hj (.), the density function f hj (.) and the hazard rate α hj (.) associated to this probability distribution can be defined.The survival function of sojourn time in state h is defined by The hazard rate of the semi-Markov process is defined by

Sojourn times distribution
Let us assume that distributions of sojourn times belong to a parametric family.The simplest model is obtained using the exponential distribution E(σ hj ), for which the hazard rate is constant over time (corresponding to the Markov case) and is related to a single positive parameter σ, The Weibull distribution (Weibull 1951), which generalizes the exponential one, is often used in practical applications.Indeed, the Weibull distribution with two parameters W(σ hj , ν hj ) is well adapted to deal with various shapes of monotone hazards, where σ hj > 0 is a scale parameter and ν hj > 0 is a shape parameter.The exponentiated Weibull distribution EW(σ hj , ν hj , θ hj ) (Mudholkar and Srivastava 1993) with an additional 4SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models shape parameter θ hj > 0 is very useful to fit ∪ and ∩ shapes of hazard rates These three distributions which allow to fit various shapes of the hazard ratio are nested: a EW(σ hj , 1, 1) is equivalent to a W(σ hj , 1) which is equivalent to a E(σ hj ).The Wald test can be used to test each parameter and evaluate the relevance of a given distribution.

Parametric maximum likelihood estimation
In a parametric framework, distributions of sojourn times are supposed to belong to a class of parametric functions.For each transition, the distribution (which depends on a finite number of parameters) can be specified using either the hazard rate α hj (•), the density f hj (•) or the cumulative distribution function F hj (•).
The likelihood function associated to a single semi-Markov process can be written as where N is the total number of observed transitions between two different states, U denotes the duration between the time of the last observation and T N time of the end of the study.The indicator δ is equal to 1 if the last sojourn time U is right-censored by the end of the study.Indeed, the last duration and the last arrival state are unknown unless the process entered an absorbing state (δ = 0).When an observation is right-censored (δ = 1), the survival function of the sojourn times is taken into account.The first part of Equation 7 involves density function and probabilities of Markov chain, it corresponds to the contribution of the observed transitions.

Consider that each individual
Equation 7 can be used to compute each individual contribution to the likelihood L i .The full likelihood L is the product of all individual likelihood contributions L i .

Cox proportional model
The influence of covariates on the sojourn times distributions can be studied using a Cox proportional regression model (Cox 1972).Let Z hj be a vector of explanatory variables and β hj a vector of regression parameters associated to the transition from state h to state j.The hazard rate is defined by where α 0 (d) denotes the baseline hazard defined in section 2.2.
In this model, the regression coefficients can be interpreted in terms of relative risk.As in the Cox model, time-dependent covariates can be considered assuming that the value of the covariate is constant between two consecutive events.Let us mention that the previous notation allows to consider different sets of covariates for each transition.It is then possible to consider sparse models with only significant regression parameters.
3. The SemiMarkov R package

Package description
The SemiMarkov package was developed to analyze longitudinal data using multi-state semi-Markov models.The main function semiMarkov of the package computes the parametric maximum likelihood estimation in the homogeneous semi-Markov model introduced in section 2.

Format of data
A data set asthma is included in the SemiMarkov package.This cohort study (longitudinal data) of severe asthmatic patients can be analyzed using multi-state semi-Markov models.
The data frame to be used in the function semiMarkov must be similar to the asthma data : a The rows must be grouped by individuals and ordered chronologically within groups.By definition of a semi-Markov model the waiting times must be known.Therefore, the transitions between the same states are not possible.If such transitions are observed, the row must be combined with the next transition to obtain a transition from state h to state j with h = j.
The last sojourn time of a semi-Markov process is observed only when the process enters an absorbing state.In other cases, the final state and the last sojourn time is unknown due to right-censoring process.In such case, it is only known that the censored sojourn time is greater than the last observed sojourn time (in practice, the last observed sojourn time is deduced from the date of the end of the study).A censored transition can be specified by a transition from h to h (so that such transitions are distinct from the rest of transitions).One can also identify the unknown arrival state using the argument cens.The dataset may also include additional explanatory variables (for instance, some individual's characteristics).The values of these covariates must be given for each individual and for each transition in order to take fixed or time-dependent covariates into account (one value for each row of the data frame data).

Functions description
Following is a brief description of the package functions.

Sojourn times distribution
The parametric estimation in homogeneous semi-Markov models is based on the specification of the sojourn times distribution.The following distributions are available in the package SemiMarkov : exponential ("E", "Exp" or "Exponential"), Weibull ("W" or "Weibull") and exponentiated Weibull ("EW", "EWeibull" or "Exponentiated Weibull").If the logical value TRUE is given then the default is the Weibull distribution.These distributions are nested when the appropriate parameters are equal to 1 (see Section 2).The estimations of the distribution parameters are given with standard deviations and p-values of the Wald test (H 0 : Θ hj = 1).One can then evaluate, for instance, the relevance of the exponentiated Weibull distribution in comparison to the Weibull or the exponential distribution.

Multi-state model definition
The multi-state approach requires to define the states of the process and to specify the structure of the model (the number of states and the possible transitions between them).The function table.statereturns a matrix which gives the number of observed transitions in the data set.This function can help to define the argument mtrans required in the semiMarkov and the hazard functions.The square matrix mtrans includes informations on possible transitions and on the distributions of waiting times.The element hj of the matrix mtrans is either a logical value FALSE (when the transition from h to j is not possible) or a character representing the sojourn time distribution.According to semi-Markov models, the diagonal elements of mtrans are all equal to FALSE.Note that only the transitions specified in mtrans will be considered in the analysis.In case of the three-state model described in Figure 1 where the sojourn times associated to each transition are Weibull distributed, the matrix mtrans will be defined as follows R> mtrans The argument states is a character vector used to define the names of states, possible values are those included in the data's columns state.hand state.j.

Covariates
The effect of covariates on the process evolution can be investigated considering a Cox proportional hazard model for the hazard rates of waiting times.A set of covariates can be specified using the argument cov.The argument cov_tra is used to indicate which covariates affect which transitions : cov_tra is a list of vectors where the k-th vector provides the transitions affected by the k-th covariate.The elements of these vectors can only consist of transitions specified in the argument mtrans.For example, considering the model defined in Figure 1 where a covariate "A" affects all transitions leaving state 1 whereas the covariate "B" affects the transitions leaving state 2: the arguments will be respectively cov=as.data.frame(cbind(A,B))and cov_tra=list(c("12","13"),c("21","23")).The interpretation of the regression coefficients in terms of relative risks (as in the Cox model) can help to quantify the effect of covariates and to understand the process evolution.For each estimation of regression coefficients, standard deviation and p-value of the Wald test (H 0 : β = 0) are given.

Initial values
The optimization procedure used in the maximum likelihood estimation requires definition of initial values of the parameters: the distribution parameters, the transition probabilities and the regression coefficients associated to the covariates.Default values are equal to 1 for the distribution parameters, and 0 for the regression coefficients.The initial transition probabilities are calculated by simple proportions: the number of observed transitions from state h to state j divided by the total number of observed transitions from state h.The function param.initcan be used to define specific initial values of the parameters.An object of class param.initcan then be given as argument in the semiMarkov and hazard functions.The total number of parameters depends on: the number of states, the possible transitions, the chosen distributions and the covariates.

Parametric maximum likelihood estimation
The semiMarkov function The main function semiMarkov estimates the parameters of a multi-state homogeneous semi-Markov model using the parametric maximum likelihood estimation.Several R packages are needed to run the function.The package numDeriv (Gilbert and Varadhan 2012) that allows to approximate the hessian matrix of second derivatives for the estimated parameters.The package MASS (Ripley et al. 2012) is used to obtain the inverse of the hessian matrix.The maximization of the likelihood is performed using the Augmented Lagrangian Adaptive Barrier Minimization Algorithm implemented in the function constrOptim.nlfrom the R package alabama (Varadhan 2012).Indeed, we have to face an optimization with constraints since sums of the probabilities associated to transitions leaving the same states are all equal to 1 (the sums in rows of transitions matrix).
The following arguments are used in the function semiMarkov: arguments related to the data (data, cov), arguments related to the model (states, mtrans, cov_tra, cens) and initial values (dist_init, proba_init, coef_init).Default values are defined for the distributions of waiting times and for the initial values.The function semiMarkov returns an object of the class semiMarkov which recalls the chosen model, gives informations on the optimization method and provides the parameters estimation together with their standard deviations.For each regression coefficient β, the p-value of the Wald test when testing the absence of effect (H 0 : β = 0) is also provided whereas for each distribution parameter σ (or ν or θ) the p-value of the Wald test when testing H 0 : σ = 1 is given.The Wald test for the transition probabilities is less useful and is not performed.

The hazard function
The hazard rate of sojourn time and the hazard rate of the semi-Markov process can be deduced from the parameters and the distributions of sojourn times using Equation 8and Equation 3, respectively.The function hazard computes vectors of hazard rates values using either the estimations included in an object of class semiMarkov or the specific values defined by an object of class param.init.The argument type is used to choose the type of hazard rate: alpha for the hazard rates of waiting times and lambda for the hazard rates of the semi-Markov process.
The hazard function returns the values of the hazard rates associated to a vector of times.By default, the hazard rates are calculated for a vector of ordered times of length 1000 where the starting value is equal to 0 and the ending value is determined by the longest sojourn times.The length of a vector, its starting and ending values can be specified by the user.One can also enter a whole vector of times, for instance, the different values of the sojourn times observed in the data.If covariates are used in the model, the hazard rates can be obtained for given values of the covariates using the argument cov: for time-fixed covariate a single value is needed whereas a vector of values is required for time-dependent covariates.By default, all covariates values are set equal to 1.Note that the function hazard does not require to specify the model or the distributions.Indeed, these informations are already included in the objects semiMarkov or param.init.

Showing results
An object of the class semiMarkov contains data description, the considered model and the results of the MLE that may be displayed using summary.semiMarkovand print.semiMarkov.The functions summary.hazardand print.hazardprovide the type of hazard rates, the vector of times and the associated values of hazard rates.An object of the class hazard can be plotted using plot.hazard.For each transition, the function generates a plot representing one or more (up to ten) hazard rates.

Application to asthma control data
A follow-up study of severe asthmatic patients was conducted in France between 1997 and 2001 by ARIA (Association pour la Recherche en Intelligence Artificielle).Adult asthmatics were prospectively enrolled over a 4-year period by a number of French chest physicians.The data reflects the real follow-up of patients consulting at varied times according to their perceived needs.At each visit, several covariates were recorded and asthma was evaluated

State 2
Sub-optimal control

State 1
Optimal control

State 3
Unacceptable control Figure 1: The three states model used for asthma control evolution.
using the concept of control scores (Saint-Pierre et al. 2006).The control scores can be used to define the subject's state at each consultation.The considered model to study the evolution of asthma consists of three transient states (Figure 1): the optimal control (State 1), the sub-optimal control (State 2) and the unacceptable control (State 3).A random selection of 371 patients with at least two visits (data asthma) is included in the package SemiMarkov.
A total of 557 transitions between states are observed and no deaths are reported.In a primary analysis, the data are stratified according to the values of the covariates.The effect of covariates and the proportional hazard assumption can be evaluated by representing the hazard rates in each stratum.In a second step, a proportional model can be considered to study the effect of covariates.For instance, one can consider a model with BMI as covariate and the Weibull distribution for the waiting times.

R>
R> plot(hazard(fit,cov=0), hazard(fit,cov=1),transitions=c("31")) R> plot(hazard(fit,cov=0, type="lambda"), hazard(fit,cov=1,type="lambda"), transitions = c("31")) Finally, multivariate models can be considered.However, the number of parameters can quickly be too important comparing to the size of data set (due to the complex form of the waiting times distributions and to the number of covariates under study).The optimization method can then fail to reach convergence.It is therefore important to consider sparse models using adapted distributions and keeping only the regression coefficients significantly different from 1.
R> SEV <-as.data.frame(asthma$Severity)R> fit2 <-semiMarkov(data=asthma, cov=as.data.frame(cbind(BMI,SEV)),states= states, mtrans=mtrans, cov_tra=list(c("13","31"),c("23"))) Semi-Markov multi-state models are proven to be very useful in various applications.They are extensions of Markov models in which the evolution of a process is independent from time spent in a state between two consecutive events.Such assumption is too stringent in some applications.In such case, semi-Markov models are of great interest for modeling the sojourn (waiting) times distributions.However, the implementation of such approach is complex and there are barely any packages or macros to adjust such models.The SemiMarkov package allows to fit parametric homogeneous semi-Markov models by maximizing the likelihood.The choice of waiting times distributions, in particular the exponentiated Weibull distribution, allows to fit various shapes of hazard rates functions.An advantage of the parametric approach is the possibility to study the effects of covariates via a proportional hazard model.In order to obtain sparse models adapted to the process of interest, the user can choose the number of covariates and the distributions of waiting times for each transition.Some extensions of the SemiMarkov package could be of interest.For instance, the package could be updated to deal with more waiting times distributions.The methodology can be adapted to include random effects in order to deal with the correlation between subjects.Interval censored data could also be analyzed using a penalized likelihood approach (Foucher et al. 2010) or using an estimation method with piecewise constant hazard rates (Kapetanakis et al. 2012).The optimization step is a crucial point and need to be investigated.Indeed, the multi-state approach is often limited by the number of parameters with several covariates.The adaptation of methods dealing with high dimensional data to the multi-state model framework is of high interest as well.

6SemiMarkov:
An R Package for Parametric Estimation in Multi-State Semi-Markov Models

12SemiMarkov:
Figure 2: The hazard rate of waiting time (left) and the hazard rate of the semi-Markov process (right) for the transition 31 for BMI=0 (black line) and for BMI=1 (red line).
table.state:Computes a frequency table counting the number of observed transitions in the data set.param.init:Defines default or specified initial values of the parameters.For any object of classes semiMarkov and param.init, the function computes the values of the hazard rate of sojourn times or the values of the hazard rate of the semi-Markov process for a given vector of times.summary.semiMarkov,summary.hazard,print.semiMarkov,print.hazard:Summary and printing methods for objects of classes semiMarkov and hazard.
hazard:plot.hazard:Plot method for objects of class hazard.
The data frame asthma is a table with one row per transition.The rows corresponding to the same subject are grouped and ordered chronologically.The columns of asthma data are : the patient identification number (id), the state left by the process (state.h), the arrival state (state.j), the sojourn time in state state.h(time)andbinarycovariates (Severity, BMI, Sex).Note that the variable BMI is a time-dependent covariate.There are no absorbing states in the considered model (Figure1).The last sojourn time is then right-censored.Its value is the time between the last visit and the date of the end of the 10SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models study.A censored observation is identified by a transition into the same state.In such case, the value of state.h is equal to the value of state.jand the value of time is the censored sojourn time.
fit$table.distTheregression coefficients associated with BMI can be analyzed using the Wald test statistics (H 0 : β hj = 0).For instance, the estimation of the coefficient associated to the transition from state 3 to state 1 is significantly different from 0 (β = −0.447,p = 0.028).It means that a BMI≥25 decreases the risk of leaving the unacceptable state to enter the optimal control state.