The R Package threg to Implement Threshold Regression Models

This paper introduces the R package threg, which implements the estimation procedure of a threshold regression model, which is based on the first-hitting-time of a boundary by the sample path of a Wiener diffusion process. The threshold regression methodology is well suited to applications involving survival and time-to-event data, and serves as an important alternative to the Cox proportional hazards model.This new package includes four functions: threg, and the methods hr, predict and plot for ‘threg’ objects returned by threg. The threg function is the model-fitting function which is used to calculate regression coefficient estimates, asymptotic standard errors and p values. The hr method for ‘threg’ objects is the hazard-ratio calculation function which provides the estimates of hazard ratios at selected time points for specified scenarios (based on given categories or value settings of covariates). The predict method for ‘threg objects is used for prediction. And the plot method for ‘threg’ objects provides plots for curves of estimated hazard functions, survival functions and probability density functions of the first-hitting-time; function curves corresponding to different scenarios can be overlaid in the same plot for comparison to give additional research insights.


Introduction
Threshold regression is a statistical methodology to analyze time-to-event or survival data, taking covariates into account. Before we discuss threshold regression, we briefly introduce the characteristics of survival data. Survival data record the time until the occurrence of a key event such as death of a patient, occurrence of tumor, discharge from hospital, or cessation of breast feeding. Survival data often contain censored observations for which we are not able to observe the exact event time. For example, in a study of risk factors for lung cancer, the time of diagnosis of lung cancer is to be recorded for each person recruited. At the end of study follow-up, however, some people may not have been diagnosed with lung cancer. Those people might or might not develop lung cancer later, but that is unknown at the time when the data need to be analyzed. The incompletely observed event times encountered in survival data make their analysis different from other classical methods used for completely observed data, such as ordinary linear regression. Survival and time-to-event data arise in many applied fields in addition to medicine and public health, including engineering, sociology, demography, and economics to name a few.
Among the many statistical models used for survival data, the most widely employed is the Cox proportional hazards model (Cox 1972) which superimposes a regression structure for covariates on top of a baseline hazard function that has an arbitrary form. As its name implies, the Cox proportional hazards model assumes that covariates alter the baseline hazard function in a proportional manner. Hence it is crucial to check if the proportional hazards assumption is satisfied before using the Cox model. When the proportional hazards assumption is violated, the model should not be used or at least used with caution.
Threshold regression methodology is a powerful tool to analyze survival data. Threshold regression does not assume proportional hazards and can serve in place of Cox proportional hazards regression, when the proportional hazards assumption does not hold. The defining feature of threshold regression is that the event time is defined as the first time an underlying stochastic process hits a boundary threshold. In a medical context, for example, the event of interest might be death and the time of death is the moment when the patient's latent health status first reaches a boundary at zero.
The R (R Core Team 2015) package threg (Xiao 2015), which is presented here, encompasses the most important threshold regression model developed to date and the most widely applied, namely, the Wiener threshold regression model. This model applies to settings where the underlying health process follows a Wiener diffusion process and the failure event is triggered when the process hits a fixed threshold for the first time. The first hitting time in this situation has an inverse Gaussian distribution, which has a very tractable mathematical form. Threshold regression in its general formulation covers a broad collection of models that extend well beyond first hitting times for Wiener diffusion processes. Different families of threshold regression models are created by assuming different types of stochastic processes and different boundaries or thresholds. For example, one can have gamma processes, curvilinear boundaries, Markov chains with absorbing states, and many more. Lee and Whitmore (2006) present a wide-ranging review of the general methodology and types of applications.
The connection between threshold regression and Cox proportional hazards regression is studied in depth in Lee and Whitmore (2010). They show that selected threshold regression models can be constructed that have the proportional hazards property. They also remark, however, that the proportional hazards property is not valid for many real applications even though it is frequently assumed for analytical convenience. The Wiener diffusion model covered by our threg package does not require the proportional hazards property. The hazard function of its first hitting time distribution can take a variety of shapes that are commonly encountered in practical work. Lee and Whitmore (2010) discuss a range of advantages that threshold regres-sion has as a modeling approach. These advantages include more realism in describing real world phenomena because it takes explicit account of the underlying health process and has a realistic mechanism for the time-to-event, namely, that of a first hitting time of a threshold or boundary. The richer model structure of threshold regression also tends to offer deeper insights into the data being analyzed. For example, the Wiener threshold regression model can have separate regression functions for a patient's initial health level (baseline health) and for the mean parameter of a patient's health trajectory (the health trend), which together give the model a capacity to capture the multifaceted impact of covariates on a patient's health experience.
Currently there exists the coxph function in package survival (Therneau and Grambsch 2000;Therneau 2012) in R for implementing the Cox proportional hazards regression, but there is no R package yet to implement the threshold regression, which, as stated above, serves as an important alternative for the Cox proportional hazards regression. This paper presents the R package threg (version: 1.0.3) for implementing Wiener threshold regression. This package was created in version 3.0.3 of R, and depends on the following two packages: survival (version: 2.36.14) and Formula (version: 1.1.0). It is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=threg.
The paper is organized as follows. In Section 2 we introduce an example dataset from a leukemia study, for which the proportional hazards assumption is not reasonable. We briefly compare the performance of Wiener threshold regression and Cox proportional hazards regression for this dataset. In Section 3, we present a theoretical overview of threshold regression in the Wiener diffusion case. In Section 4 we describe features of the threg function and the hr, predict and plot methods for 'threg' objects returned by function threg included in the threg package that implements this threshold regression model and also give corresponding examples using a bone marrow transplantation dataset. Conclusions are presented in Section 5.

An example with non-proportional hazards
In this section we give a quick comparison between the threshold regression and the Cox proportional hazards regression with lkr, a leukemia remission study dataset (Garrett 1997). There are 42 patients in the dataset, who were monitored for whether they relapsed (relapse: 1 = yes, 0 = no) and for how long (in weeks) they remained in remission (weeks). Two different treatments were given to these patients. For the first treatment, 21 patients received a new experimental drug (drug A), and the other 21 received a standard drug (treatment1: 1 = drug A, 0 = standard). For the second treatment, a different drug (drug B) was given to 20 patients, while the remaining 22 patients received a standard drug (treatment2: 1 = drug B, 0 = standard). The dataset also records white blood cell count, which is a strong indicator of the presence of leukemia and is categorized into three levels (wbc3cat: 1 = normal, 2 = moderate, 3 = high).
The variable treatment2 violates the proportional hazards assumption. Next we focus on this variable to demonstrate the advantage of the threshold regression model over the Cox model when the proportional hazards assumption is violated.
Firstly Kaplan-Meier non-parametric estimates of the survival curves for the two levels of treatment2 are given in Figure 1.   We can clearly see that the two survival curves in Figure 1 cross each other. Before a time point around week 12, the estimated survival probability (probability of remaining in remission) of the drug B group is lower than that of the standard drug group. After that point, however, the estimated survival probability of the drug B group is higher than the standard drug group. The crossing survival curves suggest that the proportional hazards assumption, which is the key assumption for the Cox model, does not hold.
The threshold regression model based on the Wiener process, however, does not assume proportional hazards. We can fit the threshold regression model on the treatment2 variable in the lkr dataset contained in the threg package by using the threg function: R> library("threg") R> data("lkr", package = "threg") R> lkr$f.treatment2 <-factor(lkr$treatment2) R> fit <-threg(Surv(weeks, relapse)~f.treatment2 | f.treatment2, + data = lkr) R> fit Then we can plot the predicted survival curves for the two levels of the treatment2 variable by using the plot method for 'threg' objects returned by threg: R> plot(fit, var = f.treatment2, graph = "sv", nolegend = 1, nocolor = 1) R> legend(20, 1, c("Standard", "Drug B"), lty = 1:2) The predicted survival curves generated by the codes above are given in Figure 2. Here we notice that the crossing of the predicted survival curves of the two levels of treatment2 in the Kaplan-Meier plot ( Figure 1) is well captured by the threshold regression model, as illustrated in Figure 2. However, if the Cox model is erroneously used, the predicted survival curves of the two groups shown in Figure 3 fail to capture the crossing pattern.
Using the hr method for 'threg' objects to calculate the hazard ratios at different time points illustrates the advantages of the threshold regression model over the Cox model when the proportional hazards assumption is violated.
The following use of the hr method for 'threg' objects calculates the hazard ratio of the drug B group vs. the standard drug group at week 5. This hazard ratio is calculated as 2.08.
R> hr(fit, var = f.treatment2, timevalue = 5) Similarly we can use the hr method for 'threg' objects again to calculate the corresponding hazard ratio at week 20. This hazard ratio is calculated as 0.12.
It is clear that the hazard ratios calculated at week 5 and week 20 are quite different: at week 5, the hazard ratios of the drug B group versus the standard drug group is about 2.08; while at week 20, this hazard ratio reverses and sharply decreases to about 0.12. Note that the standard drug group is the reference group. Obviously the change of hazard ratio over time is huge and the change has been well captured by using the threg function. The graph of the estimated hazard functions of the two treatment groups are also generated by the threg function above by using the graph = "hz" argument. This graph is given in Figure 4, from which we can see that the hazard function curves of the two treatment groups cross over time, and that is why the estimated hazard ratio of the drug B group versus the standard drug group changes from a number greater than 1 (2.08) to a number less than 1 (0.12). On the other hand, we can only estimate a constant hazard ratio from the Cox model (i.e., 0.73), across the whole time span for the drug B group versus the standard drug group, due to the proportional hazards assumption. Clearly, this outcome is misleading. This constant hazard ratio by the Cox model can be obtained by the coxph function of the survival package in R as follows: R> library("survival") R> summary(coxph(Surv(weeks, relapse)~treatment2, data = lkr))

Theory of threshold regression
In this section we will introduce the genesis of the threshold regression methodology briefly. We assume that the latent stochastic process is a Wiener process Y (t) starting at y 0 with drift µ and variance σ 2 , and has the following properties: 1. Y (t) has independent increments; for any non-overlapping time intervals (t 1 , t 2 ), (t 3 , t 4 ), When we regard this Wiener process as the latent health status process, we can let Y (0) = y 0 > 0 be the initial health status, and define T as the first time a sample path of the health status process reaches 0 level, i.e., T = inf{t : Y (t) = 0}. T is considered as the event time in survival data analysis by the threshold regression. Figure 5 is an illustration of a sample path of such a latent health status process.
By using either the backward or forward diffusion equations subject to the initial condition and the boundary condition for the absorbing barrier (see Cox and Miller 1965), it can be derived that T follows the inverse Gaussian distribution with the following probability density function (p.d.f.). where where Φ(·) is the cumulative distribution function of the standard normal distribution. Note that if µ > 0, the Wiener process may never hit the boundary at zero and hence there is a probability that the FHT is ∞, with P (FHT = ∞) = 1 − exp(−2y 0 µ/σ 2 ) (Cox and Miller 1965).
By carefully examining Equation 1 and Equation 2, it can be seen that both f (t|µ, σ 2 , y 0 ) and F (t|µ, σ 2 , y 0 ) actually depend on y 0 /σ and µ/σ only. Hence we need to fix one of three parameters (µ, y 0 , σ) to avoid over-parameterization. Because the degradation process is latent with undefined measurement scale, we choose to set the variance parameter σ 2 = 1 in the threg package without loss of generality. Then we can regress the other two process parameters, y 0 and µ, on the covariate data.
Suppose that the covariate vector is X = (1, X 1 , . . . , X k ), where X 1 , . . . , X k are covariates and the leading 1 in X allows for a constant term in the regression relationship. We assume that µ and ln(y 0 ) are linear in regression coefficients. Then ln(y 0 ) and µ can be linked to the covariates with the following regression forms: where γ = (γ 0 , . . . , γ k ) and β = (β 0 , . . . , β k ) are regression coefficient vectors. If some covariates are regarded as unimportant in predicting ln(y 0 ) or µ, then these covariates can be removed from the regression model by setting the corresponding elements in γ or β to zero.
For example, if covariate X 1 in X is not important to predict ln(y 0 ), we can set γ 1 to zero in Equation 3.

The uses of functions provided by the threg package
In this section, we detail how to apply the threshold regression in survival data analysis by using the functions provided by the threg package. Examples are given at the end of each subsection, and for all the examples we use the bone marrow transplantation dataset (Klein and Moeschberger 2003). This dataset contains 137 acute leukemia patients treated with bone marrow transplants which are a standard treatment for acute leukemia. Recovery following bone marrow transplantation may depend on many risk factors known at the time of transplantation. In our examples below, we simply choose three risk factors, which are previously reported as important, to illustrate how to use the four functions provided by the threg package. These three risk factors are: 1. Patient age (the recipient_age variable).
2. Risk categories based on their status at the time of transplantation (the group variable). These categories are: acute lymphoblastic leukemia (coded 1 for group), low-risk acute myelocytic leukemia (coded 2 for group), and high-risk acute myelocytic leukemia (coded 3 for group).
3. French-American-British (FAB) classification based on standard morphological criteria (the fab variable): coded 1 if the patient has acute lymphoblastic leukemia with an FAB classification of M4 or M5, and 0 otherwise. The former patients were considered to have a possible elevated risk of relapse or treatment-related death.
The study time (time to relapse, death or end of study) variable is time (in days) and the censoring indicator variable is indicator (coded 1 if dead or relapse, and 0 otherwise). The maximum follow-up for these 137 patients was 7 years.
Next we detail how to use the threg function and the hr, plot and predict methods for 'threg' objects with their arguments.

Regression coefficient estimation: threg
Maximum likelihood estimation is used to estimate the regression coefficients in Equation 3 and Equation 4. A subject i in the sample dataset who has an exact death time observed provides the information on the probability that the event is occurring at this time, and therefore contributes the FHT probability density f (t (i) |µ (i) , y (i) 0 ) to the sample likelihood function, where t (i) is the observed time of death. A subject j in the sample dataset who lives to the end of the study provides a right censored observation, and all we know about this subject is that the event time is larger than the on study time. Therefore the information contributed by a surviving subject in the sample likelihood function is the survival function evaluated at the corresponding on study time t (j) for this subject: 1−F (t (j) |µ (j) , y (j) 0 ). Among the n subjects in the sample, subjects with observed death times are indexed from 1 to n 1 and subjects with right censored observations are indexed from n 1 + 1 to n. Then, the log-likelihood function is The threg function is used to estimate the regression coefficients of the threshold regression model. Two arguments of the threg function are used to specify the inputs of the sample log-likelihood function in Equation 5: formula: A 'Formula' object (see Zeileis and Croissant 2010), with the response on the left of a~operator, and the independent variables on the right. The response must be a 'Surv' object as returned by the Surv function from package survival. On the right of the~operator, a | operator must be used: on the left of the | operator, users specify independent variables that will be used in the linear regression function for ln(y 0 ) in the threshold regression model; on the right of the | operator, users specify independent variables that will be used in the linear regression function for µ in the threshold regression model. If users just want to use a constant ln(y 0 ) or µ, they can put 0 or 1 as a placeholder on the left or right of the | operator, instead of listing the independent variables for ln(y 0 ) or µ.
data: Specifies input dataset. Such a dataset must be a survival dataset including at least the survival time variable and censoring variable. For the censoring variable, 1 should be used to indicate the subjects with failure observed, and 0 should be used to indicate the subjects that are right censored. The dataset can also include other independent variables that will be used in the threshold regression model.
The optimization function nlm from the stats package of the base distribution of R is incorporated in the threg command to find the minimum of the minus log-likelihood function in Equation 5 and obtain the maximum likelihood estimates of the regression coefficients in vectors γ and β in the threshold regression model. Note that nlm is based on the recurrence relation of the Newton-Raphson method: where the matrix of the second derivatives is called the Hessian matrix and the vector of the first derivatives is called the gradient. The convergence speed of nlm is fairly fast for loglikelihood function of the threshold regression model in Equation 5. Note that some warning messages passed from the nlm function may appear along with the outputs of the threg function when the Hessian matrix is not invertible at the values of some starting points of the Newton-Raphson procedure, but these warning messages can be ignored.
Below we use the threg function introduced above to fit a threshold regression model on the bone marrow transplantation dataset. We use recipient_age and fab as the predictors for ln(y 0 ) and group and fab as the predictors for µ. Note that group and fab are transformed to factor variables, f.group and f.fab, since they are categorical variables. From the outputs we can see that all of these three predictors are significantly important.

Hazard ratio calculation: hr method
Suppose that we have used the threg function with predictor variables {X 1 , . . . , X k−1 , G} to predict ln(y 0 ) and µ in the threshold regression model, where G is a categorical variable, then we can use hr to estimate the hazard ratio of level G = g over level G = 0 for a scenario in which the values of the other predictors and the time are given: X 1 = X 1 , . . . , X k−1 = X k−1 , and t = t 0 . Such a calculation is sorted out as follows.
Set (z g ) = (1, X 1 , . . . , X k−1 , g), then by using the threg function, y 0 and µ can be estimated for the given (z g ) as:ŷ g 0 = exp{(x g ) γ} andμ g = (x g ) β , whereγ andβ are the vectors of the estimated values for the regression coefficients (see Section 4.1). Now we can calculate the estimates of the density and survival functions at time t 0 as: And further, the estimate of the hazard function at time t 0 is calculated as: where f (·) and F (·) are given in Equation 1 and Equation 2, with σ 2 set to 1, and with y 0 and µ replaced by their estimates in level g. Similarly, if we change the non-reference level g to the reference level 0, we can obtain f (t 0 |μ 0 ,ŷ 0 0 ), S(t 0 |μ 0 ,ŷ 0 0 ) and h(t 0 |μ 0 ,ŷ 0 0 ). The hazard ratio of level G = g over level G = 0 at X 1 = X 1 , . . . , X k−1 = X k−1 and time t = t 0 is therefore: Using the formulas above, the hr method for 'threg' objects estimates the hazard ratios for a categorical variable with three arguments: var, scenario and timevalue. The uses of these three arguments are given as follows: object: An object of class 'threg', returned by the threg function.
var: Specifies the name of the categorical variable G for which the hazard ratios are to be calculated. Note that the categorical variable G specified for the var argument needs to be a factor variable in R. If G is not already a factor variable yet, you need to use the factor function in R to transform G into a factor variable first and then include this factor variable in threg when fitting the threshold regression model. Then you can specify this factor variable in the var argument of the hr function to calculate the hazard ratios for G.
timevalue: Specifies the desired time value at which the hazard ratios are to be calculated. A vector is allowed for this argument.
scenario: Specifies the values of all predictors except G. A setting of these values is referred as a scenario. The calculated hazard ratios are with reference to the specified scenario. We do not need to specify a level value for the categorical G in the scenario argument since all non-reference levels g of G are enumerated in calculating hazard ratios relative to the reference level. The reference level is chosen as the lowest level of the factor variable specified in the var argument, and all other levels of this factor variable are non-reference levels. Note that in the scenario argument, the order of presentation of the predictors does not matter, and the terms in this argument are separated by + signs. If G is the only predictor in the model, then the scenario argument is not needed.
Below we use the hr method for 'threg' objects to calculate the hazard ratios for the factorized group variable, f.group, for the specified scenario that "the patient age is 18 years old and the FAB classification is 0" at time "500 days". Note that the factor variable f.group has already been included in the threshold regression model earlier by threg which returned the object fit.
R> hr(fit, var = f.group, timevalue = 500, + scenario = recipient_age(18) + f.fab1 (0) The plot method for 'threg' objects can be used to display the graphs of the estimated survival, hazard or density functions at different levels of a factor predictor variable which has been included in the threshold regression by threg.
From Equations 7-9, the estimates of the density, survival and hazard functions in level g at a given time t 0 can be calculated. If we replace t 0 with t which varies over the range of all the observed time points in the data, Equations 7-9 become functions of t in level g and those functions curves can be plotted. When we overlay the curves of different levels of G in one plot, we can compare the corresponding function estimates at different levels of G. These kinds of plots can give additional research insights. The plot method for 'threg' objects generates these plots with the following arguments: x: An object of class 'threg', returned by the threg function.
var: Specifies the name of the categorical variable G, for each level of which the plots would be generated at given scenario specified by the scenario argument. The use of the var argument is the same as that in the hr method.
scenario: Specifies the values of all predictors except G. The use of the scenario argument is the same as that in the hr method.
graph: Specifies the type of curves to be generated. The "hz" option is to plot hazard function curves, the "sv" option is to plot survival function curves, and the "ds" option is to plot density function curves.
nolegend: The nolegend argument needs to be set to 1 if users do not want the threg package to generate legends for the plot. Note that even if nolegend is set to 1, users can still add legends by themselves after the graph is generated, by using the legend function in R.
nocolor: The nocolor argument needs to be set to 1 if users want to depict all curves in black.
Below are examples of using the plot method of 'threg' objects to overlay curves of survival, hazard and probability density functions corresponding to different levels of f.group. The specified scenario is still that "the patient age is 18 years old and the FAB classification is 0".

Predictions: predict method
The predict method for 'threg' objects can be used to predict the initial health status value y 0 , the drift value of the health process µ, the probability density function of the survival time f (t | µ, y 0 ), the survival function S(t | µ, y 0 ) and the hazard function h(t | µ, y 0 ) for a specified scenario and time value. The scenario specified here is similar to those in the hr and plot methods. The only difference is that we need to provide the scenario values for the dummy variables expanded from the factor variable G when using the predict method while we do not need to provide these values when using hr and plot, since the program will automatically enumerate all levels of G for hr and plot. The predict method has three arguments: object: An object of class 'threg', returned by the threg function.
timevalue: Specifies the desired time value at which the predicted values are to be calculated. A vector is allowed for this argument. If this argument is omitted, then the predicted values for the study time of all subjects would be calculated.
scenario: Specifies the values of all predictors, including the dummy variables expanded for the factor variable G. If this argument is omitted, then the predicted values at a specified time value for all subjects would be calculated, and in this case the covariate values for each subject are used as their corresponding scenario values.
Below we use the predict method for 'threg' objects to calculate the predicted values for the specified scenario that "the patient age is 18 years old, the FAB classification is 0 and the risk category is 3" at time "2000 days".

Conclusion
This paper introduces the use of the threg package to implement threshold regression in R.
As a newly developed methodology in the area of survival data analysis, threshold regression brings useful research insights by modeling the underlying health process. Four functions are provided by the threg package: the threg function can be used to estimate regression coefficients, the hr method for 'threg' objects can be used to calculate hazard ratios, the predict method for 'threg' objects can be used to calculated predicted values and the plot method for 'threg' objects can be used to draw predicted plots for the threshold regression model. Application of threshold regression on a wide varieties of survival data can be carried out easily with the help of the threg package. For those readers who are interested in implementing threshold regression in Stata (StataCorp. 2013), we refer them to Xiao, Whitmore, He, and Lee (2012).