Constructing Multivariate Survival Trees: The MST Package for R

Multivariate survival trees require few statistical assumptions, are easy to interpret, and provide meaningful diagnosis and prediction rules. Trees can handle a large number of predictors with mixed types and do not require predictor variable transformation or selection. These are useful features in many application ﬁelds and are often required in the current era of big data. The aim of this article is to introduce the R package MST . This package constructs multivariate survival trees using marginal model and frailty model based approaches. It allows the user to control and see how the trees are constructed. The package can also simulate high-dimensional, multivariate survival data from marginal and frailty models.


Introduction
Decision trees have gained popularity in many application fields because they can handle a variety of data structures, require few statistical assumptions, and yield classification and prediction rules that are easy to interpret. Generalizations of classification and regression trees (CARTs) applied to survival analysis can provide meaningful prognosis rules in medical research. Many authors have proposed tree-based methods to handle univariate (or uncorrelated) survival data (see, e.g., Gordon and Olshen 1985;Davis and Anderson 1989;Therneau, Grambsch, and Fleming 1990;LeBlanc and Crowley 1992). We have extended the research to handle multivariate survival data (Su and Fan 2004;Fan, Su, Levine, Nunn, and LeBlanc 2006;Fan, Nunn, and Su 2009). The goal of this paper is to discuss constructing multivariate survival trees in R (R Core Team 2017) using the MST package (Su, Calhoun, and Fan 2018) available at the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=MST. The paper is organized as follows. In Section 2 we introduce multivariate survival trees and their construction. This will help users decide which model is most appropriate for their dataset. We discuss the MST package including its features in Section 3 and analyze a simulated and a real multivariate survival dataset in Section 4. We conclude with a discussion on possible future developments.

Background
Multivariate failure time arises when individuals or objects under study are naturally clustered or when individuals experience multiple events of the same type (namely, recurrent times). Examples include disease incidence times of family members, component failure times in medical devices, and tooth loss times in dental studies. When not all subjects experience the event of interest and failure times are not all independent, multivariate survival methods must be used to account for the censoring and dependence. The details to analyze correlated failure times and construct multivariate survival trees are discussed in our papers Su and Fan (2004) and Fan et al. (2006Fan et al. ( , 2009; key aspects are summarized herein. The construction of multivariate survival trees adopts a modified CART procedure to model the correlated failure times. The procedure consists of three steps: (1) growing a large initial tree, (2) pruning it back to obtain a sequence of subtrees, and (3) selecting the best tree size. For the tree growing and pruning purposes, one needs a splitting statistic that handles the dependence of failure times and a split-complexity measure to evaluate the performance of the tree. There are two main approaches to analyzing correlated failure times and developing the splitting statistic. One is the marginal approach where the correlation is modeled implicitly using generalized estimating equations on the marginal distribution formulated by a Cox (1972) proportional hazards model. The other approach is the frailty model approach where the correlation is modeled explicitly by a multiplicative random effect called frailty. The frailty term corresponds to some common unobserved characteristics shared by all correlated failure times in a cluster. We have developed two frailty model based methods: one based on the semiparametric gamma frailty model (Su and Fan 2004) and the other based on the parametric exponential frailty model (Fan et al. 2009) to improve the computation speed.
The marginal method uses a robust log-rank statistic given in Appendix A in Equation S1. Su and Fan (2004) used an integrated log-likelihood shown in Equation S2 in Appendix A for the gamma frailty model; however, the MST package uses a Wald test statistic to improve stability and computation time. The exponential frailty model uses a score test statistic given in Equation S3 in Appendix A. The splitting statistic is calculated for each possible split and the largest value corresponds to the best split.
The MST package uses the survival package (Therneau 2017) to fit marginal and frailty models and the MASS package (Venables and Ripley 2002) to simulate from a multivariate normal distribution. In R, we can calculate the splitting statistic with the following logic where x is a numeric predictor variable and 0.5 is a candidate split: The robust log-rank statistic is calculated by fitting the model R> coxph(Surv(time, status)~I(x > 0.5) + cluster(id))$rscore The Wald test statistic is calculated by fitting the model R> coxph(Surv(time, status)~I(x > 0.5) + frailty.gamma(id, method = "em"), + control = coxph.control(eps = 1e-04, iter.max = 10, + toler.inf = sqrt(1e-09), outer.max = 5))$wald.test The gamma frailty model can be computationally intensive, so the coxph.control function is utilized as shown above to speed up the convergence. The Wald test statistic should yield similar splits to the integrated log-likelihood studied by Su and Fan (2004).
The score test statistic for the exponential frailty model is calculated by utilizing matrix multiplication, simulated annealing with the optim(. . . , method = "SANN") function, and a quasi-Newton method to estimate the parameters in the model. The default values are used when optimizing parameters in the score test statistic, except that the maximum number of iterations is increased from 100 to 800 and the initial value estimates are set to 1.
The best choice for the splitting statistic depends on many factors including the true hazard function, the types of covariates in the model, the sample size, the cluster size, and the censoring rates. While the true hazard function is typically unknown, prior simulation studies and data analyses (Nunn, Fan, Su, Levine, Lee, and McGuire 2012) showed that the frailty model based approaches often performed better than the marginal model based approach in dental applications. However, the marginal and exponential frailty models are much less computationally intensive and often select the correct cutoff points even under model misspecification.
All three splitting statistics outperformed the naive log-rank statistic that ignores the dependence. The MST package also allows users to grow a tree using a stratified or naive log-rank statistic, but we do not suggest using these splitting statistics except in rare circumstances. Interested readers are referred to our simulation studies (Su and Fan 2004;Fan et al. 2009).
Growing the initial tree is done by splitting nodes iteratively until either a small number of observations or uncensored times remain at the terminal node. The final tree model can be, theoretically, any subtree of the initial tree, and the number of subtrees can become massive. Following the CART approach, the initial tree is pruned to reduce the number of subtrees considered. Let T denote a binary tree, T i denote the set of internal nodes, and |T | denote the number of nodes of a tree. Defining G(h) to represent the maximum splitting statistic on a particular node h, the split-complexity measure is: where G(T ) = h∈T G(h) measures the overall goodness of splits of T , |T i | measures the complexity of the tree, and α ≥ 0 acts as a penalty for each additional split. The goodness of split measured at each split h is the robust log-rank statistic, the squared Wald test statistic, and the score test statistic for the marginal, gamma frailty, and exponential frailty models, respectively. Note that for a given split, each of these measures has an approximate χ 2 (1) distribution when the sample size at h is large.
An efficient pruning algorithm is determined by comparing each branch T h to a possible trim at node h and solving for α. Since the split complexity measure is 0 when T h is trimmed and h becomes a terminal node, this equates to: which is equivalent to Thus, for any internal node h of the initial tree, the value G(T h )/|T i h | is calculated where T h denotes the branch with h as its root. The weakest link of the tree is the node such that G(T h )/|T i h | is minimal. The subtree that prunes off the branch with the weakest link is taken and this process is repeated until we have pruned back to the root node. This procedure reduces the number of possible subtrees (which can be very massive) to a manageable value.
After pruning back the tree, we have to select the best-sized subtree. The best-sized subtree is taken as the tree with the largest splitting complexity measure in Equation 1 with a prespecified α. LeBlanc and Crowley (1993) suggested α be fixed within the range of 2 ≤ α ≤ 4. Our previous simulation study (Fan et al. 2009) showed α = 4 typically yielded the better performance compared with α = 2 and α = 3 for most models. The number of uncensored events, α = log(n 0 ), can also be used to select the best-sized subtree. Since we have already used the sample to grow and prune the tree, we need to use a validation method for tree size selection. Two validation methods are implemented: the test sample approach and the bootstrap method. When the sample size is large, one may use a subset of the data to grow and prune the initial tree and the other subset to select the best-sized tree. When the sample size is small or moderate, one may use bootstrapping techniques as described previously in our papers.
There are alternative approaches to constructing survival trees, the most similar approach to the one implemented in package MST is available in package rpart (Therneau, Atkinson, and Ripley 2017). The two main differences are that rpart implements the CART procedure assuming independence and prunes the tree using cross-validation. It should also be noted that the rpart package uses an exponential scaling statistic to grow the tree, which differs from the naive log-rank statistic provided by MST. The MST package is slightly more automated where the user simply inputs the data and candidate variables and the MST function outputs the final tree; whereas, the rpart package allows a bit more control when pruning the tree. Both packages allow users to control how the tree is grown and automatically sort trees based on survival rate.
Other methods for constructing survival trees include the ctree function in the partykit package which constructs conditional inference trees with stopping rules (Hothorn, Hornik, and Zeileis 2006) and the DStree package which builds a tree for discrete-time survival data (Mayer, Larocque, and Schmid 2016). Nice features with the alternative approaches are the faster computation time compared with package MST. Splitting statistics that handle the dependence are typically more computationally intensive. Simulation studies specifically comparing trees constructed using the rpart, ctree, and MST functions are warranted.

Simulating multivariate survival data
While the main purpose of the MST package is to construct multivariate survival trees, the package also includes the rmultime function to generate multivariate survival data. The details of generating survival data are described in Su, Fan, Wang, and Johnson (2006); the possible models that can be simulated are given in Table 1. The user specifies the coefficients (β 0 and β 1 ), the cutoff values (c), the censoring rate, and the model with the respective parameters. The arguments of the function are described in the R package documentation. gamma.frailty: log.normal.frailty: marginal.multivariate.exponential: marginal.nonabsolutely.continuous: An example where multivariate failure times are simulated using this functionality and a tree is constructed is presented in Section 4.
There are several packages that can simulate survival data; those include packages genSurv (includes one time-dependent covariate; Araújo, Meira-Machado, and Faria 2015), PermAlgo (generates independent survival data conditional on covariates; Sylvestre, Evans, MacKenzie, and Abrahamowicz 2015), simMSM (simulates multistate models with possibly nonlinear baseline hazards; Reulen 2015), survsim (simulates survival data with recurrent events or competing risks; Moriña and Navarro 2014), and many others. However, to our knowledge, the rmultime function is the only function available in R that can simulate high-dimensional, multivariate survival data from marginal or frailty models.

Constructing multivariate survival trees
The MST function in the R package constructs the multivariate survival tree. The wrapper function grows, prunes, and selects the best tree. The main arguments are described below: • formula: A linear survival model with the survival response, the predictors, and the cluster (or id) variable (e.g., Surv(time, status)~x1 + x2 | id).
• data: Data to grow and prune the tree.
• selection.method: Indicates the method of selecting the best-sized tree. Options are "test.sample" (for large datasets) or "bootstrap" (for small/moderate datasets).
• test: Test sample to select the best-sized tree if selection.method = "test.sample".
• B: Number of bootstrap samples if selection.method = "bootstrap". A rough guideline for the number of bootstrap samples is 25 ≤ B ≤ 100 following LeBlanc and Crowley (1993).
The MST function has some additional features to improve computational efficiency or allow users to see more details about the construction of the multivariate survival tree for a given dataset; these features are briefly described below: • minsplit: Indicates minimum number of observations to continue splitting the branch.
• minevents: Indicates minimum number of uncensored events to continue splitting the branch.
• minbucket: Indicates minimum number of observations in any terminal node.
• maxdepth: Indicates maximum depth of the tree if user wants to stop growing the tree at a certain depth.
• mtry: Indicates number of variables randomly considered at each split.
• distinct: Indicates if all distinct cutpoints considered. If distinct = FALSE, then only cutpoints from delta to 1 -delta with nCutPoints are considered.
• LeBlanc: Indicates if the entire sample or the out-of-bag sample is used while bootstrapping. The interested reader can find more details in Fan et al. (2006).
• sortTrees: Indicates if trees should be sorted such that splits to the left have lower risk of failure.
• plot.Ga: Indicates if a plot of the goodness of split vs. tree size is produced.
• details: Indicates if detailed information should be given about how the final tree was constructed.
The plot.Ga and details parameters provide additional information regarding the construction of the tree for the given dataset, while the other additional features above can improve computational efficiency. The default sortTrees = TRUE will sort the trees such that splits to the left have higher rates of survival. However, it is possible that a terminal node on the left has lower rate of survival than a terminal node on the right if they do not share the same parent node.
The output of the MST function returns the initial tree, the trees pruned and considered in the best tree selection, and the best-sized tree. The final multivariate survival tree depends on the penalty (α) used, so four possible final trees are returned corresponding to α = 2, 3, 4, and log(n 0 ). All trees are objects of class 'constparty' with methods defined in package partykit (Hothorn and Zeileis 2015). A 'constparty' tree requires observations that fall in the same terminal node to have the same prediction (or constant fit). While this requirement may not be met if two observations come from two different clusters, there are many benefits with treating the tree as a 'constparty' object. One is the added functionality for print/plot/predict methods. The plot method for 'party' objects produces Kaplan-Meier curves in each terminal node, which gives the survival rate for an observation with an average cluster effect. The predict method for 'party' objects returns the median survival time, but can easily be modified to return any prediction and handle the dependence (see examples in Section 4). Additionally, the 'constparty' object allows more compatibility with other R packages and provides additional functions to extract elements from the tree.

Simulation example
We illustrate the ability of the MST package to correctly construct multivariate survival trees with the following simulation. Suppose we have 200 clusters each with 4 patients per cluster. Suppose the failure times (e.g., time until illness) follow a marginal multivariate exponential distribution described below: Suppose failure times have a 0.65 correlation within the cluster and approximately half the failure times are censored. Each patient has four covariates (x 1 , x 2 , x 3 , x 4 ) with the first two covariates having cutpoints at 0.5 and 0.3, respectively, with the same β coefficients at 0.8, while the last two covariates have no effect on failure. The failure times can be simulated using the following commands: The dataset can be analyzed with the following commands: R> fit <-MST(formula = Surv(time, status)~x1 + x2 + x3 + x4 | id, + data, test, method = "marginal", minsplit = 100, minevents = 20, + selection.method = "test.sample") The output of the final tree is a 'constparty' object: R> (tree_final <-getTree(fit, "4"))  Figure 1 illustrates the construction of the multivariate survival tree for the simulated data. Panel A gives the large initial tree, Panel B gives the splitting statistic, G α (T ), based on the number of terminal nodes, and Panel C gives the final tree. In this case, the four split penalties considered yield the same final tree. The final tree correctly identifies that patients with x 1 ≤ 0.5 and x 2 ≤ 0.3 are at a higher risk of failure and it does not include x 3 or x 4 which have no effect on failure. The marginal model is best in this situation because the simulated data was generated without a frailty term; however, this is typically unknown. The gamma and exponential frailty models often yield similar final trees. Panels A and C were produced using the plot function and panel B was created with the plot.Ga = TRUE command in the MST function with slight modifications to increase resolution and readability. Users can go further by fitting a model using the terminal nodes of the tree as the predictors. For example, the code below can fit a marginal model using the final tree. Each observation is sent down the tree and assigned to a terminal node; the marginal model is then fit with the terminal node as the predictor. Users can use a frailty term in coxph or use the frailtypack R package (Rondeau, Mazroui, and Gonzalez 2016) to fit a frailty model. Note that, in practice, it is highly recommended that a new dataset be used since the data was already used to grow the tree and data reuse is known to result in inferences that are overoptimistic.

Computational speed
A major limitation with constructing multivariate survival trees is the computational intensity. There are several factors affecting computation speed: the model chosen to handle the dependence, the number of categorical and continuous predictors, the number of cutpoints considered, and the minimum node size to continue splitting the branch. To assess computation time, we simulated survival data with a marginal dependence structure setting ρ = 0.65 and varying the number of observations and predictors. Half of the predictors were clusterspecific factors (constant within cluster) with either 75% or 25% being continuous. Each   continuous variable had around 100 distinct cutpoints and each categorical variable had 8 unordered levels. We set one continuous and one nominal failure-specific factor (which may vary within cluster) to be informative (β = 1). We used the default setting to grow the multivariate survival tree. Table 2 gives the number of minutes it takes to grow and select the best-sized tree using an Intel Core i7-4510U processor. While the MST package does not utilize more efficient software programs such as C++ or Fortran, most of the computation time was devoted to calculating the splitting statistic. Fitting thousands of Cox proportional hazards models, while adjusting for dependence, can be very time consuming.
The exponential frailty model was the fastest method, the marginal model was the second fastest, and the gamma frailty model was the slowest. Even for moderate sample sizes, the gamma frailty model can be very time consuming. The exponential frailty model was developed to improve computation time, while assuming a constant baseline hazard function.
The computational speed gained from using the marginal and exponential frailty models is essential for analyzing large datasets or using many bootstrap samples. Increasing the number of observations or predictors increases the computation time. Nominal categorical variables are converted to ordered by replacing levels with the estimated beta coefficients from a Cox proportional hazards model adjusting for the dependence. Thus, growing trees from datasets with more categorical predictors were faster due to having less distinct cutpoints. Using percentiles can greatly improve computation time. Users are advised to carefully consider the model dependence and input parameter settings to maximize accuracy and reduce computation time.  tooth (tooth-level factor) and the average severity score over the entire mouth for each subject (patient-level factor) were included. Tables 3 and 4 give the fifty-one factors assessed that could potentially affect tooth loss. Factors were analyzed separately for molar and non-molar teeth.

Analysis of tooth loss
The data consisted of 5,336 patients with periodontal disease with 25,694 molar teeth and 40,196 non-molar teeth. The average age was 56 years, with 49% men, 9% had Diabetes Mellitus, and 23% were smokers. For molars, 1,870 (7.3%) teeth were lost with the median tooth loss time of 0.53 years and the maximum tooth loss time of 5.58 years. For non-molars, 2,723 (6.8%) teeth were lost with the median tooth loss time of 0.54 years and the maximum tooth loss time of 5.59 years.
Due to the large sample size, a test sample was used to select the best-sized tree with one third of patients randomly assigned to the test sample. For the molar tooth analysis, data is partitioned with the following commands: R> data("Teeth", package = "MST") R> molar <-subset(Teeth, molar == 1) R> set.seed(482046) R> id.train <-sample(x = unique(molar$id), size = floor(nPatients * 2/3), + replace = FALSE)  To improve computational speed, only 50 equally spread out cutoff points from the 5th to the 95th percentiles were used as candidates for continuous variables with over 50 distinct cutoff points; this was specified via the option distinct = FALSE, together with delta = 0.05 and nCutPoints = 50. A large number of observations was required in each terminal node, minbucket = 100, to ensure more accurate survival estimates. The exponential frailty model was used to handle the correlation and α = 4 was used as the split penalty when selecting the best-sized tree. The multivariate survival trees are constructed with the following command:  R> form <-as.formula(paste0("Surv(time, event)~", + paste0("x", 1:51, collapse = " + "), " | id")) R> fit_molar <-MST(form, data = molar_training, test = molar_test, + method = "exp.frailty", selection.method = "test.sample", + distinct = FALSE, delta = 0.05, nCutPoints = 50, minevents = 10, + minbucket = 100) Figures 2 and 3 give the multivariate survival trees for molar and non-molar teeth, respectively. Molar teeth are categorized into thirty possible terminal nodes. The risk of molar tooth loss is dependent on 20 factors (8 tooth-level factors and 12 subject-level factors). The tree for non-molar teeth is constructed with the same commands, except for using molar == 0 and removing x22 from the formula. Non-molar teeth are categorized into forty-three possible terminal nodes. Twenty-five factors (7 tooth-level factors and 18 subject-level factors) impacted the risk of non-molar tooth loss. Factors listed in the tree do not necessarily represent variable importance. It is possible that informative variables are not included in the final tree due to a masking effect. A detailed tooth analysis using multivariate survival random forests to assess variable importance will be studied in a subsequent paper.
The current dental prognosis is often based on clinical opinion where the clinician weighs several factors. However, an accurate prognosis is crucial to the development of an appropriate treatment plan (Mordohai, Reshad, Jivraj, and Chee 2007). There are many ways multivariate survival trees can be used to help clinicians make better decisions and improve patient care. One approach is to estimate the survival curves for each possible terminal node by fitting a marginal model that handles the correlation within subjects. The multivariate survival trees on tooth loss give the estimated survival rate for each terminal node. Each split to the left has higher rates of survival. The probability of survival at 3 years ranged from 0.999 to 0.211 for molar teeth and 1 to 0.358 for non-molar teeth. A combination of the survival curves and clinical experience can help prescribe better treatment.

Conclusions
To our knowledge, package MST is the only available package in R to construct multivariate survival trees. The MST package is easy-to-use and allows users to see and control how the survival trees are constructed. Tree based analysis automatically groups observations with differing outcomes and hence renders it an excellent research tool in medical prognosis or diagnosis. In addition, trees can handle data with a large number of predictors with mixed types (continuous, ordinal, and categorical with two or more levels) without the need for variable transformation, dummy variable creation, or variable selection. These off-the-shelf features of a data analysis tool are often required in the current era of big data. Over the years, we have received many requests from researchers in various fields to share our R codes. We hope that our newly developed R package, MST, will help facilitate the utilization of multivariate survival trees in many application fields.
Many factors must be considered when determining the appropriate model. The gamma frailty model performs well even when the underlying model is misspecified. However, the gamma frailty model takes much longer to fit than the marginal and exponential frailty models. The computational speed gained from using the marginal and exponential frailty models is essential for analyzing large datasets or using many bootstrap samples. Using percentiles as the candidate splits, instead of all distinct cutpoints, can improve computation speed at the expense of slightly less accuracy.
Future work will be devoted to improving computational efficiency and providing additional features and flexibility. Possible additions include allowing positive stable frailty models, providing more control when pruning a tree, and handling missing observations. The MST function also provides the framework to construct multivariate survival random forests. However, random forests are very computationally intensive, so methods and parameter settings that improve speed as assessed in Section 4.2 should be considered. Other faster ensemble methods, such as extremely randomized trees (Geurts, Ernst, and Wehenkel 2006), could also be implemented. While the main components to construct multivariate survival trees are available, we hope to continue improving computational speed and functionality in future versions of the MST package.

A. Splitting statistics formulas
Suppose that there are n units with K i correlated failure times in the i-th unit, i = 1, . . . , n.
Let F ik and C ik be the failure time and censoring time corresponding to the k-th failure in the i-th unit, respectively. The observed data consist of {(Y ik , ∆ ik , X ik ) : i = 1, . . . , n; k = 1, . . . , K i }, where Y ik = min(F ik , C ik ) is the observed failure time; ∆ ik = I(F ik ≤ C ik ) is the failure indicator, which is 1 if F ik ≤ C ik and 0 otherwise; and X ik = (x ik1 , . . . , x ikp ) denotes the pq-dimensional covariate vector associated with the k-th observation of the i-th unit. We consider only time-independent covariates. To ensure identifiability, it is assumed that the failure time vector F i = (F i1 , . . . , F iK i ) is independent of the censoring time vector C i = (C i1 , . . . , C iK i ) conditional on the covariate vector X i = (X i1 , . . . , X iK i ) , i = 1, . . . , n.
We will use the following notations to write the splitting statistic. Let λ 0 (t) represent an unspecified baseline hazard function, β is an unknown regression parameter, and I(·) is the indicator function. For a constant change point c, x ikj induces a binary partition of the data according to a continuous covariate X j . If X j is discrete, then the form x ikj ∈ A is considered where A can be any subset of its categories.
The marginal model develops the splitting statistic by formulating the hazard function of the k-th failure in the i-th unit as follows: λ ik (t) = λ 0 (t) exp(β · I(x ikj ≤ c)). Fan et al. (2006) showed that the splitting statistic, called the robust log-rank statistic, is: where While Equation S1 looks formidable, the calculation of G M (c) is computationally inexpensive because it is available in closed form and depends primarily on the evaluation of indicator functions.
The gamma frailty model develops the splitting statistic by formulating the hazard function as follows: λ ik (t) = λ 0 (t) exp(β · I(x ikj ≤ c))w i , where w i denotes the frailty term for the i-th unit. The frailty term w i is assumed to follow some known positive distribution; the MST package uses the common choice that w i ∼ Γ(1/ν, 1/ν) where ν represents an unknown variance. Su and Fan (2004) showed that the integrated log likelihood splitting statistic is: