Bayesian, and non-Bayesian, Cause-Speciﬁc Competing-Risk Analysis for Parametric and Non-Parametric Survival Functions: The R Package CFC

The R package CFC performs cause-speciﬁc, competing-risk survival analysis by computing cumulative incidence functions from unadjusted, cause-speciﬁc survival functions. A high-level API in CFC enables end-to-end survival and competing-risk analysis, using a single-line function call, based on the parametric survival regression models in survival package. A low-level API allows users to achieve more ﬂexibility by supplying their custom survival functions, perhaps in a Bayesian setting. Utility methods for summarizing and plotting the output allow population-average cumulative incidence functions to be calculated, visualized and compared to unadjusted survival curves. Numerical and computational optimization strategies are employed for eﬃcient and reliable computation of the coupled integrals involved. To address potential integrable singularities caused by in-ﬁnite cause-speciﬁc hazards, particularly near time-from-index of zero, integrals are transformed to remove their dependency on hazard functions, making them solely functions of cause-speciﬁc, unadjusted survival functions. This implicit variable transformation also provides for easier extensibility of CFC to handle custom survival models since it only requires the users to implement a maximum of one function per cause. The transformed integrals are numerically calculated using a generalization of Simpson’s rule to handle the implicit change of variable from time to survival, while a generalized trapezoidal rule is used as reference for error calculation. An OpenMP -parallelized, eﬃcient C++ implementation – using Rcpp and RcppArmadillo packages – makes the application of CFC in Bayesian settings practical, where a potentially large number of samples represent the posterior distribution of cause-speciﬁc survival functions.


Introduction
Motivation: Consistent propagation and calculation of uncertainty using predictive posterior distributions is a key advantage of Bayesian frameworks (Gelman and Hill 2006), particularly in survival analysis, where predicted entities such as survival probability can be highly-nonlinear, time-dependent functions of estimated model parameters.In the absence of high-performance software for Bayesian prediction, premature point-estimation of model parameters can only produce approximate -sometimes grossly wrong -mean values for pre-dicted entities (Figure 1, left panel).The R package CFC seeks to address this void for Bayesian cause-specific competing-risk analysis.
Existing methods and tools for competing-risk procedure: In survival analysis with multiple, mutually-exclusive, end-points, competing-risk techniques must be used to properly account for interaction among causes while estimating the expected percentage of population likely to experience events of a particular cause.Several techniques exist for competing-risk analysis, some of which have been implemented as open-source R packages.Among the more established technique are the cause-specific framework (Prentice, Kalbfleisch, Peterson Jr, Flournoy, Farewell, and Breslow 1978), sub-distribution hazard (Fine and Gray 1999) (available in cmprsk (Gray 2014)), mixture models (Larson and Dinse 1985) (available in NPM-LEcmprsk (Chen, Chang, and Hsiung 2015)), vertical modeling (Nicolaie, van Houwelingen, and Putter 2010), and the method of pseudo-observations (Andersen, Klein, and Rosthøj 2003) (available in pseudo (Maja Pohar Perme and Gerster 2012)).For an in-depth review and comparison of competing-risk methods, see Haller, Schmidt, and Ulm (2013).More recently, machine learning techniques such as random forests (Breiman 2001) and gradient boosting machines (Friedman 2001) have been extended to survival models (available via ran-domForestSRC (Ishwaran and Kogalur 2015) and gbm (Ridgeway 2015), respectively).The random forest survival implementation in randomForestSRC includes competing-risk analysis (Ishwaran, Gerds, Kogalur, Moore, Gange, and Lau 2014).
Cause-specific competing-risk analysis: The cause-specific framework for competing-risk analysis (CFC) is a two-step process.First, independent survival models are constructed for each cause.In each cause-specific model, events due to alternative causes are treated as censoring.To arrive at cause-specific cumulative incidence functions, however, the population depletion due to competing risks must be taken into account, in order to avoid over-estimating the cause-specific incidence probabilities (Figure 1, right panel).This leads to a second step, where a set of coupled, first-order differential equations must be solved.A key advantage of CFC is that, by separating the two steps, it allows for full flexibility in using different survival models as input into the second step, including a combination of parametric and non-parametric model.The only requirement for each cause-specific survival model is the implementation of a function that returns the unadjusted survival probability at any given time from index.While the first step is straightforward, applying the second step -calculation of cumulative incidence functions from cause-specific survival models -is non-trivial due to numerical and computational challenges.These challenges are especially pronounced in Bayesian settings involving a large number of MCMC samples representing the posterior distribution of time-dependent functions for each subject.While a non-parametric version of the cause-specific framework is available in the survival package (Therneau 2015) via function survfit, no reliable and modular open-source software enables the application of CFC to arbitrary parametric models, Bayesian or non-Bayesian, despite the popularity and intuitive appeal of this approach to competing-risk analysis.
Our contribution: The R package CFC provides -to our knowledge -the first open-source, general-purpose software for numerical calculation of cumulative incidence functions from cause-specific (unadjusted) survival functions (hereafter referred to as survival functions).Through a combination of algorithmic innovations, performance optimization techniques, and software design choices, CFC provides both an easy-to-use API for performing competingrisk analysis of standard parametric survival regression models (via integration with survival package) as well as the machinery for efficient and reliable application of CFC to arbitrary q q q q q q q q q q q q q q q q q q q q q q q q q q 0.50 0.  (Mahani and Sharabiani 2015a).The x axis corresponds to the 'incorrect' (non-Bayesian) method of calculating the average of model coefficients, followed by calculation of survival probability using the coefficient averages.The y axis corresponds to the mean+/se values of the same survival probabilities, this time using the 'correct' (Bayesian) method of calculating the probabilities once for each sample, and then averaging the probabilities.All survival probabilities are significantly over-estimated by the non-Bayesian method.The underlying MCMC run consisted of 5000 iterations, the first 2500 of which were discarded as burn-in.Right panel: Comparison of cumulative incidence probability, with and without competing-risk correction, for the 'pcm' event in 'mgus1' data set, taken from R package survival (Therneau 2015).Compared to a naive Kaplan-Meyer estimate, competing-risk analysis removes cases lost to 'death', thus reducing the pool of available subjects for 'pcm', and therefore leading to a smaller estimate for cumulative incidence probability for 'pcm'.
survival models, using both R and C++ interfaces.The C++ implementation and interface is based on the convenient framework of Rcpp (Eddelbuettel, François, Allaire, Chambers, Bates, and Ushey 2011) and RcppArmadillo (Eddelbuettel and Sanderson 2014) packages.
As such, CFC should appeal to practitioners as well as package developers.
Paper outline: The rest of this paper is organized as follows.In Section 2 we discuss the challenges of -and solutions for -the numerical quadrature problem of Bayesian CFC.In Section 3, we review the CFC package in some detail, including a description of its three usage modes.Section 4 presents several examples illustrating these usage modes and other features of CFC.Section 5 concludes with a summary of our work and pointers to pontential future research and development.

Bayesian CFC quadrature
This section provides the mathematical framework for CFC, discusses the shortcomings of existing quadrature techniques for the numerical integration involved in Bayesian CFC, and presents our alternative quadrature algorithm, the generalized Newton-Cotes framework.

Cause-specific competing-risk survival analysis
The following set of K coupled, first-order differential equations describe the time evolution of cause-specific cumulative incidence functions, F k (t): where λ k (t) (≥ 0, ∀t > 0) refers to the k'th (non-negative) cause-specific hazard function.These equations can be decoupled and transformed into integrals.We do so by summing the two sides of Equation 1 over all k: where we have defined the event-free probability function, E(t), as: Solving for E(t) in Equation 2b, we obtain: where S k (t) stands for the unadjusted survival function for cause k.They are defined as: Since hazard functions are non-negative, we conclude that and also that S k (0) = 1.These functions correspond to the naive, Kaplan-Meyer approach depicted in Figure 1 (left panel).Substituting back into Equation 1 and integrating the two sides, we arrive at the following K, one-dimensional integral equations: The integrals of Equation 7do not generally have closed-form solutions.For example, in a Weibull survival model, we have which leads to the following expression for cumulative incidence functions with two competing risks (K = 2): and similarly for F 2 (t).In the absence of an exact solution, we must resolve to numerical integration.
In most real-world problems, each of the K integrals of Equation 7 must be evaluated more than once.For example, in regression settings each observation will have a distinct set of cause-specific hazard, survival, and cumulative incidence functions that are dependent on the value of the feature vector for that observation.Furthermore, in Bayesian settings where posteriors are approximated by MCMC samples, each combination of observation and cause will have as many integrals as samples.For example, in a competing-risk analysis with 3 causes, 1000 observations, and 5000 posterior samples, 3 × 1000 × 1000 = 3 million cumulative incidence functions must be calculated.As a result, computational efficiency of the quadrature algorithm is of prime importance in Bayesian CFC.

Shortcomings of current quadrature techniques for Bayesian CFC
Most modern techniques for one-dimensional, numerical integration are based on function interpolation.Perhaps the most common set of techniques are Gaussian quadratures (Stroud and Secrest 1966), which approximate an integral by a weighted sum of function values evaluated on a pre-specified, irregular grid, often yielding exact results for polynomials.A particular flavor called Gauss-Kronrod quadrature (Laurie 1997) allows function evaluations to be re-used in successive, adaptive iterations.The QUADPACK library (Piessens, de Doncker-Kapenga, Überhuber, and Kahaner 2012), ported to R via the integrate function in stats package, is based on Gauss-Kronrod, and augmented by Wynn's epsilon algorithm (Wynn 1966) to accelerate convergence for end-point singularities.
A direct application of this software to the Bayesian CFC quadrature problem, however, is inefficient due to several reasons.Firstly, and most importantly, users often expect a dense output for cumulative incidence functions, i.e., F k (t)'s in Equation 7 must be evaluated at multiple values of t.QUADPACK, on the other hand, is not designed to produce dense output, and therefore would require as many calls as the number of outputs desired.This can lead to an excessive number of function evaluations, which is particularly troubling in time-consuming, Bayesian problems.Secondly, there is a significant opportunity to share computational work across the K integrals in CFC.This is due to the fact that the integrands in Equation 7 are various multiplicative permutations of cause-specific hazard and survival functions.Taking advantage of this opportunity, however, requires custom code that is designed for the particular structure of the CFC problem.Finally, direct calculation of each intergal in Equation 7 is suboptimal from a usability pespective, since it requires the user to supply two functions per cause: the hazard function, λ k (t), and the survival function, S k (t).
While the two functions are related by Equation 5, yet their conversion requires integration or differentiation.Also, in non-parametric survival models, the survival function is usually not differentiable (e.g., piece-wise step function) and hence the hazard function is unavailable.In the best case, requiring both functions adds a burden to the user and increases the possibility of mistakes.Numeric differentiation, e.g., using numDeriv (Gilbert and Varadhan 2012), is an option, but it is computationally expensive.
A second class of integration algorithms are quadrature by variable transformation (Press 2007, Chapter 4), better known as double-exponential (DE) or Tanh-Sinh methods (Takahasi and Mori 1974).They combine a hyperbolic change-of-variable with trapezoidal rule to induce exponential convergence of the integral near end-point singularities.As such, they are better suited to handle such singularities compared to Gauss-Kronrod techniques.DE quadrature has recently been implemented in R via the package deformula (Okamura 2015).For an empirical comparison of quadrature techniques, see Bailey, Jeyabalan, and Li (2005).Of the three problems with QUADPACK discussed above, the last two are equally applicable to DE methods.(The trapezoidal rule used in DE methods can produce cumulative integrals.) We therefore develop a custom quadrature algorithm for Bayesian CFC that addresses the shortcomings of existing techniques.

Implicit variable transformation quadrature using generalized Newton-Cotes
We transform Equation 7to an equivalent form that removes the hazard function, thereby freeing it from potential singularities.To do so, we use the definition of S k (t) in Equation 5to get Solving for λ k (t), and inserting back into Equation 7, we obtain: Thanks to the conditions described in 6, the integrand is now bounded.
The transformation from Equation 7to Equation 11is closely related to the DE method, with two notable differences: First, while in DE we use a double-exponential change of variable such as t = tanh( π 2 sinh(u)), here we use a custom transformation, t = S −1 k (u).Secondly, rather than applying the transformation explicitly -which would require the user to supply the inverse survival functions -we leave it implicit.This allows us to handle cases where explicit derivation of inverse survival functions is impossible; it also makes the software user-friendly.
Similar to the DE method, we apply Newton-Cotes techniques to the integrals of Equation 11.However, in exchange for removing the singularity, the new form -with variable transformation left implicit -imposes limits such as inability to cheaply find interval midpoints on the scale of transformed variable, thus rendering the classical Newton-Cotes expressions invalid (with the exception of trapezoidal rule).We have thus developed a generalized Simpson's rule that applies to integrals with an implicit change of variable, such as in Equation 11.Recall the standard Simpson's rule which approximates the integral of f (t) in the interval [a, b] via a quadratic approximation of the function: In generalized Simpson, we have: Proof is given in Appendix A, which relies on a quadratic expansion of f in terms of g over the interval [a, b].Note that if g(t) is linear in t, then r = 0 and the generalized equation in 13a reduces to the standard form in Equation 12. Also, in the corner cases where g(a) = g((a + b)/2) or g((a + b)/2)) = g(b) (leading to r 2 = 1), the above equation must be overridden in favor of a single trapezoidal step over the other half of the interval [a, b].
The implementation in CFC contains this protective measure, which is particularly useful for handling non-parametric survival functions.
While it is possible to use the inequalities of 6 to derive firm upper bounds for integration error, such upper bounds are too pessimistic since they assume piece-wise constant survival curves.Instead, we opt for a common approach in adaptive quadrature literature, i.e., using the difference between our method, and the output of a less accurate technique as a proxy for integration error.In our case, we use the trapezoidal rule as reference, which is easily generalized to the implicit variable transformation method: The advantage of using a generalized Newton-Cotes framework is that 1) it permits an adaptive subdivision scheme which reuses all previous integrand evaluations (Press 2007, Chapter 4), and 2) in the process, we obtain dense output for the integral, i.e., at all subdivision boundaries.To provide dense output at arbitrary points requested by the user, we use interpolation.Pseudo-code for the CFC quadrature algorithm is listed below.(For simplicity, we focus on a single integration task, which is wrapped in a for loop, corresponding to observations and/or Bayesian samples.) Algorithm 1 Input: {S k (.)}, k = 1, . . ., K (cause-specific survival functions) Input: t out (dense output time vector) Input: N max (maximum number of adaptive subdivisions) Input: (relative error tolerance for integration) Output: {I k }, k = 1, . . ., K (cause-specific CI functions, each one evaluated at t out ) 1. t max ←max(t out ) 2. t ← 0 t max (integration time grid) 3. Apply generalized Simpson and trapezoidal to t to initialize CI estimates and errors.4. e ← Maximum integration error across all causes at t max 5. N ← 1 6. while (e > ) and (N < N max ) 7.
Identify time interval with maximum contribution towards integration error.8.
Split the the identified interval in half.9.
Update generalized Simpson and trapezoidal CI estimates and errors.10.
e ← Maximum integration error across all causes at t max 11.
N ←N + 1 12. Interpolate generalized Simpson CI estimates over t to produce {I k } for t out .13. return {I k } In implementing the above quadrature algorithm in CFC, we have used several performance optimization strategies, which are discussed next.

Performance optimization strategies in CFC
Since a key target application of CFC is Bayesian survival models, performance is an important consideration.We have adopted three performance optimization strategies in CFC for executing the quadrature algorithm described in Section 2.3: 1) C++ implementation, 2) work sharing, and 3) parallel execution.Below we discuss each strategy.
C++ implementation: This includes two interconnected but distinct concepts, 1) implementation of the CFC quadrature algorithm in C++, and 2) API definition for user-supplied survival functions in C++.While it is possible to accept and call an R implementation of survival functions inside the C++-based quadrature algorithm, the data marshalling overhead of repeated calls from C++ to R can more than nullify the benefits of porting the quadrature algorithm to C++ (Eddelbuettel 2013).In other words, if the survival function must be executed in R, it is best for the quadrature algorithm to also run in R. We rely on the framework provided by the Rcpp package (Eddelbuettel et al. 2011) for our C++ implementation, which facilitates the development and maintenance of CFC and also encourages more efficient C++ implementations of survival functions by users and package developers.Details regarding the C++ components of CFC are provided in Section 3.

Work sharing:
In most cases, evaluating the integrand is where a significant fraction, if not the majority, of time is spent in a quadrature algorithm.Therefore, minimizing the number of such function evaluations can be a rewarding performance optimization technique.In generalized Simpson (Equation 13a) and trapezoidal (Equation 14) steps applied to the CFC problem, we need to evaluate two types of entities: 1) individual survival functions (S k 's), and 2) their leave-one-out products ( k =k S k ).We use two types of optimizations to minimize unnecessary calculations: 1) calculation of S k 's is shared across all causes, and 2) the full product k S k is calculated once, and divided by individual S k terms to arrive at leave-one-out terms.
Parallelization: Calculation of cumulative incidence functions for different observations and/or samples is clearly an independent set of tasks, which can be executed in parallel.We use OpenMP in C++, and doParallel (Analytics and Weston 2015a) and foreach (Analytics and Weston 2015b) packages in R, for this task parallelization.As long as the span of the iterator, which is typically the product of observation count and number of samples per observation (in Bayesian models), is sufficiently large, this parallelization scales well with the number of threads used (up to the physical/logical core count on the system).For an illustration of the impact of parallelization on performance, see Section 4.3.

CFC implementation and features
This section introduces the CFC software and its components, including the three usage modes for cause-specific competing-risk analysis.Examples illustrating each usage mode will be provided in Section 4. As usual, package documentation is the ultimate reference for all details.

Overview of CFC
CFC functionalities can be categorized into three groups: 1) core functionality, 2) utilities, and 3) legacy code.The core functionality in CFC is numerical calculation of cause-specific, competing-risk differential equations in Eq. 1, using the implicit variable transformation in Eq.11, and based on the generalized Newton-Cotes method outlined in Eqs.13a and 14.There are two implementations of Algorithm 1 in CFC: the R-based function cscr.samples.R, and the C++-based function cscr_samples_Cpp.Both these functions are private, and exposed through a common interface, cfc, which dispatches the right method by inspecting the first argument.The cfc API provides the users and package developers with the capability to perform cause-specific, competing-risk analysis using their custom-built survival functions, as long as they conform to a pre-specified but flexible interface, which is described in Section 3.2.These functions can be Bayesian or non-Bayesian, parametric or non-parametric.
The utility methods in CFC, on the other hand, expose a convenient API for users to perform end-to-end survival and competing-risk analysis with minimal effort.The tradeoff is that the set of survival functions available through this API is limited to (non-Bayesian) parametric survival regression models of package survival.This still represents a core set of popular models, and should address the needs of many practitioners.The functions in this API are cfc.survreg,summary.cfc.survreg,plot.summary.cfc.survreg,cfc.survreg.survproband cfc.prepdata.The last two functions are described in Section 3.2, and all functions are illustrated via examples in Section 4.
Legacy functions in CFC (cfc.tbasis and cfc.pbasis and their associated S3 methods) apply a composite trapezoidal rule to user-supplied survival curves, i.e., evaluated at predetermined time points.Here the choice of time intervals is not optimized, and no error analysis is performed.The use of legacy CFC is deprecated in favor of the new machinery described in this paper.Interested readers can refer to package documentation for further details on how to use the legacy code.

CFC usage modes
In order to achieve the dual objectives of user-friendliness and flexibility, CFC offers three usage modes: 1) end-to-end survival and competing-risk analysis, using cfc.survreg, 2) competing-risk analysis of user-supplied survival functions, written in R, and 3) competingrisk analysis of user-supplied functions, written in C++.The last two usage modes are exposed through a common interface, i.e., the public function cfc.We review each usage mode below, with examples to follow in Section 4.
Parametric survival regression and competing-risk analysis: This is the most convenient usage mode, in which a one-line call to cfc.survreg performs both cause-specific survival regression and competing-risk analysis.More specifically, cfc.survreg executes three steps: i) parsing the survival formula to create cause-specific status columns and formulas; ii) calling the survreg function from survival package (Therneau 2013) for each cause; iii) calling the internal CFC function, cscr.samples.R, to perform competing-risk analysis.To perform the first step, we use the (public) utility function cfc.prepdata.To perform the last step, the (unadjusted) survival function associated with the survreg models is needed, which we have implemented in cfc.survreg.survprob.This survival function is made public, so as to allow users to easily combine survreg models with other types of survival models.It contains a custom implementation of the survival functions needed by cscr.samples.R, using the "dist" field of the returned object from survreg, along with the definition of distributions made available via survreg.distributions.The implementation is a verbatim translation of the definition of location-scale family (Datta 2005).This usage mode trades off flexibility for user-friendliness, as it is restricted to the survival regression models covered in the survival package, as well as user-defined survival distributions following the conventions of that package.An example is provided in Section 4.1.
CFC for user-supplied survival functions implemented in R: If the f.list argument passed to cfc is a list of functions -implemented in R -then cfc dispatches cscr.samples.R.This is more flexible than using cfc.survregsince the user can supply any arbitrary survival function, as long as it conforms to the prototype expected by cscr.samples.R.However, it requires the user to implement one or more survival functions.These functions must follow this prototype: where t is a vector of time-from-index (non-negative) values, args is a list of argument needed by the function, and n is an iterator that spans the observations and/or samples.It is the responsibility of function implementation to consistently interpret n, and map it from one dimension to multiple dimensions, e.g., to obtain observation and sample indices.Of course, we could have chosen to hide n inside args, but decided to make it explicit to draw attention to its special meaning.In Section 4 we will see several example implementations of survival functions conforming to the above prototype.
CFC for custom models -C++ mode: This usage mode offers the same functionality as the previous one, but requires the user to supply the survival functions in C++.This often leads to significant speedup; however, implementation is also more involved as it requires at least three functions per distinct cause-specific model: initializer, survival function, and resource de-allocator.An example is provided in Section 4.3, illustrating the impact of transition from C++ to R implementation (of survival functions) on performance.To facilitate both package development and maintenance as well as survival-function implementation by users, we have adopted the framework of Rcpp (for data exchange with R) and RcppArmadillo (for linear algebra):

Using CFC
As discussed in Section 3.2, CFC can be used in three modes, which progressively become more flexible but also require more significant programming effort.Examples 1-3 illustrate how each mode can be used.The final example illustrates a key advantage of the CFC framework, namely the logical separation of cause-specific survival analysis from the competing-risk analysis, which in turns allows for combining survival models of different kind in the same competing-risk analysis.

Example 1: end-to-end competing-risk analysis using Weibull regression
In our first example, we illustrate the easiest usage mode in CFC, i.e., the cfc.survreg function.It first creates parametric survival regression models using the survreg function of the survival package, and passes these models to cfc for competing-risk analysis.
It is instructive to visualize the impact of competing-risk adjustment on cumulative incidence rates: R> old.par <-par(mfrow=c(1,2)); plot(summ, which = 2); par(old.par) We can see from Figure 3 that the competing-risk adjustment, using the CFC framework, has a very significant corrective impact on the cumulative incidence probability for both causes.
In Bayesian CFC, the cumulative incidence and survival arrays returned have three dimensions: 1) time, 2) cause, 3) n iterator (described above).In summarizing the arrays, we should not reduce/collapse the first two dimensions.The last dimension is often a flattened version of two dimensions: observations and MCMC samples.How this 2D space is mapped to the 1D iterator is left to the users in their specification of the survival function.We have similarly aimed for flexibility in designing the summary.cfcfunction, where the f.reduce argument is required from the user in order to determine how each sub-array of the 3D arrays returned by cfc must be processed/reduced, before being passed to the quantile function to determine the median and credible bands for each combination of time and cause.For the example provided in this section, a suitable reduction function can be defined as: R> my.f.reduce <-function(x, nobs, nsmp) { + return (colMeans(array(x, dim = c(nobs, nsmp)))) + } This function calculates the population average (for each time, cause and MCMC sample).As such, when supplied to summary.cfc, it produces the credible bands for population-average values for cause-specific cumulative incidence and survival probabilities.The output of the summary function can be passed to the plot function to produce Figure 4.
R> my.cfc.summ<-summary( + out.cfc.R, f.reduce = my.f.reduce + , nobs = nobs.pred,nsmp = nsmp) R> oldpar <-par(mfrow = c(2, 2)) R> plot(my.cfc.summ)R> par(oldpar) Parallelization: We saw that running CFC in R takes nearly 25 seconds on our test server.(See Section B for session information.)This is for a small data set (nobs.pred=123), a handful of samples (nsmp=10) and a lenient error tolerance (rel.tol=1e-4).By extrapolation, to perform CFC for a data set of size 1000, with 1000 samples, a somewhat more realistic scenario, we would need nearly 4.5 hours.(Note that execution time is nearly independent of the length of tout, since the latter only affects the last -interpolation -step, which is computationally cheap.)An easy way to improve performance is by using multi-threaded parallelization on a multicore machine.This can be done via the ncore parameter: R> ncores <-2 R> tout <-seq(from = 0.0, to = tmax, length.out= 10) R> t.R.par <-proc.time()[3]R> out.cfc.R.par <-cfc(f.list,arg.list, nobs.pred* nsmp, tout, + rel.tol = rel.tol,ncores = ncores) R> t.R.par <-proc.time()[3]-t.R.par R> cat("t.R.par:", t.R.par, "sec\n") The speedup is close to linear, which is expected given the low amount of coordination needed among threads: R> cat("parallelization speedup -R:", t.R / t.R.par, "\n") A more powerful to improve performance is by using the C++ interface of CFC, which is illustrated next.The initializer function is responsible for converting the incoming List from R to the weib structure: By implementing the data structure in terms of Armadillo classes vec and mat, we isolate the dependence on R data structures to the initializer, which allows for easier porting of the survival model to other environments.We must also create an external pointer to the initializer, which will be supplied to cfc.This can be accomplished by calling the following function from R, as we will demonstrate later: // [[Rcpp::export]] XPtr<initfunc> weib_getPtr_init() { XPtr<initfunc> p(new initfunc(&weib_init), true); return p; } Recall that the initfunc function pointer has been typedef'ed before.Since the initializer creates a new weib data structure on the heap, it is best practice to release this memory once we are finished.We do so by implementing a freefunc, as well as a companion function for creating an external pointer to it: t.Cpp.par-1000 samples: 19.605 sec The speedup for nsmp=1000 remains quite acceptable: R> cat("parallelization speedup -C++:", t.Cpp.1000/ t.Cpp.1000.par,"\n") parallelization speedup -C++: 1.88809

Example 4: Combining parametric and non-parametric survival models in CFC
The logical separation of survival models and competing-risk analysis in CFC offers the flexibility to use entirely different type of models for different causes.We saw an example of this in Section 4.1, where we used Weibull and exponential distributions in the cfc.survreg function.It is even possible to combine parametric and non-parametric survival models in CFC, as we illustrate next.
The random forest survival model in randomForestSRC package produces discretized survival curves.When survival curves for all causes are discrete, combining them does not require integration, and the survfit function in survival package offers this functionality.However, if at least one cause has a continuous survival function, we can use CFC to produce continuous output.The key step is to write a wrapper function around the discrete survival functions that uses interpolation to create a continuous interface.

Discussion
Summary: Bayesian techniques offer many, well-recognized methodological advantagesparticularly in survival analysis -including a general estimation framework, consistent treatment and propagation of uncertainty, validity for small (and large) samples, ability to incorporate prior information, ease of model comparison and validation, and natural handling of missing data (Ibrahim, Chen, and Sinha 2005).Translating this broad appeal into widespread adoption of Bayesian techniques is critically dependent on the availability of software for their easy and efficient estimation and prediction.Most effort in developing high-performance Bayesian software, however, has been focused on the estimation side, with research covering areas such as efficient (Girolami and Calderhead 2011;Mahani, Hasan, Jiang, and Sharabiani 2015), self-tuning (Homan and Gelman 2014), and parallel (Mahani and Sharabiani 2015b;Gonzalez, Low, Gretton, and Guestrin 2011) MCMC sampling, among others.In contrast, relatively little attention has been paid to providing techniques and tools for full Bayesian prediction, leaving many practitioners with no choice but to use premature, point summaries of model parameters to produce approximate, mean values for predicted entities.(See, however, ?.) We presented the R package CFC for Bayesian, and non-Bayesian, cause-specific competingrisk analysis of parametric and non-parametric surviva models with an arbitrary number of causes.Three usage modes available in CFC offer a combination of ease-of-use and extensibility: While a single-line call to cfc.survreg performs parametric survival regression followed by competing-risk analysis, the core function cfc allows users to include other survival models, including non-parametric ones, in cause-specific competing-risk analysis.The R interface can be used for small data sets and/or non-Bayesian models, where the computational workload is modest.It can also serve as a reference for implementing the C++ version of the survival functions in order to significantly improve peformance for computationally-demanding problems.The quadrature algorithm used in CFC can be considered an implicit variable transformation method that circumvents potential end-point singularities, and also enhances usability by removing the need to supply the cause-specific hazard functions.In addition to the C++ API, other performance optimization techniques in CFC such as cross-cause worksharing and OpenMP parallelization have combined to put a full Bayesian approach to survival and competing-risk analysis within the reach of practitioners.
Potential future work: According to Equations 3 and 4, we must have: where ∆ refers to the change in a quantity during an integration time step.However, the generalized Simpson step (Equations 13a and 13b), when applied to all causes, does not mathematically satisfy this condition.In other words, the sum of event-free probability and all cumulative incidence functions does not mathematically add up to 1 after discrete time evolutions.Similarly, the generalized trapezoidal rule of Equation 14, when applied to each cause in isolation, does not satisfy this property in general (but it does for K = 2).It is possible to extend the trapezoidal step to satisfy this property, but without an equivalent extension of the Simpson rule, we would need to develop an alternative approach to error analysis.This is because, absent the Simpson rule as the main method, the trapezoidal step would change role from reference to main method to become the return value of the integral.Developing a coherent framework that satisfies Equation 15 and includes proper error analysis is an interesting potential area of research.
In terms of software development, current implementation of cfc.survreg is R based.This is partially justified since the underlying models, from survival package, are non-Bayesian.Therefore, as long as data sizes are small, computational workloads in cfc.survreg remain manageable without porting to C++.However, for large data sets this will be inadequate, and therefore a high-performance implementation is warranted to cover the emerging, big-data use-cases.
OpenMP parallelization of cfc provides an immediate and significant performance gain, but there are other, more advanced opportunities for performance optimization.For example, Single-Instruction, Multiple-Data (SIMD) parallelization has recently been successfully applied to Bayesian problems (Mahani and Sharabiani 2015b).Given the increasing width of vector registers in modern CPUs (Jeffers and Reinders 2013), taking advantage of SIMD parallelization offers an opportunity for meaningful performance improvements.A second area of investigation, especially for large data sets with memory-bound performance ceilings, is reducing data movement throughout the memory hierarchy.Techniques such as improving data layout to permit unit-stride access, and NUMA-aware memory allocation to minimize cross-socket data transfer over slower bus interconnects (Mahani and Sharabiani 2015b) can help minimize data movement and improve cache and memory bandwidth utilization.

A. Proof of generalized Simpson rule
Our objective is to derive an approximation for b a f (t) dg(t), using a second-order Taylorseries expansion of f (t) in terms of g(t) over the interval [a, b]: where f a ≡ f (t = a) and similarly for f b , g a and g b .To find α and β, we require that this quadratic function passes through (f a , g a ), (f b , g b ), and (f m , g m ), where f m ≡ f (m = (a + b)/2), and similarly for g m .Th first of three conditions, at t = a, is already satisfied in Equation 16.The next two conditions lead to Solving for α and β leads to: Substituting α and β from Equations 18a and 18b into Equation 19, while defining g 1 ≡ g m −g a and g 2 ≡ g b − g m , and some algebraic manipulation, leads to: A second change of variable, using h ≡ g 1 +g 2 = g b −g a and δ ≡ g 1 −g 2 = 2g m −(g a +g b ) allows us to re-express the above in the following form, after some further algebraic manipulations: A final symbol definition, r ≡ h/δ, readily leads to Equation 13a.

B. Setup
Below is the R session information used in producing R output in Section 4.

Figure 1 :
Figure1: Motivating Bayesian cause-specific competing-risk analysis.Left panel: Estimated survival probability, based on a Bayesian Weibull regression, at 614 days from index, for 'ovarian' data set, available in R package BSGW(Mahani and Sharabiani 2015a).The x axis corresponds to the 'incorrect' (non-Bayesian) method of calculating the average of model coefficients, followed by calculation of survival probability using the coefficient averages.The y axis corresponds to the mean+/se values of the same survival probabilities, this time using the 'correct' (Bayesian) method of calculating the probabilities once for each sample, and then averaging the probabilities.All survival probabilities are significantly over-estimated by the non-Bayesian method.The underlying MCMC run consisted of 5000 iterations, the first 2500 of which were discarded as burn-in.Right panel: Comparison of cumulative incidence probability, with and without competing-risk correction, for the 'pcm' event in 'mgus1' data set, taken from R package survival(Therneau 2015).Compared to a naive Kaplan-Meyer estimate, competing-risk analysis removes cases lost to 'death', thus reducing the pool of available subjects for 'pcm', and therefore leading to a smaller estimate for cumulative incidence probability for 'pcm'.
typedef arma::vec (*func)(arma::vec x, void* arg, int n); typedef void* (*initfunc)(Rcpp::List arg); typedef void (*freefunc)(void *arg); Note the use of void* pointer in function prototypes.This allows for a uniform API across all survival functions, leaving the casting and interpretation of this pointer to each implementation.See Example 3 in Section 4.3.

Figure 2 :
Figure 2: Cause-specific cumulative incidence functions for the Weibull survival regression models built for bmt data set.

Figure 4 :
Figure 4: Median and 95% credible bands for population-averaged cause-specific cumulativeincidence and survival probabilities, corresponding to Example 2.
We can even use different distributions for each cause, by passing a vector as the dist argument:R> out.mix<-cfc.survreg(Surv(time,cause)~platelet+age + tcell, + bmt[idx.train,],bmt[idx.pred,],+dist= c("weibull", "exponential"), rel.tol = rel.tol)4.2.Example 2: Bayesian CFC in RUtilizing the function cfc requires that cause-specific, unadjusted survival functions be available or implemented, in R or C++.This is in contrast to cfc.survreg which has encapsulated the survival functions corresponding to the class of survival models in survival package.First, we use the utility function cfc.prepdata to prepare the bmt data set and set up formulas for cause-specific survival analysis: