stgenreg : A Stata Package for General Parametric Survival Analysis

In this paper we present the Stata package stgenreg for the parametric analysis of survival data. Any user-deﬁned hazard function can be speciﬁed, with the model estimated using maximum likelihood utilising numerical quadrature. Models that can be ﬁtted range from the Weibull proportional hazards model to the generalized gamma model, mixture models, cure rate models, accelerated failure time models and relative survival models. We illustrate the features of stgenreg through application to a cohort of women diagnosed with breast cancer with outcome all-cause death.


Introduction
Parametric models remain a standard tool for the analysis of survival data.Through a fully parametric approach, we can not only obtain relative effects, such as hazard ratios in a proportional hazards model, but also clinically relevant absolute measures of risk, such as differences in survival proportions (Lambert, Dickman, Nelson, and Royston 2010).Parametric models are also useful where extrapolation is required, such as in the economic decision modelling framework (Weinstein et al. 2003).
The most popular tool for analysing survival data remains the Cox proportional hazards model (Cox 1972), which avoids making any assumptions for the shape of the baseline hazard function.One of the reasons the Cox model remains the prefered choice over parametric models is that standard parametric models available in standard software are often not flexible enough to capture the underlying shape of the hazard function seen in real data.
The traditional approach to estimation of parametric models is through maximum likelihood.This is relatively simply when using a known probability distribution function, such as the stgenreg: General Parametric Survival Analysis in Stata Weibull or Gompertz.Many commonly used parametric survival models are implemented in a variety of software packages, such as the streg package in Stata (StataCorp. 2011), survreg (Therneau 2012) in R (R Core Team 2013) and LIFEREG in SAS (SAS Institute Inc. 2008).However, every parametric model has underlying assumptions, for example, the widely used Weibull proportional hazards model assumes a monotonically increasing or decreasing baseline hazard rate.Such assumptions can be considered restrictive, leading to the development of other more flexible parametric approaches (Royston and Parmar 2002;Royston and Lambert 2011).
In this paper we present the Stata command stgenreg which enables the user to fit general parametric models through specifying any baseline hazard function which can be written in a standard analytical form.This is implemented through numerical integration of the userdefined hazard function.This allows complex extensions to standard parametric models, for example, modelling the log baseline hazard function using splines or fractional polynomials, as well as complex time-dependent effects; methods that are unavailable in standard software.Time-varying covariates can also be incorporated through using multiple records per subject.We do not consider frailty (unobserved heterogeneity) in this article.
One of the key advantages of such a general framework for survival analysis is in the development of new models, for example in one line of code a parametric survival model can be fitted rather than having to directly program the likelihood evaluator.

Parametric survival analysis
Let T * i be the true event time of patient i = 1, . . ., n, and T i = min(T * i , C i ) the observed survival time, with C i the censoring time.Define an event indicator d i , which takes the value of 1 if T * i ≤ C i and 0 otherwise.We define the probability density function of T * i as where f (t) is the unconditional probability of an event occuring in the interval (t, t + δ).We define the hazard and survival functions as such that h(t) is the instantaneous failure rate at time t, and S(t) is the probability of 'surviving' longer than time t.This leads to We can further write where H(t) is the cumulative hazard function.When the integral in Equation 2 is analytically intractible, we can use numerical integration techniques to derive the cumulative hazard and thus still calculate the survival function.

Maximum likelihood estimation
The log-likelihood contribution of the i-th patient, allowing for right censoring and delayed entry (left truncation), using Equation 1 can be written as where t 0i and t i are the observed entry and survival/censoring times for the i-th patient.
If delayed entry is not present then the third term in Equation 3 can be dropped.Using Equation 3 we can directly maximize the log-likelihood if using known probability density and survival functions.Alternatively, using Equation 1 we can write and substituting Equation 2 this becomes We note from Equation 4 that the likelihood can also be maximized if only the hazard function is known.Of course, in standard parametric models, all 3 functions are known; however, given that often the hazard function is of most interest, specifying a complex hazard function can be advantageous.The maximization of such a specified hazard model relies on being able to evaluate the integral in Equation 4. If we propose to use such functions as fractional polynomials or splines to model a complex baseline hazard function, or incorporating complex time-dependent effects, then we have a situation where this integral cannot always be evaluated analytically, motivating alternative approaches.

Numerical integration
We propose to use numerical quadrature to evaluate the cumulative hazard, and hence maximize the likelihood in Equation 4, allowing the user to estimate a parametric survival model, specifying any function for the baseline hazard, satisfying h(t) > 0 for all t > 0.
Gaussian quadrature allows us to evaluate an analytically intractible integral through a weighted sum of a function evaluated at a set of pre-defined points, known as nodes (Stoer and Burlirsch 2002).We have where W (x) is a known weighting function and g(x) can be approximated by a polynomial function.The integral over [t 0i , t i ] in Equation 4 must be changed to an integral over [−1, 1] stgenreg: General Parametric Survival Analysis in Stata using the following rule This transformation allows the incorporation of delayed entry quite simply.The form of Gaussian quadrature depends on the choice of weighting function.The default within stgenreg is Gauss-Legendre quadrature, with weighting function, W (x) = 1.
The accuracy of the numerical integral depends on the number of quadrature nodes, m, with node locations dependent on the type of quadrature chosen.As with all methods which use numerical integration, the stability of maximum likelihood estimates should be established by using an increasing number of quadrature nodes.

Time-dependent effects and time-varying covariates
The presence of non-proportional hazards, i.e., time-dependent effects, is common in the analysis of time to event data (Jatoi, Anderson, Jeong, and Redmond 2011).This is frequently observed in registry data sources where follow-up time is often over many years (Lambert et al. 2011).Similarly in clinical trials, time-dependent treament effects are also observed (Mok et al. 2009).Time-dependent effects are incorporated seemlessly into our modelling framework, by allowing the user to interact any covariates with a specified function of time.We illustrate this in Section 4.2.1.
Time-varying covariates are a further often observed scenario in the analysis of survival data, where the value of a covariate for individual patients can change at various points in follow-up.For example in oncology clinical trials, patients will often switch treatment group when their condition progresses (Morden, Lambert, Latimer, Abrams, and Wailoo 2011), or biomarkers may be measured repeatedly over time, resulting in multiple records per subject (?).For this form of analysis the data is often set up into start and stop times, and since delayed entry (left truncation) is allowed, this again is incorporated into the described modelling framework.We illustrate through example in Section 4.4.

The Stata package stgenreg
The Stata package stgenreg is implemented as three Stata ado files.The primary shell program, stgenreg.ado,handles the syntax options for the package, which then calls the likelihood evaluator program stgenreg_d0.ado,described in Section 3.1.Finally, a variety of predictions can be obtained following estimation of a model using Stata's predict command, which calls the program stgenreg_pred.ado, described in Section 3.2.

Program implementation and syntax
The log-likelihood shown in Equation 4 is maximized using the Newton-Raphson algorithm, with first and second derivatives estimated numerically, as implemented in the ml command in Stata (Gould, Pitblado, and Poi 2010).As described in Section 2.1, the integral in Equation 4is evaluated using m-point Gaussian quadrature.
The evaluator program has been optimized using Stata's matrix programming language, Mata.This provides computational benefits and use of the wide array of mathematical functions available for the user to specify in the hazard function.In addition, we have implemented specific functions which allow the incorporation of restricted cubic splines or fractional polynomials into the hazard or log hazard function (Durrleman and Simon 1989;Royston and Altman 1994).
When using stgenreg one of the options loghazard() or hazard() must be defined.These specify a user-defined log hazard or hazard function.The function must be defined in Mata code, with parameters specified in square brackets, for example [ln_lambda].The use of Mata means that mathematical operations require a colon (:) prefix, for example :+ instead of +.Time must be coded as #t.The user can specify covariates or functions of time within the linear predictor of any parameter, providing a highly flexible framework.
For example, we can specify a Weibull distribution using either the log hazard or hazard function.Each parameter is parameterized to contain the entire real number line, i.e., both λ and γ are restricted to be positive by modelling on the log scale.
. stgenreg, loghazard ([ln_lambda] : A linear predictor can be defined for any of the parameters, with the name of the option defined as the name of the parameter specified in the loghazard() or hazard() option.For example a proportional hazards Weibull model can be fitted with covariates treatment, age and sex by adding the option ln_lambda(treatment age sex).
One of the key advantages of stgenreg is that we can incorporate a variety of functions (including functions of time) into the linear predictor of any parameter.For example, parameter [ln_lambda] has an available option ln_lambda(comp1 | comp2 | ...| compn), which can contain a variety of component functions to increase complexity.Each compj can contain a variety of functions described in Table 1.
Additionally, excess mortality (relative survival) models (Nelson, Lambert, Squire, and Jones 2007) can be fitted by use of the bhazard(varname) option.In these models a known expected mortality rate, h * (t), is included in the model as follows, Here the loghazard() and hazard() options now refer to the modelling of λ(t).Note that it is the expected mortality rate at the event time that needs to be supplied to the bhazard() option.
Finally, all standard options of the ml suite in Stata can be used when fitting a stgenreg model, such as constraints() which allow the user to constrain the value of any coefficient to be a particular constant.

Predictions
A variety of predictions can be obtained following the estimation of a model.These include the hazard, survival and cumulative hazard functions.

Component Description varlist [, nocons]
The user may specify a standard variable list within a component section, with an optional nocons option.

g(#t)
Where g() is any user defined function of #t written in Mata code, for example #t:^2.

#rcs(options )
Creates restricted cubic splines of either log time or time.Options include df(int), the number of degrees of freedom, noorthog which turns off the default orthogonalisation, time, which creates splines using time rather than log time, the default, and offset(varname) to include an offset when calculating the splines.See rcsgen in Stata for more details.
#fp(numlist [,options ]) Creates fractional polynomials of time with powers defined in numlist.If 0 is specified, log time is generated.The only current option is offset() which is consistent with that described in #rcs() above.
Table 1: Description of each component that can be included in the linear predictor of a parameter.
The standard Stata syntax to obatin predictions following a model fit is as follows . predict newvarname, statistic So for example, to obtain the fitted survival, hazard and cumulative hazard functions . predict surv1, survival .predict haz1, hazard .predict cumhaz1, cumhazard Extended prediction options unavilable in standard software include: zeros -obtains baseline predictions, at() -obtains predictions at specified covariate patterns, timevar() -obtains predictions at specified times.These options can be combined with standard choices of hazard, cumhazard and survival.Finally, the ci option can be used to obtain confidence intervals.

Analysis of example datasets using stgenreg
We illustrate stgenreg through use of a dataset comprising of 9721 women aged under 50 and diagnosed with breast cancer in England and Wales between 1986 and 1990.The event of interest is death from any cause, with follow-up restricted to 5 years.Deprivation was categorized into 5 levels; however, we have restricted the analyzes to comparing the most affluent and most deprived groups, for illustrative purposes.We therefore only consider a binary covariate, dep5, with 0 for the most affluent and 1 for the most deprived group.
We further illustrate how to incorporate a time-varying covariate through use of a dataset of 488 patients with liver cirrhosis (Anderson, Borgan, Gill, and Keiding 1993).A total of 251 patients were randomized to receive prednisone, with 237 randomized to receive a placebo.Prothrombin index was measured repeatedly, with between 1 and 17 measurements per subject, resulting in 2968 observations.Outcome was all-cause death.

Weibull proportional hazards model
We begin by fitting a Weibull proportional hazards model to the breast cancer dataset, investigating the effect of deprivation status.Given that Weibull models are available in all standard statistical software, we first illustrate the concept showing that the estimates agree with estimates derived using analytically tractible definitions of the hazard and survival functions.
The baseline hazard and log hazard functions have the following form where X is a vector of covariates, with corresponding regression coefficients β.In this case it is convenient to use the loghazard() option of stgenreg.We can investigate covariate effects by including deprivation status in the linear predictor of log(λ), using the option ln_lambda.
When fitting models which rely on numerical integration, it is important to establish the stability of maximum likelihood estimates by using an increasing number of quadrature nodes.In the case of a Weibull proportional hazards model, we can both compare with the optimized model using streg in Stata, and compare with an increasing number of quadrature nodes.
Here we present results from fitting the streg model and stgenreg models with 15, 30, 50 and 100 nodes.

Restricted cubic spline proportional hazards model
We now introduce a much more flexible proportional hazards survival model, modelling the baseline log hazard function using restricted cubic splines of log(time).We formulate the baseline log hazard function where s(log(t)) is a restricted cubic spline function of log(t).This can be implemented by using the #rcs component option.We use the default knot locations, based on the centiles of the distribution of uncensored survival times.
This draws parallels with the flexible parametric model of Royston and Parmar (2002) An advantage of modelling on the log hazard scale is that when there are multiple time dependent effects, the interpretation of the time-dependent hazard ratios is simplified as they do not depend on values of other covariates, which is the case when modelling on the cumulative hazard scale (Royston and Lambert 2011).

Generalized gamma proportional hazards model
The generalized gamma (GG) is a 3-parameter parametric model implemented in a variety of statistical packages (Cox, Chu, Schneider, and Munoz 2007).However, it is parameterized as an accelerated failure time model in Stata.We can write the survival and density functions as and where γ = |κ| −2 , z = sign{log(t)−µ}, µ = γ exp(|κ|z), Φ(z) is the standard normal cumulative distribution, and I(a, x) is the incomplete gamma function.
Therefore using Equation 1, we can write down our baseline hazard function as the ratio of the probability distribution function to the survival function.

Time-varying covariates
We now illustrate the data setup required for survival analysis incorporating a time-varying covariate.We use the liver cirrhosis dataset described above.Here we use the enter() and id() options of stset in Stata, to declare the data as multiple record per subject.
Alternatively stgenreg can be used in conjunction with Stata's stsplit command, to create at risk time intervals.

Discussion
We have presented the stgenreg command in Stata, for the general parametric analysis of survival data.Through specification of a user-defined hazard function, we have illustrated how to implement standard proportional hazards models, novel restricted cubic spline survival models and a generalized gamma model with proportional hazards.In essence, stgenreg may be used to implement a parametric survival model defined by anything from a very simple one parameter proportional hazards model, to models which contain highly flexible functions of time, for both the baseline and time-dependent effects.Any parameter defined in the hazard function can be dependent on complex functions of time, including fractional polynomials or restricted cubic splines.
The choice of the number of quadrature nodes is left to the user.An increasing number of quadrature nodes should be used to establish consistent parameter estimates.
As it is a general framework, it may not be the most computationally efficient; however, it is a useful tool for the development of novel models.For example, it may be useful to develop ideas and test new models, but then spend time developing more computationally efficient methods for specific cases.
In future developments we aim to allow for interval censoring, the extension to incorporate frailty and a post-estimation command to calculate the cumulative incidence function for competing risks.The package is available from the Statistical Software Components archive (Crowther and Lambert 2013) and can be installed from Stata by typing ssc install stgenreg.

Figure 1 :
Figure 1: Predicted hazard function for the most affluent group with 95% confidence interval.

Figure 2 :Figure 3 :
Figure 2: Kaplan-Meier estimates for the most affluent and most deprived groups, with predicted survival overlaid.The figure on the left shows predicted survival with a proportional effect of deprivation status, with the figure on the right allowing for non-proportional hazard sin the effect of deprivatin status.
. stgenreg, loghazard([xb]) xb(pro trt | #rcs(df(3))) nolog Variables _eq1_cp2_rcs1 to _eq1_cp2_rcs3 were created Log likelihood As this is a complex function, we can use Stata's local macros to build up the function.