Event History Regression with Pseudo-Observations: Computational Approaches and an Implementation in R

Due to tradition and ease of estimation, the vast majority of clinical and epidemiological papers with time-to-event data report hazard ratios from Cox proportional hazards regression models. Although hazard ratios are well known, they can be difficult to interpret, particularly as causal contrasts, in many settings. Nonparametric or fully parametric estimators allow for the direct estimation of more easily causally interpretable estimands such as the cumulative incidence and restricted mean survival. However, modeling these quantities as functions of covariates is limited to a few categorical covariates with nonparametric estimators, and often requires simulation or numeric integration with parametric estimators. Combining pseudo-observations based on non-parametric estimands with parametric regression on the pseudo-observations allows for the best of these two approaches and has many nice properties. In this paper, we develop a user friendly, easy to understand way of doing event history regression for the cumulative incidence and the restricted mean survival, using the pseudo-observation framework for estimation. The interface uses the well known formulation of a generalized linear model and allows for features including plotting of residuals, the use of sampling weights, and correct variance estimation.


Introduction
Approaches to event history modeling with covariates can be designated into three categories: nonparametric, semi-parametric, and fully parametric modeling. Under these three paradigms, the flexibility with which one can incorporate covariate information, as well as the estimands of interest, increases from nonparametric to fully parametric, but so do the assumptions that are required. Semiparametric Cox regression occupies the middle ground of modeling assumptions, having an unspecified baseline hazard but still allowing for multiple and continuous covariates (Cox 1972).
The vast majority of clinical and epidemiological papers with time-to-event data use hazard ratios as their primary estimand. We believe this is due to two things, tradition, and how easy Cox models are to estimate using standard statistical software. Fully parametric survival models require some understanding of the parameterization to interpret the results. The results of a Cox model however are familiar even to first year medical students, yet hazard ratios are commonly misinterpreted as relative risks (Sutradhar and Austin 2018). Furthermore, as was pointed out by Aalen, Cook, and Røysland (2015); Martinussen, Vansteelandt, and Andersen (2020), hazard ratios are difficult to interpret as causal contrasts in many settings. Nonparametric or fully parametric estimators allow for the direct estimation of cumulative estimands that do not condition on past survival, which are therefore more easily causally interpretable. However, incorporation of covariate information is limited to a few categorical covariates in the former, and interpretation of the coefficients directly is challenging in the latter.
With the term "cumulative estimand" we are referring to quantities that can be expressed as expectations of functionals of random variables that represent times to some event of interest that do not condition on past survival. This is in contrast to the hazard function, which is defined in terms of the probability of failing at a particular time conditional on surviving up to just before that time. Cumulative estimands that are commonly of interest are the probability of surviving beyond a particular time (the survivor function or survival), the probability of failing before a particular time (the cumulative incidence function or risk), and the restricted mean survival. What we call cumulative estimands are sometimes referred to as "marginal" to distinguish them from the hazard which is conditional on past survival. However, since we are interested in regression modeling conditional on covariates, we will use the term "cumulative" for clarity.
In many settings, the outcome of interest may be the time to failure due to one cause (for example, death due to cancer), while the remaining causes (death not due to cancer) would be considered competing causes or competing risks. In the presence of competing risks, cumulative estimands include the probability of failing due to a particular cause before a fixed time (the cause-specific cumulative incidence, also called the subdistribution) and the restricted mean time lost (Zhao et al. 2018). In Cox regression, competing risks are often treated as censoring events, but these cumulative estimands are related to the cause-specific hazards of all of the causes, and hazards based on the subdistribution functions are even more difficult to interpret.
Contrasts of cumulative estimands, such as the difference in survival probabilities, have easier causal interpretations, and although there have been several methods suggested in the literature to model the effect of covariates on them, each has its limitations. An overview of fully parametric models is provided by Royston and Parmar (2002), yet these models have similar drawbacks as the Cox model and often require computationally complex post-estimation and standardization to describe covariate effects on cumulative survival quantities. The Fine-Gray model (Fine and Gray 1999) is touted as a model for the cause specific cumulative incidence function, yet the main output from that model are ratios of the hazards defined by the subdistribution functions which lack a useful interpretation in terms of an effect on the cumulative incidence (Austin and Fine 2017). Scheike, Zhang, and Gerds (2008) and Tian, Zhao, and Wei (2014) developed inverse probability of censoring weighted estimating equation methods for using covariates to predict the cumulative incidence probability and restricted mean event time, respectively. These methods, however, can be statistically inefficient since they omit the censored observations from the estimating equations, and can be difficult to use and model dependent because they require modeling of the censoring distribution.
Pseudo-observations, as introduced by Andersen, Klein, and Rosthøj (2003), can be used to fill this gap. Pseudo-observations are calculated for each individual in a sample based on jackknife values calculated using nonparametric estimators of cumulative estimands. It has been shown that using these pseudo-observations as the outcome (instead of the time and event indicator pair) of regression models provides asymptotically unbiased estimates of the associations of the covariates included in the model on the survival outcome of interest (Graw, Gerds, and Schumacher 2009;Jacobsen and Martinussen 2016;Overgaard, Parner, and Pedersen 2017). The key advantage is that they allow direct parametrization of covariate associations with cumulative survival quantities of interest at a single or multiple times points simultaneously.
Our goal is to provide an user friendly, easy to understand way of doing event history regression for cumulative survival estimands, using the pseudo-observation framework. We do this in our R package eventglm (Sachs and Gabriel 2022), using the well known formulation of a generalized linear model (GLM) or generalized estimating equations (GEE) that builds on and leverages the existing infrastructure for such models, specifically in the stats package (R Core Team 2021) and the geepack package (Halekoh, Højsgaard, and Yan 2006). In this paper we describe our implementation of pseudo-observation based approaches to event history regression in the R package eventglm, with the primary functions cumincglm and rmeanglm, highlighting the interpretation and useful properties of this approach. We use simulated data to evaluate and compare the performance of the different methods to handle covariate dependent censoring, and the different variance estimators in the single time point setting, thus informing the default choices for the methods. Example data analyses are illustrated on two datasets that are included in the package, showing how to use the package and interpret the output. We show how the package is set up so that it can be extended to accommodate new or different methods for pseudo-observation calculation. Finally, we compare the computational performance of our implementation to the existing approaches for calculating pseudo-observations. The package eventglm is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=eventglm.

Related work
A variety of regression models for time-to-event outcomes can be specified and estimated using different R packages, but our main focus is on regression models for the cumulative incidence and restricted mean survival. The primary infrastructure for the analysis of timeto-event outcomes in R is in the survival package (Therneau and Grambsch 2000;Therneau 2022). Regression models for the cumulative incidence function are available in the timereg (Scheike and Zhang 2011;Scheike and Martinussen 2006) and riskRegression (Gerds and Kattan 2021) packages, which use the direct binomial approach for estimation in addition to the Fine-Gray model for competing risks. The Cprob package (Allignol 2018;Allignol, Latouche, Yan, and Fine 2011) implements cumulative incidence regression using temporal process regression or the pseudo-observation approach. For computation of pseudo-observations, the pseudo (Pohar-Perme and Gerster 2017) and fastpseudo (Batten 2015) packages are designed specifically for that task, while the prodlim (Gerds 2019) package provides functions to do that for the cumulative incidence function. A Stata (StataCorp 2019) package (Overgaard, Andersen, and Parner 2015) and a SAS (SAS Institute Inc. 2013) macro (Klein, Gerster, Andersen, Tarima, and Perme 2008) exist for computation of pseudo-observations. With all of these packages that only compute pseudo-observations, it is left to the user to specify regression models, estimate them, and perform inference. This provides a great deal of flexibility, but also is a barrier to entry for less experienced users of statistical methods.

Notation and estimands
Let T i denote the time to event, δ i ∈ {1, . . . , d} denote the indicator of the cause of the event for d competing causes, and X i a vector of covariates for subject i = 1, . . . , n. We will use V i to denote transformations of (T i , δ i ) whose expectations represent summary statistics of interest. Specifically, we consider the following, where 1{·} denotes the indicator function that is 1 if the event in brackets is true, and 0 otherwise, • The cause specific cumulative incidence of cause k at time t * : • In the case where d = 1, the cumulative incidence (one minus survival) at time t * : • The expected lifetime lost due to cause k up to time t * : as was shown in Andersen (2013). • In the case where d = 1, the restricted mean survival up to time t * : Our main interest is in estimating the parameters of a generalized linear regression model for V i conditional on covariates X i : for some specified link function g.
The interpretation of the coefficients will depend on estimand of interest and the link function, which is specified using the link argument of cumincglm and rmeanglm. For a model of the cumulative incidence using the identity link (the default) g(x) = x, and for a single binary covariate, we have , called the risk difference. This is often of interest in medical studies to summarize the effect of a treatment or exposure. With the log link (link="log"), we obtain exp(β 1 ) = P(T i < t * | X i = 1)/P(T i < t * | X i = 0), the relative risk or risk ratio. If instead our outcome is the restricted mean, for the identity link the β 1 is interpreted as the difference in restricted means comparing X i = 1 to X i = 0. With the log link, we obtain the ratio of restricted means.
If odds ratios are of interest, then the link = "logit" option can be used for the cumulative incidence. Another interesting option is the link = "cloglog": g(x) = log{− log(1−x)}, the complementary log log link for the cumulative incidence implies proportional hazards. Thus models using the complementary log log link applied at various time points can be used to estimate hazard ratios (subdistribution hazard ratios in the competing risks case (Austin and Fine 2017)) and to assess the proportional hazards assumption .
Other options for link functions are probit, inverse, µ −2 , square root, and users can define custom link functions. It is not immediately clear what the interpretation of the regression coefficients would be in these cases, but they are possible because the specification is based on the quasi family. See the stats::family help file for more details on the possible link functions.
An important property of the effect measures derived from these models is collapsibility. To see this, consider another binary covariate Z i and the model If g is the identity or log and if Z i is independent of X i , then β 1 = β * 1 ; this is not generally true for other link functions (Neuhaus and Jewell 1993). Hazard ratios are not generally collapsible, which is related to the fact that the hazard conditions on past survival (Sjölander, Dahlqwist, and Zetterqvist 2016). Collapsibility is an important property in causal inference. This comes up when adjusting for covariates in randomized controlled trials, and when adjusting for measured confounders in observational studies. For more elaboration on this topic, see Andersen, Syriopoulou, and Parner (2017) and Daniel, Zhang, and Farewell (2021). Andersen et al. (2003) described a multivariate version of the model in Equation 1 for the cumulative incidence by considering a finite set of time points t 1 , . . . , t k , and the models of the form

Multiple time points
In this model, the interpretation of the covariate effect β 1 is similar to the models above, e.g., a risk difference or risk ratio, but assuming that the effect is the same at each of the time points included in the model. In the model in Equation 2, the intercept depends on time, but the covariate effect is assumed to be time-constant. The latter assumption can be relaxed to allow the covariate effect to depend on time as well. In that case, there will be one coefficient for each time point, each one representing the covariate effect on the outcome for that time. Models can also include a mix of time varying and time constant covariate effects. All of these types of models can be estimated using eventglm.

Estimation
We do not observe T i and δ i directly, but rather Y i = min{C i , T i } where C i is the censoring time, and ∆ i ∈ {0, 1, . . . , d} where where 0 indicates censoring occurred before any of the events. The collection of observations will be denoted Z 1 , . . . , Z n where Z i = (Y i , ∆ i , X i ), and are assumed to be independent and identically distributed.
If there were no censoring before time t * , then the V i are all observed for i = 1, . . . , n and the parameters could be estimated using standard methods. When that is not the case, the model can be estimated using pseudo-observations (Andersen et al. 2003), the computational methods for which we will describe in the next subsection. Let P i denote the pseudo-observation for subject i which will remain abstract for the moment. When the pseudo-observations are computed in a way that in large samples, this motivates the idea of estimating β in Equation 1 by solving the estimating equations for some specified variance parameter A i which corresponds to the variance function of stats::family or the working covariance matrix in the case of GEE. In our implementation, the estimating equations are solved using the glm.fit function with the quasi family, with variance = "constant", i.e., A i = σ. It was suggested in Andersen et al. (2003) that when the estimand is the cumulative incidence, efficiency gains could potentially be made by specifying the variance function as mu(1-mu). This, however causes a great deal more numerical instability so it is an area of future consideration. In the multiple time point models, GEE is used with the glm.fit estimates used as starting values and working independence covariance. The theoretical justification and precise conditions under which the solution to Equation 4 is consistent and asymptotically normal have been studied in (Graw et al. 2009;Jacobsen and Martinussen 2016;Overgaard et al. 2017;Overgaard, Parner, and Pedersen 2019). Andersen et al. (2003) developed the original approach using the leave one observation out jackknife. Let θ = E(V i ) denote the cumulative summary statistic of interest but marginal with respect to the covariates (i.e., ignoring the covariates) andθ an estimate of that quantity using all of the observations. The estimator is generally nonparametric, e.g., the Aalen-Johansen estimator (Aalen and Johansen 1978) of the cumulative incidence curve, or the Kaplan-Meier estimator (Kaplan and Meier 1958) of the survivor curve, though recently parametric estimators of the marginal quantities have been suggested (Nygård Johansen, Lundbye-Christensen, and Thorlund Parner 2020; Sabathé, Andersen, Helmer, Gerds, Jacqmin-Gadda, and Joly 2020). Letθ −i denote the jackknife estimate obtained by leaving the ith observation out of the sample and recomputing the estimate. Then the ith jackknife pseudo-observation is

Pseudo-observations calculation under independent censoring
When θ is the cumulative incidence and the estimate is based on the Aalen-Johansen estimator, then there are some computational tricks so that the estimator does not need to be rerun n times. This approach is implemented in the prodlim package (Gerds 2019) that we made some slight modifications to be more memory saving when there is a large dataset. In the case of the restricted mean, no such tricks are readily implemented and the Aalen-Johansen estimator is computed n times and integrated each time.
When P i is computed in this way based on a nonparametric estimator, a key condition required for Equation 3 to hold is that (T i , X i , ∆ i ) ⊥ C i . This says that censoring is independent of the event times and of all covariates in the model, called completely independent censoring. In that case, the solution to the estimating equations in Equation 4 yields consistent and asymptotically normal estimates of β.

Pseudo-observations under covariate dependent censoring
If we instead assume for some subset of covariatesX i that (T i , X i , ∆ i ) ⊥ C i |X i then we can use different approaches to computing the pseudo-observations that will satisfy Equation 3. WhenX i only contains categorical covariates with a finite set of combinations, Andersen and Pohar Perme (2010) suggested computing the jackknife P i separately for each combination of values inX i . This is implemented in our package and can be obtained using the model.censoring = "stratified" option.
IfX i contains continuous covariates, then we can model the censoring mechanism conditional on those covariates and use an inverse probability of censoring weighted (IPCW) marginal estimator. Modeling the censoring process conditional on covariates and using inverse probability of censoring weighted estimators was first explored in Binder, Gerds, and Andersen (2014). This was further developed in Overgaard et al. (2019) who showed that the property in Equation 3 holds for IPCW estimators of the cumulative quantity E(V i ). Specifically, let If the censoring mechanism G can be estimated consistently byĜ, then the property in Equation 3 holds for jackknife pseudo-observations based on the marginal estimator In practice, G is estimated using a regression model for the outcome of the time to censoring, that is, using (T i , B i ) as the outcome where B i = 1 if ∆ i = 0 and B i = 0 otherwise. In our package, we have implemented estimation of the probability of remaining uncensored at the observed time using a Cox model (Cox 1972) combined with the Breslow estimator of the cumulative hazard (model.censoring = "coxph") or using Aalen's additive hazards model (model.censoring = "aareg") (Aalen 1989).
This weighting approach (ipcw.method = "binder") is the default when model.censoring is "coxph" or "aareg". An alternative formulation is to use the estimator which is what would be implied as the solution to the first degree method of moments equations (Hájek 1971). This estimator is available using the option ipcw.method = "hajek". In our simulation study, we find similar performance with the (Binder et al. 2014) approach.
The covariatesX i are specified as a one-sided formula in the formula.censoring argument. If this argument is NULL, the default, then the same covariates as specified on the right side of the main formula are used. The covariates are required to be categorical if the "stratified" option is used. Otherwise, the censoring formula is just as flexible as in glm, allowing for interactions, transformations, splines, and more.

Variance estimation
Given calculations of the P i using one of the methods described above, and a specification of the regression model, including the link function g and function A i in Equation 4, one can then solve the estimating equations to obtainβ an estimate of β. If one were to make the usual assumptions of a generalized linear model, namely independent and identically distributed observations and correct specification of the mean and variance models, an estimate of the asymptotic variance ofβ would be the standard model-based variance estimator. This is available in the package by specifying the option type = "naive" in vcov, but it is not recommended. Since the P i are only approximately independent and identically distributed, Andersen et al. (2003) suggested instead using the robust sandwich variance estimator. The sandwich variance is the default that we use in the package (type = "robust" in vcov), by using the implementation available in vcovHC function of the sandwich package (Zeileis 2004(Zeileis , 2006. In our si mulation study, it was the clear superior approach. The option type = "cluster" uses the cluster-robust variance estimator of vcovCL, also in sandwich. With multiple time points, the robust variance estimator as implemented in geepack (Halekoh et al. 2006) is returned with the option type = "robust". (2016) argue that the remainder term in the Taylor series expansion used to justify the sandwich variance estimate does not converge to 0 quickly enough, and Overgaard et al. (2017) derived a variance estimator that accounts for this remainder for the cumulative incidence. Simulation studies therein showed that the Huber-White variance tends to be conservative and that small gains can be made by using Overgaard's corrected variance estimator for the cumulative incidence outcome. Overgaard et al. (2019) further developed similar theory for the inverse probability of censoring weighted estimators. The variance expressions are quite complex and will not be reproduced here, but they are implemented in the package (using type = "corrected" in vcov). This option is available for the cumulative incidence and survival for a single time point only.

Jacobsen and Martinussen
The nonparametric bootstrap can also be used to estimate the variance ofβ in which the pseudo-observations are recalculated for each bootstrap subsample (Efron 1992). We do not directly implement a bootstrap method for variance estimation, as existing tools in R can be used for that purpose, which we demonstrate in the example.

Main package functions and properties
The primary user-facing functions of the eventglm package are cumincglm and rmeanglm.
These are designed to be analogous to the stats::glm function, but for the cumulative incidence outcome or restricted mean outcome, respectively. A minimal call to either cumincglm or rmeanglm requires three arguments: formula, time, and data. The left hand side of the formula must be a call to Surv, which specifies a possibly right censored time-to-event outcome, with or without competing risks. Currently only right censoring is supported (not interval censoring nor left truncation), which means there can be only a single time variable provided to Surv. The Surv function is imported from survival and re-exported by eventglm for convenience's sake (so users do not have to use library("survival") or survival::Surv). Without competing risks, the event indicator in Surv will normally be 0 for censored and 1 for the event. With competing risks, the event indicator should be a factor whose first level indicates censoring, and the other levels indicating the possible event types. In the competing risks case the cause argument is also required, which specifies the failure type of interest as the factor level either as an integer or character value. In the absence of competing risks, the cumincglm has the option to specify survival = TRUE, which provides a model for the survival (one minus the cumulative incidence).
The time argument may be a vector of times for cumincglm and must be a single numeric value for rmeanglm, which specifies the time(s) t * at which it is of interest to model the cumulative incidence, survival, restricted mean, or expected lifetime lost. The times must be less than or equal to the largest observed event time in the sample. By default, when time is a vector, the model allows time varying intercepts but it is assumed that all covariate effects are not time varying. We provide the special term tve() that can be used in the right side of the formula to indicate that the covariate wrapped inside the term is assumed to be time varying. This is illustrated in the example below.
The data argument should be a data frame in which the variables specified in the formula can be found. The link argument determines the link function g in our notation, which is identity by default, and any value that is supported by the stats::quasi family can be used here.
Covariate dependent censoring can be handled using the argument model.censoring, which is "independent" by default, assuming completely independent censoring. Alternatives are "stratified", "coxph", or "aareg", and each of these three options require a specification of the relationship between censoring and covariates in the formula.censoring argument. If formula.censoring is left unspecified, the right hand side of the main formula is used, otherwise a one sided formula can be specified with the implicit outcome of the censoring time. Only categorical covariates may be specified with the "stratified" option.
Since the modeling framework is based on glm, all modeling features such as splines, quadratic terms, interactions, and contrasts that can be used in glm can be used in the eventglm versions by specifying them as usual on the right side of the relevant formulas. This is true for both the main formula and the formula for censoring. The remaining arguments are passed on to glm.fit, and are used in the same way here. A noteworthy argument is weights, which can be used to specify prior weights for the observations. These can be used to specify inverse probability of missingness weights, propensity scores weights for causal inference, or sampling weights. We illustrate the use of sampling weights for case-cohort sampling in the data analysis example.
In addition to the standard methods print and summary that detail the model fit, we provide many post-estimation features in correspondence with 'glm' objects. For example vcov, confint are used for inference with the argument type that determines the type of variance calculation ("robust" by default). Furthermore, predict, and residuals can be used for prediction of individual values and model checking of various kinds. The residuals in the cumulative incidence model are scaled by default according to the recommendations of Perme and Andersen (2008) The objects returned by cumincglm and rmeanglm inherit the classes 'pseudoglm', 'glm', and 'lm', so in addition to the methods we define, many more are available using existing infrastructure. The y element of the objects of class 'pseudoglm' returned by these functions contains the pseudo-observations and these can be used for other purposes without recalculating them again, such as estimating relative survival (Pavlič and Pohar Perme 2019).

Data analysis examples
R> library("survival") R> library("eventglm") The eventglm package includes two example datasets: • colon: Data from a clinical trial of adjuvant chemotherapy for colon cancer. Levamisole is a low-toxicity compound and 5-FU is a moderately toxic chemotherapy agent. There are only one record per patient that includes the time to death (or censoring). This is redistributed from the survival package, with a small modification to include only the death outcome.
• mgus2: Observational data from 1341 patients with monoclonal gammopathy of undetermined significance (MGUS). The outcome of interest is the time to plasma cell malignancy (PCM), with death as a competing risk, and censoring at the last month of contact. This dataset is redistributed from the survival package with an added competing risks event indicator.
To illustrate the basic concepts, Figure 1 shows

Overall survival in colon cancer
We can now do inference on the cumulative incidence of death and the restricted mean survival in the colon dataset using the eventglm package and the main functions that do the model fitting: cumincglm and rmeanglm. These functions resemble the glm function, with two key differences: the outcome is a call to Surv, and there is an argument time that specifies the fixed time point at which the cumulative incidence or restricted mean is of interest. First, we fit a regression model for the cumulative incidence, or one minus survival: Unlike survfit, it is trivial to perform inference using the summary or confint methods that we provide for objects of class 'pseudoglm'. We can fit another model using the log link to obtain estimates of the relative risks comparing the active treatment arms to the observation arm: R> colon.rr <-cumincglm(Surv(time, status)~rx, time = 2500, + data = colon, link = "log") R> br.ci <-coefficients(colon.rr) R> confr.ci <-confint(colon.rr) R> round(exp(cbind(br.ci, confr.ci)), 2) br.ci 2.5 % 97.5 % A key advantage of the regression approach is that it gives us the ability to adjust or model other covariates. In this example, since it is a randomized trial, all baseline covariates should be independent of treatment assignment. However, several of these variables are associated with time to death, so they can be used as precision variables. We would expect that adjusting for age, or the number of positive lymph nodes (more than 4) in the above models would reduce the standard error estimates of the treatment effects, without changing the coefficient estimates. Let us find out: In the above output, in addition to the intercept term plus the main effects of each of the times on the intercept, for each covariate wrapped in the special term tve, there is the interaction between each of the time points and that covariate. Thus, for example, the coefficient labelled factor(pseudo.time)500:rxLev is interpreted as the risk difference comparing the Levamisole along group to the Observation group at 500 days, adjusting for the other covariates in a time constant manner. In this model, we observe that the effect on survival, on the risk difference scale, of Levamisole plus 5-FU gets larger in magnitude over time, similar to what we can see in the Kaplan-Meier curves.

Modeling censoring
By default, we assume that time to censoring is independent of the time to the event, and of all covariates in the model. This is more restrictive than parametric survival models, or Cox regression, which only assumes that censoring time is conditionally independent of event time given the covariates in the model. We provide several options to relax that assumption using the model.censoring and formula.censoring options. The first is to compute pseudoobservations stratified on a set of categorical covariates, which assumes that the censoring is independent given a set of categorical covariates: R> colon.ci.cen1 <-cumincglm(Surv(time, status)~rx + age + node4, + time = 2500, data = colon, model.censoring = "stratified", + formula.censoring =~rx) Next, we can assume that the time to censoring follows a Cox model given a set of covariates. By default, the same covariate formula (right hand side) as the main model is used, but any formula can be specified. We can also use Aalen's additive hazards model instead of a Cox model for the censoring distribution. Then IPCW pseudo-observations are used (Overgaard et al. 2019). The two different weighting options ("binder", the default or "hajek") can be specified with the ipcw.method option.

Competing risks in plasma cell malignancy
The package works very similarly when there are competing risks. The key differences are that the event indicator in Surv is a factor with more than 2 levels and that the cause option is used to specify the cause of interest. The MGUS dataset has a number of covariates, and the time until progression to PCM, or death. Here the event PCM is of primary interest, with death being a competing event. We can get similar estimates to the marginal Aalen-Johansen estimates for the cumulative incidence of PCM at 10 years and the expected lifetime lost due to PCM up to 10 years with similar commands as above.
R> cumincglm(Surv(etime, event)~sex, + cause = "pcm", time = 120, data = mgus2) Call: cumincglm(formula = Surv(etime, event)~sex, time = 120, cause = "pcm", data = mgus2) Model for the identity cumulative incidence of cause pcm at time 120 Including other covariates in the model is done using the standard formula interface. Inspection of the diagnostic plots in Figure 2 reveals that a more complex model may by appropriate.
Predicted restricted means give a possible method to predict individual event times, while the predicted cumulative incidence should be probabilities. Note that with the identity and log links, the predicted cumulative incidences are not guaranteed to be between 0 and 1.
R> hist(predict(mgfitrmean, newdata = mgus2), + xlab = "Predicted lifetime lost due to PCM up to 120 days", + main = "") Parner, Andersen, and Overgaard (2020) describe how to fit regression models with pseudoobservations that account for case-cohort sampling. The basic idea is weighted estimating equations, which we can implement easily with the weights argument that gets passed to glm.fit. First let us create a case-cohort sample of the MGUS dataset by sampling the malignancy events with probability 0.9, and a random subcohort with probability 0.2.

Simulation study of statistical properties
We conducted a simulation study with the goal of determining which methods should be used as the defaults in our package. The key criteria are validity, as measured by type I error rates, bias, and confidence interval coverage, robustness to misspecification of the censoring mechanism, and statistical efficiency. Detailed descriptions of the simulation setup and results are in the Appendix, and code available in the replication script.
According to our simulation study, the stratified option works quite well even when the censoring model is misspecified, and the Aalen additive model tends to work better than the Cox model. Even when the censoring models are misspecified, either by omitting covariates or incorrectly assuming proportional hazards, some form of adjustment for covariate dependent censoring is an improvement over assuming completely independent censoring. There were no clear differences in terms of bias comparing the Binder versus Hajek weighting approaches. The robust variance estimator (the sandwich variance as implemented in the sandwich package) is the clearly superior approach for inference, with minimal bias and approximately correct confidence interval coverage coverage in all cases. Overgaard's corrected variance estimator has only marginal benefits over the robust estimator in a few cases.

Speed and memory comparison
Pseudo-observations can also be computed using the packages pseudo, prodlim (survival and cumulative incidence only), and fastpseudo (restricted mean only). The pseudo package does the job, but is not optimzed for speed or memory usage. The prodlim approach is optimized in C code, but cannot handle large datasets because it stores the jackknife values for every observed event time. We have optimized this code further in eventglm so that it only stores the jackknife values for the time of interest, thus it can be used for much larger datasets. We compare the speed of computing the pseudo-observations for the survival curve at a fixed time in Figure 4. These timing calculations were done on a laptop with an 9th generation Intel core i7 processor, and 8gb of RAM. Neither pseudo nor prodlim are able to handle a dataset with 100,000 observations, while eventglm can go at least an order of magnitude larger and in a reasonable amount of time.
The fastpseudo package uses only base R to efficiently compute pseudo-observations for the restricted mean survival, but does not handle competing risks. Upon inspection of the code during testing, it is clear that it also can only handle integer observation times, which is something that is not clearly documented in the package. Since the restricted mean does require computing the survival curve at all times less than the time of interest, the default method in eventglm has the same limitations as prodlim. However, the IPCW method only requires fitting a regression model for the time to censoring once, and then simply computing means, and thus can be applied to much larger datasets. The stratified option can also be used to improve computational efficiency as computing several sets of pseudo-observations on partitions of the data can be faster and a better use of memory than doing it once for a large dataset.

Extending eventglm
As of version 1.1.0, the argument model.censoring of cumincglm and rmeanglm refers to a function. This function is the workhorse that does the computation of the pseudo-observations that are later used in the generalized linear model. A number of computation methods are built in as "modules" that are contained in the source file called "pseudo-modules.R". As an example, consider the independent module: R> eventglm::pseudo_independent function(formula, time, cause = 1, data, type = c("cuminc", "survival", "rmean"), formula.censoring = NULL, ipcw.method = NULL) { margformula <-update.formula(formula, .~1) mr <-model.response(model.frame(margformula, data = data)) stopifnot(attr(mr, "type") %in% c("right", "mright")) marginal.estimate <- This function, and any pseudo-observation module, must take the same named arguments (though they do not all have to be used), and return a vector of pseudo-observations. Users can specify their own functions directly, or by name. Our built in modules all have the prefix pseudo_, and so if a name is given rather than a function, we search for functions with this prefix first, and if not found, without the prefix.

Example: Parametric pseudo-observations
Let us see how to define a custom function for computation of pseudo-observations. In this first example, we will fit a parametric Weibull survival model with survreg marginally and do jackknife leave-one-out estimates. This may be useful if there is interval censoring, for example.

Example 2: Infinitesimal jackknife
When the survival package version 3.0 was released, it became possible to get the influence function values returned from survfit estimation functions. These efficient influence functions are used in the variance calculations, and they are related to pseudo-observations.
More information is available in the "Pseudo-values" vignette of survival, which is under development at the time of writing. We can use this feature to create a custom function for infinitesimal jackknife pseudo-observations: R> pseudo_infjack <-function(formula, time, cause = 1, data, + type = c("cuminc", "survival", "rmean"), + formula .

Conclusion
Using the pseudo-observation approach, in comparison to Cox regression or fully parametric regression, can directly parametrize associations of interest between covariates and cumulative summaries of survival. This provides valid inference under similar assumptions as the Cox model, but easier interpretation of resulting coefficients, particularly when one is interested in causal effects. Given the advantages of the pseudo-observation approach, it is not surprising that there has been a great deal of development of statistical methods surrounding the estimation and inference based on them. However, we believe the barrier to this approach becoming as common as Cox regression is the lack of easy implementation. Our package enables the use of these methods with a user-friendly interface that will be familiar to even a beginning R user but, by leveraging existing infrastructure, allows for the flexibility and options advanced R users are expecting. For example, the objects returned by cumincglm and rmeanglm inherit from 'pseudoglm', 'glm', and 'lm', so in addition to the methods we define, many more are available using existing infrastructure available in such packages as stats, broom, splines, and many more (R Core Team 2021; Robinson, Hayes, and Couch 2022). Our hope is that pseudo-observation based survival regression will become as common as Cox models.

Future work
A GEE approach that allows for borrowing information across multiple time points was actually suggested initially (Andersen et al. 2003), although it was found that the robust standard errors used in GEE are not exactly correct Overgaard et al. (2017). Our future goal is to implement the standard error calculations that correctly account for both the correlated data and the pseudo-observation calculation when there are multiple time points. The key challenge is to design the interface with the appropriate balance of usability, understanding and flexibility.
In addition to new features such as goodness of fit statistics based on cumulative residuals as described by Pavlič, Martinussen, and Andersen (2019), we also plan to extend to additional estimands like unrestricted lifetime and the probability of being in state in more general multistate model settings, allowing for left-trunction or delayed entry. To this end, the modular approach we described in Section 7 allows the user to specify their own model.censoring function that takes as input the design matrix and outputs the vector of pseudo-observations that are used in the subsequent models. This allows further development and implementation of new methods in this area, such as the use of the infinitesimal jackknife and the use of flexible parametric models for interval censoring. This opens up a lot of possibilities for future extensions of eventglm, and will also make it easier to maintain.

A.1. Data generation
We generated datasets with competing risks according to Beyersmann, Latouche, Buchholz, and Schumacher (2009) as follows: We first generated a binary covariate Z as Bernoulli with probability 0.3, a normal random variable with mean 4 and standard deviation 4, and a log normal random variable with parameters 0 and 1: X 1 , X 2 . Then Q = (1, Z, X 1 , X 2 ). We used a proportional hazards Weibull distribution to generate the time data for k = 1, 2, with a hazard of: h k (t | Q) = γ k * (1/e (Q ⊤ ζ k ) ) γ k * t γ k −1 and a cumulative hazard given by: where Q is the vector of all covariates of interest in this order (1, Z, X1, X2), which then correspond to the cause specific vector of coefficients ζ k = (ζ 0 , ζ z , ζ x1 , ζ x2 ). The overall survivor function of the times to any of the events is then given by: We create overall survival times (times to any event) by inverting the CDF, one less the survivor, using the probability integral transform to obtain overall survival times, T ov . We then determine which of the event types a time belongs to by randomly generating from a Bernoulli with probability h m (T ov | Q)/(h m (T ov | Q)+h m ′ (T ov | Q)) and assigning event type 1 if 1 and 2 if 0. We then generate Weibull censoring times using the rweibull parameterization with shape parameter equal to e Q ⊤ α and scale parameter γ c . The intercept (i.e., first element) of α determines the amount of censoring, and whether the remaining coefficients are nonzero determines whether the censoring depends on covariates. When γ c = 1, the censoring times follow a proportional hazards model, and thus the Cox model for the censoring times is correctly specified.
We consider a binary covariate of main interest (with probability 0.3), and two continuous covariates, one normally distributed with mean 4 and variance 1, and the other log normally distributed with parameters 0 and 1. Intercept and shape parameters were determined so that the proportion of observations having the event of interest was approximately equally probable as the competing event before the time of interest and for varying amounts of censoring. We consider 3 different effects of covariates on the outcome of interest. In the null scenario, there is no association of any covariates with the event times. We additionally consider moderate and large effect sizes in combination will small effects of the continuous covariates. We allow for any (or none) of the covariates to be associated with censoring. Specific parameter values are given in the supplementary materials and as a companion R package (sachsmc/pseudoglm on GitHub) for running the simulation studies. Within each scenario, we consider different sample sizes, censoring rates, and strength of covariate effects on the censoring time.
For each scenario and simulation replicate, we fit regression models with the cause of interest at a fixed time modeled as a function of the binary covariate of interest, adjusted for the two continuous covariates. We did this for the cumulative incidence and the restricted mean at a fixed time for the identity and log link functions and compared the estimated coefficient for the binary covariate to the true coefficient. All of the available model estimation options were run and compared in the simulation study. We report a subset of the findings that are representative of the main conclusions, using a sample size of 500 observations, with 1000 simulation replicates.
The true values of the coefficients were determined by generated a very large sample of covariates Q, then calculating the corresponding true values of the cumulative incidence or restricted mean life time lost, and finally regressing those true values against the covariates using the link function. Samples large enough to achieve a precision of 1e-4 on the coefficient values were used. Code for reproducing the simulation study is available in the reproducibility materials as an R script.

A.2. Results
Under the null setting, where none of the covariates are associated with the outcome, we find that all of the methods are approximately unbiased and preserve the nominal type I error rate (data not shown). This holds regardless of whether or how strongly associated the covariates are with censoring (similar to what was found in the simulations study of Binder et al. (2014)). The more interesting results are where we find when and how the standard pseudo-observation method and the stratified method break down due to dependent censoring. In what follows, we present settings with samples of size 500 and the identity link. The patterns of relative performance of the methods for other sample sizes and link functions are similar.
In Table 1, we show the bias of the coefficient estimates for the cumulative incidence as a proportion of the true coefficient value and empirical standard deviation over the 1000 simulation replicates in a subset of the scenarios and with a subset of the methods. The beta.cens column shows the values of the three coefficients (binary, continuous 1, continuous 2) representing the strengths of the associations between the covariates and censoring. When there is a large amount of censoring (80%), the independent approach shows a large amount of bias. When the censoring depends only on the binary covariate (0.1, 0, 0), the stratified approach effectively removes that bias. When the censoring depends on all three covariates, the stratified approach is misspecified and thus biased, but the weighting methods (ipcw.aalen and ipcw.coxph) effectively decrease that bias. The true censoring model does not follow the proportional hazards model, and thus the ipcw.coxph approach is misspecified and is apparently less efficient but more effective at reducing bias as compared to the ipcw.aalen approach. Similar trends were observed with the restricted mean models.
Drilling down into the scenario with a large amount of covariate dependent censoring, we compare the different inverse probability of censoring weighting approaches in Table 2. The second column shows whether the censoring model follows proportional hazards or not, and the weighting column shows the weighting method used. All weighting methods exhibit similarly small amounts of bias, with no clear patterns emerging regarding degrees of bias or relative efficiency. It seems that the Binder approach to weighting combined with either Aalen's additive hazards model or the Cox proportional hazards model would work well in many plausible scenarios.
Turning now to the variance estimation, Table 3 shows the bias of the standard deviation estimation relative to the empirical standard deviation over the replicates, along with the 95% confidence interval coverage using the different variance estimates. The robust variance estimator (the sandwich variance as implemented in the sandwich package) is the clear winner here, with minimal bias and approximately correct coverage in all cases. The corrected variance estimator has marginal benefits over the robust variance estimator in some settings, that is, smaller variance but still correct coverage.    Table 3: Proportional bias of the estimated standard error relative to the empirical standard deviation and 95% confidence interval coverage.