FamEvent : An R Package for Generating and Modeling Time-to-Event Data in Family Designs

FamEvent is a comprehensive R package for simulating and modeling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (pop-ulation, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclu-sion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an expectation-maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-speciﬁc cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% conﬁdence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the ﬁtted model provide graphical displays to evaluate and summarize the models.


Introduction
Family-based studies are efficient study designs, commonly used for linkage (gene mapping) and association (gene discovery) studies of both Mendelian and complex traits. Family-based designs, unlike population-based designs of unrelated individuals, are robust to population admixture and stratification that can distort disease-gene associations. Family-based design is also a very valuable approach to identify and characterize new pathogenic variants involved in complex human diseases through next generation sequencing technologies. Assessing such designs by simulation studies is an important aspect when planning a family study. We provide here a user-friendly R (R Core Team 2021) package for the simulation and estimation of time-to-event data under various family designs.
Nearly all of the currently available statistical software for time-to-event or age-at-onset data is only suitable for data collected under a random sampling scheme for independent individuals. The "Survival Analysis" task view (Allignol and Latouche 2021) available on the Comprehensive R Archive Network (CRAN) lists 255 CRAN packages related to survival data, including packages for estimating survival and hazard functions, fitting regression models, fitting of more complex models such as multistate models, and simulation. One exception is the coxme package (Therneau 2020) that can incorporate frailty terms to model family time-to-event data as well as to estimate the effects of other fixed covariates within a semiparametric proportional hazards (PH) framework. However, it has functional limitations compared with FamEvent (Choi, Kopciuk, He, and Briollais 2021): it does not simulate family data for age-at-onset outcomes, it adopts only a Gaussian distribution for the random effects, it does not provide age-dependent penetrance functions or various baseline hazard functions that retain the PH assumption. It also does not address important sources of bias including correction for non-random sampling and inferring missing genotypes and carrier probabilities. Drawbacks shared by both packages include limitations to right-censored data and the PH assumption. A specific shortcoming of FamEvent compared with the coxme package is its focus on fixed effects in the penetrance function estimation, so that mixed effects models are not yet provided. Although both packages can model family age-at-onset data, FamEvent provides substantially more functionality than the coxme package and corrects for two important sources of bias -sampling of families and missing data.
No other R packages that model family data, such as gap (Zhao 2007(Zhao , 2020 or pbatR (Hoffmann 2018), found in the Statistical Genetics task view (Montana 2021) provide functionalities for age-at-onset outcomes. On the other hand, some simulation programs have been proposed to simulate family or pedigree data, e.g., SimPed (Leal, Yan, and Müller-Myhsok 2005), SIMLA (Schmidt, Hauser, Martin, and Schmidt 2005), PBAT (Lange and Laird 2002), SIMLINK (Boehnke 1986) -a comprehensive list is available at https://github.com/gaow/ genetic-analysis-software/ -but none of them can handle time-to-event outcomes. Our R package FamEvent is therefore an original contribution that fills a gap in simulating complex time-to-event data in the context of family designs and genetic studies. The simulated data can mimic real data obtained in these types of family studies. In addition, the estimation methods in FamEvent address important features for age-at-onset data from several common family-based designs. Thus, FamEvent provides considerably more advantages with few drawbacks all within one R package.
In the R package FamEvent, we provide methods to generate and model age-at-onset outcomes for families that harbor a genetic mutation. We implement three common family-based designs -population, high risk clinic and multi-stage designs -along with ascertainment correction for the estimation of age-dependent penetrance functions, specifically adapted to the sampling scheme, using a prospective likelihood. We also handle missing genotype data by providing mutation carrier probabilities for family members with missing genotypes and estimating age-dependent penetrance functions via an expectation-maximization (EM) algorithm. Plot methods are available for simulated family data and for fitted penetrance models, respectively. To construct pedigree plots, we implemented a pedigree function and its plot method built on kinship2 (Sinnwell and Therneau 2020) into the plot.simfam function to graphically display the pedigree structures of specified families with indication of the proband and affection and mutation carrier statuses of all family members. When mutation carrier status is missing, carrier probabilities can be displayed instead. Following penetrance model estimation, the plot.penmodel function presents both parametric and non-parametric estimates and their confidence intervals for the penetrance functions specific to gender and mutation status groups; parametric age-dependent penetrance curves are estimated from the specified parametric penetrance model by using penmodel or penmodelEM functions and nonparametric Kaplan-Meier estimates of the penetrance curves are obtained by implementing the survfit function built on the survival package (Therneau 2021).
Our comprehensive R package that simulates and models family data will enable development of methods to identify additional risk factors, adjust for interventions and produce unbiased disease risk estimates. In Section 2 we describe the family-based study designs implemented in FamEvent followed by details on the penetrance function and its estimation in Section 3.
Methodological details on ascertainment-corrected likelihoods, an EM algorithm, robust variance estimation and disease gene carrier probabilities are given in Section 4. Details on the key functions from the FamEvent package are provided in Section 5 and four motivating examples, including a real data analysis, are given in Section 6. Concluding remarks are given in Section 7.

Family-based study designs
Family-based designs are popular for studying heritable genetic diseases because high risk disease genes are rare in the general population. Often multiple family members are carriers of and affected by a disease gene, and can be identified from disease registries or high risk disease clinics. The study designs for sampling family data considered in FamEvent include population-based, clinic-based and two-stage sampling designs, as described in Table 1. In population-based studies, an affected family member (proband) leads to selection of the family into the study; the probands can be randomly sampled from the disease population regardless of their mutation status (POP design) or from the diseased and mutation carrier population (POP+ design). In clinic-based studies, the families are selected from high risk disease clinics; selection of families is not only based on a single proband but involves other

Design
Description and ascertainment criteria POP Population-based design with affected probands whose mutation status can be either carrier or non-carrier. POP+ Population-based design with affected and mutation carrier probands. CLI Clinic-based design that includes affected probands with at least one parent and one sibling affected. CLI+ Clinic-based design that includes affected and mutation carrier probands with at least one parent and one sibling affected. Two-stage Two-stage sampling design that includes random sampling of families in the first stage and oversampling of high risk families in the second stage. affected family members (CLI design). The CLI design samples families with an affected proband and at least two affected family members whereas the CLI+ design samples families with an affected mutation carrier proband and at least two affected members. The two-stage sampling is a popular sampling design method for oversampling high risks families, where the high risk families are defined as the families with at least two affected members. In this design, families are sampled in two stages: the first sampling stage is based on the population-based study design and the second stage involves oversampling of high risk families.
For designing efficient studies, a two-stage family design can be used. In the first stage, case patients (i.e., probands) are selected and asked about their family disease history and then are stratified into different categories, e.g., high-, intermediate-and low-risks. In the second stage, case patients and their relatives are subsampled with different sampling probabilities that could depend on their risk category. For a fixed sample size, we can estimate the sampling probability for each stratum that minimizes the variance of the estimate of the parameter of interest. We illustrate a sample size determination for an optimal two-stage design in Section 6.3.
Families identified based on any of these study designs are not representative of the general population, since they tend to have higher disease risks from both genetic and non-genetic factors. Selection of families via each study design can lead to biased disease risk estimates, so adjustment for ascertainment is necessary. The ascertainment correction in the penetrance estimation is provided in Section 4.1.

Penetrance models
The risks of diseases arising from identified single or multiple genes often vary in the age at onset and are associated with individuals' gender and mutation status, where the agedependent disease risk is referred to as the penetrance. Penetrance in our R package is estimated using the cumulative distribution function given the age and gender of the individual for the disease or phenotypes associated with the gene of interest. A number of factors can impact penetrance such as mutation type, epigenetic factors, gender, modifier genes, etc. (Shawky 2014).
Parametric hazard regression models are implemented in penetrance studies as they can relate covariates, including genetic factors and gender, to the age-at-onset outcome. We first describe proportional hazard models, the most popular model for time-to-event data, which assumes no additional familial variations given the inherited disease gene. Additional variations to induce familial correlation due to unobserved genetic or environmental risk factors are modeled using two approaches: shared frailty model and two-gene model.

Proportional hazard models
Flexible functional forms are adopted for the baseline hazard function that retains the PH assumption and that also makes ascertainment correction easier.
The PH regression model with a baseline hazard h 0 (t) that relates an individual's mutation status G and other measured covariates X to the age-at-disease onset can be expressed as follows: where β X is the vector of regression coefficients for measured covariates X and β G is the regression coefficient for the genetic variable G. This PH regression model is used to estimate the probabilities of disease onset at age t via its cumulative distribution function (CDF) where S(t; G, X) = exp − t 0 h(v; G, X)dv is the survival function at age at disease onset t. Penetrance functions for variable age at disease onset are based on this CDF, which conditions on measured covariates including mutation status and gender (Wijsman 2005).
The baseline hazard h 0 (t) is usually unspecified in PH models for evaluating covariate effects. For the purpose of estimating survival probabilities, a parametric assumption of the h 0 (t) is made. The possible choices for the parametric baseline hazard distribution are Weibull, log-logistic, log-normal, Gompertz, gamma, and more flexible distributions, including the log-Burr and a piecewise constant baseline. The generalized log-Burr distribution allows a flexible baseline that includes the Weibull model (η → ∞) or the log-logistic model (η = 1) as special cases (Lawless 2003;Kopciuk et al. 2009). The Weibull model is quite flexible but does have a monotonic functional form of the hazard whereas the log-logistic specification does not. The hazard and cumulative hazard functions are summarized in Table 2.

Shared frailty models
The shared frailty model is used in conjunction with the PH model, where the frailty term acts multiplicatively on the baseline hazard function to describe the unknown common risks shared within family.
Let T f i denote the age at disease onset for individual i in family f and Z f > 0 be the frailty shared within family f . The shared frailty models can be expressed as: Distribution Laplace transform L(s) where h 0 (t f i ) is the baseline hazard function and X f i is a vector of covariates for individual i in family f and G f i is a genetic covariate indicating carrier status of a mutated gene.
As the frailty is an unknown quantity, the penetrance function is obtained by integrating over the frailty distribution, G(z), where the indexes i and f are dropped for simplicity of notation, where L(s) is the Laplace transform of the frailty distribution, H 0 (t) = t 0 h 0 (v)dv is the cumulative baseline hazard function and X and G are their covariates and mutation status of a gene, respectively.
The penetrance function is determined by the choice of baseline hazard and frailty distributions with given covariate values and regression coefficients. The possible choices of the baseline functions and the frailty distributions and their Laplace transforms are listed in Tables 2 and 3, respectively. For example, if Weibull baseline and gamma frailty are assumed, the penetrance function can be obtained as

Two-gene models
In the two-gene model, we suppose that in addition to a major gene, G 1 , families share a second gene, G 2 , that induces familial correlation. G 2 is considered as a covariate, that acts multiplicatively on the baseline hazard, but is completely unobserved. The two-gene model can be written, dropping indexes i and f , as: where G 1 and G 2 , respectively, indicate carrier (= 1) or non-carrier (= 0) status of the major and second genes.
Similarly, the penetrance function for the two-gene model is obtained depending on the choice of hazard function and the status of the second gene, However, as the second gene G 2 is unobserved, the penetrance functions can be obtained as a weighted sum over the two possible values of the second gene status.
where p(G 2 ) is the probability of the second gene status, which is determined by the assumed allele frequency, and G 2 takes values of 1 or 0, representing carrier or non-carrier, respectively.

Ascertainment correction
Assuming, without loss of generality, that the affected family member (proband) who led to selection of the family into the study is a disease gene carrier, then ascertainment correction for the selection process takes one of two forms. If the proband is randomly sampled from the population (POP or POP+ design), say through a disease registry, then ascertainment correction depends only on this individual. If the proband is selected from a high risk disease clinic (CLI or CLI+ design), then ascertainment correction involves other, possibly affected family members. In the prospective likelihood method, the ascertainment correction is based solely on the probability of individuals being affected before their age at examination (Choi, Kopciuk, and Briollais 2008).
. , X fn f ) as the vector forms that represent their phenotypes (disease outcomes), genotypes and covariates, respectively. The contribution of family f to the ascertainment-corrected prospective likelihood is and 0 otherwise. The numerator, regardless of family study design, assumes conditional independence of family members' phenotypes given their genotypes, and is specified as Ascertainment correction of family f from the population-based designs (POP or POP+) depends on the proband (p) in family f being affected before his or her current age at examination (a fp ), and hence, the denominator P (A f |G f , X f ) can be written as where G fp and X fp represents the proband's genotype and observed covariates in family f . For the clinic-based designs (CLI or CLI+), the ascertainment correction is determined by three additional family members -another affected sibling and at least one affected parent. By the conditional independence assumption of disease status given genotype information, the denominator for the clinic-based designs is given by where indices f p , f s , f f , f m represent the proband, proband's sibling, father and mother in family f , respectively.
In two-stage sampling, the ascertainment correction is based on the sampling weights derived from an inverse probability of sampling families, which are implemented into the composite likelihood as a weighted product of ascertainment-corrected likelihoods corresponding to each family (Choi and Briollais 2011;Lawless, Kalbfleisch, and Wild 1999). The likelihood contribution of n families sampled from two-stage sampling is written as where L f is the ascertainment-corrected likelihood for family f by a population-based design at the first stage and w f represents the sampling weight for family f at the second stage, which is obtained by the inverse probability of sampling family from the two stages of sampling.

EM algorithm for missing genotype data
In addition to ascertainment or selection correction, family members with phenotype (disease outcome) but no genotype information can be included via an EM algorithm. Given their family's observed genotypes and phenotypes, the probabilities of an individual's possible genotypes are computed in the expectation or E-step. In the maximization or M-step, the model parameters are estimated by maximizing a weighted log-likelihood. The conditional genotype probabilities found in the E-step form these weights.
The vector of genetic covariates in family f , G f , is partitioned into observed genotypes G o f and missing genotypes G m f . Both the measured covariates X f and phenotypes D f are fully observed, with the phenotypes subject to right censoring. Then the single-iteration EM is implemented as follows: E-step. Compute the possible genotype probabilities for each individual i with missing genotype in family f as , where g f i can take the value 1 or 0 to represent a carrier or non-carrier of the mutated gene, respectively, and their probabilities P ( can be obtained empirically from the family data or analytically from the assumed penetrance model. The empirical carrier probabilities (non-carrier probabilities just as the complementary probability) can be obtained with observed genotype and phenotype information for each subset of the data defined by D f , X f , G o f after excluding the probands. Based on the penetrance model, these carrier probabilities are obtained as the posterior distribution with the assumed or estimated allele frequency (as shown in Section 4.4). For individuals with known carrier status, their weights are one. Then, the conditional expectation of the log-likelihood function of the complete data given the observed data (D, X, G o ) can be written as a weighted log-likelihood which has the form where G f i is the set of all possible genotypes for individual i in family f .

M-step.
Maximize the weighted log likelihood to obtain the parameter estimates in the model.
No iteration between the E-and M-steps is necessary when the empirical carrier probability is used as the possible genotype probabilities only need to be calculated once in the E-step. Otherwise, the EM steps iterate until convergence.

Robust variance estimation
To account for familial correlation, as our penetrance model assumes conditional independence of the individuals in the family, the robust variance estimator of the parameter estimatesθ is provided in a 'sandwich' form (White 1982) as where I 0 (θ) is the observed information matrix obtained from The robust variance-covariance matrix then can be consistently estimated by evaluating the VAR(θ) at the maximum likelihood estimates.

Disease gene carrier probabilities
Mutation carrier probabilities for relatives with missing genotype information can be estimated using observed genotypes within the families or alternatively, with the addition of phenotype information. The carrier probability can be calculated based on only observed genotypes using Mendelian transmission probabilities or using data-driven probabilities empirically calculated from the aggregated data for each subgroup based on relation, proband's mutation status, mode of inheritance, disease status and disease-allele frequency in the population. It can be also obtained based on both observed genotype and phenotype information using the penetrance model fit.
The carrier probability for individual i conditional on the observed phenotype and carrier status of his or her family members is calculated by where G i indicates the carrier status of individual i and G o represents the observed carrier status in his or her family members, D i represents the observed phenotype (t i ; δ i ) of individual i in terms of age at disease onset t i and disease status indicator δ i (1 for affected individuals and 0 for unaffected individuals).

Package description
The R package FamEvent is available for download from CRAN. This package will appeal to users who want to simulate complex pedigree data for age-at-onset phenotypes for families Functions Description carrierprob Computes the carrier probability from observed genotype or phenotype data or from the penetrance model fit. fampower Computes the power of detecting genetic effect in the penetrance model based on a family-based simulation study. penetrance Estimates the cumulative disease risks (penetrances) and confidence intervals at given age(s) based on the fitted penetrance model. penmodel Fits penetrance models for complete family data. penmodelEM Fits penetrance models for family data with missing genetic information. penplot Plots the penetrance functions given the values of baseline parameters and regression coefficients and choices of baseline and frailty distributions. simfam Simulates family data. who carry a major gene and/or users who want to estimate disease gene penetrance functions using their own family data with correction for selection bias and missing genotype information. Plotting functions permit visual examination of individual pedigrees, the true penetrance functions and the estimated penetrance functions based on the fitted model. Mutation carrier probabilities for individuals with missing genotype information are estimated using information on family members genotypes and possibly phenotypes. The main functions used in FamEvent are summarized in Table 4 and their usage in practice is described in this section.
The package in R is installed and loaded in the usual way: R> install.packages("FamEvent") R> library("FamEvent")

Penetrance curves
The function penplot enables researchers to see the shape of the penetrance function and to choose appropriate penetrance functions for checking the performance of a penetrance model or planing a simulation study.
The shape of penetrance functions is determined by choosing the baseline hazard distribution (base.dist) with their parameter values (base.parms), the regression coefficient values for gender and major gene (vbeta) and the source of residual familial correlation (variation). Familial correlation can be induced by either a shared frailty (variation = "frailty") or a second gene shared within the family (variation = "secondgene"). The default is "none" implying that event times are independent given major genotypes. When variation = "frailty", the choice of the frailty distribution (frailty.dist) and the variance of the frailty distribution (depend) should be specified.
For example, the following function call will display the penetrance functions and return penetrance estimates by age 70 specific to gender and mutation-status, based on the Weibull baseline distribution with scale parameter, λ, set to 0.01, shape parameter, ρ, set to 3 and familial correlation induced by a shared frailty within each family which follows a gamma distribution with mean 1 and variance 1 that was specified by the argument depend = 1. We can also specify the minimum age of disease onset agemin for the penetrance function to start.

Family data generation
The simfam function generates data for all family members, including their age, gender, family relation, disease gene mutation status, and times to an event, based on the penetrance model associated with mutated genes and gender as we described in Section 3. The principles of generating family data were described in (Choi et al. 2008). Each family consists of three generations -two parents and their offspring whose number ranges in size from 2-5, one of whom is the proband. Each offspring has a spouse and their children whose number ranges in size from 2-5. The age difference between the second and third generations is assumed to be 20 years on average. Given the study design, the proband's mutation status is generated first and their age at onset generated conditional on the mutation status. Other family members' mutation statuses are determined based on the proband's status and then their ages at onset are generated. Finally, their affection status is determined if their ages at onset are before their current age. This procedure is repeated until the ascertainment criteria specified by the study design is satisfied.
Simulating families under the clinic-based ("cli" or "cli+") or the two-stage designs can be slower since the ascertainment criteria for the high risk families are difficult to meet in such settings. In particular, the "cli" design could be slower than the "cli+" design since the proband's mutation status is randomly selected from a disease population in the "cli" design, so his or her family members are less likely to be mutation carriers and to be affected, whereas when the probands are all mutation carriers ("cli+"), their family members have higher chance to be carriers and affected by disease. Therefore, the "cli" design requires more iterations to sample high risk families than the "cli+" design. All simulations that include variation = "frailty" could be slower in order to generate families with specific familial correlations induced by the chosen frailty distribution.

Argument Description N.fam
Number of families to generate. design Family-based study design. Possible choices are "pop", "pop+", "cli", "cli+", "twostage". variation Source of familial correlation. Possible choices are "frailty" for frailty shared within families, "secondgene" for second gene variation, "none" for no familial correlation given major genotypes. interaction Logical; if TRUE, allows the interaction between gender and major gene. depend Variance of the frailty distribution. Dependence within families increases with depend value. base.dist Choice of baseline hazard distribution. Possible choices are "Weibull", "loglogistic", "Gompertz", "lognormal", "gamma", "logBurr". base.parms Vector of baseline parameter values. vbeta Vector of regression coefficients for gender, major gene, interaction between gender and major gene (if interaction = TRUE), and second gene (if variation = "secondgene"). frailty.dist Choice of frailty distribution. Possible choices are "gamma", "lognormal" or NULL. mrate Proportion of missing genotypes; value between 0 and 1. hr Proportion of high risk families, which include at least two affected members, to be sampled from the two stage sampling (design = "twostage"); value should lie between 0 and 1. probandage Vector of mean and standard deviation of the proband's age. agemin Minimum age of disease onset. agemax Maximum age of disease onset. Popular hazard function distributions -such as Weibull, loglogistic, Gompertz, lognormal, gamma, or logBurr -are available to generate the baseline hazard distribution. Residual familial correlation can be created by incorporating a frailty term (variation = "frailty") with a choice of lognormal or gamma distribution or via a two-gene model (variation = "secondgene"). For the major and possibly second gene, users can specify if the genetic model is dominant or recessive (dominant.m for the major gene and dominant.s for the second gene) and their population allele frequencies (allelefreq) . Additional parameter option values can fix the proportion of missing genotypes (mrate) and the minimum (agemin) and maximum (agemax) age of disease onset.
Details of selected arguments for the simfam function are described in Table 5.
The following example shows the use of simfam function to generate 200 families using set.seed(4321) from the study design "pop+", where families are sampled based on affected and mutation carrier probands. The ages to disease onset are assumed to follow a Weibull baseline hazard distribution with the effects of gender and mutation status set at β s = −1.13, β g = 2.35, respectively. The familial correlation is due to a shared frailty following a gamma distribution with mean 1 and variance 1. The allele frequency of the major gene is assumed to be 0.02.
R> head(fam[fam$famID == 1, ], n = 7) famID indID gender motherID fatherID proband generation majorgene The data frame includes columns famID, indID, motherID, fatherID for family, individual, mother and father IDs, respectively; generation which takes values 1, 2, 3 or 0, where 0 indicates spouses; and gender indicating males (= 1) and females (= 0); majorgene indicates the genotype status of a major gene of interest, denoting 1 for AA, 2 for Aa and 3 for aa, where A is a disease-causing allele; ageonset is the age of disease onset generated by the penetrance model shown in Equation 1. However, we do not observe ageonset beyond the current age (currentage), so time takes the minimum value of ageonset and currentage and status indicates disease status at current age, i.e., 1 if the disease is observed by current age and 0 otherwise; mgene records the mutated gene carrier status derived from the major gene genotype, indicating 1 if carrier of disease gene, 0 otherwise; relation represents the family members' relationship with the proband as described in Table 6.

Relation Description 1 Proband (self) 2
Brother or sister 3 Son or daughter 4 Parent 5 Nephew or niece 6 Spouse 7 Brother or sister in law Table 6: Family relation code used in simulated data.
In addition, data include the family size fsize, the number of affected family members naff and sampling weight weight for each family. For example, the family with famID = 1 has 18 members and includes only one affected individual in addition to the proband. The individual with indID = 3 is the proband whose current age is 47 years old, he was affected (status = 1) at age 47 and is a mutation carrier (mgene = 1) with genotype Aa (majorgene = 2). Figure 1 also graphically displays the pedigree of this family, where the red colour indicates the proband, right shading indicates mutation carriers, non-shading for non-carriers, left filled symbol indicates affected by the disease and non-filled symbol for unaffected by the disease.
The output of simfam function is an object of class 'simfam'. The 'simfam' class has its own summary and plot methods: summary function prints the summary of generated data and plot function provides the pedigree plots of specified families with indication of family members' affection status and mutation carrier status. Examining several individual pedigrees is helpful for assessing the genetic transmission model, number of carriers and affected individuals within the simulated families.

Penetrance model estimation
The penetrance model is estimated with either the penmodel function for complete genetic data or the penmodelEM function in presence of missing genetic data using the EM algorithm. These functions provide the model parameter estimates with their conventional standard errors based on the Hessian matrix or robust standard errors based on the sandwich variance formula if robust = TRUE. The output of penmodel or penmodelEM is an object of class 'penmodel', which has a list with elements of model parameter estimates, their covariance matrix, standard errors, (or robust covariance matrix, robust standard errors if robust = TRUE is specified).
The corresponding program codes for fitting a proportional hazard model for complete or missing genetic data are: penmodel(formula, cluster = "famID", gvar = "mgene", parms, cuts = NULL, data, design = "pop", base.dist = "Weibull", agemin = NULL, robust = FALSE) penmodelEM(formula, cluster = "famID", gvar = "mgene", parms, cuts = NULL, data, design = "pop", base.dist = "Weibull", agemin = NULL, robust = FALSE, method = "data", mode = "dominant", q = 0.02) Both functions take the formula expression as used in other regression models with time-toevent data using the Surv function and covariates, name of the cluster variable (cluster), name of the genetic variable (gvar), initial parameter values including baseline parameters and regression coefficients (parms), family data set (data), as well as specification of the family study design (design), baseline hazard distribution (base.dist) options, allowing for the same or different choice of baseline hazard from the simulated data. In addition to the options for base.dist listed in Table 5, base.dist = "piecewise" fits a piecewise constant baseline hazard function with specified cuts that define the intervals where the hazard function is constant. A prospective likelihood that corrects for ascertainment is used with the type of correction depending on the family study design specified in the design argument.
When imputation of missing genotype is needed, the penmodelEM function implements the EM algorithm to estimate the disease gene carrier probabilities for family members who are missing this key variable. Two methods are available: if sufficient genotype information is available within every family, then carrier probabilities can be empirically calculated from the aggregated data for each subgroup based on generation and proband's mutation status (method = "data") otherwise they can be calculated based on Mendelian transmission probabilities by selecting method = "mendelian" with mode of inheritance (mode) specifying as either "dominant" or "recessive" and the allele frequency (q) as a value between 0 and 1.
The output of penetrance model fit includes the parameter estimates, their variance-covariance matrix, the corresponding standard errors (SEs), the log-likelihood and Akaike information criterion (AIC) values at its maximum value. In addition, robust SEs and 'sandwich' variancecovariance matrix are provided if robust = TRUE. Robust SEs could be smaller than the conventional SEs when the sample sizes are small or a parameter is estimated from effectively few non-zero covariate values (Fay and Graubard 2001). When sample sizes are small, the robust SE can be biased downward and some particular techniques can be used to correct this bias (see for example, Imbens and Kolesár (2016)). Conventional SEs could be considered instead although we do not expect a large difference from the robust SE.
The 'penmodel' class has its own print, summary and plot methods. The summary method returns and displays the model parameter estimates of transformed baseline parameters and regression coefficients with their SEs (or robust SEs if robust = TRUE), t statistics and corresponding two-sided p values. The plot method produces a graph of estimated penetrance functions for four subgroups by gender and mutation status in different colours along with Kaplan-Meier curves from family data used for fitting the model excluding probands, in order to naïvely correct for ascertainment. Thus, this plot displays both parametric and nonparametric estimates of penetrances from the minimum to the maximum age at onset, along with their 95% confidence intervals if conf.int = TRUE.
Continuing our example from Section 5.2, we fit the data set consisting of 200 families (data = fam) to a penetrance model for the right censored time-to-event outcome, Surv(time, status), with two covariates, gender and mgene, assumed a Weibull baseline hazard distribution and accounted for the family study design (design = "pop+") used to sample the families. We specified the name of the cluster variable as cluster = "famID", the name of genetic variable as gvar = "mgene" and the initial values of the baseline parameters, λ and ρ, as well as the gender and mutation effects on the baseline hazard function as parms = c(0.01, 3, -1.13, 2.35). The summary of the model fit is given below and the plot function generates the penetrance curves shown in Figure 2.

Age-specific penetrance estimation with confidence intervals
At given age(s) and fixed covariate values, the penetrance function provides penetrance estimates with 95% confidence intervals (CIs) and SEs of the penetrance estimates through Monte Carlo (MC) simulations of the estimated penetrance model. Provided a model fit from penmodel or penmodelEM, parameter estimates of both the transformed baseline parameters and regression coefficients along with their variance-covariance matrix are used as the inputs to a multivariate normal distribution. Based on these inputs, MC = n sets of parameters are generated for given age(s) and fixed covariates and their corresponding penetrance estimates calculated. For baseline parameter estimation, a log transformation is applied to both scale and shape parameters (λ, ρ) for the Weibull, loglogistic, Gompertz, gamma baseline distributions, to (λ, ρ, η) for the log-Burr distribution and to the piecewise constant parameters for a piecewise baseline hazard. But for the lognormal baseline distribution, the log transformation is applied only to ρ, not to λ, which represents the location parameter for the normally distributed logarithm.
Empirical estimates of the the 95% CIs at given age(s) are based on the 2.5th and 97.5th percentiles of the penetrance functions estimated from the n simulated data values. The SE of the penetrance estimate, also for given age(s), is calculated via the standard deviation of the n simulated penetrance functions. Both the 95% CIs and SEs are obtained for fixed covariates at the given age(s). For example, using the penetrance model fitted to the 200 families in Section 5.3, the penetrance estimates by age 50, 60, and 70 for male mutation carriers fixed = c(1, 1) can be obtained using 100 MC simulations (MC = 100) as follows:

Carrier probability estimation
The carrierprob function estimates mutation carrier probabilities for relatives with missing genotype information using observed genotypes within the families condition = "geno" or alternatively, with the addition of phenotype information, using condition = "geno+pheno". When condition = "geno", inputs for carrierprob include two methods of estimation: method = "mendelian" uses Mendelian transmission probabilities or method = "data" uses data-driven probabilities calculated from the aggregated data for each subgroup based on relation, proband's mutation status, mode of inheritance, disease status and disease-allele frequency in the population. When condition = "geno+pheno", method = "model" should be used to calculate the carrier probabilities based on both observed genotype and phenotype information, where the penetrance model should be specified by fit obtained from either the penmodel or penmodelEM functions.
The carrierprob function returns a data frame that includes the estimated carrier probabilities, named carrp.geno or carrp.pheno, appended after the last column in the family data set, indicating carrier probabilities based on observed genotypes only and those based on both observed genotypes and phenotype, respectively.

Penetrance estimation in presence of missing genotype data
The presence of missing genotypes is a common issue in genetic studies. The R package FamEvent can be used to generate and analyze family data with missing genotypes for the major gene. For example, we can generate families with 30% of missing genotypes, assuming a Weibull baseline function with scale and shape parameters of 0.01 and 3, β sex = 0.5 and β gene = 2. Given the parameter values in the Weibull model, the function penplot displays penetrance functions (not shown) and also returns the true penetrance values by age 70. In our situation, those are 0.868, 0.708, 0.240, and 0.153 in male carriers, female carriers, male non-carriers, and female non-carriers, respectively. For this example, set.seed(4321) was used.

Sample size and power calculation
Sample size calculation is a critical task when designing a new family study. In penetrance estimation studies, the goal is to collect enough families to detect a genetic relative risk associated for a known mutation or genetic variant with a specified statistical power and/or to estimate the penetrance function at a certain age in gene carriers with a certain precision.
We can use our R package FamEvent to perform the sample size calculation in these two situations.
For the first situation, we can construct the Wald test statistic as N (0, 1) under the null hypothesis, whereβ is the estimated log hazard ratio (HR) associated with a given mutation, β 0 its value under the null hypothesis (e.g., β 0 = 0) and SE(β) is a standard error estimate ofβ. For the one-sided test H 0 : β ≤ β 0 vs. H 1 : β > β 0 , the power of the Wald test is defined as: P (W > Z 1−α ) under the alternative hypothesis, where Z 1−α is the (1 − α) quantile of the standard normal distribution. The power function for β > β 0 can then be obtained from the asymptotic normality of the maximum likelihood estimator as where Φ(.) the cumulative normal distribution.
We can then perform some simulations with FamEvent package to determine the number of families needed to achieve a certain power. Since β and β 0 are fixed, simulating different family sizes will affect the standard error ofβ and hence the power of the test. For example, we simulated 50 families under the POP+ design, using a Weibull baseline function with shape and scale parameters of 0.01 and 3, a gender effect with β gender = 1 (i.e., HR = 2.72), a dominant model for the major gene with allele frequency of 0.02 for the minor allele and β gene = 1 (HR = 2.72).
We obtained a robust standard error forβ gene of 0.34. If we assume β 0 = 0 and a one-sided test with α = 0.05, the power of the Wald statistic to test for β gene > 0 is 89.8%.

Optimal designs
Designing efficient studies is an important aspect of family studies. We illustrate this problem by considering a two-stage family design. Typically case patients (i.e., probands) are selected in the first stage and asked about the history of disease in their family and stratified into To construct an optimal design we therefore determine some optimal weights for each stratum and then decide the optimal sample sizes accordingly (in our case, the number of families to include into the study). The optimal weighting problem was discussed by Lindsay (1988) who obtained optimal weights in a way that maximizes the information over a class of estimating functions. Let w be the vector of weights, S the vector of component scores and U the score function based on the full likelihood. Then, the optimal weights are obtained by minimizing with respect to w, and are given by with E(U S) = E(S 2 ) where S 2 denotes the vector whose elements are the squared elements of S and the variance VAR(S) is a block matrix where the size of each block depends on the size of the stratum.
Consider the problems of determining the optimal weights for estimating: 1) the log hazard ratio measuring the effect of a mutation on a time-to-event outcome, and, 2) the penetrance of the mutation by age 70. We simulated family data corresponding to the three risk categories (High, Med, Low) assuming that High corresponds to the CLI+ design, Med to POP+ and Low to POP. We simulated 150 families from each design. To simulate the family data, we used a Weibull baseline function with shape and scale parameters of 0.01 and 3, respectively, a gender effect with β sex = 1 (HR = 2.72), a dominant model for the major gene with allele frequency of 0.02 for the minor allele, a mean and standard deviation for the proband's age of 45 and 2.5 years, respectively, and a minimum age at onset for the disease of 20 years. We assumed β gene = 1.5, 0.8, and 0.5 for the High, Med and Low designs, respectively, and the associated penetrance as given in Table 8. The command lines to generate and fit family data are similar to the one described in Section 6.1.
We are first interested in an optimal design for the log HR estimates. The variances for the log HR estimates in high-risk, intermediate-risk and low-risk families are 0.13 2 , 0.21 2 and 0.52 2 , respectively (

Real data analysis
To analyze real data using penmodel or penmodelEM, the data should be prepared following the same data frame that the simfam function provides. The data frame should include column names famID, indID, motherID, fatherID, proband (coded 1 for proband, 0 for nonproband), gender (coded 1 for male, 0 for female), currentage, time, status (coded 1 for affected, 0 for unaffected), mgene (coded 1 for carrier, 0 for noncarrier, or NA for missing). When data includes a sampling weight, it should be named as weight in the data frame. Without this weight variable, all families will be equally weighted. In addition, agemin has to be specified by attr(data, "agemin") <-18, for example.
The package includes data named LSfam from 32 Lynch Syndrome (LS) families identified through the Ontario Familial Colorectal Cancer Registry (OFCCR) (Cotterchio et al. 2000). Lynch Syndrome is an autosomal dominant condition caused by several DNA mismatch repair (MMR) genes, predominantly MLH1 and MSH2, that predisposes carriers to colorectal cancers.
The OFCCR used the population-based Ontario Cancer Registry to identify incident colorectal cancer (CRC) cases (probands), aged 20-74, diagnosed from July 1997 to July 2000. Probands were screened for any MMR gene mutations. For each proband found to carry an MMR mutation, all first-and second-degree relatives of the proband's family were considered to be eligible for the study. The data set includes a total of 765 individuals. Excluding individuals without information on age at diagnosis or examination or disease status, n = 503 individuals are used for analysis including 32 probands and 471 relatives. The probands are all mutation carriers and of the 471 relatives, 60 are known mutation carriers, 62 are known non-carriers, and the mutation statuses of the rest are unknown. After loading the data, data("LSfam", package = "FamEvent"), summary.simfam(LSfam) and plot.simfam(LSfam) can be applied to provide a summary of the data and a graphical display of the pedigree structure for the first family, respectively. Using the LS family data, we aimed to estimate the effects of gender and mutation status on CRC risk and to provide age-specific penetrance estimates specific to gender and mutation status by taking missing genetic information into account. To begin, we specified the minimum age of onset as 18 years and used those whose age at onset or current age is greater than the minimum age into the analysis. What follows are two penetrance models fitted using baselines hazard distributions -Weibull and piecewise constant with cut points at c(30,40,50,60).

Conclusions
FamEvent is a comprehensive R package for simulating and modeling time-to-event data from family-based study designs. Family-based designs continue to be powerful approaches to study complex diseases with a genetic basis, even with increasingly low costs for whole genome sequencing. For example, a recent consortium for affective and psychotic disorders has been developed to identify genetic factors for mental illness (Glahn et al. 2019). Genetic factors are being identified from family studies in vastly different diseases, disorders and behaviors including the intergenerational transmission of divorce (Salvatore et al. 2018), Alzheimer's Disease (Beecham et al. 2017), human longevity (Yashin et al. 2018), and the impact of the rearing environment on children's behavior (Liu and Neiderhiser 2017).
Common issues encountered in family-based designs, regardless of the research domain, include missing genotype information on family members, selection of the families and residual correlation after conditioning on the major gene. Substantial bias in penetrance parameter estimates as well as underestimation of variability can occur without addressing these issues in the analysis of data. No existing R packages, such as gap, pbatR, or coxme, address the missing genotype information or selection bias in their methods. FamEvent is a versatile and user-friendly R package for simulating and fitting time-to-event data in complex pedigrees under various sampling designs. It assumes the segregation of a major gene with or without the presence of residual correlation due to a second gene or shared frailty. The simulated data can mimic real data obtained in many types of family studies. Mutation carrier probabilities for individuals with missing genotypes can also be estimated using information on the relatives' genotypes and possibly phenotypes using the carrierprob function in FamEvent.
Plotting functions permit visual examination of individual pedigrees, as well as the true and estimated penetrance functions while several summary and print methods provide the key parameter and penetrance estimates from fitted models or details of the simulated family data set. The power of detecting a genetic effect in the penetrance model based on a familybased simulation study is available in the fampower function. The FamEvent R package also includes data from 32 Lynch Syndrome families segregating MMR mutations selected from the Ontario Familial Colorectal Cancer Registry, including 765 relatives; these data will permit other users to evaluate their models or methods. This package addresses important features of age-at-onset data from common family-based designs, generates data that mimics real family data, and provides important tools for investigators planning family-based studies or analyzing their corresponding data.
Future extensions will include more sophisticated functions for penetrance estimation as well as the simulation of time-to-event data in the context of familial sequencing studies. For instance, we have started to use FamEvent in combination with our other R package sim1000G (Dimitromanolakis, Xu, Król, and Briollais 2019), which simulates genetic variants according to the 1000 Genomes data. The penetrance function can then depend on a multi-allelic genetic marker, where the marker can be composed of rare or common genetic variants or a combination of both. This is particularly useful to simulate a pattern of familial aggregation of age-at-onset outcomes given a complex genetic architecture. For example, we recently used FamEvent to simulate a familial aggregation of age at colorectal cancer onset in Familial Colorectal Cancer Type X families to investigate the type of genetic architecture that could explain this familial aggregation (Choi et al. 2019). FamEvent was also used in combination with sim1000G to assess the power of rare variant association tests in family designs for time-to-event data (Dimitromanolakis et al. 2019) and to simulate sister pair data under early age at onset ascertainment, where the genetic model reflects a scientific hypothesis of locus heterogeneity (Romanescu, Green, Andrulis, and Bull 2020). This demonstrates that FamEvent can have a broad range of applications in complex genomic studies and we anticipate more to come. This will motivate our future extensions of FamEvent.