Fitting Additive Binomial Regression Models with the R Package blm

The R package blm provides functions for fitting a family of additive regression models to binary data. The included models are the binomial linear model, in which all covariates have additive effects, and the linear-expit (lexpit) model, which allows some covariates to have additive effects and other covariates to have logisitc effects. Additive binomial regression is a model of event probability, and the coefficients of linear terms estimate covariate-adjusted risk differences. Thus, in contrast to logistic regression, additive binomial regression puts focus on absolute risk and risk differences. In this paper, we give an overview of the methodology we have developed to fit the binomial linear and lexpit models to binary outcomes from cohort and population-based case-control studies. We illustrate the blm package’s methods for additive model estimation, diagnostics, and inference with risk association analyses of a bladder cancer nested case-control study in the NIH-AARP Diet and Health Study.


Introduction
Logistic regression is the default approach for studying how explanatory factors are associated with a binary outcome (Hosmer and Lemeshow 2000). In the logistic model, the log-odds are expressed as a linear function of the regression coefficients, and the model coefficients estimate adjusted odds ratios. In an additive regression model of binary data, the effects of covariates are linearly related to risk, and the model coefficients estimate adjusted risk differences. The binomial linear model (BLM) -the generalized linear model for the binomial family with an identity link -is one example (Cox 1970;Wacholder 1986). Despite the relevance of absolute risks and risk differences to epidemiology, finance, and other fields, few methods or software for absolute risk and risk difference estimation exist. As with survival data (Aalen 1989;Scheike and Zhang 2003), convenient tools for additive modeling of binary data have lagged behind tools for log-linear models because reliable estimation of additive models is technically more challenging (Austin 2010;Spiegelman and Hertzmark 2005;Newcombe 2006;Greenland 1987).
In this paper, we introduce the R (R Core Team 2013) package blm (Kovalchik 2013), available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=blm. The package provides methods to fit two types of additive regression models for binary data: BLM, a strictly additive model, and lexpit, a more flexible model that consists of additive and multiplicative effects, where multiplicative effects are modeled through an inverse-logit (expit) term (Kovalchik, Varadhan, Fetterman, Poitras, Wacholder, and Katki 2013). Sections 2.1.1 and 2.1.2 detail each model and their interpretation. Section 2.2 describes the data sets to which the models can be applied. The methods for estimation and inference are presented in Section 2.3 and Section 2.4. We overview the blm package in Section 3. In Section 4, we demonstrate the main uses of the package with risk association analyses of an NIH-AARP bladder cancer case-control study.

Binomial linear model (BLM)
Let Y τ be a Bernoulli random variable taking the value 1 if the event occurs within the time interval τ and 0 otherwise. Under the binomial linear model, the probability of an event is a linear function of a set of p time-independent covariates x, Under the BLM, each coefficient is the risk difference associated with a unit increase in the corresponding covariate, adjusted for all other covariates of the model. As a specific example, consider a model with a single covariate, x 1 , that is a zero-one indicator of exposure, P(Y τ = 1|x) = β 0 + β 1 x 1 . In this case, β 0 is the expected risk of an event in the unexposed, β 0 + β 1 is the expected risk for the exposed, and β 1 the excess risk due to exposure.

Linear-expit model (lexpit)
The lexpit model is a generalization of BLM, which incorporates a multiplicative component that is a function of q covariates z, In (2), expit(x) = exp(x)/(1+exp(x)) is the inverse-logit function. When there are no additive terms, P(Y τ = 1|x, z) = expit{z γ} becomes a conventional logistic model, which shows that this model is also a special case of the lexpit model.
In the lexpit, the intercept is included in the expit term so that the background risk of the model -the risk when all remaining covariates of z and x are zero -is expit{γ 0 }. Like the BLM, the additive coefficients of the lexpit estimate the adjusted risk difference measures of association for the corresponding covariate of x. The parameter exp(γ j ) is the odds ratio association between the residual risk P(Y τ = 1|x, z) − x β and the jth covariate z j . As with logistic regression, the exponentiated regression coefficient is the odds ratio association between two individuals with different z j exposure, fixing all other factors of the model.

Data
BLM and lexpit can be fit to binary data collected from a cohort study or from a populationbased case-control study with sufficient sampling information. In what follows, we assume that the binary variable of interest is based on an underlying time-to-event variable and represents the occurrence of event within a specified time interval τ , Y = I(T ∈ τ ). For each study type, the covariates (x 1 , x 2 , . . .) and (z 1 , z 2 , . . .) are the observed values at the start of the interval τ .

Cohort study
Given a cohort study of n observations, the outcomes for the additive binomial model are the y 1 , . . . , y n indicators of an event occurring during the time interval τ . The binary outcomes can be defined in terms of the corresponding time-to-event variables t 1 , . . . , t n and event indicators δ 1 , . . . , δ n , as y i = δ i I(t i ∈ τ ).

Population-based case-control study
A population-based case-control study identifies all cases of an event occurring in a welldefined population during a specified period of time τ . The population is divided into J strata, each consisting of N j individuals, and m j controls are sampled from each stratum with simple random sampling. In addition to case status y ij , each observation has a sampling weight, which is the inverse inclusion probability, w ij = N j /m j for controls and w ij = 1 for cases, assuming m j << N j for all j. In additive risk modeling, sampling information is needed to weigh-back to the underlying cohort and, thereby, obtain estimates for the absolute risk in the source population.

Estimation
Estimates for the parameters of the BLM and lexpit model are obtained by constrained maximization of a pseudo log-likelihood using a block relaxation algorithm (de Leeuw 1994). We describe the estimation methodology for the lexpit model. Fitting for the BLM is essentially equivalent to fitting a lexpit model with a constant expit term.
The estimates for the regression parameters Θ = (β, γ) are the solutions to the maximization problem,Θ with the constraints 1. Initialization. Set the intercept term and all other parameters to zero.
4. Iterate between Steps 2 and 3 until convergence. is the weighted sum of the log-likelihood components for binomial data, with each probability following the lexpit model, The solution to (3) would be a standard maximum likelihood problem if it were not for the constraint that all estimated probabilities of the model be within the (0, 1) range. The space F is termed the feasible region because it ensures the feasibility of all the fitted values of the model. Although any covariate patterns could conceivably be specified in F, our practice is to use an empirically-based region that is defined by the observed covariate patterns in the study sample.

Optimization algorithm
The constrained maximization procedure uses a two-stage block relaxation approach (de Leeuw 1994), which is summarized in Table 1. In the first stage, expit terms are considered fixed and the maximizing values forβ are determined with an adaptive barrier algorithm (Lange 1994(Lange , 2010 that, in the blm package, is implemented with the constrOptim function of the stats package. In the second stage, the linear terms are treated as fixed, using an offset term, and an iterative reweighted least squares algorithm with risk offset is used to updateγ. The block relaxation procedure is monotonic so convergence to a stationary point is guaranteed. Optimization for the BLM does not require Step 3, and there is no offset term (q ij = 0) in the updating of theβ. In this case, the intercept term is incorporated into the linear part and is initialized toβ 0 = i j w ij y ij / i j w ij .

Inference
Variances forΘ are estimated using an influence-based method. Several authors have previously described influence methods for variance estimation of complex survey statistics (Demnati and Rao 2010; Graubard and Fears 2005), and the influence operator is well-known for its use in the study of robustness (Hampel 1974). Further details of the influence function and its use with variance estimation are given by Deville (1999).
When the influence operator, ∆{.}, is applied to an estimator, it yields an estimate of the Gâteaux derivative and each component of this derivative is an analytic jackknife deviatethe estimated deviation in the estimator when one observation is omitted. The variation in the deviates generated by the influence operator can therefore estimate a statistic's variance in the same way as the deviates generated from jackknife resampling. In the case of the lexpit model, using the index k = (0, 1) to denote case status, the influences forβ are and where H(θ) is the second derivative of the objective function given in Equation 3 under the constraints F. Letting ∆ ijk {Θ} = (∆ ijk {β}, ∆ ijk {γ}), be the combined influences of the ijth observation on the parametersΘ, the variance estimate forΘ is with n jk the number of k types in the jth stratum and∆ .jk {Θ} the average influence over the n jk observations. The approximate large-sample distribution for (Θ−Θ) is M V N (0, Var(Θ)), and this result is the basis for the package's Wald tests and confidence interval construction.
When some fitted values are at the boundary of the feasible region (either 0 or 1), largesample normality may not hold (Self and Liang 1987;Andrews 2000). Since the boundary cases in lexpit affect individual fitted values, we believe standard inference should apply when the number of constrained observations is few. However, because standard inference is not guaranteed, active constraints should be closely monitored (as we describe in Section 4.1) and caution taken with the interpretation of the fitted model when active constraints are present.

Package description
3.1. Overview The blm package (Version 2013.2.4.4) consists of two model classes, blm and lexpit, supporting class methods, and additional functions to help diagnose the fitted model. Table 2 summarizes the main features of the package.

Model classes
The blm and lexpit are S4 class objects whose constructers and methods have been designed to emulate the lm class. The basic syntax for fitting a blm model with cohort data is where y~x is a formula and data is a data.frame. The syntax for the lexpit model has separate formulae for the linear and expit terms of the model lexpit(formula.linear = y~x, formula.expit = y~z, data) but its usage is otherwise the same as blm. The slots of the modeling objects, which can be accessed with the @ operator or the named method, contain a similar set of attributes as the lm class.  to the blm/lexpit classes. In addition to the class methods listed in Table 2, the slots include the initialization parameters of the algorithm (par.init), the log-likelihood for the fitted (loglik) and null model (loglik.null), and the barrier.value for the constrained optimization algorithm (barrier.value).
When the models are fit to population-based case-control data, the function call should also include a vector of weights containing the sampling weights for each observation in the data set and a factor for the strata argument, if the control sampling used stratification.

Application: Bladder cancer in the NIH-AARP Study
The NIH-AARP Diet and Health Study is the largest study of diet and health ever conducted (Schatzkin, Subar, Thompson, Harlan, Tangrea, Hollenbeck, Hurwitz, Coyle, Schussler, Michaud, Freedman, Brown, Midthune, and Kipnis 2001). Between 1995 and 1996, over half a million members of the American Association of Retired Persons (AARP) responded to a detailed questionnaire about their dietary habits and other health behaviors and all participants were followed for cancer incidence and mortality outcomes. Instructions for researchers interested in submitting a proposal to study the NIH-AARP Diet and Health Study cohort are available at http://dietandhealth.cancer.gov/resource.
The present analysis was based on a nested case-control study of bladder cancer within the NIH-AARP cohort. Cases were 292 study participants over the age of 60 years at the time of the baseline questionnaire who were diagnosed with bladder cancer (ICD-O3 C67.0-67.9) by age 70 years. Thus, the time interval of the analysis is τ = (60, 70]. 292 controls were randomly sampled from all individuals between ages 60 and 70 years at the time of the baseline questionnaire who at age 70 years had never been diagnosed with bladder cancer.

Gender, smoking, and bladder cancer
Relative risk analyses have previously suggested that gender and smoking are associated with the risk of developing bladder cancer (Freedman, Silverman, Hollenbeck, Schatzkin, and Abnet 2011). The first model fit examines the absolute risk differences for each gender and smoking-status subgroup.
R> library("blm") R> data("aarp") R> fit <-blm(bladder70~female * smoke_status, data = aarp, + weights = aarp$w) Here we fit a BLM with main effects for gender, smoking status categories, and each interaction using the pre-loaded data set aarp. The variable smoke_status is a factor with levels for Never, Curent, Former, and Unknown smoking statuses. The outcome variable bladder70 is a zero-one indicator of bladder cancer case status by age 70 years. The weights aarp$w are the sampling fractions for each observation, which are needed to weigh the risk estimates back to the underlying AARP cohort. Stratification was not used in this case-control study so strata is left to take its default NULL value.
The object fit is of the blm class. One of the methods for this class is coef, which can be used to extract the baseline risk and risk differences associated with each parameter. This shows, for example, that the baseline absolute risk of bladder cancer by age 70, the risk in the reference group of male never smokers, is 0.2 per 1,000 persons. The excess risk for male current smokers is 1.7 per 1,000, corresponding to an overall absolute risk for male current smokers is 0.2 + 1.7 = 1.9 per 1,000.
Both summary and confint can be used to assess the significance of the estimated effects.

R> summary(fit)
Estimate Std.Err t value Pr(>|t|) (Intercept) 1.9468e-04 3.4955e-05 5.5694 3.925e-08 *** female 6.2155e-04 1.7731e-04 3.5054 0.0004915 *** smoke_statusFormer 3. The significance levels of summary are based on a Wald test. The confidence intervals for confint are at the 95% level and are constructed with a large-sample approximation based on Student's t distribution. Both methods of inference suggest that the main effects of gender, and former and current smoking status are significant risk factors for bladder cancer.
To obtain the fitted values for each covariate of interest, we can use the predict method. When predict is supplied with the fitted blm, it returns the fitted absolute risk for each observation of the data frame used in the model's estimation. One can also provide a data frame with the newdata argument to compute fitted values for any covariate pattern of interest. The inclusion of standard errors is specified by the logical argument se. In the following code, we create a data frame containing the eight possible covariate types for the gender and smoking model and obtain fitted values and standard errors for these risk types.

R> all.vars(model.formula(fit))
[1] "bladder70" "female" "smoke_status" R> risk.types <-unique(subset(aarp, select = all.vars(model.formula(fit)))) R> risk.types <-subset ( Three functions for assessing the fit of the model are which.at.boundary, logLik, and Rsquared. The method which.at.boundary provides a matrix of covariate patterns whose predicted risks are at the boundary of the feasible region (0 or 1) according to a specified criterion. The default criterion is a risk within 1e-6 of the lower or upper bounds of this region. Although not a direct assessment of fit, the evaluation of the number and types of boundary cases can be indicative of a poorly specified model and each of these observations should be treated like potential points of influence.
The logLik method returns an object of the class logLik and is registered with the stats4 package. Thus, the returned value can be used with applicable methods, such as AIC. However, when the blm or lexpit object is fit with weights, it is important to keep in mind that the returned value is a pseudo-log-likelihood. Although χ 2 testing does not necessarily apply to pseudo-log-likelihoods, the measures can still be useful for informal comparisons of improvement in fit between nested models, and the AIC for informal comparisons between nested and non-nested models, for example, between a blm and lexpit model fit to the same binary outcome.
The Rsquared method returns McFadden's pseudo unadjusted and adjusted R 2 statistics (McFadden 1974). Binomial regression models do not have equivalent measure for explained variation as the R 2 of logistic regression based on ordinary least squares (OLS) . Still, these measures that attempt to mimic the R 2 of OLS can be useful for comparing the fit between models that have been applied to the same data set, with better-fitting models having an R 2 value closer to 1.

R> which.at.boundary(fit)
No boundary constraints using the given criterion.

R> AIC(fit)
[1] 5493.509 There are no concerns regarding cases at the boundary. We have used the logLik method to obtain the pseudo-AIC of the model, which we can compare to any later extensions we consider. The low R 2 measures for the current model suggest that we have not greatly improved the fit of the model over a null model and an expanded model should be considered.

Mode of effects
We next consider some simple strategies for assessing the possible functional relationship between a continuous covariate and absolute risk. A graphical method provided by the blm package is the risk.exposure.plot. The risk.exposure.plot is a loess scatter plot of the unadjusted risk in subgroups defined by the covariate. The function crude.risk creates the data frame with the estimates of the crude risk in ordered bins defined by the covariate, which consists of overlapping groups of 20% of the supplied data set and a sliding window of 1% of the sample size. When the output of crude.risk is plotted with risk.exposure.plot, it provides a visual representation of the continuous relationship between absolute risk and the continuous covariate that is not influenced by any model assumptions.
In the code below, we use crude.risk to obtain a data frame of the unadjusted risk estimates for bladder cancer by age 70 by dietary fiber. This returns a data frame with the populationbased risk estimates, risk, and the mean covariate value in each overlapping subgroup, x.
R> risk.exposure.plot(object = risk, scale = 1000, las = 1, + col = "royalblue", pch = 19, ylab = "Crude risk (per 1,000)", + xlab = "Avg. Fiber Consumption (Centered)") Figure 1 shows the results of the plot of the crude risks. Because this gives a sense of the functional relationship between risk and the continuous covariate, it can be useful for guiding the choice of representation of the covariate in the blm or lexpit model. For dietary fiber, we see a general decline in risk with greater fiber consumption, but there is an increase in risk the intermediate range of consumed fiber. This suggests that a higher-order polynomial for fiber on the multiplicative scale may be more appropriate than a simple linear effect for fiber.
There appears to be a strong relationship between bladder cancer and fiber but of a nonlinear nature. We therefore expand the absolute risk model using lexpit regression. The linear term of the model will include the same gender and smoking effects as we specified with the BLM. The expit term will have a main effect for the continuous variable redmeat, while fiber consumption will have a linear and quadratic term, centering fiber consumed on the median value of the sixth category of the factor (fiber.centered). The following script fits the described lexpit model.
R> formula.linear <-bladder70~female * smoke_status R> formula.expit <-bladder70~redmeat + fiber.centered + + I(fiber.centered^2) R> fit <-lexpit(formula.linear, formula.expit, data = aarp, weight = aarp$w) The results of summary indicate that redmeat and fiber.centered are both significantly associated with bladder cancer but suggest that the quadratic term for fiber.centered might not be necessary. The risk.exposure.plot provided a means of looking at a continuous covariates possible functional relationship to the crude (unadjusted) risk. If we wanted to consider the functional relationship after adjustment for other covariates, we could use testing approach. A test for the inclusion of a factor in the linear or expit term can be done directly when more than one additional covariate is included in the expit term. When this is the case, the lexpit regression can include a linear and multiplicative term for the covariate of interest. Testing the significance of each term provides a comparative assessment of the strength of the information of each mode of effect. Fitting both linear and multiplicative terms is possible because the expit transformation removes collinearity between each term. The code below shows how to use this procedure for the variable fiber.centered. Both the linear and quadratic additive terms for fiber.centered are not significant. We therefore conclude that the simpler model with only multiplicative effects for fiber.centered may adequately describe the risk association for this dietary variable and bladder cancer.

R>
The overall fit of the simpler model fit can be assessed with Rsquared, EO, and the gof functions. We have already described Rsquared. The EO function computes the ratio of expected and observed counts and its 95% confidence interval within subgroups of a specified categorical factor. Ratios that are not significantly different from one indicate that the model has good internal (within the training data) calibration, while ratios significantly below (above) suggest that the model is under-predicting (over-predicting) for those subgroups. In the script below, we look at the internal calibration in groups defined by education level. In comparison to the BLM, the lexpit model has improved the pseudo R 2 and AIC measures of fit, and the model is well calibrated for all educational categories.

R> Rsquared(fit)
The function gof assesses the overall fit of the model. This function performs the Hosmer-Lemeshow goodness-of-fit test across deciles of risk. For cohort data, this statistic is compared to a χ 2 distribution, with large values suggesting a lack of fit. For case-control data, the function employs the adjustment proposed by Archer, Lemeshow, and Hosmer (2007) for use with weighted estimators.

R> gof(fit)
$ Smoking had the largest effect of all categorical risk factors. Among male members of the AARP over 60 years old, current smokers had a 1.5 per 1,000 greater risk (95% CI 0.06 to 3.05 per 1,000) of bladder cancer by age 70 than never smokers. Among women members, the excess risk increased by 2.7 per 1,000 as compared to male smokers, but this was not a statistically significant difference (95% CI -1.42 to 6.75 per 1,000). Gender was also associated with a greater risk of bladder cancer in never smokers. Female gender was associated with a significant excess risk of 0.4 per 1,000 risk (95% CI 0.08 to 0.78 per 1,000) of bladder cancer among never smokers. Terms from the 'Intercept' down of the confint output are variables in the expit term. The 'Intercept' is the logit of the background risk. The reference group for the fitted model was male never smokers, with no consumption of red meat, who 18 grams of fiber intake per day (the centering value). The risk of bladder cancer by age 70 for this subpopulation was 0.9 per 10,000 persons (95% CI 0.6 to 1.3 per 10,000). The remaining expit terms represent log-odd ratios conditional on all other factors in the model. Thus, for two individuals of the same gender, smoking status, and fiber intake, the person who consumed an additional one gram per day of red meat had a 2% greater odds (95% 1.4 to 2.6) of bladder cancer.

Summary
The R package blm provides easy-to-use tools to fit additive regression models for binary data from observational studies. The blm and lexpit models directly estimate absolute risks and adjusted risk differences for cohort and some case-control studies, making them an important addition to the statistician's toolbox. By complementing conventional multiplicative modeling, the tools of the blm package can help clarify how covariates affect a binary outcome.