logbin : An R Package for Relative Risk Regression Using the Log-Binomial Model

Relative risk regression using a log-link binomial generalized linear model (GLM) is an important tool for the analysis of binary outcomes. However, Fisher scoring, which is the standard method for ﬁtting GLMs in statistical software, may have diﬃculties in converging to the maximum likelihood estimate due to implicit parameter constraints. logbin is an R package that implements several algorithms for ﬁtting relative risk regression models, allowing stable maximum likelihood estimation while ensuring the required parameter constraints are obeyed. We describe the logbin package and examine its stability and speed for diﬀerent computational algorithms. We also describe how the package may be used to include ﬂexible semi-parametric terms in relative risk regression models.


Introduction
Logistic regression is commonly used for the analysis of binary outcome data, and has good computational properties deriving from the fact that it uses the canonical link function for a binomial generalized linear model (GLM). The resulting effect measure is the adjusted odds ratio, which has attracted criticism because it is difficult to interpret and communicate accurately (Holcomb Jr, Chaiworapongsa, Luke, and Burgdorf 2001). Studies by Forrow, Taylor, and Arnold (1992) and Lacy et al. (2001), in the context of medical statistics, have shown that clinical decision-making can be heavily influenced by the choice of effect measure.
The odds ratio is the only estimable measure of effect in case-control designs, and approximates the relative risk (or prevalence ratio) if the event probability is low. However, the magnitude of the odds ratio always exceeds that of the relative risk, and misinterpretation can result in the exaggeration of effect sizes in prospective or cross-sectional studies of common events (Davies, Crombie, and Tavakoli 1998).
For these reasons, many authors have advocated the use of relative risk regression instead of logistic regression (e.g., Sackett, Deeks, and Altman 1996;Grimes and Schulz 2008). But computation of the maximum likelihood estimate (MLE) using standard software can be difficult, leading to the development of several methods that produce alternative estimates of adjusted relative risks (e.g., Schouten et al. 1993;Deddens, Petersen, and Lei 2003;Zou 2004). Although these methods generally have good properties, in a comparative study Marschner (2015) concluded that the MLE has some efficiency advantages, provided that it can be computed.
Over recent years there has been discussion and development of various approaches for maximum likelihood estimation which avoid the need to use approximate methods when the standard methods fail (Lumley, Kronmal, and Ma 2006;de Andrade and Carabin 2011;Marschner and Gillett 2012). The R (R Core Team 2018) package logbin (Donoghoe 2018) presented in this paper implements a number of different methods for computation of the MLE and is designed to address the calls of Lumley et al. (2006) and Marschner (2015) that such methods be made readily available in standard software.
In Section 2, we introduce the log-binomial model that is used for relative risk regression. In Section 3, we describe the logbin package and briefly outline the algorithms that underlie its main function, logbin. In Sections 4 and 5 we discuss the stability and speed of each approach, demonstrating each with simulated data. In Section 6 we highlight the potential effect of the choice of parameter space that is considered by different methods, and in Section 7 we describe and demonstrate an extension that allows the inclusion of semi-parametric components in the regression function through the logbin.smooth function. log link is not the canonical link function for the binomial GLM, meaning that the maximum likelihood estimate may be a repelling fixed point of the iterative algorithm. This issue can be addressed by a simple line search modification to the Fisher scoring algorithm, which has been implemented in R by the glm2 package (Marschner 2011), and is used by both Stata's and SPSS's default GLM-fitting algorithms (StataCorp. 2015;IBM Corporation 2016).
The second problem arises because parameter constraints must be imposed so that the fitted event probabilities do not exceed 1. This does not have a simple solution in the Fisher scoring framework: although step-halving can be used to prevent intermediate estimates from moving outside the parameter space, Lumley et al. (2006) point out that the MLE may not be reached if the boundary of the parameter space is almost perpendicular to the gradient of the loglikelihood.

The logbin package
The logbin package provides an R interface to perform maximum likelihood estimation for log-binomial regression models, using different algorithms. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package= logbin. In Section 3.1 we describe the syntax for the main function logbin, giving an example for a particular model. In Section 3.2 we outline the various computational methods that are available in the package, and in Section 3.3 describe the object returned by the logbin function.

Input arguments and example
The main function in the logbin package is logbin, which has the following usage: logbin(formula, mono = NULL, data, subset, na.action, start = NULL, offset, control = list(...), model = TRUE, method = c("cem", "em", "glm", "glm2", "ab"), accelerate = c("em", "squarem", "pem", "qn"), control.method = list(), warn = TRUE, ...) The arguments are mostly identical to those used by the glm function, except that family does not need to be specified by the user. As with glm for a binomial model, logbin allows the response part of the formula to be either a binary vector or a two-column matrix with the columns giving the number of events and non-events.
The method argument selects the computational approach that will be used to find the maximum likelihood estimate. These are each described in Section 3.2. Argument mono allows the user to specify covariates that should be constrained to have non-negative coefficients, a feature that is restricted to the "cem" and "em" methods. Argument accelerate is also relevant only for "cem" and "em", and potentially provides additional speed by using features of the turboEM package. More details are provided in Section 5.1.

Example data
The ASSENT-2 study (ASSENT-2 Investigators 1999) was a randomized clinical trial conducted in 16,949 patients who had experienced a recent heart attack. The primary outcome of the study was 30-day mortality, and the comparison between the randomized treatments (tenecteplase and alteplase) showed a difference that was within the prespecified criteria for equivalence on both absolute and relative scales.
Four baseline characteristics of each patient were recorded, each in three categories: age (<60, 60-75, >75 years), heart attack severity (mild, moderate, severe), time from heart attack to treatment (<2, 2-4, >4 hours) and geographical region (Western countries, Latin America, Eastern Europe). The cross-tabulation of this data with the outcome is included in the glm2 package as dataset heart.
We will demonstrate the use of the logbin function by estimating the adjusted relative risk of death associated with the four baseline characteristics. The general code we will use to fit the model is R> model.heart <-logbin(cbind(Deaths, Patients -Deaths)+ factor(AgeGroup) + factor(Severity) + factor(Delay) + factor(Region), + data = heart, start = rep (-0.2, 9)) Note that because we did not specify method in this call to logbin, the combinatorial EM algorithm (method = "cem") will be employed by default.

Computational methods
The main purpose of the logbin package is to provide an implementation of the combinatorial EM (CEM) algorithm described by Marschner and Gillett (2012) for stable maximum likelihood estimation in log-binomial regression. However, the package also provides an interface to use modified Fisher scoring (via the function glm, or package glm2) and adaptive barrier (via constrOptim) algorithms to fit the same model. Full details of these algorithms are provided elsewhere, but we give a brief overview of each approach below.

EM-type algorithms
The default method, selected by specifying method = "cem" in the call to logbin, implements the combinatorial EM algorithm described by Marschner and Gillett (2012). This approach is based on the fact that, because of the product probability structure of the log-binomial model, each observed Y i can be viewed as being derived from a collection of unobserved latent binary outcomes. An EM algorithm (Dempster, Laird, and Rubin 1977) based on this latent variable model can be used to maximize the log-likelihood L(θ).
However, due to constraints imposed by the latent variable model, the maximization is performed only over a subspace of the parameter space Θ. This restriction can be overcome by considering a family τ of reparameterizations, each t ∈ τ of which can be used to define an EM algorithm that maximizes over a subspace Θ(t) ⊆ Θ such that t∈τ Θ(t) = Θ.

L(θ).
Due to the covering property in Equation 2, one of these must be the global MLE. So the combinatorial EM algorithm (Marschner 2014) consists of applying the family of EM algorithms, followed by a discrete optimization step to choose the constrained MLE that gives the greatest likelihood:θ = arg max θ∈T L(θ).
The number of parameter subspaces, and hence the number of EM algorithms that must be applied, is determined by the number of parameters in the model. Specifically, if there are A categorical covariates, where the ath has c a levels, and B continuous covariates, the number of parameter subspaces defined by the CEM algorithm is Because the number of parameter subspaces grows exponentially with the number of parameters, an exhaustive search of all of the parameter subspaces may be prohibitive for large models. Marschner (2014) and Donoghoe and Marschner (2016) have described some strategies that may be used to improve the computational efficiency of the CEM algorithm.
One strategy makes use of the fact that the log-likelihood of the log-binomial model is concave in θ, so that any stationary point must be a global maximum. logbin with method = "cem" implements this approach by stopping the search if the constrained maximumθ(t) is in the interior of its parameter subspace Θ(t). The interior is defined such that none of the parameters are within δ of the boundary, where δ is the argument bound.tol supplied to logbin.control. Care must be taken not to make δ so large that true interior points are thought to be on the boundary, or so small that true boundary points are numerically considered to be in the interior. The start argument can be used to specify the starting values of the parameter vector, which determines the parameter subspace that will be examined first.
Another strategy, implemented using method = "em", is that of parameter expansion. This was outlined by Donoghoe and Marschner (2016), and amounts to adding an auxiliary parameter that allows each of the component latent variable models that make up the CEM algorithm to be nested within a single overparameterized latent variable model. In contrast to the CEM algorithm, this only requires a single application of an EM algorithm, which maximizes the observed-data log-likelihood over the expanded parameter space. Appropriate identifiability constraints can then be applied to recover the MLE in the original parameter space.
Because of the natural constraints imposed by the underlying EM algorithm, it is straightforward to impose non-negativity constraints on any parameters using these EM-based approaches. The mono argument to logbin can be used to identify these parameters by either name or index. For categorical covariates, mono will ensure that the level-specific parameters are non-decreasing in the order determined by the factor levels.

Modified Fisher scoring
Fisher scoring, accessed by using method = "glm", is a general algorithm for maximum likelihood estimation. It is a Newton-type approach that uses the expected information matrix in approximating the gradient of the score function. Used for fitting GLMs, it can be implemented with an iteratively reweighted least squares (IRLS) algorithm (McCullagh and Nelder 1989), which is the approach used by R's glm.fit function, SPSS's GENLIN command and SAS's GENMOD procedure (SAS Institute Inc. 2013). The default option in Stata's glm command implements the Fisher scoring algorithm directly, but an IRLS algorithm can be selected by specifying the irls option.
When the canonical link function is used in a GLM, Fisher scoring coincides exactly with Newton's algorithm, which has guaranteed quadratic local convergence (Lange 2013, p. 293).
With non-canonical links, as in the log-binomial model, the fact that the expected information is always positive definite provides some stability by ensuring that the algorithm will always head in an ascent direction, but local convergence is not guaranteed without step-size modification (Lange 2013, p. 296).
The default GLM-fitting algorithms in SAS, SPSS and Stata each allow the option to perform Fisher scoring for a chosen number of iterations before switching to Newton's algorithm until convergence. Nevertheless, either algorithm can overshoot the MLE, and SPSS and Stata also implement step-halving if the likelihood decreases between iterations. This is the strategy used by the glm.fit2 method provided by the glm2 package in R (Marschner 2011(Marschner , 2018, which can be accessed for the log-binomial model by specifying method = "glm2" in the call to logbin. The log-binomial model also requires that fitted event probabilities do not exceed 1, but Newton-type algorithms do not naturally impose any parameter constraints. In R, the glm and glm2 functions both employ step-halving to bring estimates back into the parameter space if the deviance becomes infinite or any fitted values are outside (0, 1) at any iteration. Although undocumented, the GENMOD procedure in SAS addresses the issue in the same way.
The glm command in Stata does not perform any check of the validity of parameter estimates, and it is possible that it returns estimates that are outside the parameter space (Williamson, Eliasziw, and Fick 2013). By contrast, the binreg command implements the method of Wacholder (1986): If fitted values go outside (0, 1) at any iteration, they are replaced by a value inside the range, rather than changing the parameter estimates (Hardin and Cleves 1999). Thus this approach can also return parameter estimates that produce invalid fitted probabilities.

Adaptive barrier
The adaptive barrier approach (Lange 1994), chosen by specifying method = "ab", is a general method for convex optimization with constraints. The constrained problem is converted into an unconstrained problem by adding a logarithmic barrier term, which is infinite when the constraints are active, and is updated at each iteration. Within each iteration, the optimization can be performed using any method that allows infinite values for the objective function.
This approach is well-suited to maximum likelihood estimation of the log-binomial model because the negative log-likelihood is convex, as are the required parameter constraints: Neither SPSS or Stata currently include an inbuilt optimization routine that allows inequality constraints. The NPL procedure in SAS provides a wide range of constrained optimization  methods, and has been used by Yu and Wang (2008) to fit a log-binomial regression model. In R, logbin calls upon the constrOptim function, an adaptive barrier algorithm that allows affine inequality constraints.
Because the gradient of the log-likelihood is known, the default optim method used for the inner iterations is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton algorithm. An alternative algorithm can be chosen, or any of the other optional arguments to constrOptim (which are passed to optim) can be set, by using the control.method argument of logbin.

Output and reporting tools
Having already used the default CEM approach at the end of Section 3.1, we can fit the model to the heart attack data with each of the other computational approaches using R> model.heart.em <-update(model.heart, method = "em") R> model.heart.glm <-update(model.heart, method = "glm") R> model.heart.glm2 <-update(model.heart, method = "glm2") R> model.heart.ab <-update(model.heart, method = "ab") The MLE for this model is in the interior of the parameter space, and Table 1 shows the parameter estimates, deviance and convergence status of the fitted model using each approach. Both EM-type algorithms, the modified Fisher scoring of glm2 and the adaptive barrier approach all successfully converged in this case. As noted by Marschner and Gillett (2012), without the implementation of step-halving to ensure that the sequence of estimates has nondecreasing likelihood, the Fisher scoring algorithm used by glm enters a cycle of period 8 and never converges.
The object returned by logbin has class c("logbin", "glm", "lm"). The logbin package includes S3 methods so that the usual methods for 'glm' objects are also available for those returned by logbin. In particular, summary and print.summary return the usual output as well as reporting the the small-sample corrected AIC c (Burnham and Anderson 2002), and, for method = "cem", the number of EM iterations performed in the parameter subspace that contained the MLE.
When the MLE is on the boundary of the parameter space, the assumptions that underlie the usual method of estimating the covariance matrix by the inverse of the information matrix may not be valid. Unlike the summary method for 'glm' objects, in such a situation the summary method for 'logbin' objects returns a matrix of NAs along with an appropriate warning. This also affects the results from confint and vcov applied to a 'logbin' object.
The anova, confint, predict and vcov functions have all been modified so that they work well with 'logbin' objects. Other useful methods for 'glm' objects, such as extractAIC, residuals and simulate, will also perform as expected when used with a 'logbin' object.

Stability
In the previous section we saw an example of convergence problems that may occur when Fisher scoring is used for relative risk regression. In this section we consider this issue in more detail. The issue of numerical instability in relative risk regression has been discussed in detail by Marschner (2015). Many of the potential problems associated with Fisher scoring for non-canonical links can be avoided by using step-halving to force the deviance to decrease at each iteration, as implemented by glm2. However, this is not a guaranteed solution to the problem, particularly if the MLE is on or near the boundary of the parameter space. We now consider a numerical illustration of the problem within the Fisher scoring framework, and then investigate how other methods, available within logbin, can be used to overcome it.

Numerical illustration
As illustrated in the previous section, one type of convergence problem is a non-convergent iterative sequence. Another type of problem, which we illustrate in this section, is convergence to a suboptimal value. We illustrate the problem using a simple simulated dataset with 100 binary observations Y i and a single continuous covariate x i ∈ [0, 1], i = 1, . . . , 100, fitting a two-parameter log-binomial model log p(x i ; θ) = θ 1 + θ 2 x i . The MLE in this data iŝ θ = (−0.00, −1.31), which is on the boundary of the parameter space since p(0;θ) = 1. Figure 1(a) shows the path taken by glm from a starting estimate of (−1, −1). The lighter lines with red crosses show the first few occasions on which the Fisher scoring algorithm produced estimates outside the parameter space, and step-halving was invoked to find a valid estimate along the direction of the step. This occurred at every iteration, and although the Fisher scoring step approaches a direction that is parallel to the boundary, this means that eventually the steps along the boundary become so small that convergence is declared at a suboptimal estimate. In this case, the failure to reach the MLE is not solved by glm2, which follows the same path, and occurs from a large number of different starting estimates, and even if the convergence criterion is made stricter .   Figures 1(b) and (c) show the path taken by the EM-type algorithms from the same starting estimate. Both methods converge stably to the MLE. They appear to have similar local convergence, but the parameter-expanded version converged faster overall because it followed a more direct route in the first few iterations. The main advantage of the parameter-expanded version is not evident here: Had we used a starting estimate with positive θ 2 , our combinatorial EM algorithm would have converged to the constrained MLE withθ 2 ≥ 0, and we would need to use the algorithm again in the other parameter subspace to find the global MLE. The parameter-expanded version, on the other hand, requires just a single implementation from any starting estimate.
Finally, Figure 1(d) shows the trace of the adaptive barrier algorithm using the default options provided by the logbin function. This algorithm follows a similar path as Fisher scoring in approaching the boundary, but instead of seeking to move outside the parameter space, we can see the effect of the barrier term in forcing the estimates towards the interior, and the MLE is successfully found.

Simulations
In order to further explore the stability of the computational methods, we used each method to fit the model on 1000 simulated samples generated by taking parametric bootstrap replications from the simulated data discussed in the previous section.
The combinatorial EM, parameter-expanded EM and adaptive barrier algorithms all converged for every sample, although the EM-type algorithms required a large number of iterations to satisfy the convergence criterion in some cases. In one sample, the sequence of estimates from glm showed chaotic behavior and did not converge, but glm2 avoided this problem. Both methods reported convergence to the same estimate in every other sample.
Despite the modified Fisher scoring and adaptive barrier approaches reporting convergence, the resulting estimate was sometimes suboptimal, similar to what was observed in the original sample. The estimate returned by the EM-type algorithms had a greater likelihood than the estimates from the other approaches in every sample. In particular, the likelihood ratio of the EM estimate compared to that obtained using glm2 exceeded 1.05 in more than 10% of cases, leading to an average error of −0.019 in the estimate of the θ 2 parameter from glm2. The adaptive barrier approach generally found numerically the same estimate as the EM algorithms, although there were four cases in which the estimate of the θ 2 parameter was more than 0.01 away from the MLE with smaller likelihood.

Speed
In unconstrained problems, Newton's algorithm has quadratic local convergence (Lange 2013, p. 294). In practice, Fisher scoring will usually perform similarly, particularly for large sample sizes when the expected information approximates the observed information matrix well. This is reflected in the fact that the default maximum number of iterations used by glm.fit is 25.
The algorithms underlying the adaptive barrier approach also generally have fast convergence in the unconstrained case. For example, quasi-Newton methods such as BFGS have superlinear local convergence (Broyden, Dennis Jr, and Moré 1973).
In contrast, the EM algorithm typically takes a large number of iterations to converge. Its local convergence is linear, inversely related to the proportion of missing information. Because the EM-type algorithms are the main novel method included in the logbin package, we discuss an approach for improving its speed without compromising its stability.

turboEM algorithms
The turboEM package (Bobb and Varadhan 2018) provides a general interface to several algorithms that aim to accelerate the convergence of EM algorithms. In logbin, the accelerate argument of the logbin function can be used to specify one of these algorithms ("squarem", "pem" or "qn") to speed up either the individual components of the combinatorial EM algorithm or the single parameter-expanded EM algorithm described in Section 3.2.
We have observed that the default parameters generally give good behavior when using each of these acceleration schemes, but they can be altered by the user via the control.method argument to the logbin function; we refer the reader to the turboEM package documentation for specific details. The sections below briefly outline the acceleration algorithms, given an EM mappingθ c+1 = F (θ c ).

SQUAREM
The SQUAREM algorithm (Varadhan and Roland 2008) is a globally convergent iteration scheme for accelerating an EM algorithm. Its basis is a vector generalization of Steffensen's method for scalar fixed-point problems, which Varadhan and Roland call the STEM scheme: where α c is a step length that minimizes some measure of discrepancy between the zeros resulting from two different linear approximations of the residual function r(θ) = F (θ) − θ.
Analogous to the way in which the Cauchy-Barzilai-Borwein method (Raydan and Svaiter 2002) improves upon the Cauchy method for a linear problem, the SQUAREM algorithm is defined by forcing its recursive error relation to have the same (squared) form when compared to the STEM algorithm. This giveŝ Given the current estimateθ c , the update requires two applications of the EM step. The SQUAREM scheme can be made globally convergent by using a "back-tracking" strategy to modify the step length α c so that each iteration is guaranteed to increase the observed-data log-likelihood.

Parabolic EM
Parabolic extrapolation of the EM algorithm (Berlinet and Roland 2009) was designed to avoid the problem of "stagnation", in which the parameter estimates move quickly to a neighborhood of the maximum, but final convergence to a stationary point is very slow. Given three past valuesθ c−2 ,θ c−1 ,θ c , we consider a Bézier curve M (t) controlled by these points.
A grid search along this curve is performed for different t ∈ R, as long as L(M (t)) is increasing. We retain the point M (t) that maximizes the likelihood, and use this to choose a new set of three points with which to define the next Bézier curve. This approach requires two applications of the EM step at each iteration to choose the next control points, as well as potentially multiple evaluations of the likelihood in performing the grid search. It preserves the property of increasing the likelihood at each iteration. Zhou, Alexander, and Lange (2011) described an acceleration scheme based on Newton's method for finding a root of the residual function r(θ):

Quasi-Newton
The quasi-Newton approach replaces dF in the above equation with a low-rank matrix M , which is calculated using q secant approximations derived from previous iterations. Thus the method requires q initial EM updates, at which point quasi-Newton updating can commence based on the q − 1 secant pairs.
Although this algorithm can violate the likelihood ascent property of the EM algorithm, if the quasi-Newton update fails to increase the likelihood at any iteration we can revert to the usual EM update from our current estimate.

Simulations
We compared the speed of the different computational approaches using a simulation study similar to that described by Marschner (2014). In short, we considered a simple log-binomial model as in Equation 1 with an intercept and J − 1 = 5, 10 or 15 binary covariates. The baseline risk was fixed at exp(θ 1 ) = 0.6 and the relative risk for each covariate was exp(θ j ) = 0.8 or 1, j = 2, . . . , J.
In each scenario, we produced 1000 datasets with sample size n = 500, and used the logbin function to find the MLE via modified Fisher scoring (method = "glm2"), adaptive barrier (method = "ab") and various methods based on the EM algorithm. Although in general the combinatorial EM algorithm (method = "cem") converged to the MLE in a reasonable amount of time for a single dataset, the total time across all simulations became prohibitive, particularly as the number of parameter subspaces to be searched (2 J−1 ) grew. For this reason, we examined only the parameter-expanded version using the method = "em" option. We also applied each of the turboEM acceleration schemes described in Section 5.1 to the parameter-expanded EM algorithm. Table 2 shows the average time taken for each algorithm to converge to the MLE for each scenario, using a computer with a 3.40 GHz processor. While Fisher scoring was consistently very fast, even as the number of covariates increased, the adaptive barrier algorithm was only marginally slower in all scenarios. Unsurprisingly, the EM approach was substantially slower, with the deficit increasing with the number of covariates. However, the acceleration schemes provided a considerable improvement, with the SQUAREM algorithm proving to be the best.

Parameter space
The parameter space over which the log-binomial likelihood must be maximized depends on the covariate space for which we require the fitted probabilities to lie in [0, 1]. That is, the parameter space is Θ X = {θ : 0 ≤ p(x; θ) ≤ 1, x ∈ X } for some X . As discussed by McCullagh (2005) in the context of the proportional odds model, at a minimum this X must include the observed covariate vectors, but it could be much larger.
In the logbin package, the Fisher scoring and adaptive barrier methods both ensure risks are valid for X 1 = n i=1 x i , that is, the observed covariate vectors. This constraint is applied by R's glm and glm2 functions, as well as the GENMOD procedure in SAS and binreg in Stata.
In fact, because Θ X corresponds to non-positivity constraints on a linear combination of x ∈ X , it can be shown that Θ X = Θ X * , where X * = Conv(X ), the convex hull of X . So methods that maximize over Θ X 1 will ensure that fitted risks are valid for covariate vectors that lie within the convex hull of the set of observed covariate vectors, but could produce probabilities greater than 1 for a covariate vector x ⊆ X * 1 , even if the individual components of x are each within their observed ranges. Figure 2 shows two simple scenarios in which this could have an impact. In (a), we consider a model with an intercept and two continuous covariates (x 1 , x 2 ) ∈ [0, 1] 2 : The black circles mark 10 observed covariate vectors, and the shaded area shows the convex hull X * 1 . Methods that maximize the likelihood over Θ X 1 will ensure that the fitted probability will be within [0, 1] for every point in the shaded area. One such valid parameter vector is θ * = (−0.25, −0.5, 0.5), which gives p(x * ; θ * ) ≤ exp(−0.05) = 0.951 for all x * ∈ X * 1 , taking its maximum value along the line x * 2 − x * 1 = 0.4. However, the point x = (0.2, 0.8), marked by a red square, gives a fitted risk of p(x ; θ * ) = exp(0.05) = 1.051.
In Figure 2(b), we consider a model with one categorical covariate and one continuous covariate: where x 1 ∈ {1, 2}, and the black circles denote observed values. The parameter space Θ X 1 ensures that fitted probabilities will be valid for x 2 values along the grey lines, where the range depends on the level of x 1 . In such a situation, the interpretation of the exponentiated categorical parameters as relative risks is only relevant at certain levels of x 2 . For example, the parameter vector θ * = (−0.5, −0.3, 0.45) is within Θ X 1 and we would usually interpret exp(θ * 2 − θ * 1 ) = 1.221 as the relative risk associated with a change from x 1 = 1 to x 1 = 2 while keeping x 2 constant. However, the fitted risk at (1, 0.8) is exp(−0.14) = 0.869, and applying a relative risk greater than 1/0.869 = 1.15 will give a risk greater than 1.
An alternative choice for X is the Cartesian product of the observed ranges of each covariate. That is, The corresponding parameter space Θ X 2 is maximized over by the EM-type algorithms provided by the logbin package with method = "em" or "cem".
In Figure 2 X 2 is surrounded by a black dashed line, and we can see that the unobserved covariate vectors discussed previously (marked by red squares) are inside this covariate space. Thus the parameter vector that produced invalid fitted probabilities for these covariate vectors will not be a member of Θ X 2 .
Since X * 1 ⊆ X 2 , then Θ X 2 ⊆ Θ X * 1 . This means that max that is, the maximum likelihood estimateθ (2) in Θ X 2 cannot have a higher likelihood than the maximum likelihood estimateθ (1) in Θ X * 1 . Clearly, if Θ X * 1 = Θ X 2 then the MLEs coincide, that isθ (1) =θ (2) . For models that only contain categorical covariates, this will occur if we have at least one observation in each cell of the categorical cross-tabulation.
In general, due to the concavity of the log-likelihood function, we know that ifθ (2) is in the interior of Θ X 2 , thenθ (1) =θ (2) . Ifθ (2) is on the boundary of its parameter space Θ X 2 , it is possible thatθ (1) =θ (2) , in which case the fitted risks under a model with θ =θ (1) will be invalid for some x ∈ X 2 \ X * 1 . Either parameter space could be more appropriate in particular scenarios. In situations with small sample sizes where the covariates are uncorrelated, it may be more reasonable to consider the parameter space that is based on X 2 , in order to avoid estimates that give invalid risks for covariate patterns that were not observed simply by chance. On the other hand, if the covariates are strongly related to each other, it would not make sense to impose parameter constraints that ensure valid fitted probabilities for implausible covariate combinations.
An extreme example of this occurs if the covariates are structurally dependent on one another, such as in the case where they are separate indicator variables for the levels of a single categorical covariate. If entered this way into the model (rather than as a single factor), the EM-type algorithms will consider them to be separate continuous covariates each with a range of [0, 1], and X 2 will include a covariate vector in which more than one of these covariates is non-zero, overly constraining the parameter space.
Another such example occurs when we wish to model a quadratic relationship between a covariate and the log-probability of an event. In a model with a single covariate x 1 , the typical way to achieve this is to define x 2 = x 2 1 and include both in the linear predictor: However, neither choice of parameter space is ideal in this case. Figure 3(a) shows each covariate space for a hypothetical set of covariate vectors. X 2 , surrounded by a dashed black line, is most clearly inadequate here: it includes covariate pairs such as (1, 0) that do not lie on the quadratic curve. The resulting parameter space Θ X 2 would overly constrain our parameter vector to ensure that fitted risks are valid for these impossible combinations.
The parameter space Θ X * 1 is a vast improvement, but may still be problematic. Because the curve x 2 = x 2 1 is itself convex, the convex hull X * 1 of the observed covariate vectorsshown by the shaded area -is bounded by straight lines that lie wholly within the curve. This means that the covariate vector for any unobserved x 1 value lies outside X * 1 , and so Θ X * 1 will contain parameter vectors that produce fitted risks within [0, 1] at the observed x 1 but can have invalid fitted risks at intermediate values. An example of such a curve, with θ * = (−0.1, 3.25, −17.5) is shown in Figure 3(b), which is in Θ X * 1 since p(x 1 ; θ * ) ≤ 1 for all observed x 1 , but gives p(x 1 ; θ * ) > 1 for x 1 ∈ (0.04, 0.15).
The purpose of this discussion is to highlight the point made by McCullagh (2005) that alternative parameter spaces may be of interest, beyond the space Θ X 1 implemented in standard software. Through its "em" and "cem" methods, logbin provides the facility to impose alternative parameter space restrictions, while also allowing standard parameter space assumptions using its other methods. This flexibility is a useful feature of the package, and to our knowledge is not available in any other package.

Semi-parametric extensions
As described by Donoghoe and Marschner (2015), it is straightforward to extend the EM-type algorithms for log-binomial regression to include J flexible semi-parametric components via B-splines. Specifically, the model in Equation 1 becomes and each unknown f j is parameterized by constraining it to belong to the space of B-splines: where B are the B-spline basis functions defined by a set of knots that determine the continuity of the curve at chosen turning points (Ramsay 1988). These can be calculated recursively (De Boor 1978), implemented in R via the function splines::splineDesign.
Semi-parametric log-binomial models can be fitted in logbin by using the logbin.smooth function. We illustrate its use with an augmented version of the heart attack data described in Section 3.1. In their supplementary material, Marschner and Gillett (2012) provide this data, which has patient age in years rather than in broad categories.
B-spline terms can be included in the logbin.smooth formula by using B(), which has two additional arguments. knot.range is used to specify the number of internal turning points in the curve, placed at evenly spaced quantiles of the observed covariate values. Figure 4 shows the fitted risk of death by age for models of increasing complexity, from a linear age term up to a semi-parametric model with 10 internal turning points.
An information criterion can be used to select the "best" model, and following the suggestion of Donoghoe and Marschner (2015) we choose the model with the lowest small-sample corrected Akaike information criterion, AIC c (Burnham and Anderson 2002). In this case, the smooth model with one internal turning point has the lowest AIC c . This entire process is automated by logbin.smooth if knot.range is provided as a vector, with the parameterexpanded EM algorithm and SQUAREM acceleration scheme used to speed up convergence:

knots
Age Risk of death Figure 4: Fitted risk of death by age for individuals from Western countries with <2 hour treatment delay after mild (light blue), moderate (blue) and severe (dark blue) heart attacks, based on models with increasing complexity for the semi-parametric effect of age. The vertical grey lines show the location of the turning points for the B-splines.
R> m.bestsmooth <-logbin.smooth(cbind(deaths, patients -deaths)+ factor(severity) + factor(onset) + factor(region) + + B(age, knot.range = 0:10), data = heart2, method = "em", + accelerate = "squarem") Alternatively, the user can specify the knots argument to the B function, providing a vector that contains the locations of the internal turning points. A plot method for class 'logbin.smooth' has been defined so that fitted semi-parametric models can be easily visualized. The at argument must provide a data.frame, in which each row determines the levels of the other covariates in the model at which the fitted risks will be calculated. : Fitted risk of death by age for individuals from Western countries with <2 hour treatment delay after mild (light blue), moderate (blue) and severe (dark blue) heart attacks, based on a model with an isotonic relationship between age and risk of death.
The mono argument can be used in a call to logbin.smooth in the same way as it is for logbin. Constraining the B-spline parameters to be monotonically non-decreasing is equivalent to using I-spline basis functions with non-positivity constraints on each coefficient, which will produce a monotonically non-decreasing smooth curve (Tutz and Leitenstorfer 2007;Donoghoe and Marschner 2015). Alternatively, non-smooth monotonic relationships can be included by specifying an Iso() term in the formula for logbin.smooth. In this case, the unknown f j is parameterized as a step function that may increase at each unique observed covariate value. The result from fitting such a model to our example is shown in Figure 5.

Discussion
Relative risk regression is an important alternative to logistic regression for binomial outcomes, as it provides an effect measure that is arguably easier to interpret than the odds ratio. However, standard methods for maximum likelihood estimation can encounter difficulties with the log-binomial model, signalling the need for specialized statistical software to reliably fit such models.
The logbin package presented in this paper provides an interface in R to algorithms that perform maximum likelihood estimation for log-binomial models. Specifically, EM-type algorithms described by Marschner and Gillett (2012) and Donoghoe and Marschner (2016) and an adaptive barrier approach based on the method of Lange (1994) and suggested by Lumley et al. (2006) and de Andrade and Carabin (2011) all provide stable convergence to the MLE, as demonstrated in Section 4.2. The usual modified Fisher scoring algorithm, as implemented in the glm function or glm2 package (Marschner 2018) can also be accessed via logbin.
The speed of the adaptive barrier approach is comparable to that of the Fisher scoring algorithm. The combinatorial EM algorithm can be considerably slower in large models, but this can be substantially improved by applying the underlying EM algorithm to an overparameterized model (Donoghoe and Marschner 2016). Either can be further accelerated by using one of the algorithms implemented in the turboEM package (Bobb and Varadhan 2018). As shown by simulations in Section 5.2, a combination of these ideas can make the loss in speed negligible.
Some care must be taken to identify the parameter space that is being considered by the maximization routine, as this can cause differences to occur in the MLE found by competing approaches in boundary cases. Finally, the EM-type approaches can be used to include semi-parametric terms, providing additional flexibility that would usually require the use of additional packages such as gam (Hastie 2018).