dglars: An R Package to Estimate Sparse Generalized Linear Models

dglars is a publicly available R package that implements the method proposed in Augugliaro, Mineo, and Wit (2013), developed to study the sparse structure of a generalized linear model. This method, called dgLARS, is based on a differential geometrical extension of the least angle regression method proposed in Efron, Hastie, Johnstone, and Tibshirani (2004). The core of the dglars package consists of two algorithms implemented in Fortran 90 to efficiently compute the solution curve: a predictor-corrector algorithm, proposed in Augugliaro et al. (2013), and a cyclic coordinate descent algorithm, proposed in Augugliaro, Mineo, and Wit (2012). The latter algorithm, as shown here, is significantly faster than the predictor-corrector algorithm. For comparison purposes, we have implemented both algorithms.


Introduction
Nowadays, high-dimensional datasets, in which the number of predictors, say p, is larger than the sample size N , are becoming more and more common. Routine monitoring, e.g., of web traffic or satellite surveyance, and high-throughput screening methods in biology and chemistry, raise the tantalizing prospect of being able to explain and predict a wide range of phenomena. What makes high-dimensional statistical inference possible is the assumption that the regression function lies in a low-dimensional manifold (Fan and Lv 2010). In a regression setting, this means that the coefficient parameter vector has a sparse structure. Modern statistical methods developed to cope with this problem are usually based on the idea of using a penalty function to estimate a sparse solution curve embedded in the parameter space and then to find the point that represents the best compromise between sparsity and fit of the model. Recent statistical literature has a great number of contributions devoted to this problem: some important examples are the L 1 -penalty method, originally proposed in Tibshirani (1996), the SCAD method (Fan and Li 2001), the Dantzig selector (Candes and Tao 2007), which was extended to generalized linear models (GLMs) in James and Radchenko (2009), and the MC+ penalty function introduced in Zhang (2010), among others.
Differently from the methods cited above, Augugliaro et al. (2013) propose a new approach based on the differential geometrical representation of a GLM. The method does not require an explicit penalty function, because it introduces sparsity more directly: it defines the sparsest continuous solution path with a pointwise maximum likelihood estimate (MLE). It has been called differential geometric LARS (dgLARS) because it generalizes the geometrical ideas underlying least angle regression (LARS) proposed in Efron et al. (2004). As emphasized in Augugliaro et al. (2013), LARS is not only "an important contribution to statistical computing", as suggested in Madigan and Ridgeway (2004), but is a proper likelihood method in its own right: it can be generalized to any model and its success does not depend on the arbitrary match of the constraint and the objective function, as is the case in penalized inference methods. In particular, using the differential geometric characterization of the classical signed Rao's score test statistic, dgLARS gains important theoretical properties that are not shared by other methods.
From a computational point of view, the dgLARS method consists essentially in computing the implicitly defined solution curve. In Augugliaro et al. (2013) this problem is satisfactorily solved by using a predictor-corrector (PC) algorithm, that however has the drawback of becoming intractable when working with thousands of predictors, since in the predictor step of this algorithm the number of arithmetic operations scales as the cube of the number of predictors. To overcome this problem, Augugliaro et al. (2012) propose a much more efficient cyclic coordinate descend (CCD) algorithm, which connects the original dgLARS problem with an iterative reweighted least squares (IRLS) algorithm.
In this paper we present the R (R Core Team 2014) dglars package (Augugliaro 2014), that implements both the algorithms to compute the solution curve implicitly defined by dgLARS. In this version of the package only the Poisson and binomial family are implemented, other families will be added in future versions of the package. The implemented functions return an object inheriting from an S3 class for which suitable methods have been implemented. The package dglars is available under the general public license (GPL ≥ 2) from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=dglars. The paper is organized as follows. In Section 2 we give some methodological background of the proposed method. More specifically, in Section 2.1 we introduce the dgLARS method by giving some essential clues to the theory underlying a GLM from a differential geometric point of view. In Section 2.2 we give a description of the implemented algorithms and in Section 2.3 we describe the notion of generalized degrees of freedom. Section 3 is devoted to the description of the main functions implemented in the dglars package. In Section 4 we compare the behavior of the proposed method with the L 1 -penalized MLE by simulation studies and in Section 5 we use the functions implemented in our package to study two real data sets. Finally, in Section 6 we draw some conclusions.

Methodological background
In this section we give a rough overview of the method. The reader interested in more of the differential geometric details of this method is referred to Augugliaro et al. (2013). The dgLARS method defines a continuous solution path for a GLM, with on one extreme of the path the MLE and on the other extreme the intercept-only estimate. The aim of the method is to define the most efficient model -in likelihood terms -that uses the fewest variables. In order to describe the method, we introduce the GLM and its dgLARS solution path in Section 2.1. The computational aspects are given in Section 2.2.

Description of the dgLARS method
Let Y be a scalar random variable with probability density function belonging to the exponential family, i.e., p(y; θ i , φ) = exp{(yθ − b(θ))/a(φ) + c(y, φ)}, where θ ∈ Θ ⊆ R is the canonical parameter, φ ∈ Φ ⊆ R + the dispersion parameter and a(·), b(·) and c(·, ·) are specific given functions. The expected value of Y is related to the canonical parameter by the mean value mapping, namely E(Y) = µ = ∂b(θ)/∂θ, similarly, the variance of Y is related to its expected value by the identity VAR(Y) = a(φ)V (µ), where V (µ) is called variance function. In the following we shall assume that the dispersion parameter is fixed, in this case, without loss of generality we can set φ = 1. Let X be the p-dimensional vector of random predictors. A GLM is based on the assumption that the conditional expected value of Y given X = x is specified by the link function h(·), namely h(E(Y|X = x)) = β 0 + p m=1 x m β m .
This means that the probability density function of Y is defined with respect to the mean value µ(β) = h −1 (x β), where x 0 = 1. Under the assumption that we are working with N independent copies of the pair (Y, X ) the conditional probability density function can be written as p(y; µ(β)) = N i=1 p(y i ; µ i (β)). In the following of this paper we shall use (β; y) = log p(y; µ(β)) as notation for the conditional log-likelihood function.
The mth score function, measuring the likelihood increase in the mth direction is given as To measure the scale of this change, relative to the size of the mth predictor, we scale the score function by the square root of the conditional Fisher information, i.e., to obtain the conditional Rao's score statistic, .
At the MLE β =β, the Rao's score statistics are equal to zero, since the derivatives of the likelihood are zero. On the other hand, for the minimal model with only the MLE of the interceptβ 0 = (β 0 , 0, . . . , 0), we define γ max to be the largest absolute value of the Rao's score statistic atβ 0 , i.e., γ max = max m=1,...,p |r u m (β 0 )|.
This value points out the best instantaneous "normalized" contribution to the likelihood of a single variable. This variable would make an excellent candidate for being included in the model. With these definitions in hand, we can now define the dgLARS estimator.
Here we have defined the dgLARS estimator in terms of the Rao's score statistic. Augugliaro et al. (2013) show that Definition 1 follows naturally from a differential geometric interpretation of a GLM that allows us to generalize the LARS method introduced in Efron et al. (2004). The dgLARS method generalizes the geometric description of LARS and is based on the following simple differential geometric identity where ρ m (β) is the angle between ∂ m (β; Y) and the tangent residual vector r β (Y) while r β (Y) p(µ(β)) is the length of this vector -which, crucially, does not depend on variable m. Using Figure 1 the dgLARS method can be described in the following way. First the method selects the predictor, say X a 1 , whose basis vector ∂ a 1 (β(γ max ); Y) has the smallest angle with the tangent residual vector, and includes it in the active set A(γ (1) ) = {a 0 , a 1 }, where a 0 stands for the intercept and γ (1) = γ max . The solution curveβ(γ) = (β a 0 (γ),β a 1 (γ), 0, . . . , 0) is chosen in such a way that the tangent residual vector is always orthogonal to the basis ∂ a 0 (β(γ); Y), while the direction of the curveβ(γ) is defined by the projection of the tangent residual vector onto the basis vector ∂ a 1 (β(γ); Y). The curveβ(γ) continues as defined above until γ = γ (2) , for which there exists a new predictor, say X a 2 , that satisfies the equiangularity condition, namely ρ a 1 (β(γ (2) )) = ρ a 2 (β(γ (2) )).
At this point X a 2 is included in A(γ (2) ) and the curveβ(γ) = (β a 0 (γ), β a 1 (γ), β a 2 (γ), 0, . . . , 0) continues, such that the tangent residual vector is always orthogonal to the basis vector ∂ a 0 (β(γ); Y) and with direction defined by the tangent vector that bisects the angle between ∂ a 1 (β(γ); Y) and ∂ a 2 (β(γ); Y), as shown on the right side of Figure 1. Efron et al. (2004) show that the LASSO solution curve can be obtained by a simple modification of the LARS method. Letβ(γ) be the solution of a GLM penalized using the L 1 -penalty function, then it is easy to show that the sign of any non-zero coefficient has to agree with the sign of the score function, namely sign(∂ m (β(γ); y)) = sign(β m (γ)). Figure 1: Differential geometrical description of the dgLARS method for a GLM with two covariates; in the left side the first predictor X a 1 is found and included in the active set; in the right side the generalized equiangularity condition (Equation 3) is satisfied for X a 1 and X a 2 .
When this condition is violated the corresponding predictor is removed from the active set. Like the original LARS method, also the dgLARS method can be easily modified to compute a differential geometric extension of the LASSO solution curve, called the dgLASSO solution curve.
Definition 2 Let γ ∈ [0, γ max ] be a fixed value. The dgLASSO estimator of a GLM is given by Condition 1 and Condition 2, such that also the following restrictions on the signs of the non-zero coefficients are satisfied sign(r u m (β(γ))) = sign(β m (γ))), ∀ m ∈ A(γ).
The dgLASSO method computes the dgLASSO solution curve in the same way as the dgLARS method, but it removes a predictor from the active set when the sign of the corresponding estimate is not in agreement with the sign of the Rao's score test statistic. Like for the LASSO and SCAD estimator also for dgLARS and dgLASSO estimator the problem of how to estimate the dispersion parameter φ is still an open question and theoretical results are not available. More details about the dgLASSO method will be given in the following section.

Computational aspects
From a computational point of view, the problem of how to estimate the dgLARS solution curve can be formalized in this way: we want to find a decreasing sequence of p values of the tuning parameter γ, say such that for any γ ∈ (γ (k+1) ; γ (k) ] the dgLARS estimator satisfies the following conditions: where A c (γ) is the complement of the active set. The new predictor is included in the active set at γ = γ (k+1) where the following condition is satisfied When we want to estimate the dgLASSO solution curve it is also necessary to add the following condition ∃ a i ∈ A(γ (k+1) ) :β a i (γ (k+1) ) = 0.
Since for any a i ∈ A(γ) the sign of the r u a i (β(γ)) does not change, it is easy to see that Equation 5 identifies a value of the tuning parameter such that for any γ < γ (k+1) we have that sign(r u a i (β(γ))) = sign(β a i (γ)) which means that the predictor a i must be removed from the active set. Augugliaro et al. (2013) propose to use a predictor-corrector (PC) method to compute the dgLARS/dgLASSO solution curve. From a computational point of view, the main problem of using the PC algorithm is related to the number of arithmetic operations needed to compute the Euler predictor that is involved in the computation and that requires the inversion of an adequate Jacobian matrix. In the actual implementation of the PC algorithm, the prediction step has complexity O(|A(γ)| 3 ) but it can be improved to O(|A(γ)| 2.376 ) using the Coppersmith-Winograd algorithm. In any case, these results make such an algorithm cumbersome in a high-dimensional setting. To overcome this problem, Augugliaro et al. (2012) propose a cyclic coordinate descent (CCD) method, which can be defined relating the original dgLARS problem to a sequence of penalized weighted least squares problems. In a high-dimensional setting, this method is computationally more efficient than the predictorcorrector algorithm to compute the dgLARS solution curve. In the following we describe both algorithms that we have implemented in the dglars package.

Predictor-corrector (PC) algorithm
In order to make this paper self-contained, we briefly review the PC algorithm originally proposed to compute the dgLARS/dgLASSO solution curve. For more details the interested reader is referred to Augugliaro et al. (2013). The basic idea underlying the PC algorithm (Allgower and Georg 2003) is to trace a curve implicitly defined by a system of non-linear equations. The curve is obtained by generating a sequence of points satisfying a chosen tolerance criterion.
Let γ ∈ (γ (k+1) ; γ (k) ] be a fixed value of the tuning parameter, such that k predictors are included in the active set, i.e., A(γ) = {a 1 , a 2 , . . . , a k }. The corresponding point of the solution curve will be denoted byβ A (γ) = (β a 0 (γ),β a 1 (γ), . . . ,β a k (γ)) . Using Equation 4, the solution path satisfies the relationship |r u a 1 (β A (γ))| = |r u a 2 (β A (γ))| = . . . = |r u a k (β A (γ))|, which means thatβ A (γ) solves the following system of k + 1 non-linear equations where v a i = sign(r u a i (β A (γ (k) ))). In order to simplify our notation, we defineφ , . . . , r u a k (β A (γ))) and v A = (0, v a 1 , . . . , v a k ) . By differentiatingφ A (γ) with respect to γ, we can locally approximate the solution curve at γ − ∆γ by the following expression, where ∆γ ∈ [0; γ − γ (k+1) ] and ∂ϕ A (γ)/∂β A (γ) is the Jacobian matrix of the vector function ϕ A (γ) evaluated atβ A (γ). To approximate the point that lies on the dgLARS solution curve, in Augugliaro et al. (2013) the step size ∆γ is approximated by the following expression where "min + " indicates that the minimum is taken over only positive components within each choice of a c h . Approximation 8 generalizes the step size proposed for the LARS method ) and approximates the step size corresponding to the inclusion of a new predictor in the active set. Equation 7 with step size given by Equation 8 is used for the predictor step of the PC algorithm. In the corrector step,β A (γ − ∆γ in ) is used as starting point for the Newton-Raphson algorithm that is used to solve System 6.
When we want to estimate the dgLASSO solution curve it is necessary to adjust the step size given by Equation 8 in order to consider Equation 5. Using Equation 7, it is easy to see that the first sign change will, approximately, occur at where d(γ (k) ) = (∂ϕ A (γ (k) )/∂β A (γ (k) )) −1 v A . The predictor step of the PC algorithm developed to estimate the dgLASSO solution curve is based on Equation 7 with step size ∆γ = min{∆γ, ∆γ out }. A schematic overview of the PC algorithm is given in Table 1.
Since the optimal step size is based on a local approximation, we also include an exclusion step for removing incorrectly included variables in the model. When an incorrect variable is included in the model after the corrector step, we have that there exists a non-active variable such that the absolute value of the corresponding Rao's score test statistic is greater than γ.
Checking this is trivial. To overcome this drawback, the "optimal" step size from the previous step is reduced using a contractor factor cf .

Cyclic coordinate descent (CCD) algorithm
In order to simplify our notation, in this subsection we drop the dependence of the estimate on the tuning parameter γ. Assume thatβ is the point of the solution curve defined at γ ∈ (γ (k+1) ; γ (k) ] by the dgLARS method. Considering System 6, it is easy to see thatβ can be equivalently defined as solution of the following system of non-linear equations where v a i = sign(β a i ). When we consider a predictor that is not included in the active set, Suppose that we want to compute the value of the solution curve at γ = γ − ∆γ, where ∆γ > 0. If ∆γ is small enough, we can approximate the gradient of the log-likelihood function as follows where . Approximation 10 is the basic step to relate the Fisher scoring algorithm with the iterative reweighted least squares (IRLS) algorithm. On the other hand, if ∆γ is small enough, the Fisher information matrix can be considered approximately constant. Then, as a consequence of the Karush-Kuhn-Tucker conditions, the value of the solution curve at γ can be approximated by the solution of the following constrained optimization problem, The previous expression suggests that, at the step t 1 ,β(γ ) can be updated by the following iterative rulê Problem 11 can be efficiently solved using the same approach as proposed in Friedman, Hastie, and Tibshirani (2010). Suppose that we have estimated β n , with n = m; in this case a coordinate descent step for solving Problem 11 is obtained by the solution of the following problem is the partial residual for fitting β m . It is easy to see that, at the step t 2 , the coordinate-wise updating rule has the following form where S(x; λ) = sign(x)(|x| − λ) + is the soft-thresholding operator. The pseudo-code of the proposed CCD algorithm is reported in Table 2. In order to reduce the computational burden, in our algorithm we initially cycle only through the predictors that are included in the current active set A and then, when the convergence is met, we check if the active set is changed.

Generalized degrees of freedom
The behavior of the dgLARS method is closely related to the way of selecting the optimal value of the tuning parameter γ. For the LASSO estimator, Zou, Hastie, and Tibshirani (2007) developed an adaptive model selection criterion to select the regularization parameter based on a rigorous definition of the degrees of freedom.
In order to define an adaptive model selection criterion, Augugliaro et al. (2013) propose the notion of generalized degrees of freedom (GDF) for the dgLARS method. This notion comes from a differential geometric approach to the covariance penalty theory (Efron 2004), which defines the degrees of freedom for a general modeling procedure. When it is possible to compute the MLE of the considered GLM, the authors also propose a possible estimator of the GDF that is implemented in the package by the function gdf() discussed in Section 3.2. When we work with a logistic regression model, it has been shown by a simulation study that the proposed estimator is better than the number of non-zero estimated coefficients. On the other hand, simulation studies show that for a Poisson regression model the number of nonzero estimated coefficients can be considered a good estimator of the GDF. See Augugliaro Step Algorithm

The dglars package
The dglars package is an R package containing a collection of tools related to the dgLARS method. In the following of this section we describe the functions available with this package.

Description of the dglars() and dglars.fit() functions
The main functions of this package are dglars() and dglars.fit(). The first one dglars(formula, family = c("binomial", "poisson"), data, subset, contrast = NULL, control = list()) is a wrapper function implemented to handle the formula interface usually used in R to create the N × p-dimensional design matrix X and the N -dimensional response vector y. These objects, together with the arguments family and control, are passed to the function dglars.fit() dglars.fit(X, y, family = c("binomial", "poisson"), control = list()) which is the R function used to compute the dgLARS/dgLASSO solution curve. Although in R the formula interface is the more familiar way to specify the linear predictor in a GLM, in a high dimensional setting this management of the involved model variables can be inefficient from a computational point of view. For this reason we recommend using the function dglars.fit() directly for simulation studies and real applications in the cases where p is very large.
The argument control is a named list of control parameters passed to the two algorithms described in Section 2.2. In order to reduce the computational time needed to compute the dgLARS/dgLASSO solution curve, the two algorithms are written in Fortran 90 and handled by two specific R wrapper functions called dglars_pc() and dglars_ccd(). These two functions are not user-intended and for this reason they are not described in this paper. By default, control argument is defined as follows control = list(algorithm = "pc", method = "dgLASSO", np = NULL, g0 = NULL, eps = 1.0e-05, nv = NULL, dg_max = 0, nNR = 50, NReps = 1.0e-06, ncrct = 50, cf = 0.5, nccd = 1.0e+05) Using the control parameter algorithm it is possible to select the algorithm used to fit the dgLARS solution curve, i.e., setting algorithm = "pc" the default PC algorithm is used, whereas the CCD algorithm is used when algorithm = "ccd" is selected. The group of control parameters method, np, g0 and eps, is composed of those elements that are shared by the two algorithms. The argument method is used to choose between the dgLASSO solution curve (method = "dgLASSO") and the dgLARS solution curve (method = "dgLARS"), while np is used to define the maximum number of points on the solution curve. As seen in Section 2.2, since the PC algorithm can compute the step size by a local approximation, the number of effective points of the solution curve can be significantly smaller than np. In contrast, the CCD algorithm fits the dgLARS solution curve using a multiplicative grid of np values of the tuning parameter. The g0 control parameter is used to define the smallest value of the tuning parameter. By default this parameter is set to 1.0e-04 when p > N and to 0.05 otherwise. Finally, eps is used to assess the convergence of the two algorithms. When the PC algorithm is used, eps is also used to identify a predictor that will be included in the active set, namely when the absolute value of the corresponding Rao's score test statistic belongs to [γ−eps; γ+eps].
The group composed by nv, dg_max, nNR, NReps, ncrct and cf contains the control parameters specific to the PC algorithm. nv is used to define the maximum number of predictors included in the model, while dg_max is used to fix the step size. When setting dg_max = 0 (default) the PC algorithm uses the local approximation to compute the γ value to evaluate the inclusion or exclusion of a predictor from the active set. The control parameters nNR and NReps are used to set the number of steps and to define the convergence of the Newton-Raphson algorithm used in the corrector step. When the Newton-Raphson algorithm does not converge or when there exists a predictor such that the absolute value of the corresponding Rao's score test statistic is greater than γ+eps, the step size is reduced by the contractor factor cf, i.e., ∆γ = ∆γ·cf, and then the corrector step is repeated. The control parameter ncrct sets the maximum number of attempts for the corrector step.
Finally, the control parameter nccd is used to define the maximum number of steps of the CCD algorithm.

An example of a simulated logistic regression model
To gain more insight on how to use the dglars() function we consider a simulated data set. First we load the dglars package in the R session by the code

R> class(out_dglasso_pc)
[1] "dglars" which is a list containing a matrix named beta used to store the estimated points of the solution curve, the vector dev of deviances, the vector g containing the sequence of the used values of the tuning parameter and the vector df containing the number of non-zero estimated coefficients including the intercept.
By default, dglars() computes the dgLASSO solution curve; the dgLARS solution curve can be computed using the control parameter method, i.e., R> out_dglars_pc <-dglars(y~., family = "binomial", data = dataset, + control = list(method = "dgLARS")) The print() method for the 'dglars' object can be used to print the basic information, i.e., the call that produced the 'dglars' object with a five-column table showing the names of predictors included or excluded from the active set, the sequence of γ values used to compute the dgLARS solution curve and the corresponding deviance and fraction of explained deviance, respectively. The number of non-zero estimated coefficients is also reported.
By printing the 'dglars' object out_dglasso_pc for our simulated data set, we can see that the dgLASSO method first finds the true predictors and then includes the other false predictors.

R> out_dglasso_pc
Call: dglars(formula = y~., family = "binomial", data = dataset) To be more specific, at γ (1) = 3.6372, i.e., the starting value of the Rao's score test statistic, the predictor X.2 has the smallest angle with the tangent residual vector and is then included in the active set. The predictor X.1 is included in the active set at γ (2) = 3.2187, this means that for any γ ∈ [γ (2) ; γ (1) ) only the intercept term and the regression coefficientβ 2 (γ) are different from zero and the number of non-zero estimates is equal to 2. The third predictor is included at γ (3) = 0.9319, which means that for any γ ∈ [γ (3) ; γ (2) ) the number of nonzero estimates is equal to 3, i.e the intercept term and the regression coefficientsβ 1 (γ) and β 2 (γ). This process continues until γ is equal to the control argument g0. The estimated coefficient path can be extracted using the coef() method available for the 'dglars' object. For example, with the following R code we can see the sequence of the first ten values of the tuning parameter γ with the corresponding estimated points.  More information about the estimated sequence of models can be obtained using the summary() method for the 'dglars' object, i.e., summary.dglars(object, k = c("BIC", "AIC"), complexity = c("df", "gdf"), digits = max(3, getOption("digits") -3), ...) where object is a fitted 'dglars' object. To choose the best solution point, the summary() method computes the measure of goodness of fit residual deviance + k × complexity, where k is a non-negative value used to weight the complexity of the fitted model. Using the argument complexity the user can choose between two different definitions of complexity of a fitted model, i.e., the well known number of estimated non-zero coefficients (complexity = "df") and the notion of generalized degrees of freedom (complexity = "gdf"). More details about the notion of generalized degrees of freedom are given in Sections 2.3 and 3.2. Setting k = "BIC" and complexity = "df", which are the default values, Definition 12 shows that the summary method for 'dglars' objects reports the Bayesian information criterion (BIC; Schwarz 1978). The Akaike information criterion (AIC; Akaike 1973) can be easily computed setting k = "AIC" and complexity = "df". The user can also define own measures of goodness of fit setting k as any non-negative value.
The following R code shows that the output printed by the summary() method is divided in two different sections.
R> summary(out_dglasso_pc, k = "BIC", complexity = "df") Call: dglars(formula = y~., family = "binomial", data = dataset) The first section completes the basic information printed out by the default print() method showing the BIC. The ranking of the estimated models obtained by this measure of goodness of fit is also shown and the corresponding best model is identified by an arrow on the right. The second section shows the formula of the identified best model and the corresponding estimated coefficients. From the previous output we can see that the best model identified by the BIC criterion is that one with only the predictors X.1 and X.2, which are the two true predictors that influence the response variable in our simulated model.
The following R code shows the results given by the summary() method when the AIC is used as measure of goodness of fit.
R> summary(out_dglasso_pc, k = "AIC", complexity = "df") Call: dglars(formula = y~., family = "binomial", data = dataset) In this case we can see that the best model contains all the considered predictors, even if the second best model is the same model identified by using the BIC criterion.
The user can plot the output from the dglars() function using the plot() method for the 'dglars' object, i.e., plot(x, k = c("BIC", "AIC"), complexity = c("df", "gdf"), g.gof = NULL, ...) where x is a fitted 'dglars' object while the arguments k and complexity are equal to the arguments of the summary method for 'dglars' objects. With the following R code R> out_dglasso_pc2 <-dglars(y~., family = "binomial", data = dataset, + control = list(dg_max = 0.1)) R> par(mfrow = c(2, 3)) R> plot(out_dglasso_pc2, k = "BIC", complexity = "df") R> plot(out_dglasso_pc2, k = "AIC", complexity = "df") we first reduce the step size setting the control parameter dg_max = 0.1 and then we plot the output from the dglars() function. As we have done for the summary() method, we first use the BIC criterion to select the best model and then we use the AIC criterion. As shown in Figure 2, when we fit the dgLARS solution curve using the PC algorithm, the plot() method produces three different plots, namely the plots showing the sequence of the BIC (first row) or AIC (second row) as function of γ and the plots showing the paths of the coefficients and of the Rao's score test statistics. The last plot is not available when the dgLARS solution curve is fitted using the CCD algorithm. The values of the tuning parameter corresponding to a change in the active set are identified by vertical dashed gray lines, while the optimal value of the tuning parameter γ, according to the BIC or AIC, is identified by a red dashed line. Using the argument g.gof the user can specify a value of the tuning parameter to identify different best models. Finally the argument ... can be used to specify additional graphical parameters.

Comparison between PC and CCD algorithm
As we have seen in the previous section, the dglars package implements two different algorithms to compute the dgLARS solution curve., i.e., a PC method and a CCD algorithm. Although the two algorithms compute the same solution curve the results can look different. To gain more insight we consider the following R code R> set.seed(321) R> n <-100 R> mu <-binomial()$linkinv(eta) R> y <-rbinom(n, 1, mu) R> dataset <-data.frame(y = y, X = X) where we simulate a logistic regression model with sample size equal to 100 and 5 predictors; only the first two predictors affect the response variable. By the following code R> out_dglasso_pc <-dglars(y~., family = "binomial", data = dataset) R> out_dglasso_ccd <-dglars(y~., family = "binomial", data = dataset, + control = list(algorithm = "ccd")) we estimate the dgLASSO solution curve using the PC and the CCD algorithm, respectively. The left panel of Figure 3, obtained by the following code, shows the path of the BIC values computed using the PC algorithm (black dashed line) and the CCD algorithm (red dashed line).
In order to better understand the effects of the number of predictors on the run times of the two algorithms, we use a simple simulation study based on a logistic regression model with sample size equal to N = (100, 200, 300) and p = (25,50,100,200,400,600,800,1000). The design matrix is obtained as follows (a) X 1 , X 2 , . . . , X p sampled from a N (0; Σ), where the diagonal and off-diagonal elements of Σ are 1 and 0, respectively; (b) the same of (a) but with COR(X i ; X j ) = ρ |i−j| with ρ = 0.9.
To simulate the binary response vector we use a model with intercept and with only three predictors; the corresponding non-zero coefficients were chosen equal to 2. In Table 3 we report the average CPU times in seconds coming from 100 simulation runs. All timings reported were carried out on a personal computer with AMD Athlon II X2 250 dual-core processor. We clearly see that the proposed CCD algorithm is always significantly faster than the PC algorithm. The difference between the two algorithms is greater when we consider the scenario with high correlation among the predictors. In this case the PC algorithm dramatically increases its computational time, while the CCD algorithm slows down just a little in this scenario. In Figure 4 we show the average CPU times of the considered algorithms for N = 300 for both scenarios (a) and (b).

Description of the gdf() function
As we have discussed in Section 2.3, the behavior of the dgLARS method, as a variable selection method, is closely related to the way we define the complexity of a model. When we have a Poisson regression model, the number of non-zero estimated coefficients can be considered a good approximation of the GDF, but when we have a logistic regression model, Augugliaro et al. (2013) propose another possible estimator of the GDF. In the same paper the authors show how the proposed GDF has a better behavior than using the number of non-zero estimated coefficients.
The proposed estimator is implemented by the function gdf() with argument
As shown in Augugliaro et al. (2013), the proposed estimator of the GDF requires the dgLARS estimates and the MLE of the GLM specified using all the predictors which, in some cases, cannot be computed. For this reason we prefer to use the number of non-zero estimated coefficients as default value for the argument complexity.

An example of use for a simulated logistic regression model: Continued
To gain more insight about the main differences between the proposed GDF estimator and the number of non-zero estimated coefficients, we consider the simulated logistic regression model used in Section 3.1. The following R code is used to show the first ten estimates given by the proposed estimator (gdf) and by the number of non-zero estimated coefficients (df).  Table 3: Average CPU times in seconds to compute the solution curve using the CCD algorithm and the PC algorithm. Standard deviations are reported in parentheses.
R> out_dglasso_pc <-dglars(y~., family = "binomial", data = dataset, + control = list(g0 = 0)) R> df <-out_dglasso_pc$df R> bh <-coef(out_dglasso_pc) R> gdfh <-gdf(out_dglasso_pc) R> complexity <-rbind(df, gdfh) R> rownames(complexity) <-c("df", "gdf") We can see that the main difference is that the estimates of the gdf are always smaller than df. This behavior finds a theoretical justification by the results given in Efron (1986), in other words, the number of non-zero estimated coefficients is equal to the gdf only when we are evaluating the complexity of a model on a point that is estimated by using the maximum likelihood method. When we set the argument complexity = "gdf", the summary() method uses the proposed estimator of the GDF to define the BIC or AIC values.
Coming back to our simulated logistic regression model, the following R code R> summary(out_dglasso_pc, k = "AIC", complexity = "gdf") Call: dglars(formula = y~., family = "binomial", data = dataset, control = list(g0 = 0)) shows that the AIC, defined using the estimates of the GDF, selects the logistic regression model with the true predictors. This result is in contrast to what we have previously obtained, since in the first application of the summary() method, the AIC identified a logistic regression model with four predictors. The true logistic regression model is also identified if we use the BIC based on the GDF. This result is not reported for sake of brevity. However, we point out that BIC defined using the estimator of GDF tends usually to select sparser models than AIC defined also with the estimator of GDF.

Description of the cvdglars() and cvdglars.fit() functions
In the previous section we have seen that we can use the summary method for 'dglars' objects to select the optimal value of the tuning parameter γ by using an information criterion, such as the BIC or AIC. When the user has enough data, another possible way to estimate the solution point of the dgLARS solution curve is by k-fold cross-validation (Hastie, Tibshirani, and Friedman 2009).
In the dglars package, the k-fold cross-validation is implemented by the functions cvdglars() and cvdglars.fit(). The first one can be used with arguments cvdglars(formula, family = c("binomial", "poisson"), data, contrast = NULL, subset, control = list()) As we have done for dglars(), this function is a wrapper function of the more efficient function cvdglars.fit(). It is implemented only to handle the classical R formula interface but it can be computational inefficient in a high-dimensional setting. For this reason, in this setting we suggest to use the function cvdglars.fit() with arguments cvdglars.fit(X, y, family = c("binomial", "poisson"), control = list()) where X is the design matrix of dimension N × p, y is the N -dimensional response vector and family is the error distribution used in the model. Like for the dglars.fit() function, control is a named list of control parameters with the following elements control = list(algorithm = "pc", method = "dgLASSO", nfold = 10, foldid = NULL, ng = 100, np = NULL, g0 = NULL, eps = 1.0e-05, nv = NULL, dg_max = 0, nNR = 50, NReps = 1.0e-06, ncrct = 50, cf = 0.5, nccd = 1.0e+05) The control parameter nfold is used to set the number of folds used for the cross-validation. By default cvdglars.fit() uses a ten-fold cross-validation and the prediction error is measured by the deviance. When nfold is equal to the sample size, the solution point is estimated by the leave-one-out cross-validation. The optional control parameter foldid can be used to pass to the algorithm a N -dimensional vector of non-negative integers used to identify the folds.
From a computational point of view, cvdglars.fit() splits the data into nfold parts and estimates the dgLARS solution curve using nfold−1 parts of the data. To estimate the prediction behavior of the fitted model, the algorithm uses the remaining part of the data and a sequence of ng equispaced points of the estimated solution curve to compute the deviance of the model. These steps are repeated nfold times and then the mean deviance is computed for each of the ng points. The optimal value of the tuning parameter, denoted byγ, is defined as minimum of the mean cross-validation deviance. Finally, using the whole data set, cvdglars.fit() runs the adequate subroutine written in Fortran 90 to compute the dgLARS solution curve with control parameter g0 fixed toγ.
The remaining control parameters of the cvdglars.fit() function have the same meaning as those seen in Section 3.1 for the dglars.fit() function.

An example of use for a simulated Poisson regression model
To gain more insight about the use of the cvdglars.fit() function, we have simulated a data set from a Poisson regression model. Like for the previous example, we assume that only the first two predictors influence the response variable. The corresponding R code is R> set.seed(321) R> n <-100 R> p <-100 R> s <-2 R> X <-matrix(rnorm(n * p), n, p) R> bs <-rep(0.5, s) R> Xs <-X[, 1:s] R> eta <-drop(0.5 + drop(Xs %*% bs)) R> mu <-poisson()$linkinv(eta) R> y <-rpois(n, mu) R> dataset <-data.frame(y = y, X = X) R> out_cvdglasso_pc <-cvdglars(y~., family = "poisson", data = dataset, + control = list(g0 = 0.1, eps = 1.0e-03)) Also in this case the values of the control parameters g0 and eps are chosen in order to remove the instability coming from the link function in a high-dimensional setting. The function cvdglars.fit() returns an object of class 'cvdglars' R> class(out_cvdglasso_pc) [1] "cvdglars" Figure 5: Plot of the 'cvdglars' object for the simulated data set.
namely a list containing, among the other, the p + 1 dimensional vector of the coefficients estimated by k-fold cross-validation, called beta, the ng dimensional vector of the mean deviance, called dev_m, and the estimate of the tuning parameter γ scaled on the interval [0; 1], called g_hat.
As for the 'dglars' object, specific method functions are developed to simplify the study of the output coming from cvdglars.fit(). For example, the print() and coef() methods can be used to print out the basic information stored in the 'cvdglars' object and to extract the estimated coefficients. Finally, with the plot() method for the 'cvdglars' object it is possible to plot the mean cross-validation deviance as function of the tuning parameter. The optimal value of the tuning parameter is visualized by a vertical red dashed line and the number of estimated non-zero coefficients is also reported. The confidence band for the mean deviance is computed. Figure 5 shows the plot for the results of the model based on our simulated data set. We see that the 10-fold cross-validation method allows us to select a Poisson regression model with only five predictors.

Simulation studies
In this section we compare by simulation studies the behavior of the dgLARS method with the L 1 -penalized GLM. To compute the LASSO solution curve we use the R package glmnet (Friedman, Hastie, and Tibshirani 2013), which implements the cyclical coordinate descent algorithm proposed in Friedman et al. (2010) and the R package glmpath (Park and Hastie 2013) which uses the PC algorithm developed by Park and Hastie (2007) to estimate the solution curve.
Our simulation studies are based on a logistic regression model with four scenarios corresponding to four different configurations of the design matrix. The details of the considered scenarios are the following: (a) X 1 , X 2 , . . . , X p sampled from a N (0; Σ), where the diagonal and off-diagonal elements of Σ are 1 and 0, respectively; (b) the same as (a) but with COR(X i ; X j ) = ρ |i−j| ; in our study we use ρ = 0.9; (c) the same as (a) with Σ a block diagonal matrix, namely, Σ = diag(Σ 1 , Σ 2 , . . . , Σ k ) where Σ i is a 10 × 10 matrix with diagonal elements equal to 1 and off-diagonal elements equal to 0.5; (d) in this scenario we use as design matrix the gene expression matrix provided by Alon et al. (1999). The sample size is equal to 66 and the number of genes p is equal to 2000.
For scenarios (a), (b) and (c) the sample size is equal to one hundred while three different values of p are used, i.e., p ∈ (100, 500, 1000). In these scenarios only the first five predictors are used to simulate the binary response variable while in scenario (d) the response variable is simulated using five genes randomly selected. The intercept term is equal to one and the non-zero coefficients are equal to two. For each scenario we simulated 500 data sets and a ten-fold cross-validation deviance is used to select the tuning parameter for the L 1 -penalized MLE and the parameter γ of the dgLARS method. To make the results comparable, for each data set the same folds were used to determine the value of the tuning parameters of dgLARS/dgLASSO, glmnet and glmpath, respectively.
In Tables 4 and 5 we report the summary measures used to evaluate the behavior of the considered methods i.e., the total number of variables included in the final model, the false discovery rate, the false positive rate, the false negative rate and the deviance, which was computed using an independent test set.
These results show that the behavior of the dgLARS method is closely related with the covariance structure among the predictors. When we consider scenario (a), dgLARS and dgLASSO exhibit a similar behavior to the L 1 -penalized MLE. When the relevant predictors are highly correlated, scenario (b), or when there exists an unknown group structure of the predictors, scenario (c), the results show that the dgLASSO method selects sparser models  Table 4: Results from the simulation studies (a) and (b); for each scenario we report the mean number of variables included in the final model (Size), the mean false discovery rate (FDR), the mean false positive rate (FPR), the mean false negative rate (FNR), and the mean residual deviance (Dev). Standard errors are in parentheses. than the L 1 -penalized GLM. Also when a design matrix based on real data is used to simulate the response variable (scenario (d)) the dgLASSO method tends to select sparser models.
To end this section, we recall that in general dgLARS is based on a theory completely different from the L 1 -penalized MLE implemented in the glmnet and glmpath packages since dgLARS does not use explicitly a penalized function (see Augugliaro et al. 2013, for more details).  Table 5: Results from the simulation studies (c) and (d); for each scenario we report the mean number of variables included in the final model (Size), the mean false discovery rate (FDR), the mean false positive rate (FPR), the mean false negative rate (FNR), and the mean residual deviance (Dev). Standard errors are in parentheses.

Breast cancer data set
In this section we use the functions available in the dglars package to study the sparse structure of a logistic regression model applied to a subset of the breast cancer gene deletion/amplification data set obtained by John Bartlett at the Royal Infirmary, Glasgow (Wit and McClure 2004). The aim of the study is to identify which genes play a crucial role in the severity of the disease, defined as whether or not the patient dies as a result of breast cancer. The data set contains 52 samples, 29 of which are labeled as deceased due to breast cancer. For each sample, 287 gene deletion/amplification measurements are available. A few missing values are imputed using the method proposed by Troyanskaya et al. (2001).
To study the considered data set, we first load the data in the R session R> data("breast", package = "dglars") In this example, we estimate the optimal value of the tuning parameter by the 10-fold crossvalidation method by using the cvdglars() function, i.e., Figure 6: Plot of the 10-fold cross-validation deviance computed for the breast cancer data set. Vertical red dashed line shows that the dgLASSO method finds a logistic regression model with only four non-zero estimated coefficients.

R> plot(breast_dglasso)
The print() output shows that the dgLASSO method selects a logistic regression model with a high level of sparsity since only three genes are found to influence the response variable. Even if the model selected by the 10-fold cross-validation deviance has few variables, we try to see if it is possible to further reduce the number of genes included in the active set. Then, we first use the function dglars(), with design matrix defined using only the genes previously identified, and then we analyze the obtained sequence of logistic regression models with the summary() method. Since the MLE of the logistic regression model specified by the subset of identified genes can be computed, we set the argument complexity = "gdf" to use the estimator of the generalized degrees of freedom previously described. Using the formula of the model estimated by 10-fold cross-validation deviance, i.e.,

Duke breast cancer data set
This data set contains 46 of the original 49 patients with primary breast cancer that compose the Duke breast cancer data set studied in West et al. (2001). The considered samples can be classified in two groups of the same size, namely the group with estrogen receptor-positive (ER+) and the group with estrogen receptor-negative (ER−). For each sample the gene expression measure of 7129 genes is measured by human HuGeneFL microarray. Aim of this study is to identify a small subset of genes that can be used as prognostic or predictive markers.
As done for the analysis of the previous data set, we first load the data in the R session and then we estimate the optimal value of the tuning parameter by means of the 10-fold cross validation, i.e., R> data("duke", package = "dglars") R> duke_cvdglasso <-cvdglars(y~., family = "binomial", data = duke, + control = list(g0 = 0.01)) The value of the control argument g0 was chosen after a preliminary study. By the R code

R> plot(duke_dglasso)
we can see the plot of the 10-fold cross-validation deviance (Figure 7) which shows that the dgLASSO method estimates a logistic regression model with ten parameters, i.e., the intercept term plus the regression coefficients corresponding to a subset of nine genes. To map each manufacturer ID with the corresponding gene name we used the R package hu6800.db (Carlson 2013). However, for three genes this mapping is not available. Table 6 shows the basic information about the remaining six identified genes.
The identified set contains several genes that play a central role in breast caner. In a recent study Kristiansen et al. (2003) show that CD24 is a new prognostic marker in breast cancer. Manufacturer ID Symbol Name J00116 s at COL2A1 Collagen, type II, alpha 1 L22524 s at MMP-7 Matrix metallopeptidase 7 (matrilysin, uterine) L33930 s at CD24 CD24 molecule V00572 at PGK1 Phosphoglycerate kinase 1 X03635 at ESR1 Estrogen receptor 1 X17059 s at NAT1 N-acetyltransferase 1 (arylamine N-acetyltransferase) Table 6: Subset of genes identified by the 10-fold cross-validation deviance.
The authors confirm the results previously obtained in Fogel et al. (1999) who were the first to describe CD24 expression in breast cancer in an immunohistochemistry study. Previous works have focused on the role of MMP-7 in breast cancer (Bertram and Hass 2008;Bucan, Reimers, Choi, Eddy, and Vogt 2010), while the over-expression of PGK1 has been observed in breast and pancreatic carcinoma and multi-drug resistant ovarian cancer (Duan et al. 2002;Zhang et al. 2005). Several microarray and proteomic studies indicate that NAT1 is highly expressed in estrogen receptor positive breast cancer (Wakefield et al. 2008). Human NAT1 is also a target of cisplatin, one of the most important chemotherapeutic compounds used in the management of various human malignancies (Ragunathan et al. 2009). Finally, in recent years ESR1 is shown to be frequently expressed in breast cancer (Holst et al. 2007). Several studies showed statistically significant associations between ESR1 polymorphisms and breast cancer, see for example Gold et al. (2004) and Wang et al. (2007).
As a consequence of the complete data separation, the MLE of the logistic regression model specified using only the subset of genes previously identified cannot be computed (Albert and Anderson 1984), then we use the BIC, with complexity defined by the number of non-zero estimated coefficients, to try to see if it is possible to identify a smaller subset. For this reason, we first estimate the dgLASSO solution curve by the code R> duke_dglasso <-dglars(duke_cvdglasso$formula_cv, family = "binomial", + data = duke) and then, from the summary() output

Summary
In this paper we have described the R package dglars. This package implements the differential geometric extension of the LARS method proposed in Augugliaro et al. (2013) and called dgLARS. The core of the package are two functions implementing a predictor-corrector algorithm and a cyclic coordinate descent algorithm to compute the dgLARS solution curve. In order to implement these two algorithms in an efficient way, the main code is written in Fortran 90 and handled in R by specific wrapper functions. This package also provides a function to estimate the generalized degrees of freedom proposed in Augugliaro et al. (2013). When we have the MLEs of the GLM specified by all the considered predictors, it is possible to use a more accurate measure of goodness of fit than the number of non-zero estimated coefficients to select the optimal points of the solution curve. The use of these functions is shown by means of simulated data sets and application to real data sets. The output of the functions are presented in a way that is easy to interpret for people familiar with standard lm(), glm() or gam() output. The package dglars is available under the general public license (GPL ≥ 2) from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=dglars. In future versions of the package we are going to implement other distribution families such as gamma and inverse Gaussian distributions.