penalized : A MATLAB Toolbox for Fitting Generalized Linear Models with Penalties

penalized is a ﬂexible, extensible, and eﬃcient MATLAB toolbox for penalized maximum likelihood. penalized allows you to ﬁt a generalized linear model (gaussian, logistic, poisson, or multinomial) using any of ten provided penalties, or none. The toolbox can be extended by creating new maximum likelihood models or new penalties. The toolbox also includes routines for cross-validation and plotting.


Introduction
Consider a linear regression model y = Xβ + e, where y is a vector of n observations, X is a matrix of covariates, β is a vector of p coefficients, and e is a vector of n random errors. Fitting this model involves two tasks: model selection, where a subset of the coefficients from β are included in the model and the rest are excluded (that is, set to zero); and estimation, where the values of the included coefficients are determined. Both of these tasks can be accomplished in one step by adding a penalty term to the regression. For example, Mallows C p (Mallows 1973), the AIC (Akaike 1973), and the BIC (Schwarz 1978), can all be written as a penalized least-squares problem 1 2n y − Xβ 2 2 + λ β 0 , where β 0 is the number of nonzero coefficients (the L 0 norm) of β and λ is a value which may depend on the dispersion in the model and the number of observations. Model selection and estimation are accomplished by finding the β which minimizes Equation 1.
Unfortunately, the L 0 norm is non-convex, so it is quite hard to find the minimum of Equation 1. Partly for this reason, it has become popular to use an L 1 penalized regression (the LASSO; Tibshirani 1996) for model selection and estimation: where β 1 = i |β i | is the L 1 norm of β. The value of the penalized regression P λ (β) is not meaningful in the same way that Mallows C p or the AIC is, so the ideal penalty weight λ is usually determined by minimizing some other criterion, such as the cross-validation error.
The L 1 penalty can also be added to maximum likelihood estimation; in this case we maximize P λ (β) = 1 n log (y; β) − λ β 1 .
(For linear regression, this corresponds to log (y; β) = − 1 2 y − Xβ 2 .) While the LASSO has plenty of desirable characteristics, it has some potentially undesirable ones too. The LASSO selects coefficients in the model by shrinking all coefficients towards zero, so the model with the correct signs and zeros for the coefficients will tend to underfit the data. Any attempt to mitigate the shrinkage by reducing the penalty weight λ will lead the LASSO to add in extra irrelevant coefficients. These problems have led to the proposal of a number of alternative penalties (for example, SCAD (Fan and Li 2001), MC+ (Zhang 2010;Mazumder, Friedman, and Hastie 2011), FLASH (Radchenko and James 2011), and Relaxo (Meinshausen 2007)), together with their own algorithms and software implementations. Some of these implementations only work with linear regression, or are difficult for the end-user to obtain and use.
It would be useful to be able to carry out penalized model fitting using any penalty, applied to any likelihood, as the problem demands rather than as software availability permits. The penalized toolbox is a set of MATLAB (The MathWorks Inc. 2007) functions which allows you to do this. The toolbox contains functions for penalized maximum likelihood, objects which represent common generalized linear models (least-squares, logistic, multinomial, and poisson), a wide selection of penalty functions, a cross-validation routine, and some plotting functions. Any penalty can be combined with any generalized linear model, and new models and penalties can be added to the toolbox. This paper describes the toolbox: how to use it, how it works, and how the likelihood models and penalty functions are designed. After a tutorial walkthrough of the toolbox, which shows the sorts of analyses that can be carried out, I outline the maximization algorithm used in the toolbox. Next, I describe the likelihood models and penalty functions, especially how they interface to the core maximization algorithm. Finally, the toolbox is compared with the R (R Core Team 2016) package glmnet (Friedman, Hastie, and Tibshirani 2010).

A tutorial
The penalized toolbox is loosely modelled on glmnet (Friedman et al. 2010) so some of this tutorial may appear familiar to users of that R package. To start, change to the directory containing the toolbox and type install_penalized This will add the necessary paths to your MATLAB path. penalized consists of pure MATLAB code, and uses no MEX files. (You can uninstall_penalized later if you want.) This tutorial can be run interactively by typing the command echodemo jsstutorial To begin, assume we have a set of 0-1 observations y together with a covariate matrix X which can be modelled by a logistic regression. To create a logistic model, type model = glm_logistic(y, X, "nointercept") (Note that all models in the toolbox are prefixed with glm_. The argument "nointercept" in the third position specifies that glm_logistic should not add an intercept to the design matrix; otherwise an intercept is added. The intercept is never penalized.) To perform a LASSO (L 1 penalized) fit of this model, type fit = penalized(model, @p_lasso) The first argument to the function penalized is the model being fitted. The second argument is a penalty function handle. (The syntax @function is like a C function pointer). Here the penalty function is p_lasso, one of many penalty functions available in the toolbox. All provided penalty functions are prefixed with p_. The function penalized then fits an L 1 penalized logistic regression over a range of penalty weights λ.
The flexibility of the toolbox comes from the modularization implied in this function call: any model (which conforms to the calling conventions) and any penalty function (likewise) can be used.
The returned value, fit, is a structure. The field fit.lambda contains the sequence of λ values used for penalization. These are automatically selected by the function or manually controlled by the options lambdamax, lambdaminratio, and nlambda which may be passed to penalized. (For more information on how penalized automatically selects the values of λ, and how this selection process can be influenced or overridden, type help options) The field fit.beta contains the coefficients: fit.beta(:, i) gives the fitted coefficients β for the penalty weight fit.lambda(i). If the model has an intercept, it is stored in fit.beta(1, i). The fitted coefficients can be plotted against λ by typing plot_penalized(fit) yielding the graph shown in Figure 1A. If the model has an intercept, this is omitted from the plot.
One way of selecting the best value of λ is to pick that which minimizes the Akaike information criterion (AIC; Akaike 1973). The AIC can be computed and plotted against lambda by typing AIC = goodness_of_fit("aic", fit); semilogx(fit.lambda,AIC) The function goodness_of_fit supplied in the package computes the goodness of fit from the fit structure. Instead of "aic", you can also pass "bic" (i.e., the Bayesian infomation criterion), "deviance", or "log-likelihood". However, the calculation of AIC and BIC assumes that the degrees of freedom are equal to the number of nonzero parameters, which is only know to be true for the LASSO penalty (Zou, Hastie, and Tibshirani 2007).
Alternatively, the best value of λ may be selected by cross-validation. A 5-fold cross-validation of the penalized logistic model can be carried out by typing cv = cv_penalized(model, @p_lasso, "folds", 5) The option "folds" gives the number of folds; otherwise the arguments to cv_penalized are the same as those to penalized. (Note that 5 is the default number of folds, so the option was not necessary here.) The return structure cv contains the results of the cross-validation. The cross-validation error can be plotted against lambda by typing

plot_cv_penalized(cv)
The results are shown in Figure 1B. The minimum cross-validation error is obtained at λ = 0.0142, which is recorded in cv.minlambda, and plotted as a vertical dashed line.
Consider now a linear regression model with observations y and covariate matrix X and an intercept. We create this by typing model2 = glm_gaussian(y, X); Since the "nointercept" option is omitted, an intercept is added to the model. Instead of the LASSO penalty, we might try the clipped LASSO (Antoniadis and Fan 2001), as implemented in the toolbox function p_clipso. The clipped LASSO penalty function is defined as clipso(x) = min(|x|, α). To use the clipped LASSO with a value of α equal to 0.3, we type fit = penalized(model2, @p_clipso, "alpha", 0.3, "standardize", true) Additional parameters are added as name-value options in the call to penalized. Here there are two. The "alpha" option specifies the α value for the clipso penalty. The "standardize" option effectively scales the columns of X to have equal norms. This is useful when the penalty function is not scale invariant (which is unfortunately the case for many penalty functions). Standardization is reversed when the fitted coefficients are calculated. The results of this fit can again be plotted using plot_penalized(fit), yielding the graph in Figure 2A. The intercept is omitted from the plot.
Instead of a single value for α, a range of values can be efficiently fitted in a single function call by setting "alpha" to an array: fit = penalized(model2 ,@p_clipso, "alpha", [inf, 1, 0.5, 0.3, 0.05]) (An infinite value for α makes p_clipso behave the same as p_lasso.) The returned fit structure holds the coefficients for all values of penalty weight λ, and all the values of α specified in the function call. That is, fit.beta(:, i, j) holds the fitted coefficients β for the penalty weight fit.lambda(i) and the penalty parameter fit.alpha(j), where in this case fit.alpha will be equal to [inf, 1, 0.5, 0.3, 0.05].
We can plot all these coefficients using plot_penalized(fit). This interactively displays the fitted coefficients against λ for each value of α, pausing after each plot. A specific value or values of α can be picked out for plotting by typing plot_penalized(fit, "slice", 5) This plots the fitted coefficients β against λ for the 5th α value (the slice), which is equal to 0.05. The result is shown in Figure 2B. The intercept is omitted from this plot. We can pick the best value of λ and α simultaneously by cross-validation. Typing cv = cv_penalized(model2, @p_clipso, "alpha", [inf, 1, 0.5, 0.3, 0.05], ... "folds", 3) will do a three-fold cross-validation of the model over a range of λ and the specified values of α. The results can be plotted with plot_cv_penalized(cv), as shown in Figure 3A. In this case, the plot superimposes the cross-validation error versus λ curve for each value of α on the same axes. Error bars are omitted from this plot. The minimum cross-validation error is attained for α = 0.05 and λ = 0.1895 (which are stored in cv.minalpha and cv.minlambda).
Cross-validation errors for specific values of α (with error bars) can also be plotted. For example plot_cv_penalized(cv, "slice", [1, cv.minalpha], "errorbars", "on") will plot the cross-validation error for the 1st value of α (∞) and the cross-validation error for the best value of α, saved in cv.minalpha. This is shown in Figure 3B. In this case the best cross-validation error for α = 0.05 is not distinguishable from the best one produced by the lasso (α = ∞).
If you have a custom penalty function mypenalty, that requires a parameter γ, it can be used by typing fit = penalized(model2, @mypenalty, "gamma", 2) and the results can be plotted, as before. The function mypenalty must follow the calling conventions outlined in Section 5 below. Cross validation works just as before: cv = cv_penalized(model2, @mypenalty, "gamma", 1:5, "folds", 3) Here the cross-validation is done for a range of γ values from 1 to 5. In this case, the field cv.mingamma says which value of γ yielded the smallest cross-validation error. Similarly, if you have developed a custom statistical model mymodel, it can be fitted with any penalty by typing fit = penalized(mymodel, @mypenalty, "gamma", 2) Cross-validation and plotting work as before. The custom model mymodel must be a MATLAB object with the methods outlined in Section 4.1 below.

The maximization algorithm
The algorithm used in the penalized toolbox is Fisher scoring over an active set with orthant projection (Schmidt, Fung, and Rosales 2009;Park and Hastie 2007). It is described here simply to indicate how the likelihood model and the penalty function interact with it, as there is nothing original in this implementation.
We consider only penalties π(β) that are a weighted sum of coordinate penalty functions, so where π i is the penalty function for the ith coefficient β i , and w i are the penalty weights. We wish to maximize the penalized likelihood Define the score s(β) as the gradient 1 of the likelihood, s(β) = d log (y; β)/dβ. Define π (β) = dπ(β)/dβ as the gradient of the penalty function (with elements Define H(β) as the Hessian of the likelihood (y; β) with respect to β, and Π as the Hessian of the penalty π(β). This is simply a diagonal matrix with entries is used to find the next estimate β t+1 from the existing estimate β t . In Fisher scoring, the Hessian H is replaced by the negative of the Fisher information matrix, −F, to give This iteration is rapidly convergent when β t is close to the maximum, but might be poor far from it. Problematic convergence can be improved using a Levenburg-Marquardt adjustment, adding in the term ωdiag( 1 n F), to yield The Levenburg-Marquardt weight ω is iteratively adjusted using a trust-region algorithm.
If there is some improvement in the penalized log-likelihood P λ (β t+1 ) − P λ (β t ), then ω is decreased or kept the same, depending on the size of the improvement. However, if there is no improvement in the penalized log likelihood, then ω is repeatedly increased until some improvement occurs. If, after many attempts with increasing ω, there is still no improvement in the penalized log-likelihood, then the iteration halts.
However, (7) assumes that the penalty functions are differentiable. Most interesting penalties are not, so the algorithm must be changed to deal with this. An active set algorithm is used, as follows.
The vector β is divided into two parts, the "active" vector β A , where differentiability of the penalty holds, and the "inactive" vector β ∼A , where it does not. The algorithm assumes that the only singularity of the coordinate penalty functions π i is at zero, so the active vector is simply all the non-zero entries of β, and the inactive vector is all the zero entries.
Each modified iteration step proceeds in two parts: (A) Gradient ascent on the inactive vector: Any elements β i in β ∼A for which (where s(β i ) is the ith element of the score) do not satisfy the first-order optimality conditions for the maximum, and so are candidates for addition to the active vector. Theoretically, all elements β i in β ∼A for which the above holds could be added to the active vector, but in practice the algorithm is more stable if, at every iteration, only the few elements which violate the optimality conditions the most are added (Perkins, Lacker, and Theiler 2003). For those few elements, we set β i = β i + s(β i ), for some small . These elements then enter the active vector.
(B) Fisher update on the active vector: Holding the inactive vector fixed, use the up- where F A and Π A are the information and penalty matrices restricted to elements of the active vector. This step may only be valid when β t A and β t+1 A are in the same orthant. Thus, β t+1 A is projected onto the closest point in the orthant containing β t A for which twice-differentiability does hold (Andrew and Gao 2007), giving us the update rule where Proj A is the orthant projection operator, which simply zeros all elements of β t+1 A which have a different sign from β t A . Elements of the active set which are zeroed then join the inactive set. If the projected step does not lead to a reduction in the penalized likelihood P λ (β), then the trust-region procedure shrinks the step until it does.

Warm starts
Because we are interested in the value of the coefficients β over a range of penalty weights λ, a continuation process is used to estimate them efficiently: The best fitted value β * k for a given penalty weight λ k is used as the initial value of the iterations with the next penalty weight λ k+1 .
A continuation process is also used when there are a range of penalty parameters to fit (e.g., α in clipso). Writing β * k,j to be the best fitted coefficients for penalty weight λ k and penalty parameter j, the initial value β 0 k,j for the kth value of λ and the jth value of the penalty parameter can be either: 1. The best fit from the previous value of the penalty parameter, β * k,j−1 . This option is selected by typing penalized(..., "warmstart", "relax"). You can think of this as fixing a particular weight λ, then "relaxing" the penalty parameter from the first value specified to the last. This is the default option.
2. The best fit from the previous value of penalty weight λ, β * k−1,j . This option is selected by typing penalized(..., "warmstart", "lambda"). You can think of this as fixing a particular penalty parameter at j, then fitting successive values of the weight λ.
3. Use both warm starts above and pick the best. This option is selected by typing penalized(..., "warmstart", "both"). Obviously this takes twice as long.

The likelihood
The maximization algorithm needs the score vector s(β) and Fisher information matrix F from the likelihood model. For some likelihoods -generalized linear models (Nelder and Wedderburn 1972;McCullagh and Nelder 1989) -these are relatively simple to compute. In this section, we briefly review the construction of generalized linear models and specify how they interface with the maximization algorithm.
In vector notation, s(β) = X Dm where m is a vector of derivatives with elements m i = d i (y i )/dµ i and D is a diagonal matrix with elements D i,i = dµ i /dη i .
The Fisher information matrix F is given by Because the observations are independent, E(mm ) is a diagonal matrix with elements . Calling this matrix V, the information matrix can be written as F = X DVDX. The information matrix over the active set is just F A = X A DVDX A , where X A is the covariate matrix restricted to those columns which are in the active set.

Interface to the maximization algorithm
The score and information matrix could be supplied by a function, which is typically what occurs in maximization routines (see for example, the MATLAB function fminunc in the optimization toolbox). However, because the likelihood depends on a substantial amount of data (the observations y and covariates X), and needs additional book-keeping functions, it is best implemented as an object, which has methods that can be called by the maximization algorithm as needed. The most straightforward interface would be for the likelihood object to supply the log-likelihood log (y; β), the score s(β) and the information matrix F(β) as requested. However, the full information matrix could be extremely large, even when the active set is small. A more efficient interface is for the likelihood object to supply the components needed to compute the information matrix, and allow the maximization algorithm to assemble them in an efficient way. Thus, if model is a likelihood object representing a generalized linear model, the log-likelihood at β is computed by a call to the method logl: The components m, D, V, and X as a function of β are returned by a call to the method scoring: [L, m, D, V, X] = scoring(model, beta) Here L is the log-likelihood again; it is efficient to return it at the same time as the other quantities {m, D, V, X}. The diagonal matrices D and V are returned as column vectors; when all diagonal entries in D or V are the same, these can be returned as scalars.
Though the meaning of the elements in the tuple {m, D, V, X} are specified for generalized linear models, the maximization algorithm does not care; it requires only that the information matrix F A is equal to X A DVDX A and the score s(β) is equal to X Dm. For example, the likelihood object could return the tuple {Dm, 1, DVD, X} or {m, 1, V, DX} instead of {m, D, V, X}, as they yield exactly the same score and information matrix.
As well as the above two methods logl and scoring, the likelihood object must also have the following methods: obj = constructor(...) The constructor creates the likelihood object. All of the provided constructors will add an intercept unless a "nointercept" option has been given. However, this is just a convention.
p = property(model) returns a structure containing a number of model properties. The structure should contain fields n and p for the number of rows and columns in X respectively; nobs for the number of observations, which may be different from n (e.g., in glm_logistic); intercept which is the empty matrix if no intercept is fitted, and 1 if an intercept has been inserted as column 1; and colscale which gives the L 2 norms of the columns of X (used for standardization).
p = property(model, "name") returns a specific property of the model indicated by the name.
beta = initial(model) returns a suitable initial value for β, usually just zeros.
s = sample(model, index) returns a subset s of the model having observations in the index. This is used in cross-validation. beta = project(model, beta) projects β onto the allowable domain of the model, when the domain of β is restricted. Otherwise, it returns β unchanged.
Further details of these methods can be found by typing help models. The toolbox provides 'glm_gaussian', 'glm_logistic', 'glm_poisson', and 'glm_multinomial' class constructors. These all inherit from a base class 'glm_base', which provides many of the above methods.

Penalties
The third component of the penalized toolbox is the penalty function π(β) = p i=1 w i π i (β i ). If the weights w i in the penalty are different from 1, they can be specified in a call to penalized by including the penaltywt option. For example, penalized(model, @penalty, "penaltywt", w) uses elements of the vector w as the penalty weights. A penalty function is useful in a model selection role by inducing sparseness in the coefficient vector β. The ability of a penalty function to induce sparseness arises from the discontinuous derivative of the penalty at zero. Coefficients β i satisfy the first-order optimality conditions when . What this implies is that coefficients β i are "trapped" at the singularity (Fan and Li 2001) until their score 1 n s(β i ) exceeds the maximum subderivative multiplied by λw i . Thus penalties which have a singularity at zero are sparsity inducing, because they trap coefficients at zero.
Because the singularity at zero is the only useful one for inducing sparsity, the maximization algorithm assumes that it is the only singularity. However, some penalties also have discontinuous derivatives away from zero (such as p_clipso). These discontinuities do not induce sparsity, and are ignored by the maximization algorithm. This does not cause a problem with p_clipso, but other penalties with non-differentiable points away from zero might possibly fail. The subderivative at zero also needs to be finite; otherwise coefficients will be permanently trapped there.

Supplied penalty functions
The penalized toolbox supplies the following penalty functions: Adaptive (Zou 2006): The adaptive LASSO is π i (β i ) = |β i |/|β i | γ , whereβ i is a consistent estimate of β i , such as the ordinary least-squares estimate. The adaptive LASSO is discussed further in Section 6.2.
FLASH (Radchenko and James 2011): The FLASH algorithm is not defined as a penalized optimization, but it has an implied penalty. The FLASH penalty, with parameter δ, is equal to |β i | when β i = 0, and is equal to (1 − δ)|β i | when β i = 0. The equivalence between the FLASH algorithm and this penalty is shown in Appendix A. To use FLASH as described in Radchenko and James (2011), you must call penalized(model, @p_flash, ..., "warmstart", "lambda") to get the correct warm starts.
Lq : The Lq penalty is just the Lq norm, π i (β i ) = |β i | q . When q is less than 1, this penalty will trap all elements of β in the inactive set because the subderivatives are infinite. However, it can be used in penalized when a range of different q values is provided. For example the call penalized(model, @p_Lq, "q", [1 0.8 0.6 0.4 0.2 0]) will work, because the first value of q is just the LASSO, and subsequent values of q use the LASSO solution as a warm start. (Zhang 2010;Mazumder et al. 2011): The MC+ penalty is easiest defined by its derivative π i (β i ) = sign(β i )(1 − |β i |/(α)) + when β i = 0, and π sub i (β i ) = [−1, 1] when β i = 0.

MC+
None : This penalty function does not penalize, for cases where unpenalized maximum likelihood is needed. So π i (β i ) = 0. When using this penalty, the convergence criteria for penalized may need to be tightened. See help options.
SCAD (Fan and Li 2001): The smoothly clipped absolute deviation penalty is easiest defined by its derivative SCAD behaves like the LASSO when |β i |< λ, does not penalize when |β i |≥ aλ, and smoothly transitions between these two behaviours when λ ≤ |β i |< aλ. The parameter a must be greater than 2.
Other penalty functions can be defined and used as long as they adhere to the calling conventions given next.

Interface to the maximization algorithm
In the maximization algorithm, the penalty function must supply, at different times, the individual penalties π i (β i ), the derivatives π i (β i ), the subdifferential [π − i (β i ), π + i (β i )], and the second derivatives π i (β i ), when requested. Switching between these is accomplished with a mode parameter. Any additional parameters needed by the penalty function are passed in as fields in an options structure. For example, the call penalized(model, @clipso, "alpha", 0.3) will create an options structure which contains a field options.alpha equal to 0.3. This options structure is then passed to the penalty function.
The penalty function takes four arguments -a mode string, a coefficient vector, the current value of λ, and an options structure options, with any extra parameters as fields. The following call patterns are expected: p = penalty("", beta, lambda, options) An empty mode string "" asks the penalty function to return the penalty values for a coefficient vector beta; p(i) is the penalty π i (β i ) for coefficient beta(i). The total penalty is lambda * sum(w.*p) where w is a vector of weights, usually 1. The vector of weights can be specified by the option penalized(... "penaltywt", w). d = penalty("deriv", beta, lambda, options) This returns a vector d of derivatives, where d(i)=dπ i (β i )/dβ i . Elements of d in the inactive set are ignored, so any value can be returned for them. For elements of d in the active set, and where the derivative is discontinuous, return either endpoint of the subdifferential or the average of the endpoints.
[lo, hi] = penalty("subdiff", beta, lambda, options) This returns the subdifferential intervals for elements beta(i). The return values should be lo(i)= π − i (β i ) and hi(i)=π + i (β i ). Elements of lo and hi that are in the active set are ignored, so any value can be used there. If all subdifferential intervals in the inactive set are the same (which is usually the case), then lo and hi can be scalars rather than vectors. p2 = penalty("2ndderiv", beta, lambda, options) This returns the vector of second derivatives of the penalty for the parameter beta. If all second derivatives are the same, p2 can be a scalar. Elements of p2 that are not in the active set are ignored. tf = penalty ("project", beta, lambda, options) This returns true if the orthant projection Proj A is required for the coefficient vector β. This is true for most penalties; exceptions are ridge and none.
For example, suppose we want to create a new penalty π(β) = log(1 + α|β|), which we will call abslog. The penalty will be called with a specific parameter α = 1 by typing penalized(model, @abslog, "alpha", 1). The function penalized will put α into an options structure which will be passed to our penalty function as the last parameter.
Thus the first two lines of our new penalty can be function [x,y] = abslog(mode, beta, lambda, options) alpha = options.alpha; When the mode parameter is "", we return the penalty values in the x variable x = log(1 + alpha * abs(beta)); When the mode is "deriv", we return the derivative.
x = alpha * sign(beta)./(1 + alpha * abs(beta)); When the mode is "subdiff" we return the endpoints of the subdifferential in x and y: x = -alpha; y = alpha; When the mode is "2ndderiv" we return the second derivative x = -alpha^2./(1 + alpha * abs(beta)); Finally, when the mode is "project" we return x = true, because this penalty requires orthant projection.

Related algorithms
The penalized toolbox can also be used to implement some other penalized likelihood algorithms.

The relaxed LASSO
The relaxed LASSO (Meinshausen 2007) is a way of successively reducing the shrinkage over the active set of parameters. The relaxed LASSO is defined as where φ is the relaxation parameter. Initially, φ = 1, but after the coefficients for a given λ have been determined, φ is relaxed towards zero over the non-zero coefficients, while holding the other coefficients at zero. This is simply the FLASH penalty with δ = 1 − φ, so we can use the FLASH penalty to implement relaxed LASSO (relaxo) with the following call fit = penalized(model, @p_flash, "delta", 0:0.1:1, "warmstart", "relax") (Note that "warmstart", "relax" is the default and can be omitted.) In this case, the FLASH parameter δ = 1 − φ is relaxed in increments of 0.1. The penalized function fits a full sequence of λ for δ = 0 -i.e., a LASSO fit -then for each fit, relaxes δ from 0 through to 1.

The adaptive LASSO
The penalized function allows you to set individual penalty weights for each coefficient by adding a "penaltywt" option. The weighted penalized likelihood can be run with the call fit = penalized (model, @p_lasso, "penaltywt", w) where w is a vector of the penalty weights w i (the penalty weight for any intercept is always forced to be zero).
The adaptive LASSO uses a particular set of penalty weights which ensure an oracle property and near minimax estimation (Zou 2006). The adaptive LASSO could be implemented in penalized by setting w i = 1/|β i | γ whereβ i is a consistent estimate of β i , such as the ordinary least-squares estimate. Unfortunately, this approach does not permit us to use cross-validation to find the best value of the power γ, since γ needs to be a parameter for this to work.
Thus, we pass the adaptive LASSO weights as a separate option in the call to the penalized routine. The separate option is called "adaptivewt". The adaptive LASSO can be called in two ways. If only one value of gamma is used, say 0.5, then the adaptive LASSO is invoked as fit = penalized (model, @p_adaptive, "gamma", 0.5, "adaptivewt", {beta_ols}) where beta_ols is the vector of adaptive LASSO weights. The adaptive weight vector is enclosed in a cell because when penalized reads the parameters, it assumes that any nonstandard option (that is, one which is not described in help options) whose value is an array must be a penalty parameter which should then be "relaxed" over. As the adaptive weights are not a relaxation parameter, they are enclosed in a cell to avoid this misinterpretation. If multiple values of gamma are used, then the call is, for example, fit = penalized(model,@p_adaptive, "gamma", 1:-0.2:0.01, "adaptivewt", ...

{beta_ols})
Again the adaptive weights must be enclosed as a cell so that penalized interprets it correctly.

Performance
Flexibility and extensibility were the overriding design concerns for the penalized toolbox. Performance was not completely ignored, however, and while penalized is slower than glmnet, it completes in reasonable times. Table 1 gives some representative running times for penalized compared to glmnet. The timings for penalized were obtained using MATLAB 2007b (32-bit) running on a Samsung EP300E5C laptop (Core i3 2.4GHz, 6GB memory, using mains power) under Windows 7 (64-bit). The timings for glmnet were obtained on the same machine using R (64-bit). Each timing is an average of 15 runs. Each run fitted a sequence of 100 lambdas on the same data set.
The design matrices X used in the timings were randomly generated with uncorrelated columns; however, introducing correlations did not change the comparisons very much and so are not shown. The true β coefficients oscillated between positive and negative values, with an exponential decay on their magnitude. The rate of decay was such that the 7th coefficient was half the size of the first. The set of λ values was selected by glmnet and then used by penalized, to ensure that they both computed the same number of parameters.
penalized is rarely more than a few seconds slower than glmnet, and most of the time the difference is not really noticeable. If using a parameterized penalty (e.g., p_clipso, p_Lq), the time taken increases with the number of parameters. will take about 5 times longer than penalized(model2, @p_clipso, "alpha", 0.3) Likewise, cross-validation time increases linearly with the number of folds.

Accuracy
The speed of penalized depends on the options controlling convergence at each level of λ. The default convergence criteria were chosen to maximize speed while keeping the coefficient estimates close to those provided by glmnet. Generally, the difference between the estimates provided by glmnet and penalized were less than 0.5% of the norm of the coefficient vector. Figure 4(A) compares the estimated coefficients from penalized (solid lines) and glmnet (dots) for one run (n = 1000, p = 100) using a logistic model. Other models are similar. The estimates from the two packages follow each other closely. The differences in the coefficients are plotted against λ in Figure 4(B). The jags in this plot are due to different convergence criteria. In glmnet the algorithm converges when all coefficients have a small enough change, while in penalized it converges when the vector of coefficients has a small enough change. This means that variables which enter the active set in penalized may not move very much until the next lowest λ is used.

Conclusion
penalized is a flexible and efficient MATLAB toolbox for using and exploring penalized regression with generalized linear models. It allows the user to use any penalty with any generalized linear model. The toolbox can be extended to include other log-likelihood models and other penalties than those provided, making it simpler to explore the performance of any model or penalty that can be coded to the toolbox API. The toolbox also has the option to select the underlying maximization algorithm, so future versions may include a faster maximization algorithm for the gaussian model.

A. The FLASH penalty
The forward LASSO adaptive shrinkage (FLASH) algorithm was described in Radchenko and James (2011). They did not, however, specify the objective function that the algorithm maximizes. Here it is shown that the FLASH algorithm is a form of penalized likelihood. We only consider the least-squares case.
At iteration t, the FLASH algorithm computes two updates. The first update is the ordinary least-squares update over an active set A: and the second is the LASSO update: In either case, the step is β t+1 A − β t A . Thus FLASH takes a step which is a weighted sum of the least-squares step and the LASSO step. That is, the FLASH update is ).
This implies that the penalty is π(β) = (1 − δ)|β|, for coefficients in the active set, which is the FLASH penalty defined here.
A set of multinomial observations is a matrix Y where the (i, j)th element y i,j is the number of times category j occurred in observation i. Each element y i,j has a certain probability p i,j of occurring, and the log-likelihood of the observations is The matrix of observations Y can be viewed as a concatenated set of column vectors Y = [y 1 , y 2 , . . . , y q ], one column for each category. Each column vector y j has an associated probability vector

B.1. Interface to the maximization algorithm
The scoring routine requires a tuple {m, D, V, J}. The vector m is just where division is element-by-element. The vector V is the expected value of m 2 , namely V = E y p 2 = y 2 − y + y p = y 2 − y + m.