Fast Estimation of Multinomial Logit Models: R Package mnlogit

We present R package mnlogit for training multinomial logistic regression models, particularly those involving a large number of classes and features. Compared to existing software, mnlogit offers speedups of 10x-50x for modestly sized problems and more than 100x for larger problems. Running mnlogit in parallel mode on a multicore machine gives an additional 2x-4x speedup on up to 8 processor cores. Computational efficiency is achieved by drastically speeding up calculation of the log-likelihood function's Hessian matrix by exploiting structure in matrices that arise in intermediate calculations.


Introduction
Multinomial logit regression models, the multiclass extension of binary logistic regression, have long been used in econometrics in the context of modeling discrete choice (McFadden 1974;Bhat 1995;Train 2003) and in machine learning as a linear classification technique (Hastie, Tibshirani, and Friedman 2009) for tasks such as text classification (Nigam, Lafferty, and McCallum 1999). Training these models presents the computational challenge of having to compute a large number of coefficients which increases linearly with the number of classes and the number of features. Despite the potential for multinomial logit models to become computationally expensive to estimate, they have an intrinsic structure which can be exploited to dramatically speedup estimation. Our objective in this paper is twofold: first we describe how to exploit this structure to optimize computational efficiency, and second, to present an implementation of our ideas in our R (R Core Team 2013) package mnlogit which is available from CRAN at: http://cran.r-project.org/web/packages/mnlogit/index.html.
An older method of dealing with the computational issues involved in estimating large scale multinomial logistic regressions has been to approximate it as a series of binary logistic regressions (Begg and Gray 1984). In fact the R package mlogitBMA (Sevcikova and Raftery 2013) implements this idea as the first step in applying Bayesian model averaging to multinomial logit data. Large scale logistic regressions can, in turn, be tackled by a number of advanced optimization algorithms (Komarek and Moore 2005;Lin, Weng, and Keerthi 2008). A number of recent R packages have focussed on slightly different aspects of estimating regularized multinomial logistic regressions. For example: package glmnet (Friedman, Hastie, and Tibshirani 2010) is optimized for obtaining the entire L1-regularized paths and uses the coordinate descent algorithm with 'warm starts', package maxent (Jurka 2012) is intended for large text classification problems which typically have very sparse data and the package pmlr (Colby, Lee, Lewinger, and Bull 2010) which penalizes the likelihood function with the Jeffreys prior to reduce first order bias and works well for small to medium sized datasets. There are also R packages which estimate plain (unregularized) multinomial regression models. Some examples are: the VGAM package (Yee 2010), the multinom function in package nnet (Venables and Ripley 2002) and package the mlogit (Croissant 2012).
Of all the R packages previously described, mlogit is the most versatile in the sense that it handles many data types and extensions of multinomial logit models (such as nested logit, heteroskedastic logit, etc.). These are especially important in econometric applications, which are motivated by the utility maximization principle (McFadden 1974), where one encounters data which depends upon both the observation instance and the choice class. Our package mnlogit provides the ability of handling these general data types while adding the advantage of very quick computations. This work is motivated by our own practical experience of the impossibility of being able to train large scale multinomial logit models using existing software.
In mnlogit we perform maximumum likelihood estimation (MLE) using the Newton-Raphson (NR) method. We speed up the NR method by exploiting structure and sparsity in intermediate data matrices to achieve very fast computations of the Hessian of the log-likelihood function. This overcomes the NR method's well known weakness of incurring very high per-iteration cost, compared to algorithms from the quasi-Newton family (Nocedal 1992(Nocedal , 1990. Indeed classical NR estimations of multinomial logit models (usually of the Iteratively Reweighted Least Square family) have been slow for this very reason. On a single processor our methods have allowed us to achieve speedups of 10x-50x compared to mlogit on modestsized problems while performing identical computations. In parallel mode 1 , mnlogit affords the user an additional speedup of 2x-4x while using up to 8 processor cores.
We provide a simple formula-based interface for specifiying a varied menu of models to mnlogit. Section 2 illustrates aspects of the formula interface, the expected data format and the precise interpretations of variables in mnlogit. To make the fullest use of mnlogit we suggest that the user understand the simple R example worked out over the course of this section. Section 3 and Appendix A contain the details of our estimation procedure, emphasizing the ideas that underlie the computational efficiency we achieve in mnlogit. In Section 4 we present the results of our numerical experiments in benchmarking and comparing mnlogit's performance with other packages while Appendix C has a synopsis of our timing methods. Finally Section 5 concludes with a short discussion and a promising idea for future work.

On using mnlogit
The data for multinomial logit models may vary with both the choice makers ('individuals') and the choices themselves. Besides, the modeler may prefer model coefficients that may (or may not) depend on choices. In mnlogit we try to keep the user interface as minimal as possible without sacrificing flexibility. We follow the interface of the mlogit function in package mlogit. This section describes the mnlogit user interface, emphasizing data preparation requirements and model specification via an enhanced formula interface. To start, we load the package mnlogit in an R session: R> library("mnlogit") 2.1. Data preparation mnlogit accepts data in the 'long' format which requires that if there are K choices, then there be K rows of data for each individual (see also Section 1.1 of the mlogit vignette). Here is a snapshot from data in the 'long' format on choice of recreational fishing mode made by 1182 individuals: Notice that the snapshot shows this data for two individuals.
The actual choice made by an individual, the 'response' variable, is shown in the column 'mode'. mnlogit requires that the data contain a column with exactly two categories whose levels can be coerced to integers by as.numeric(). The greater of these integers is automatically taken to mean TRUE.
The only other column strictly mandated by mnlogit is one listing the names of choices (like column 'alt' in Fish data). However if the data frame is an mlogit.data class object, then this column maybe omitted. In such cases mnlogit can query the index attribute of an mlogit.data object to figure out the information contained in the 'alt' column.

Model parametrization
Multinomial logit models have a solid basis in the theory of discrete choice models. The central idea in these discrete models lies in the 'utility maximization principle' which states that individuals choose the alternative, from a finite, discrete set, which maximizes a scalar value called 'utility'. Discrete choice models presume that the utility is completely deterministic for the individual, however modelers can only model a part of the utility (the 'observed' part). Stochasticity entirely arises from the unobserved part of the utility. Different assumptions about the probability distribution of the unobserved utility give rise to various choice models like multinomial logit, nested logit, multinomial probit, GEV (Generalized Extreme Value), mixed logit etc. Multinomial logit models, in particular, assume that unobserved utility is i.i.d. and follows a Gumbel distribution. 2 We consider that the observed part of the utility for the i th individual choosing the k th alternative is given by: Here Latin letters (X, Y , Z) stand for data while Greek letters (ξ, α, β, γ) stand for parameters. The parameter ξ k is called the intercept. For many practical applications data in multinomial logit models can be naturally grouped into two types: • Individual specific variables X i which does not vary between choices (e.g., income of individuals in the 'Fish' data of Section 2).
• Alternative specific variables Y ij and Z ij which vary with alternative and may also differ, for the same alternative, between individuals (e.g., the amount of fish caught in the 'Fish' data: column 'catch').
In mnlogit we model these two data types with three types of coefficients: 1. Individual specific data with alternative specific coefficients X i · β j 2. Alternative specific data with generic coefficients Z ik · α.
3. Alternative specific data with alternative specific coefficients Y ik · γ k .
The vector notation serves to remind that more than one variable of each type maybe used to build a model. For example in the fish data we may choose both the 'price' and 'catch' with either generic coefficients (the α) or with alternative specific coefficients (the γ k ).
Due to the principle of utility maximization, only differences between utility are meaningful. This implies that the multinomial logit model can not determine absolute utility. We must specify the utility for any individual with respect to an arbitrary base value 3 which we choose to be 0. For convenience in notation, we fix the choice indexed by k = 0 as the base, thus normalized utility is given by: Notice that the above expression implies that V i0 = 0 ∀i. To simplify notation we re-write the normalized utility as: This equation retains the same meaning as the previous, notice the restriction: k = 0, since we need V i0 = 0. The most significant difference is that Z ik in Equation 2 stands for: Z ik − Z i0 (in terms of the original data).
The utility maximization principle, together with the assumtion on the error distribution, implies that for multinomial logit models (Train 2003) the probability of individual i choosing alternative k, P ik , is given by: Here V ij is the normalized utility given in Equation 2 and k = 0 is the base alternative with respect to which we normalize utilities. The number of available alternatives is taken as K which is a positive integer greater than one. From the condition that every individual makes a choice, we have that: k=K−1 k=0 P ik = 1,. This gives us the probability of individual i picking the base alternative: Note that K = 2 is the familiar binary logistic regression model.
Equation 2 has implications about which model parameters maybe identified. In particular for alternative-specific coefficients of individual-specific data we may only estimate the difference β k − β 0 . Similarly for the intercept only the difference ξ k − ξ 0 , and not ξ k and ξ 0 separately maybe estimated. For a model with K alternative we estimate K − 1 sets of parameters β k − β 0 and K − 1 intercepts ξ k − ξ 0 .

Formula interface
To specify multinomial logit models in R we need an enhanced version of the standard formula interface -one which is able to handle multi-part formulas. In mnlogit we built the formula interface using tools from the R package Formula (Zeileis and Croissant 2010). Our formula interface closely confirms to that of the mlogit package. We illustrate it with examples motivated by the 'Fish' dataset (introduced in Section 2). Consider a multinomial logit model where 'price' has a generic coefficient, 'income' data being individual-specific has an alternative-specific coefficient and the 'catch' also has an alternative-specific coefficient. That is, we want to fit a model that has the 3 types of coefficients described in Section 2.2. Such a model can be specified in mnlogit with a 3-part formula: By default, the intercept is included, it can be omitted by inserting a '-1' or '0' anywhere in the formula. The following formulas specify the same model with omitted intercept: We can omit any group of variables from the model by placing a 1 as a placeholder: When the meaning is unambiguous, an omitted group of variables need not have a placeholder.

Using package mnlogit
In an R session with mnlogit loaded, the man page can be accessed in the standard way:
In mnlogit we use the R built-in function qr, with its argument 'tol' set to 'linDepTol', to check for linear dependencies . If collinear columns are detected in the data then some are removed so that the remaining columns are linearly independent.
We now illustrate the practical usage of mnlogit and some of its methods by a simple example. Consider the model specified by the formula: This model has: • One variable of type α: 'price'.
• Two variable of type β k : 'income' and the intercept.
In the 'Fish' data the number of alternatives K = 4, so the number of coefficients in the above model is: • 1 coefficient for data that may vary with individuals and alternatives, corresponding to α.
• 2 × (K − 1) = 6, alternative specific coefficients for individual specific data (note: that we have subtract 1 from the number of alternative because after normalization the base choice coefficient can't be identified), corresponding to β k .
• 1 × K = 4 alternative specific coefficients for data which may vary with individuals and alternatives, corresponding to γ k .
Thus the total number of coefficients in this model is 1 + 6 + 4 = 11.
We call the function mnlogit to fit the model using the 'Fish' dataset on 2 processor cores.
R> fit <-mnlogit(fm, Fish, ncores=2) R> class(fit) [1] "mnlogit" For mnlogit class objects we have the usual methods associated with R objects: coef, print, summary and predict methods. In addition, the returned 'fit' object can be queried for details of the estimation process by: The estimation process terminates when first one of the 3 conditions 'maxiter', 'ftol' or 'gtol' are met. In case one runs into numerical singularity problems during the Newton iterations, we recommend relaxing 'ftol' or 'gtol' to obtain a suitable estimate. The plain Newton method has a tendency to overshoot extrema. In mnlogit we have inlcuded a 'line search' 4 which avoids this problem and ensures convergence (Nocedal and Wright 2000).
As a convenience, we provide the following method so that an mnlogit object maybe queried for the number and type of model coefficients. Finally there is provision for hypothesis testing. We provide the function hmftest to perform the Hausman-McFadden test for IIA (Independence of Irrelevant Alternatives), which is the central hypothesis underlying multinomial logit models (Train 2003, Chap. 3). Three functions to test for hypotheses, applicable to any model estimated by the maximum likelihood method, are also provided: • Function lrtest to perform the likelihood ratio test.
• Function waldtest to perform the Wald test.
• Function scoretest to perform the Rao score test.
These intent of these tests is succinctly described in Section 6 'Tests' of the mlogit package vignette and we shall not repeat it here. We encourage the interested user to consult the help page for any of these functions in the usual way, for example the lrtest help maybe accessed by: R> library("mnlogit") R> ?lrtest Functions hmftest and scoretest are adapted from code in the mlogit package, while lrtest and waldtest are built using tools in the CRAN R package lmtest (Zeileis and Hothorn 2002).

Estimation algorithm
In mnlogit we employ maximum likelihood estimation (MLE) to compute model coefficients and use the Newton-Raphson method to solve the optimization problem. The Newton-Raphson method is well established for maximizing the logistic family loglikelihoods (Hastie et al. 2009;Train 2003). However direct approaches of computing the Hessian of multinomial logit model's log-likelihood function have extremely deleterious effects on the computer time and memory required. We present an alternate approach which exploits structure of the the intermediate data matrices that arise in Hessian calculation to achieve the same computation much faster while using drastically less memory. Our approach also allows us to optimally parallelize Hessian computation and maximize the use of BLAS (Basic Linear Algebra Subprograms) Level 3 functions, providing an additional factor of speedup.

Maximizing the likelihood
Before going into details we specify our notation. Throughout we assume that there are K ≥ 3 alternatives. The letter i labels individuals (the 'choice-makers') while the letter k labels alternatives (the 'choices'). We also assume that we have data for N individuals available to fit the model (N is assumed to be much greater than the number of model parameters). We use symbols in bold face to denote matrices, for example H stands for the Hessian matrix.
To simplify housekeeping in our calculations we organize model coefficients into a vector θ. If the intercept is to be estimated then it simply considered another individual specific variable with an alternative specific coefficient but with the special provision that the 'data' corresponding to this variable is unity for all alternatives. The likelihood function is defined by L( θ) = i P y i | θ , where each y i labels the alternative observed to chosen by individual i. Now we have: Here I(y i = k) is the indicator function which unity if its argument is true and zero otherwise. The likelihood function is given by: It is more convenient to work with the log-likelihood function which is given by l( θ) = logL( θ). A little manipulation gives: In the above we make use of the identity k I(y i = k) = 1 and the definition of P i0 in Equation 4. McFadden (1974) has shown that the log-likelihood function given above is globally concave. A quick argument to demostrate global concavity of l( θ) is that it's the sum of affine functions V ik and the negation of the composition of the log-sum-exp function with a set of affine functions. 5 We solve the optimization problem by the Newton-Raphson (NR) method which requires finding a stationary point of the gradient of the log-likelihood. Note that MLE by the Newton-Raphson method is the same as the Fisher scoring algorithm (Hastie et al. 2009;Li 2013). For our log-likelihood function 5, this point (which we nameθ) is unique (because of global concavity) and is given by the solution of the equations: ∂l( θ) ∂ θ = 0. The NR method is iterative and starting at an initial guess obtains an improved estimate ofθ by the equation: Here the Hessian matrix, H = ∂ 2 l ∂ θ∂ θ and the gradient ∂l is called the full Newton step. In each iteration we attempt to update θ old by this amount. However if the log-likelihood value at the resulting θ new is smaller, then we instead try an update of δθ/2. This linesearch procedure is repeated with half the previous step until the new log-likelihood value is not lower than the value at θ old . Using such a linesearch procedure guarantees convergence of the Newton-Raphson iterations (Nocedal and Wright 2000).

Gradient and Hessian calculation
5 The log-sum-exp function's convexity and its closedness under affine composition are well known, see for example Chapter 3 of Boyd and Vandenberghe (2004).
Each Newton-Raphson iteration requires computation of the Hessian and gradient of the loglikelihood function. The expressions for the gradient and Hessian are quite well known 6 and in there usual form are given by: For a model where where only individual specific variables are used (that is only the matrix X contributes to the utility in Equation 2), the matricesX andW are given by (Li 2013;Bohning 1992 here X is a matrix of order N × p (p is the number of variables or features) and, Here the sub-matrices W k,t are diagonal matrices of order N × N , where diag(W k,t ) i = P ik (δ kt − P it ) and δ kt is the Kronecker delta which equals 1 if k = t and 0 otherwise. Using this notation the gradient can be written as (Li 2013): Where we take vectors y and P as vectors of length N × (K − 1), formed by vertically concatenating the N probabilities P ik and responses I(y i = k), for each k ∈ [1, K − 1]. The Newton-Raphson iterations of Equation 6 take the form: . Although in this section we have shown expressions for models with only individual specific variables, a general formulation ofX andW including the two other types of variables appearing in Equation 2 exists 7 . This is presented in Appendix B but their specific form is tangential to the larger point we make (our ideas extend to the general case in a simple way).
An immediate advantage of using the above formulation, is that Newton-Raphson iterations can be carried out using the framework of IRLS (Iteratively Re-weighted Least Squares) (Hastie et al. 2009, Section 4.4.1). IRLS is essentially a sequence of weighted least squares regresions which offers superior numerical stability compared to explicitly forming H and directly solving Equation 6 (Trefethen and Bau 1997, Lecture 19). However this method, besides being easy to implement, is computationally very inefficient. The matricesX and W are huge, of orders (K − 1)N × (K − 1)p and N (K − 1) × N (K − 1) respectively, but are otherwise quite sparse and possess a neat structure. We now describe our approach of exploiting this structured sparsity.

Exploiting structure -fast Hessian calculation
We focus our attention on computation of the Hessian since it's the most expensive step, as we later show from empirical measurements in Table 1 of Section 4. We start by ordering the vector θ, which is a concatenation of all model coefficients as specified in Equation 2, in the following manner: Here, the subscripts index alternatives and the vector notation reminds us there maybe multiple features modeled by coefficients of type β, γ, α. In θ we group together coefficients corresponding to an alternative. This choice is deliberate and leads to a particular structure of the Hessian matrix of the log-likelihood function with a number of desirable properties.
Differentiating the log-likelihood function with respect to the coefficient vector θ, we get: Here we have partitioned the gradient vector into chunks according to θ m which is a group of coefficients of a particular type (defined in Section 2.2), either alternative specific or generic. Subscript m (and subscript k) indicates a particular alternative, for example: The vector y m is a vector of length N whose i th entry is given by I(y i = m), it tells us whether the observed choice of individual i is alternative m, or not. Similarly P m is vector of length whose i th entry is given by P im , which is the probability individual i choosing alternative m. The matrices M m and Z k contain data for choice m and k, respectively. Each of these matrices has N rows, one for each individual. Specifically: Similarly, the matrices Z k are analogues of the Y m and have N rows each (note that due to normalization Z 0 = 0).
To compute the Hessian we continue to take derivatives with respect to chunks of coefficients θ m . On doing this we can write the Hessian in a very simple and compact block format as shown below.
Here H nm is a block of the Hessian and the matrices W nm are diagonal matrix of dimension N ×N , whose i th diagonal entry is given by: P in (δ nm −P im ). 8 The details of taking derivatives in this block-wise fashion are given in Appendix A.
The first thing to observe about Equation 10 is that effectively utilizes spartsity in the matrices X andW to obtain very compact expressions for H. The block format of the Hessian matrix is particularly suited for extremely efficient numerical computations. Notice that each block can be computed independently of other blocks with two matrix multiplications. The first of these involves multiplying a diagonal matrix to a dense matrix, while the second requires multiplication of two dense matrices. We handle the first multiplication with a handwritten loop which exploits the sparsity of the diagonal matrix, while the second multiplication is handed off to a BLAS (Basic Linear Algebra Subprograms) call for optimal efficiency (Golub and Loan 2013) 9 . Computing the Hessian block-by-block is very efficient since we can use level 3 BLAS calls (specifically DGEMM) to handle the most intensive calculations. Another useful property of the Hessian blocks is that because matrices W nm are diagonal (hence symmetric), we have the symmetry property H nm = H mn , implying that we only need to compute roughly half of the blocks.
Independence of Hessian blocks leads to a very fruitul strategy for parallelizing Hessian calculations: we simply divide the work of computing blocks in the upper triangular part of the Hessian among available threads. This strategy has the great advantage that threads don't require any synchronization or communication overhead. However the cost of computing all Hessian blocks is not the same: the blocks involving generic coefficients (the α) take much longer to compute longer. In mnlogit implementation, computation of the blocks involving generic coefficients is handled separately from other blocks and is optimized for serial computation.
Hessian calculation is, by far, the most time consuming step in solving the multinomial logit MLE problem via the Newton-Raphson method. The choice we make in representing the Hessian in the block format of Equation 10 has dramatic effects on the time (and memory) it takes for model estimation. In the next section we demonstrate the impact on computation times of this choice when contrasted with earlier approaches.

Benchmarking performance
For the purpose of performance profiling mnlogit code, we use simulated data generated using a custom R function makeModel sourced from simChoiceModel.R which is available in the folder mnlogit/vignettes/. Using simulated data we can easily vary problem size to study performance of the code -which is our main intention here -and make comparisons to other packages. Our tests have been performed on a dual-socket, 64-bit Intel machine with 8 cores per socket which are clocked at 2.6 GHz 10 . R has been natively compiled on this machine using gcc with BLAS/LAPACK support from single-threaded Intel MKL v11.0.
The 3 types of model coefficients mentioned in Section 2.2 entail very different computational requirements. In particular it can be seen from Equations 9 and 10, that Hessian and gradient calculation is computationally very demanding for generic coefficients. For clear-cut comparisons we speed test the code with 4 types of problems described below. In our results we shall use K to denote the number of alternatives and n p to denote the total number of coefficients in the model.

1.
Problem 'X': A model with only individual specific data with alternative specific coefficients.
2. Problem 'Y': A model with data varying both with individuals and alternatives and alternative specific model coefficients.
3. Problem 'Z': Same type of data as problem 'Y' but with generic coefficients which are independent of alternatives.
4. Problem 'YZ': Same type of data as problem 'Y' but with a mixture of alternative specific and generic coefficients.
Although problem 'X' maybe considered a special case of problem 'Y' but we have considered it separately, because it's typically used in machine learning domains as the simplest linear multiclass classifier (Hastie et al. 2009). We shall also demonstrate that mnlogit runs much faster for this class of problems. 11 The 'YZ' class of problems serves to illustrate a common use case of multinomial logit models in econometrics where the data may vary with both individuals and alternatives while the coefficients are a mixture of alternative specific and generic types (usually only a small fraction of variables are modeled with generic coefficients).
The workings of mnlogit can be logically broken up into 3 steps: 1. Pre-processing: Where the model formula is parsed and matrices are assembled from a user supplied data.frame. We also check the data for collinear columns (and remove them) to satisfy certain necessary conditions, specified in Appendix B, for the Hessian to be non-singular.
2. Newton-Raphson Optimization: Where we maximize the log-likelihood function to estimate model coefficients. This involves solving a linear system of equations and one needs to compute the log-likelihood function's gradient vector and Hessian matrix.
3. Post-processing: All work needed to take the estimated coefficients and returning an object of class mnlogit.  Table 2: Ratio between mlogit and mnlogit total running times on a single processor for problems of various sizes and types. Each problem has 50 variables with K alternatives and N = 50 * K * 20 observations to train the model. mlogit has been run separately with two optimizers: Newton-Raphson and BFGS. In all cases the iterations terminated when the difference between log-likelihoods in successive iterations reduced below 10 −6 . Note: These numbers can vary depending on the BLAS implementation linked to R and hardware specifications.
the very high pre-processing time for problem 'Z' whereof a large portion is spent in ensuring that the data satisfies necessary conditions, mentioned in Appendix B, for the Hessian to be non-singular.

Comparing mnlogit performance
We now compare the performance of mnlogit in single-threaded mode with some other R packages. This section focuses on the comparison with the R package mlogit since it's the only one which covers the entire range of variable and data types as mnlogit. Appendix C contains a synopsis of our data generation and timing methods including a comparison of mnlogit with the R packages VGAM and nnet. Table 2 shows the ratio between mlogit and mnlogit running times for the 4 classes of problems considered in Table 1. We see that for most problems, except those of type 'Z', mnlogit outperforms mlogit by a large factor. We have not run larger problems for this comparison because mlogit running times become too long, except problem 'Z' 12 .
Besides Newton-Raphson, which is the default, we have also run mlogit with the BFGS optimizer. Typically the BFGS method, part of the quasi-Newton class of methods, takes more iterations than the Newton method but with significantly lower cost per iteration since it never directly computes the Hessian matrix. Typically for large problems the cost of computing the Hessian becomes too high and the BFGS method becomes overall faster than the Newton method (Nocedal and Wright 2000). Our approach in mnlogit attacks this weakness of the Newton method by exploiting the structure and sparsity in matrices involved in the Hessian calculation to enable it to outperform BFGS.

Parallel performance
We now now turn to benchmarking mnlogit's parallel performance. Figure 1 shows the speedups we obtain in Hessian calculation for the same problems considered in Table 1. The value of n p , the number of model parameters, is significant because it's the dimension of the Hessian matrix (the time taken to compute the Hessian scales like O(n 2 p )). We run the parallel code separately on 2, 4, 8, 16 processor cores, comparing in each case with the single core time. Figure 1 shows that it's quite profitable to parallelize problems 'X' and 'Y', but the gains for problem 'Z' are not too high. This is because of a design choice we make: Hessian calculation for type 'Z' variables is optimized for serial processing. In practical modeling, number of model coefficients associated with 'Z' types variable is not high, especially when compared to those of types 'X' and 'Y' (because the number of the coefficients of these types is also proportional to the number of choices in the model, see Section 2.4). For problems of type 'YZ' (or other combinations which involve 'Z'), parallelization can bring significant gains if the number of model coefficients of type 'Z' is low, relative to other types.
It can also be seen from figure 1 that, somewhat mysteriously, parallel performance degrades quickly as the number of processor cores is increased beyond 4. This is a consequence of the fact that our OpenMP progam runs on a machine with shared cache and main memory, so increasing the number of threads degrades performance by increasing the chance of cache misses and hence slowing memory lookups. This is an intrinsic limitation of our hardware for which there is a theoretically simple solution: run the program on a machine with a larger cache! An important factor to consider in parallel speedups of the whole program is Amdahl's Law 13 which limits the maximum speedup that maybe be achieved by any parallel program. Assuming parallelization between n threads, Amdahl's law states that the ultimate speedup is given by: S n = 1 fs+(1−fs)/n , where f s is the fraction of non-parallelized, serial code. Table 3 lists the observed speedups on 2, 4 and 8 processor coress together with f s for problems of

Discussion
Through mnlogit we seek to provide the community a package which combines quick calculation and the ability to handle data collinearity with a software interface which encompasses a wide range of multinomial logit models and data types used in econometrics and machine learning. Our main idea, exploiting matrix structure in large scale linear algebra calculations is not novel; however this work is the first, as far as we are aware, to apply it to the estimation of multinomial logit models problems in a working software package. The parallelization capability of mnlogit, which can easily add a 2x-4x factor of speedup on now ubiquitous mulitcore computers, is another angle which is underutilized in statistical software. Although mnlogit code is not parallelized to the maximum possible extent, parallelizing the most expensive parts of the calculation was an important design goal. We hope that practical users of the package benefit from this feature.
This work was initially motivated by the need to train large-scale multinomial logistic regression models. For very large-scale problems, Newton's method is usually outperformed by gradient based, quasi-Newton methods like the l-BFGS algorithm (Liu and Nocedal 1989). Hessian based methods based still hold promise for such problems. The class of inexact Newton (also called truncated Newton) methods are specifically designed for problems where the Hessian is expensive to compute but taking a Hessian-vector product (for any given vector) is much cheaper (Nash 2000). Multinomial logit models have a Hessian with a structure which permits taking cheap, implicit products with vectors. Where applicable, inexact Newton methods have the promise of being better than l-BFGS methods (Nash and Nocedal 1991) besides having low memory requirements (since they never store the Hessian) and are thus very scalable. In the future we shall apply inexact Newton methods to estimating multinomial logit models to study their convergence properties and performance.

A. Log-likelihood differentiation
In this Appendix we give the details of our computation of gradient and Hessian of the loglikelihood function in Equation 5. We make use of the notation of Section 3.3. Taking the derivative of the log-likelihood with respect to a chunk of coefficient θ m one gets: The second term in this equation is a constant term, since the utility V ik , defined in Equation 2, is a linear function of the coefficients. Indeed we have: The vectors y m and the matrices M m and Z k are specified in Section 3.3. We take the derivative of the base case probability, which is specified in Equation 4, as follows: Here the probability vector P m is of length N with entries P im . In the last line we have used the fact that, after normalization, Z 0 is 0. Using Equations 11 and 12 we get the gradient in the form shown in Equation 9.
Upon differentiating the probability vector P k (k ≥ 1) in Equation 3 with respect to θ m we get: where D( P k ) is an N × N diagonal matrix whose i th diagonal entry is P ik and, matrix W km is also an an N × N diagonal matrix whose i th diagonal entry is P ik (δ km − P im ). In matrix form this is: We write the Hessian of the log-likelihood in block form as: However it can be derived in a simpler way by differentiating the gradient with respect to θ n . Doing this and making use of Equation 13 gives us Equation 10. The first two cases of equation are fairly straightforward with the matrices W km being the same as shown in Equation 13. The third case, when ( θ n , θ m are both α), is a bit messy and we describe it here.
Here the last line follows from the definition of matrix W kt as used in Equation 13.

B. Data requirements for Hessian non-singularity
We derive necessary conditions on the data for the Hessian to be non-singular. Using notation from Section 3.2, we start by building a 'design matrix'X by concatenating data matrices X, Y k and Z k in the following format: In the above 0 stands for a matrix of zeros of appropriate dimension. Similarly we build two more matrices Q and Q 0 as shown below: Using the 2 matrices above we define a 'weight' matrixW: The full Hessian matrix, containing all the blocks of Equation 10, is given by: H =X WX . For the matrix H to be non-singular, we must have the matrixX be full-rank. This leads us to the following necessary conditions on the data for the Hessian to be non-singular: 1. All matrices in the set: {X, Y 0 , Y 1 . . . Y K−1 } must be of full rank.
In mnlogit we directly test condition 1, while the second condition is tested by checking for collinearity among the columns of the matrix 15 : Columns are arbitrarily dropped one-by-one from a collinear set until the remainder becomes linearly independent.
Another necessary condition: It can be shown with some linear algebra manipulations (omitted because they aren't illuminating) that if we have a model with has only: data for generic variables independent of alternatives and the intercept, then the resulting Hessian will always be singular. mnlogit does not attempt to check the data for this condition which is independent of the 2 necessary conditions given above.

C. Timing tests
We give the details of our simulated data generation process and how we setup runs of the R packages mlogit, VGAM and nnet to compare running times against mnlogit. We start by loading mlogit into an R session:

R> library("mlogit")
Next we generate data in the 'long format' (described in Section 2) using the makeModel function sourced from the file simChoiceModel.R which is in the mnlogit/vignettes/ folder. The data we use for the timing tests shown here is individual specific (problem 'X' of Section 4) because this is the only one that packages VGAM and nnet can run. We generate data for a model with K choices as shown below.
R> source("simChoiceModel.R") R> data <-makeModel('X', K=10) Default arguments of makeModel set the number of variables and the number of observations, which are: The next steps setup a formula object which specifies that individual specific data must be modeled with alternative specific coefficients and the intercept is excluded from the model.
R> system.time(fit.mnlogit <-mnlogit(fm, data, "choices")) user system elapsed 2.200 0.096 2.305 Likewise we measure running times for mlogit running the same problem with the Newton-Raphson (the default) and the BFGS optimizers.
R> mdat <-mlogit.data(data[order(data$indivID), ], "response", shape="long", + alt.var="choices") R> system.time(fit.mlogit <-mlogit(fm, mdat)) # Newton-Raphson Here the first step is necessary to turn the data.frame object into an mlogit.data object required by mlogit. The default stopping conditions for mnlogit and mlogit are exactly the same. The timing results shown in Table 2 were obtained in a similar way but with different formulas for each type of problem. All our tests use the function makeModel to generate data.
For comparison with nnet we must make a few modifications: first we turn the data into a format required by nnet and then change the stopping conditons from their default to (roughly) match mnlogit and mlogit. We set the stopping tolerance so that 'reltol' controls convergence and roughly corresponds at termination to 'ftol' in these packages. Note that nnet runs the BFGS optimizer.
R> library("VGAM") R> eps <-vglm.control(epsilon = 1e-6) R> system.time(fit.vglm <-vglm(fm.nnet, data=ndat, multinomial, control=eps)) user system elapsed 46.271 1.852 48.298 Note: The precise times running times reported on compiling this Sweave document depend strongly on the machine, whether other programs are also running simultaneously and the BLAS implementation linked to R. For reproducible results run on a 'quiet' machine (with no other programs running).