sparseHessianFD : An R Package for Estimating Sparse Hessian Matrices

Sparse Hessian matrices occur often in statistics, and their fast and accurate estimation can improve efficiency of numerical optimization and sampling algorithms. By exploiting the known sparsity pattern of a Hessian, methods in the sparseHessianFD package require many fewer function or gradient evaluations than would be required if the Hessian were treated as dense. The package implements established graph coloring and linear substitution algorithms that were previously unavailable to R users, and is most useful when other numerical, symbolic or algorithmic methods are impractical, inefficient or unavailable.

The Hessian matrix of a log likelihood function or log posterior density function plays an important role in statistics. From a frequentist point of view, the inverse of the negative Hessian is the asymptotic covariance of the sampling distribution of a maximum likelihood estimator. In Bayesian analysis, when evaluated at the posterior mode, it is the covariance of a Gaussian approximation to the posterior distribution. More broadly, many numerical optimization algorithms require repeated computation, estimation or approximation of the Hessian or its inverse; see Nocedal and Wright (2006).
The Hessian of an objective function with M variables has M 2 elements, of which M (M +1)/2 are unique. Thus, the storage requirements of the Hessian, and computational cost of many linear algebra operations on it, grow quadratically with the number of decision variables. For applications with hundreds of thousands of variables, computing the Hessian even once might not be practical under time, storage or processor constraints. Hierarchical models, in which each additional heterogeneous unit is associated with its own subset of variables, are particularly vulnerable to this curse of dimensionality However, for many problems, the Hessian is sparse, meaning that the proportion of Şstruc-turalŤ zeros (matrix elements that are always zero, regardless of the value at which function is estimated) is high. Consider a log posterior density in a Bayesian hierarchical model. If the outcomes across units are conditionally independent, the cross-partial derivatives of heterogeneous variables across units are zero. As the number of units increases, the size of the Hessian still grows quadratically, but the number of non-zero elements grows only linearly; the Hessian becomes increasingly sparse. The row and column indices of the non-zero elements comprise the sparsity pattern of the Hessian, and are typically known in advance, before computing the values of those elements. R packages such as trustOptim (Braun 2014), sparseMVN (Braun 2015) and ipoptr (Wächter and Biegler 2006) have the capability to accept Hessians in a compressed sparse format.
The sparseHessianFD package is a tool for estimating sparse Hessians numerically, using either Ąnite differences or complex perturbations of gradients. Section 1.1 will cover the speciĄcs, but the basic idea is as follows. Consider a real-valued function f (x), its gradient ∇f (x), and its Hessian Hf (x), for x ∈ R M . DeĄne the derivative vector as the transpose of the gradient, and a vector of partial derivatives, so Df (x) = ∇f (x) ⊤ = (D 1 , . . . , D M ). (Throughout the paper, we will try to reduce notational clutter by referring to the derivative and Hessian as D and H, respectively, without the f (x) symbol). Let e m be a vector of zeros, except with a 1 in the mth element, and let δ be a sufficiently small scalar constant. A ŞĄnite differenceŤ linear approximation to the mth column of the Hessian is H m ≈ (∇f (x + δe m ) − ∇f (x)) /δ. Estimating a dense Hessian in this way involves at least M + 1 calculations of the gradient: one for the gradient at x, and one after perturbing each of the M elements of x, one at a time. Under certain conditions, a more accurate approximation is the Şcomplex stepŤ method: H m ≈ Im (∇f (x + iδe m )) /δ, where i = √ −1 and Im returns the imaginary part of a complex number (Squire and Trapp 1998). Regardless of the approximation method used, if the Hessian is sparse, most of the elements are constrained to zero. Depending on the sparsity pattern of the Hessian, those constraints may let us recover the Hessian with fewer gradient evaluations by perturbing multiple elements of x together. For some sparsity patterns, estimating a Hessian in this way can be profoundly efficient. In fact, for the hierarchical models that we consider in this paper, the number of gradient evaluations does not increase with the number of additional heterogeneous units.
The package deĄnes the sparseHessianFD class, whose initializer requires the user to provide functions that compute an objective function, its gradient (as accurately as possible, to machine precision), and the sparsity pattern of its Hessian matrix. The sparsity pattern (e.g., location of structural zeros) must be known in advance, and cannot vary across the domain of the objective function. The only functions and methods of the class that the end user should need to use are the initializer, methods that return the Hessian in a sparse compressed format, and perhaps some utility functions that simplify the construction of the sparsity pattern. The class also deĄnes methods that partition the variables into groups that can be perturbed together in a Ąnite differencing step, and recovers the elements of the Hessian via linear substitution. Those methods perform most of the work, but should be invisible to the user.
As with any computing method or algorithm, there are boundaries around the space of applications for which sparseHessianFD is the right tool for the job. In general, numerical approximations are not ŞĄrst choiceŤ methods because the result is not exact, so sparseHes-sianFD should not be used when the application cannot tolerate any error, no matter how small. Also, we admit that some users might balk at having to provide an exact gradient, even though the Hessian will be estimated numerically. 1 However, deriving a vector of Ąrst derivatives, and writing R functions to compute them, is a lot easier than doing the same for a matrix of second derivatives, and more accurate than computing second-order approximations from the objective function. Even when we have derived the Hessian symbolically, in practice it may still be faster to estimate the Hessian using sparseHessianFD than coding it directly. These are the situations in which sparseHessianFD adds the most value to the statisticianŠs toolbox.
This article proceeds as follows. First, we present some background information about numerical differentiation, and sparse matrices in R, in Section 1. In Section 2, we explain how to use the package. Section 3 explains the underlying algorithms, and Section 4 demonstrates the scalability of those algorithms.

Background
Before describing how to use the package, we present two short background notes. The Ąrst note is an informal mathematical explanation of numerical estimation of the Hessian matrix, with an illustration of how the number of gradient estimates can be reduced by exploiting the sparsity pattern and symmetric structure. This note borrows heavily from, and uses the notation in, Magnus and Neudecker (2007, Chapter 6). The second note is a summary of some of the sparse matrix classes that are deĄned in the Matrix package (Bates and Maechler 2015), which are used extensively in sparseHessianFD.

Numerical differentiation of sparse Hessians
The partial derivative of a real scalar-valued function f (x) with respect to x j (the jth component of x ∈ R M ) is deĄned as For a sufficiently small δ, this deĄnition allows for a linear approximation to D j f (x). The derivative of f (x) is the vector of all M partial derivatives.
We deĄne the second-order partial derivative as and the Hessian as The Hessian is symmetric, so D 2 ij = D 2 ji .

Approximation using Ąnite differences
To estimate the mth column of H using Ąnite differences, we choose a sufficiently small δ, and compute For M = 2, our estimate of a general Hf (x) would be This estimate requires three evaluations of the gradient to get Df (x 1 , x 2 ), Df (x 1 + δ, x 2 ), and Df (x 1 , x 2 + δ). Now suppose that the Hessian is sparse, and that the off-diagonal elements are zero. That means that If the identity in Equation 7 holds for x 1 , it must also hold for x 1 + δ, and if Equation 8 holds for x 2 , it must also hold for x 2 + δ. Therefore, Only two gradients, Df (x 1 , x 2 ) and Df (x 1 + δ, x 2 + δ), are needed. Being able to reduce the number of gradient evaluations from 3 to 2 depends on knowing that the cross-partial derivatives are zero.

Approximation using complex steps
If f (x) is deĄned over a complex domain and is holomorphic, then we can we can approximate Df (x) and Hf (x) at real values of x using the complex step method. This method comes from a Taylor series expansion of f (x) in the imaginary direction of the complex plane (Squire and Trapp 1998). After rearranging terms, and taking the imaginary parts of both sides, Estimating a Ąrst derivative using the complex step method does not require a differencing operation, so there is no subtraction operation that might generate roundoff errors. Thus, the approximation can be made arbitrarily precise as δ → 0 (Lai and Crassidis 2008). This is not the case for second-order approximations of the Hessian (Abreu, Stich, and Morales 2013). However, when the gradient can be computed exactly, we can compute a Ąrst-order approximation to Hessian by treating it as the Jacobian of a vector-valued function (Lai and Crassidis 2008).
If this matrix were dense, we would need two evaluations of the Df (x) to estimate it. If the matrix were sparse, with the same sparsity pattern as the Hessian in Equation 9, and we assume that structural zeros remain zero for all complex x ∈ Z M , then we need only one evaluation. Suppose we were to subtract Im(Df (x 1 , x 2 )) from each column of Hf (x). When x is real, the imaginary part of the gradient is zero, so this operation has no effect on the value of the Hessian. But the sparsity constraints ensure that the following identities hold all complex x.
As with the Ąnite difference method, because Equation 13 holds for x 1 , it must also hold for x 1 + iδ, and because Equation 14 holds for x 2 , it must also hold for x 2 + iδ. Thus, for real x. Only one evaluation of the gradient is required.
Perturbing groups of variables Curtis, Powell, and Reid (1974) describe a method of estimating sparse Jacobian matrices by perturbing groups of variables together. Powell and Toint (1979) extend this idea to the general case of sparse Hessians. This method partitions the decision variables into C mutually exclusive groups so that the number of gradient evaluations is reduced. DeĄne G ∈ R M ×C where G mc = δ if variable m belongs to group c, and zero otherwise. DeĄne G c ∈ R M as the cth column of G.
Next, deĄne Y ∈ R M ×C such that each column is either a difference in gradients, or the imaginary part of a complex-valued gradient, depending on the chosen method.
If C = M , then G is a diagonal matrix with δ in each diagonal element. The matrix equation HG = Y represents the linear approximation H im δ ≈ y im , and we can solve for all elements of H just by computing Y. But if C < M , there must be at least one G c with δ in at least two rows. The corresponding column Y c is computed by perturbing multiple variables at once, so we cannot solve for any H im without further constraints.
These constraints come from the sparsity pattern and symmetry of the Hessian. Consider an example with the following values and sparsity pattern.
Suppose C = 2, and deĄne group membership of the Ąve variables through the following G matrix.
Variables 1, 2 and 5 are in group 1, while variables 3 and 4 are in group 2.
Next, compute the columns of Y using Equation 16. We now have the following system of linear equations from HG = Y.
Note that this system is overdetermined. Both h 31 = y 12 and h 53 = y 52 can be determined directly, but h 31 + h 53 = y 31 may not necessarily hold, and h 42 could be either y 41 or y 22 . Powell and Toint (1979) prove that it is sufficient to solve LG = Y instead via a substitution method, where L is the lower triangular part of H. This has the effect of removing the equations h 42 = y 22 and h 31 = y 12 from the system, but retaining h 53 = y 52 . We can then solve for h 31 = y 31 − y 52 . Thus, we have determined a 5 × 5 Hessian with only three gradient evaluations, in contrast with the six that would have been needed had H been treated as dense.
The sparseHessianFD algorithms assign variables to groups before computing the values of the Hessian. This is why the sparsity pattern needs to be provided in advance. If a non-zero element is omitted from the sparsity pattern, the resulting estimate of the Hessian will be incorrect. The only problems with erroneously including a zero element in the sparsity pattern are a possible lack of efficiency (e.g., an increase in the number of gradient evaluations), and that the estimated value might be close to, but not exactly, zero. The algorithms for assigning decision variables to groups, and for extracting nonzero Hessian elements via substitution, are described in Section 3.

Sparse matrices and the Matrix package
The sparseHessianFD package uses the sparse matrix classes that are deĄned in the Matrix package (Bates and Maechler 2015). All of these classes are subclasses of sparseMatrix. Only the row and column indices (or pointers to them), the non-zero values, and some metadata, are stored; unreferenced elements are assumed to be zero. Class names, summarized in Table 1, depend on the data type, matrix structure, and storage format. Values in numeric and logical matrices correspond to the R data types of the same names. Pattern matrices contain row and column information for the non-zero elements, but no values. The storage format refers to the internal ordering of the indices and values, and the layout deĄnes a matrix as symmetric (so duplicated values are stored only once), triangular, or general. The levels of these three factors determine the preĄx of letters in each class name. For example, a triangular sparse matrix of numeric (double precision) data, stored in column-compressed format, has a class dtCMatrix. Matrix also deĄnes some other classes of sparse and dense matrices that we will not discuss here. The Matrix package uses the as function to convert sparse matrices from one format to another, and to convert a base R matrix to one of the Matrix classes.
The distinctions among sparse matrix classes is important because sparseHessianFDŠs hessian method returns a dgCMatrix, even though the Hessian is symmetric. Depending on how the Hessian is used, it might be useful to coerce the Hessian into a dsCMatrix object. Also, the utility functions in Table 2 expect or return certain classes of matrices, so some degree of coercion of input and output might be necessary. Another useful Matrix function is tril, which extracts the lower triangle of a general or symmetric matrix.

Using the package
In this section, we demonstrate how to use the sparseHessianFD, using a hierarchical binary choice model as an example. Then, we discuss the sparsity pattern of the Hessian, and estimate the Hessian values.

Example model: hierarchical binary choice
Suppose we have a dataset of N households, each with T opportunities to purchase a particular product. Let y i be the number of times household i purchases the product, out of the T purchase opportunities, and let p i be the probability of purchase. The heterogeneous parameter p i is the same for all T opportunities, so y i is a binomial random variable.
Let β i ∈ R k be a heterogeneous coefficient vector that is speciĄc to household i, such that β i = (β i1 , . . . , β ik ). Similarly, z i ∈ R k is a vector of household-speciĄc covariates. DeĄne each p i such that the log odds of p i is a linear function of β i and z i , but does not depend directly on β j and z j for another household j ̸ = i.
The coefficient vectors β i are distributed across the population of households following a multivariate normal distribution with mean µ ∈ R k and covariance Σ ∈ R k×k . Assume that we know Σ, but not µ, so we place a multivariate normal prior on µ, with mean 0 and covariance Ω ∈ R k×k . Thus, the parameter vector x ∈ R (N +1)k consists of the N k elements in the N β i vectors, and the k elements in µ.
The log posterior density, ignoring any normalization constants, is

Sparsity patterns
Let x 1 and x 2 be two subsets of elements of x. DeĄne D 2 x 1 ,x 2 as the product set of crosspartial derivatives between all elements in x 1 and all elements in x 2 . From the log posterior density in Equation 21, we can see that D 2 β i ,β i ̸ = 0 (one element of β i could be correlated with another element of β i ), and that, for all i, D 2 β i ,µ ̸ = 0 (because µ is the prior mean of each β i ). However, since the β i and β j are independently distributed, and the y i are conditionally independent, the cross-partial derivatives D 2 β i ,β j = 0 for all i ̸ = j. When N is much greater than k, there will be many more zero cross-partial derivatives than non-zero. Each D 2 is mapped to a submatrix of H, most of which with be zero. The resulting Hessian of the log posterior density will be sparse.
The sparsity pattern depends on the indexing function; that is, on how the variables are ordered in x. One such ordering is to group all of the coefficients in the β i for each unit together. β 11 , . . . , β 1k , β 21 , . . . , β 2k , . . . , β N 1 , . . . , β N k , µ 1 , . . . , µ k (22) In this case, the Hessian has a Şblock-arrowŤ structure. For example, if N = 5 and k = 2, then there are 12 total variables, and the Hessian will have the pattern in Figure 1a.
In both cases, the number of non-zeros is the same. There are 144 elements in this symmetric matrix, but only 64 are non-zero, and only 38 values are unique. Although the reduction in RAM from using a sparse matrix structure for the Hessian may be modest, consider what would happen if N = 1, 000 instead. In that case, there are 2,002 variables in the problem, and more than 4 million elements in the Hessian. However, only 12, 004 of those elements are non-zero. If we work with only the lower triangle of the Hessian, then we need to work with only 7,003 values.
The sparsity pattern required by sparseHessianFD consists of the row and column indices of the non-zero elements in the lower triangle the Hessian, and it is the responsibility of the user to ensure that the pattern is correct. In practice, rather than trying to keep track of the row and column indices directly, it might be easier to construct a pattern matrix Ąrst, check visually that the matrix has the right pattern, and then extract the indices. The package Figure 1: Two examples of sparsity patterns for a hierarchical model.

Matrix.to.Coord
Returns a list of vectors containing row and column indices of the non-zero elements of a matrix.

Matrix.to.Pointers
Returns indices and pointers from a sparse matrix.

Coord.to.Pointers
Converts list of row and column indices (triplet format) to a list of indices and pointers (compressed format).  (Table 2) to convert between sparse matrices, and the vectors of row and column indices required by the sparseHessianFD initializer.
The Matrix.to.Coord function extracts row and column indices from a sparse matrix. The following code constructs a logical block diagonal matrix, converts it to a sparse matrix, and prints the sparsity pattern of its lower triangle. R> library("sparseHessianFD") R> bd <-kronecker(diag(3), matrix(TRUE, 2, 2)) R> Mat <-as(bd, "nMatrix") R> printSpMatrix(tril(Mat)) [ If there is uncertainty about whether an element is a structural zero or not, one should err on the side of it being non-zero, and include that element in the sparsity pattern. There might be a loss of efficiency if the element really is a structural zero, but the result will still be correct. All that would happen is that the numerical estimate for that element would be zero (within machine precision). On the other hand, excluding a non-zero element from the sparsity pattern will likely lead to an incorrect estimate of the Hessian.
x A numeric vector, with length M at which the object will be initialized and tested.
fn,gr R Functions that return the value of the objective function, and its gradient. The Ąrst argument is the numeric variable vector. Other named arguments can be passed to fn and gr as well (see the ... argument below).
rows, cols Sparsity pattern: integer vectors of the row and column indices of the nonzero elements in the lower triangle of the Hessian.
delta The perturbation amount for Ąnite differencing of the gradient to compute the Hessian (the δ in Section 1.1). Defaults to sqrt(.Machine$double.eps).
index1 If TRUE (the default), row and col use one-based indexing. If FALSE, zerobased indexing is used.
complex If TRUE, the complex step method is used. If FALSE (the default), a simple Ąnite differencing of gradients is used.
... Additional arguments to be passed to fn and gr.

The sparseHessianFD class
The function sparseHessianFD is an initializer that returns a reference to a sparseHessianFD object. The initializer determines an appropriate permutation and partitioning of the variables, and performs some additional validation tests. The arguments to the initializer are described in Table 3.
To create a sparseHessianFD object, just call sparseHessianFD. Applying the default values for the optional arguments, the usage syntax to create a sparseHessianFD object is obj <-sparseHessianFD(x, fn, gr, rows, cols, . ..) where ... represents all other arguments that are passed to fn and gr.
The fn, gr and hessian methods respectively evaluate the function, gradient and Hessian at a variable vector x. The fngr method returns the function and gradient as a list. The fngrhs method includes the Hessian as well.

An example
Now we can estimate the Hessian for the log posterior density of the model from Section 2.1. For demonstration purposes, sparseHessianFD includes functions that compute the value (binary.f), the gradient (binary.grad) and the Hessian (binary.hess) of this model. We will treat the result from binary.hess as a ŞtrueŤ value against which we will compare the numerical estimates.
To start, we load the data, set some dimension parameters, set prior values for Σ −1 and Ω −1 , and simulate a vector of variables at which to evaluate the function. The binary.f and binary.grad functions take the data and priors as lists. The data(binary) call adds the appropriate data list to the environment, but we need to construct the prior list ourselves.   1 1 1 1 1 1 1 2 2 ...
Finally, we create an instance of a sparseHessianFD object. Evaluations of the function and gradient using the fn and gr methods will always give the same results as the true values because they are computed using the same functions. The default choice of method is complex=FALSE, so the evaluation of the Hessian is a Ąnite differenced approximation, so it is very close to, but not identical to, the true value, in terms of mean relative difference.
R> hs <-obj$hessian(P) R> mean(abs(hs-true.hess))/mean(abs(hs)) [1] 2.33571e-09 If complex=TRUE in the initializer, the call to the hessian method will apply the complex step method. To use this method, the functions passed as fn and gr must both accept a complex argument, and return a complex result, even though we are differentiating a real-valued function. Although base R supports complex arguments for most basic mathematical functions, many common functions (e.g., gamma, log1p, expm1, and the probability distribution functions) do not have complex implementations. Furthermore, the complex step method is valid only if the function is holomorphic. The methods in sparseHessianFD do not check that this is the case for the function at hand. We convey the following warning from the documentation of the numDeriv package (Gilbert and Varadhan 2012), which also implements the complex step method: Şavoid this method if you do not know that your function is suitable. Your mistake may not be caught and the results will be spurious.Ť Fortunately for demonstration purposes, the log posterior density in Equation 21 is holomorphic, so we can estimate its Hessian using the complex step method, and compute the mean relative difference from the true Hessian.

Algorithms
In this section, we explain how sparseHessianFD works. The algorithms are adapted from Coleman, Garbow, and Moré (1985b), who provided Fortran implementations as Coleman, Garbow, and Moré (1985a). Earlier versions of sparseHessianFD included licensed copies of the Coleman et al. (1985a) code, on which the current version no longer depends. Although newer partitioning algorithms have been proposed (e.g., Gebremedhin, Manne, and Pothen 2005;Gebremedhin, Tarafdar, Pothen, and Walther 2009), mainly in the context of automatic differentiation, we have chosen to implement established algorithms that are known to work well, and are likely optimal for the hierarchical models that many statisticians will encounter.

Partitioning the variables
Finding consistent, efficient partitions can be characterized as a vertex coloring problem from graph theory (Coleman and Moré 1984). In this sense, each variable is a vertex in an undirected graph, and an edge connects two vertices i and j if and only if H ij f (x) ̸ = 0. The sparsity pattern of the Hessian is the adjacency matrix of the graph. By Şcolor,Ť we mean nothing more than group assignment; if a variable is in a group, then its vertex has the color associated with that group. A ŞproperŤ coloring of a graph is one in which two vertices with a common edge do not have the same color. Coleman and Moré (1984) deĄne a Ştriangular coloringŤ as a proper coloring with the additional condition that common neighbors of a vertex do not have the same color. A triangular coloring is a special case of an Şcyclic coloring,Ť in which any cycle in the graph uses at least three colors (Gebremedhin, Tarafdar, Manne, and Pothen 2007).
An Şintersection setŤ contains characteristics that are common to two vertices, and an Şintersection graphŤ connects vertices whose intersection set is not empty. In our context, the set in question is the row indices of the non-zero elements in each column of L. In the intersection graph, two vertices are connected if the corresponding columns in L have at least one non-zero element in a common row. Powell and Toint (1979) write that a partitioning is consistent with a substitution method if and only if no columns of the of lower triangle of the Hessian that are in the same group have a non-zero element in the same row. An equivalent statement is that no two adjacent vertices in the intersection graph can have the same color. Thus, we can partition the variables by creating a proper coloring of the intersection graph of L.   This intersection graph, and the number of colors needed to color it, are not invariant to permutation of the rows and columns of H. Let π represent such a permutation, and let L π be the lower triangle of πHπ ⊤ . Coleman and Moré (1984, Theorem 6.1) show that a coloring is triangular if and only if it is also a proper coloring of the intersection graph of L π . Furthermore, Coleman and Cai (1986) prove that a partitioning is consistent with a substitution method if and only if it is an acyclic coloring of the graph of the sparsity pattern of the Hessian. Therefore, Ąnding an optimal partitioning of the variables involves Ąnding an optimal combination of a permutation π, and coloring algorithm for the intersection graph of L π .
These ideas are illustrated in Figures 2 and 3. Figure 2a shows the sparsity pattern of the lower triangle of a Hessian as an adjacency matrix, and Figure 2b is the associated graph with a proper vertex coloring. Every column (and thus, every pair of columns) in Figure 2a has a non-zero element in row 7, so there are no non-empty intersection sets across the columns. All vertices are connected to each other in the intersection graph (Figure 2c), which requires seven colors for a proper coloring. Estimating a sparse Hessian with this partitioning scheme would be no more efficient than treating the Hessian as if it were dense. Now suppose we were to rearrange H so the last row and and column were moved to the front.
In Figure 3a, all columns share at least one non-zero row with the column for variable 7, but variable groups ¶2, 4, 6♢ and ¶1, 3, 5♢ have empty intersection sets. The intersection graph in Figure 3c has fewer edges than Figure 2c, and can be colored with only three colors.
The practical implication of all of this is that by permuting the rows and columns of the Hessian, we may be able to reduce the number of colors needed for a cyclic coloring of the graph of the sparsity pattern. Fewer colors means fewer partitions of the variables, and that means fewer gradient evaluations to estimate the Hessian.
The sparseHessianFD class Ąnds a permutation, and partitions the variables, when it is initialized. The problem of Ąnding a cyclic coloring of the graph of the sparsity pattern is NPcomplete (Coleman and Cai 1986), so the partitioning may not be truly optimal. Fortunately, we just need the partitioning to be reasonably good, to make the effort worth our while. A plethora of vertex coloring heuristics have been proposed, and we make no claims that any of the algorithms in sparseHessianFD are even Şbest availableŤ for all situations.
The Ąrst step is to permute the rows and columns of the Hessian. A reasonable choice is the Şsmallest-lastŤ ordering that sorts the rows and columns in decreasing order of the number of elements (Coleman and Moré 1984, Theorem 6.2). To justify this permutation, suppose non-zeros within a row are randomly distributed across columns. If the row is near the top of the matrix, there is a higher probability that any non-zero element is in the upper triangle, not in the lower. By putting sparser rows near the bottom, we do not change the number of non-zeros in the lower triangle, but we should come close to minimizing the number of non-zeros in each row. Thus, we would expect the number of columns with non-zero elements in common rows to be smaller, and the intersection graph to be sparser (Gebremedhin et al. 2007).
The adjacency matrix of the intersection graph of the permuted matrix is the Boolean crossproduct, L ⊤ π L π . Algorithm 1 is a ŞgreedyŤ vertex coloring algorithm, in which vertices are colored sequentially. The result is a cyclic coloring on the sparsity graph, which in turn is a consistent partitioning of the variables.

Computing the Hessian by substitution
The cycling coloring of the sparsity graph deĄnes the G matrix from Section 1.1. We then estimate Y using Equation 16. Let C m be the assigned color to variable m. The substitution method is deĄned in Coleman and Moré (1984, Equation 6.1).
We implement the substitution method using Algorithm 2. This algorithm completes the bottom row of the lower triangle, copies values to the corresponding column in the upper triangle, and advances upwards.

Software libraries
The coloring and substitution algorithms use the Eigen numerical library (Guennebaud, Jacob et al. 2010 Table 4: Computation times (milliseconds) for computing Hessians using the numDeriv and sparseHessianFD packages, and the Ąnite difference and complex step methods, across 500 replications. Rows are ordered by the number of variables.

Speed and scalability
As far as we know, numDeriv (Gilbert and Varadhan 2012) is the only other R package that computes numerical approximations to derivatives. Like sparseHessianFD it includes functions to compute Hessians from user-supplied gradients (through the jacobian function), and implements both the Ąnite differencing and complex step methods. Its most important distinction from sparseHessianFD is that it treats all Hessians as dense. Thus, we will use numDeriv as the baseline against which we can compare the performance of sparseHessianFD.
To prepare Table 4, we estimated Hessians of the log posterior density in Equation 21 with different numbers of heterogeneous units (N ) and within-unit parameters (k). The total number of variables is M = (N + 1)k. Table 4 shows the mean and standard deviations (across 500 replications) for the time (in milliseconds) to compute a Hessian using functions for both the Ąnite difference and complex step methods from each package. Times were generated on a compute node running ScientiĄc Linux 6 (64-bit) with an 8-core Intel Xeon X5560 processor (2.80 GHz) with 24 GB of RAM, and collected using the microbenchmark package (Mersmann 2014). Code to replicate Table 4 is available in the doc/ directory of the installed package, and the vignettes/ directory of the source package. In Table 4 we see that computation times using sparseHessianFD and considerably shorter than those using numDeriv.
To help us understand just how scalable sparseHessianFD is, we ran another set of simulations, for the same hierarchical model, for different values of N and k. We then recorded the run times for different steps in the sparse Hessian estimation, across 200 replications. The steps are summarized in Table 5. The times were generated on an Apple Mac Pro with a 12-core Intel Xeon E5-2697 processor (2.7 GHz) with 64 GB of RAM.
In the plots in Figure 4, the number of heterogeneous units (N ) is on the x-axis, and median run time, in milliseconds, is on the y-axis. Each panel shows the relationship between N and run time for a different step in the algorithm, and each curve in a panel represents a different

Measure Description
Function estimating the objective function Gradient estimating the gradient Hessian computing the Hessian (not including initialization or partitioning time) Partitioning Ąnding a consistent partitioning of the variables (the vertex coloring problem) Initialization total setup time (including the partitioning time) Table 5: Summary of timing tests (see Figure 4).
number of within-unit parameters (k).
Computation times for the function and gradient, as well as the setup and partitioning times for the sparseHessianFD object, grow linearly with the number of heterogenous units. The time for the Hessian grows linearly as well, and that might be partially surprising. We saw in Section 3.1 that adding additional heterogeneous units in a hierarchical model does not increase the number of required gradient evaluations. So we might think that the time to compute a Hessian should not increase with N at all. The reason it does is that each gradient evaluation takes longer. Nevertheless, we can conclude that the sparseHessianFD algorithms are quite efficient and scalable for hierarchical models.