Derivative Computations and Robust Standard Errors for Linear Mixed Effects Models in lme4

While robust standard errors and related facilities are available in R for many types of statistical models, the facilities are notably lacking for models estimated via lme4. This is because the necessary statistical output, including the Hessian and casewise gradient of random effect parameters, is not immediately available from lme4 and is not trivial to obtain. In this article, we supply and describe two new functions to obtain this output from Gaussian mixed models: estfun.lmerMod() and vcov.full.lmerMod(). We discuss the theoretical results implemented in the code, focusing on calculation of robust standard errors via package sandwich. We also use the Sleepstudy data to illustrate the code and compare it to a benchmark from package lavaan.


Introduction
Package lme4 (Bates, Mächler, Bolker, and Walker 2015) is widely used to estimate a variety of generalized linear mixed models.Despite its popularity, the package does not provide certain results related to derivatives of the likelihood, which makes it difficult to obtain robust standard errors and other statistical tests.This absence is partially related to the fact that lme4 does not directly estimate models via likelihood maximization, but rather employs a penalized least squares approach that leads to ML (or REML) estimates (Bates et al. 2015).While this approach eases model estimation, it also makes it more difficult to obtain derivatives (first and second) of the likelihood from a fitted model (which are required for, e.g., the Huber-White sandwich estimator).While it is possible to instead utilize the robust estimation methods from package robustlmm (Koller 2016), we are interested in directly using derivative-based methods that rely on estimation of the traditional model.Thus, the goal of this paper is to describe R package merDeriv, which contains functions that compute these derivatives for objects of class lmerMod.We also briefly discuss derivatives associated with models of class glmerMod, though we do not currently have code for these models (the computations are more difficult due to the need for numerical integration).
The paper proceeds as follows.We first describe general notation for the linear mixed model.Next, we derive expressions for the linear mixed models' casewise (observation level) and clusterwise (cluster/group level) first derivatives, along with the Hessian and Fisher information arXiv:1612.04911v2[stat.ME] 2 Nov 2017 matrix (including both fixed effect parameters and variances/covariances of random effects.Next, we illustrate the derivatives' application via the sleep study data (Belenky, Wesensten, Thorne, Thomas, Sing, Redmond, Russo, and Balkin 2003) included with lme4, comparing our results to a benchmark from lavaan (Rosseel 2012).This illustration includes computation of the Huber-White sandwich estimator (Eicker 1967;White 1980;Huber 1967) for linear mixed models with independent clusters/groups.Finally, we discuss further use and extension of our package's functionality.

Linear mixed model
Following Bates et al. (2015), the linear mixed model can be written as where y is the observed data vector of length n; X is an n × p matrix of fixed covariates; β is the fixed effect vector of length p; Z is an n × q design matrix of random effects; and b is the random effect vector of length q.
The vector b is assumed to follow a normal distribution with mean 0 and covariance matrix G, where G is a block diagonal matrix composed of varaiance/covariance for random effect parameters.The residual covariance matrix, R, is the product of the residual variance σ 2 r and an identity matrix of dimension n.We further define σ 2 to be a vector of length K, containing all variance/covariance parameters (including those of the random effects and the residual).Thus, the matrix G has (K − 1) unique elements.For example, in a model with two random effects that are allowed to covary, σ 2 is a vector of length 4 (i.e., K = 4).The first three elements correspond to the unique entries of G, which are commonly expressed as σ 2 0 , σ 0 σ 1 , and σ 2 1 .The last component is then the residual variance σ 2 r .Based on Equations 1, 2, and 3, the marginal distribution of the LMM is where Therefore, the marginal likelihood can be expressed as

Derivative computations for the linear mixed model
In this section, we first discuss analytic results involving the linear mixed model's first and second derivatives.We then illustrate how these derivatives can be obtained from an object of class lmerMod.

Scores
Based on the objective function from Equation 6, we derive the score function s i () for each observation w.r.t. the parameter vector ξ = (σ 2 , β) .We focus separately on σ 2 and on β below.

Scores for σ 2
The gradient with respect to the k th entry of σ 2 (k = 1, 2, 3, . . ., K) is (Stroup 2012, p. 136-137): where V is defined in Equation 5.This gradient sums over i, whereas the scores are defined for each observation i.Thus, to obtain the scores, we can remove the sums from the above equation.This is accomplished by replacing a trace operator with a diag operator, as well as replacing a matrix product with a Hadamard product (also known as elementwise/entrywise multiplication): In this way, the gradient of parameter σ 2 k (a scalar) becomes a n × 1 score vector.

Scores for β
For the fixed effect parameter β, the gradient is: The score vector s(β; y) can again be obtained by replacing the matrix multiplication by the Hadamard product: The full set of scores can then be expressed as a matrix whose columns consist of the results from Equations 8 and 10.
These equations provide scores for each observation i, and we can construct the clusterwise scores by summing scores within each cluster.In situations with one grouping (clustering) variable, the clusterwise scores can be obtained from our estfun.lmerMod()function via the default argument level = 2.The casewise scores, on the other hand, can be retrieved for all models via the argument level = 1.

Hessian/observed information matrix
The Hessian is the second derivative of the log-likelihood, noted as A in this paper.The negative of the Hessian is often called the observed information matrix or observed Fisher information.It is a sample-based version of the Fisher information.Because package lme4 does not provide a Hessian that includes both the fixed and variance/covariance of random effect (including residual variance) parameters, the derivation of this matrix requires special attention.
To obtain the Hessian, we can divide the matrix A into the following four blocks: , where β contains all fixed parameters and σ 2 contains all variance-covariance parameters (in variance-covariance scale) in the linear mixed model.To facilitate the analytic derivations, we index the above four blocks as: .
Block 1 is straightforward, which can be obtained by taking the derivative of Equation 9 w.r.t.β, which can be expressed as: Derivation of Block 4 is described in Stroup (2012) and can be written as where k 1 ∈ 1, . . ., K and k 2 ∈ 1, . . ., K.
Finally, Block 3 (which is the transpose of Block 2 ) can be seen as the derivative of Equation 7 w.r.t.β.Using the identity from Petersen and Pedersen (2012, p. 11, Eq. (86)), this allows us to derive Block 3 as The results obtained for A are similar to the derivation of Fisher information matrix, as described below.

Fisher/expected information matrix
The Fisher information matrix (or expected information matrix) is the expectation of the negative second derivative of the log likelihood, noted as A throughout the paper.It can often be obtained in R with the help of the vcov() function, but package lme4 only provides results for fixed effect parameters.Thus, we obtain the Fisher information w.r.t.all model parameters by taking the expectation of the negative of the Hessian matrix A .
Specifically, we can express the matrix A in the following four blocks as before.The only difference is the negative expectation operator.
, Following the same strategy, we index the above four blocks as: Because X and X T are considered constants, Block 1 is simply the negative of Block 1 shown as below: This analytic result is mathematically equivalent to the result provided by solve(vcov()) in lme4 (which only contains fixed effect parameters).
Derivation of Block 4 is also based on the result from Block 4 .In particular, following the expectation identity for Gaussian distributions from Petersen and Pedersen (2012, p. 43, Eq. (380)), the second term of Equation 12 can be transformed to tr Thus Block 4 is reduced to the form shown as below, which is also described in Stroup (2012).
Finally, Block 3 is the negative of the expectation of Block 3 .Using the expectation identity from Petersen and Pedersen (2012, p. 35, Eq. (312)), this allows us to derive Block 3 as ∂σ 2 ∂β = 0, which reflects asymptotic independence of β and σ 2 .Thus, we have expressed the necessary derivatives as functions of model matrices and derivatives of the marginal variance V .We can summarize the Fisher information matrix result for the LMM as: .
These results are equivalent to Equations 6.69 to 6.74 of McCulloch and Neuhaus (2001).
We can then invert the information matrix to obtain the variance-covariance matrix.In the vcov.lmerMod()function from merDeriv, we use the default argument full = TRUE to get the variance-covariance matrix w.r.t.all parameters in the model.If full = FALSE, the variance-covariance matrix w.r.t.only fixed parameters is returned.To switch between the observed and expected information matrix, we can supply the argument information = "observed" or information = "expected".The default option is "expected" due to its wider usage.

Relation to lmerMod objects
In this section, we describe how the quantities needed to compute the scores, Hessian, and Fisher information matrix can be obtained from an lmerMod object.The data and model matrices y, X, β, and Z can be obtained directly from lme4 via getME().The only remaining components, then, are V and ∂V /∂σ 2 .In the following, we focus on how to indirectly obtain these components.
In the lme4 framework, the random effects covariance matrix G is decomposed via (Bates et al. 2015): where Λ θ is a q × q lower diagonal matrix, called the relative covariance factor.It can be seen as a Cholesky decomposition of G/σ 2 r .The dimension of Λ θ is the same as that of G. Additionally, the position of σ 2 k in G is the same as the position of θ k in Λ θ .
Inserting Equation 18into Equation 5, we can express V as Equation 19 is mathematically equivalent to Equation 5, but it has computational advantages when, e.g., a random effect variance is close to 0.
Using Equation 5, the term ∂V /∂σ 2 k can usually be expressed as so long as σ 2 k is not the residual variance.The partial derivative ∂G ∂σ 2 k is then a matrix of the same dimension as G, with an entry of 1 corresponding to the location of σ 2 k and 0 elsewhere.

Because the location of σ 2
k within G matches its location within Λ θ , we can use Λ θ to facilitate computation of ∂V /∂σ 2 k .The only trick is that G is symmetric, whereas Λ θ is lower diagonal.
The code below illustrates implementation of this strategy, where object is a fitted model of class lmerMod.We use forceSymmetric() to convert the lower diagonal information from Λ θ into the symmetric G.
The above results are sufficient for obtaining the derivatives necessary for computing the Huber-White sandwich estimator and for carrying out additional statistical tests (see the Discussion section).In the following sections, we will describe the Huber-White sandwich estimator for linear mixed models with independent clusters, then provide an application.

Huber-White sandwich estimator
Let y c j contain the observations within cluster c j .If observations in different clusters are independent (as is the case in many linear mixed models), then we can write where J is the total number of clusters and () is defined in Equation 6.The first and second partial derivatives of w.r.t.ξ = (σ 2 β) can then be written as where represents the first derivative within cluster c j , which can be expressed as the sum of the casewise score s i () belonging to c j .The function s i () has also been studied in other contexts (e.g., Wang, Merkle, and Zeileis 2014;Zeileis and Hornik 2007).
Inference about ξ relies on a central limit theorem: where d − → denotes convergence in distribution.The traditional estimate of V (ξ) relies on Equation 23, whereas the Huber-White sandwich estimator of V (ξ) is defined as (e.g., Freedman 2006;White 1980;Zeileis 2006): where A = −E( ( ξ); y) and B = Cov( ( ξ; y)).The square roots of the diagonal elements of V are the "robust standard errors." When the model is correctly specified, the Huber-White sandwich estimator corresponds to the Fisher information matrix.However, the estimator is often used in non-i.i.d.samples to "correct" the information matrix for misspecification (e.g., Freedman 2006).While mixed models explicitly handle lack of independence via random effects, the Huber-White estimators can still be applied to these models to address remaining model misspecifications such as outliers in random effects or deviations from normality (Koller 2016(Koller , 2013)).
To construct the Huber-White sandwich estimator, A can be obtained from Equation 23, whose analytic expression for the linear mixed model is expressed in Section 3.2.The matrix B can then be constructed via (e.g., Freedman 2006): Thus, we require the derivations presented in the previous section: the "score" terms s i (ξ; y i ) (i = 1, . . ., n) and the information matrix using the marginal likelihood from Equation 6.

Application
In this section, we illustrate how the package can be used to obtain clusterwise robust standard errors for the sleepstudy data (Belenky et al. 2003) included in lme4.This dataset includes 18 subjects participating in a sleep deprivation study, where each subject's reaction time was monitored for 10 consecutive days.The reaction times are nested by subject and continuous in measurement, hence the linear mixed model.
We first load package lme4, along with the merDeriv package that is the focus of this paper.
R> library("lme4") R> library("merDeriv") Next, we fit a model with Days as the covariate, including random intercept and slope effects that are allowed to covary.There are six free model parameters: the fixed intercept and slope β 0 and β 1 , the random variance and covariances σ 2 0 , σ 2 1 , and σ 01 , and the residual variance σ 2 r .

Scores
The analytic casewise and clusterwise scores are obtained via estfun.lmerMod(),using the arguments level = 1 and level = 2, respectively.The sum of scores (either casewise or clusterwise) equals the gradient, which is close to zero at the ML estimates.The clusterwise scores are also provided by estfun.lavaan() in lavaan.Figure 1 presents a comparison between the clusterwise scores obtained from estfun.lmerMod() and estfun.lavaan(),showing they are nearly identical.The absolute difference between the scores obtained from these two packages is within 1.5 × 10 −7 .The sum of squared differences is within 2.2 × 10 −14 .

Variance covariance matrices
We also compare the variance covariance matrix calculated via our lme4 second derivatives to the vcov() output of lavaan.The results are displayed in Table 1.The maximum of Figure 1: Comparison of scores obtained via estfun.lavaanand estfun.lmerMod.In the left panel, the y-axis represents analytic, clusterwise scores obtained from estfun.lmerMod, and the x-axis represents clusterwise scores obtained from estfun.lavaan.The dashed line serves as a reference line as y = x; In the right panel, the y-axis represents the difference between the scores obtained via estfun.lavaanand estfun.lmerMod,and the x-axis represents the clusterwise scores obtained from estfun.lavaan.The dashed line serves as a reference line as y = 0; q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −0.3 −0.1 0.1 0.3 −0.3 −0.1 0.1 0.2 0.3 estfun.lavaanestfun.lmerModq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −0.3 −0.1 0.1 0.3 −1.0e−07 0.0e+00 1.0e−07 estfun.lavaandifference the absolute difference for all components in the variance covariance matrix is 0.07.This minor difference is due to the fact that lavaan applies the delta method to compute the Fisher information matrix for defined parameters (Rosseel 2012;Oberski et al. 2014).In contrast, merDeriv utilizes analytic expressions.This difference is ignorable due to the small relative difference (within 10 −6 ). for the sleepstudy data.The first two columns describe the specific matrix entry being compared, the third and fourth columns show the estimates, the fifth and sixth column shows the absolute and relative difference.
Finally, the clusterwise Huber-White sandwich estimator is shown in Table 2, which is comparable to the one provided by lavaan.The maximum of the absolute difference for all components in the variance covariance matrix is 0.05.The minor difference is again caused by the aforementioned reasons.Table 2: Comparison of the sleepstudy sandwich estimator obtained from our merDeriv code with the analogous estimator obtained from lavaan.The first two columns describe the specific matrix entry being compared, the third and fourth columns show the estimates, the fifth and sixth column shows the absolute and relative difference.

Discussion
In this paper, we illustrated how to obtain the Huber-White sandwich estimator of estimated parameters arising from objects of class lmerMod with independent clusters.This required us to derive observational (and clusterwise) scores for fixed and random parameters (leading to the "meat") as well as a Fisher information matrix that included random effect variances and covariances (leading to the "bread").In the discussion below, we address extensions to related statistical metrics and models.

Restricted maximum likelihood (REML)
While we focused on linear mixed models estimated via maximum likelihood (ML), extension to restricted maximum likelihood (REML) is straightforward.The central idea of REML is to maximize the likelihood function of variance parameters after accounting for the fixed effects.By using this approach, the downward bias of ML for variance estimates can be eliminated (similarly to division of n versus n − 1 in simple variance calculations), so REML is used often in LMM applications (Stroup 2012).
Referring to the sleepstudy example, package merDeriv can provide scores (estfun) and variance covariance matrix (vcov) based on the REML likelihood function and corresponding estimates.The fixed effects parameters are equivalent for ML and REML, whereas the corresponding vcov components are larger based on REML.For example, the ML variance for the estimated fixed intercept is 43.99 whereas the REML variance for the estimated fixed intercept is 46.57.

Statistical tests
The scores derived in this paper can potentially be used to carry out a variety of score-based statistical tests.For example, the "fluctuation test" framework discussed by Zeileis and Hornik (2007), Merkle and Zeileis (2013), and others generalizes the traditional score (Lagrange multiplier) test and is used to detect parameter instability across orderings of observations.The tests have been critical for the development of model-based recursive partitioning procedures available via packages such as partykit (Hothorn and Zeileis 2015).
The code that we present here facilitates application of score-based tests to linear mixed models, because the tests described in the previous paragraph are available via object-oriented R packages.That is to say the aforementioned packages can be applied to linear mixed models estimated via lme4, because we have supplied the generic function estfun for models of class lmerMod.A challenge involves the fact that much of the above theory requires observations to be independent.For the linear mixed models with independent clusters, tests can often be applied immediately.However, while we can test parameter instability across independent clusters, it is more difficult to test for instability across correlated observations within a cluster.A related issue, further described below, arises when we attempt to apply sandwich estimators to models with crossed random effects.

Models with multiple random effects terms
The "independence" challenges described in the previous section translate to the setting of models with multiple random effects terms, such as (partially) crossed random effects or models with multilevel nested designs (e.g., three-level models; Bates 2010, Ch.2).These correspond to situations where there are at least two unique variables defining clusters (for example, clusters defined by primary school attended and by secondary school attended).In this case, we cannot simply sum scores within a cluster to obtain independent, clusterwise scores.This is because observations in different clusters on the first grouping variable may be in the same cluster on the second grouping variable.Thus, it is unclear how the statistical machinery developed for independent observations (e.g., robust standard errors, instability tests) can transfer to models with partially crossed random effects.While our estfun.lmerMod()code can return casewise scores and vcov.lmerMod()can return the full variance covariance matrix of all lmerMod objects, it is unclear how to further use these results.
The main difficulty involves construction of the "meat", which is the variance of the first derivatives based on the grouping variable.One possible solution is to create separate "meats" based on different grouping variables, accounting for covariances between the meats.This approach is described in Rasbash and Goldstein (1994) and Cameron, Gelbach, and Miller (2011) to decompose parameter variances when there are multiple grouping variables.It may be possible to apply the same idea to our problem, and we plan to study this in the future.

GLMM
Finally, the procedures described here for scores, Hessians, Fisher information and sandwich estimators can be extended to generalized linear mixed models estimated via glmer().The technical difficulty involved with this extension is the observational scores.In the linear mixed model, we can derive the analytic scores for each observation because we know that the marginal distribution is normal.In the GLMM, the marginal distribution is typically unknown, and we require integral approximation methods (e.g., quadrature or the Laplace approximation) to obtain the scores and second derivatives.Combination of these integral approximation methods with the lme4 penalized least squares approach presents a challenge that we have not yet overcome.We plan to do so in the future.

Acknowledgments
This research was partially supported by NSF grant 1460719.