HLMdiag : A Suite of Diagnostics for Hierarchical Linear Models in R

Over the last twenty years there have been numerous developments in diagnostic procedures for hierarchical linear models; however, these procedures are not widely implemented in statistical software packages, and those packages that do contain a complete framework for model assessment are not open source. The lack of availability of diagnostic procedures for hierarchical linear models has limited their adoption in statistical practice. The R package HLMdiag provides diagnostic tools targeting all aspects and levels of continuous response hierarchical linear models with strictly nested dependence structures ﬁt using the lmer() function in the lme4 package. In this paper we discuss the tools implemented in HLMdiag for both residual and inﬂuence analysis.


Introduction
Nested data structures-observations organized in non-overlapping groups-arise naturally from numerous data collection schemes. These structures occur when individuals are observed over time (longitudinal repeated measures data); when a field is subdivided into smaller plots on which a treatment is applied (split plots); or when a stratified sampling scheme is used, such as when sampling students within schools within districts (multilevel data). When data are organized in this manner it is clear that the observations are no longer independent, so any statistical model used must allow for a more general dependence structure where observations belonging to the same group can be correlated. Hierarchical linear models (HLMs)-also referred to as mutlilevel models, mixed effects models, random coefficients models, and random effects models-allow for such a dependence structure. HLMs incorporate parameters associated with the global trend-the fixed effects-and parameters associated with the individual observations-the random effects-that govern the variance-covariance structure of Residual HLM MLwiN SuperMix PROC MIXED xtmixed gllamm nlme lme4 HLMdiag Marginal Level-1 Table 1: Overview of readily available residuals for commonly used statistical software. Note that HLM and HLMdiag can calculate both least squares and empirical Bayes residuals (we denote this by * in the above table). Also, HLMdiag cannot calculate least squares residuals for cross-classified models, but can calculate empirical Bayes residuals (we denote this by c in the above table). the model. Compared to the linear model, additional complexities are introduced in the process of both model fitting and model checking due to the dependence structure and the incorporation of explanatory variables from each level of the data hierarchy. For example, in the analysis of exam scores, observations may have been collected on both the student (the individual or level-1 unit) and the school level (the group or level-2 unit).
For the linear model fit by ordinary least squares, residual analysis and influence analysis are well-established staples both in practice and in the literature (Belsley, Kuh, and Welsch 1980;Cook and Weisberg 1982). In the last twenty years there have been numerous developments in diagnostic procedures for HLMs, which have primarily focused on the formulation of deletion diagnostics (e.g., Cook's distance), leverage, and outlier detection at each level of these models. We refer the reader to Loy and Hofmann (2013) for a recent review of available diagnostics for HLMs. While these developments greatly improve an analyst's ability to check a fitted model, the incorporation of diagnostics into statistical software has lagged behind.
As noted by West and Galecki (2011), there are many software programs and packages capable of fitting HLMs: some are specialized programs dedicated only to this class of model while others are add-ons to general statistical software packages. Examples of specialized programs include HLM, MLwiN, and SuperMix (Raudenbush, Bryk, Cheong, Condon, and du Toit 2011;Rasbash, Steele, Browne, and Goldstein 2012;Hedeker, Gibbons, Toit, and Patterson 2008), and examples of package add-ons include PROC MIXED in SAS (SAS Institute Inc. 2008), xtmixed and gllamm in Stata (StataCorp 2007;Rabe-Hesketh, Skrondal, and Pickles 2004), and nlme and lme4 in R (Pinheiro, Bates, DebRoy, Sarkar, and R Core Team 2013;Bates, Maechler, and Bolker 2013a;R Core Team 2013). Residual analysis is well developed for all of the above programs and packages (Table 1), however, influence analysis is strikingly underdeveloped (Table 2). Currently, SAS is the only program to provide some tools to diagnose each aspect of the model.
In R, there are packages that work toward an exhaustive diagnosis of a fitted model, but none are complete. The LMERConvenienceFunctions package (Tremblay and Ransijn 2013) provides model criticism plots based on the level-1 residuals through the function mcp.fnc, and the influence.ME package (Nieuwenhuis, Pelzer, and te Grotenhuis 2013;Nieuwenhuis, te Grotenhuis, and Pelzer 2012) provides access to influence measures for the fixed effects parameters for models fit using the lme4 package. As seen in Tables 1 and 2, HLMdiag fills the need for accessible diagnostics for HLMs in R, implementing a unified and complete framework to access influence diagnostics and residuals. The package requires that models have strictly nested dependence structure (for full functionality) and are fit using lmer() in Table 2: Overview of readily available tools for influence analysis for commonly used statistical software. FE denotes diagnostics for the fixed effects and VC denotes diagnostics for the variance components. Note that a ' * ' indicates that the specified diagnostics are available for higher-level units in gllamm, and a 'c' indicates that the specified diagnostics are also available for cross-classified models in HLMdiag.
the package lme4.
Next, we introduce the notation for HLMs and the data example that is used throughout the paper.

Hierarchical linear models
The discussion throughout this paper focuses on two-level HLMs for ease of explanation, however, it should be noted that the tools provided by HLMdiag can be used with higherlevel models. The two-level HLM can be formulated through two equations specifying the within-group (level-1) and between-group (level-2) models In the above equations i = 1, . . . , m denotes the group, y i is an n i × 1 vector of outcomes, Z i is an n i × q design matrix of level-1 explanatory variables, β i is a q × 1 vector of unknown fixed parameters, W i is a q × p design matrix of level-2 explanatory variables, γ is a p × 1 vector of fixed effects, and b i is a q × 1 vector of random effects. Additionally, we will assume that errors are independent normal between groups and different levels, that is, Combining the within-and between-group models we obtain a form of the linear mixed model (cf., e.g., Pinheiro and Bates 2000): where X i = Z i W i and β = γ from the two-level formulation. The HLM is extended to more levels by incorporating additional random effects associated with the higher-level units.
For general references on HLMs we refer the reader to Kreft and de Leeuw (1998), Raudenbush and Bryk (2002), Goldstein (2011), Hox (2010, and Snijders and Bosker (2012) who present these models in a social science framework. A more general treatment of these models can be found in Pinheiro and Bates (2000), McCulloch and Searle (2001), and Demidenko (2004).

Exam data
For illustrative purposes we make use of data on exam scores of 4,059 students in 65 inner-London schools. This data set is distributed as part of the R package mlmRev (Bates, Maechler, and Bolker 2013b), which makes well known multilevel modeling data sets available in R, and is analyzed in detail by Goldstein, Rasbash, Yang, Woodhouse, Pan, Nuttall, and Thomas (1993) and more recently by Leckie and Charlton (2013).
R> data("Exam", package = "mlmRev") R> head ( For each student, the data consist of their gender (sex) and two standardized exam scoresan intake score on the London Reading Test (LRT) at age 11 (standLRT) and a score on the General Certificate of Secondary Education (GCSE) examination at age 16 (normexam). Additionally, the students' LRT scores were used to segment students into three categories (bottom 25%, middle 50%, and top 25%) based on their verbal reasoning subscore (vr) and overall score (intake). At the school level, the data contain the average intake score for the school (schavg) and type based on school gender (schgend, coded as mixed, boys, or girls).
Throughout Sections 2 and 3 we explore the relationship between a student's intake score and their achievement on the GCSE examination. In Section 2 we focus on the use of residuals for model selection and validation, and in Section 3 we search for influential students and schools.

Residual analysis
The presence of multiple sources of variability in HLMs results in numerous quantities defining residuals. For this paper we will follow the classification by Hilden-Minton (1995) and define three types of residuals (for a more general discussion of the types of residuals for linear models we refer the reader to Haslett and Haslett 2007): Note that these residuals are by definition confounded as they are interrelated. This confounding of the residuals can lead to complications in the diagnosis of model deficiencies, since a violation in one type of residual may manifest itself as an alleged violation in a different residual, so an analyst must be cautious. To cope with these confounded residuals Hilden-Minton (1995) recommends an upward residual analysis, as it is possible to examine level-1 residuals that are unconfounded by other residuals (details below) while this is impossible in a downward residual analysis. This is the approach that we will follow in this section, starting with a discussion of level-1 residuals, followed by a discussion of level-2 residuals.

Level-1 residuals
The definition of the level-1 residuals leads to different residuals depending on how β and b i are estimated:

Least squares (LS):
Fitting separate linear models to each group and using LS to estimate β and b i leads us to a first set of residuals. The benefit of this estimation procedure is that residuals depend only on the lowest level of the hierarchy (level 1); thus the LS residual is unconfounded by other residuals (Hilden-Minton 1995). While LS residuals are unconfounded by the level-2 residuals, it is important to remember that for small within-group sample sizes the LS residuals will be unreliable. In such cases empirical Bayes residuals should be used.
The LS level-1 residuals are calculated by fitting separate LS regression models for each group and obtaining the residuals. In HLMdiag, LS models are fit using lmList() in lme4 if there are no categorical explanatory variables that take on constant values within the grouping factor. When a categorical explanatory variable does take on a constant value within the grouping factor, separate LS models can still be fit, where the intercept simply absorbs the coefficient of the constant explanatory variable. HLMdiag automates this process with the function adjust_lmList().

Empirical Bayes (EB):
The EB (shrinkage) residuals are defined as the conditional modes of the b i s given the data and the estimated parameter values (which can be found either by maximum likelihood or restricted maximum likelihood). The EB residuals at each level are interrelated, which makes us prefer LS residuals over the EB residuals at level-1, unless small within-group sample sizes prevent the use of LS residuals.
For higher levels of the models our preference will be reversed: once the assumptions at the lower level are checked, and the issue of confounding is taken care of, we suggest the use of EB residuals over LS residuals. This is discussed in more detail in Section 2.2.
For 'merMod' objects, resid() returns the raw residuals from the model, that is, y i − y i , where y i = X i β − Z i b i are the predicted conditional mean responses. The estimate b i calculated by lmer() is an EB estimate; thus, resid() is an object specific function to extract the EB residuals from an 'merMod' object.
We will highlight some of the functionality in the HLMdiag package for model building and validation using the exam data previously introduced. We fit an initial two-level HLM using a student's standardized London Reading Test intake score (standLRT) to explain their GCSE exam score allowing for a random intercept for each school: To do this we set level = 1 and type = "LS". The standardized level-1 residuals are given by where ∆ i is a diagonal matrix with elements equal to the diagonal of VAR( ε i ). Specifying standardize = TRUE indicates that the standardized residuals should also be returned. Alternatively, we can specify standardize = "semi", which requests that the semi-standardized residuals (explanation below) be returned. For LS level-1 residuals a data frame is returned consisting of the model frame, LS residuals, fitted values, and, if requested, standardized residuals. R> qplot(x = standLRT, y = LS.resid, data = resid1_fm1, + geom = c("point", "smooth")) + ylab("LS level-1 residuals") suggests that standardized LRT scores may not be linearly related to GCSE exam scores. Likelihood ratio tests (not shown) confirm that quadratic and cubic terms for standLRT contribute significantly in describing GCSE exam scores, so we incorporate these terms in the updated model, fm2.
R> fm2 <-lmer(normexam~standLRT + I(standLRT^2) + I(standLRT^3) + + (1 | school), Exam, REML = FALSE) To check for homoscedasticity of the level-1 residuals, one strategy is to plot residuals against explanatory variables or any other potentially meaningful order of the points. For each group, i, the LS level-1 residuals, from the LS model fit. In order to target the assumption of homoscedastic level-1 residuals we make use of the semi-standardized residuals (Snijders and Berkhof 2008) The semi-standardized level-1 residuals are calculated from model fm2 below: R> qplot(x =`I(standLRT^2)`, y = semi.std.resid, data = resid1_fm2) + + geom_smooth(method = "lm") + ylab("semi-standardized residuals") + + xlab("standLRT2") Figure 3: These twenty plots display the semi-standardized residuals from a hierarchical model against one of the predictor variables. The plot of the real data is randomly embedded among nineteen simulated plots. Which is the real plot?
and indicates a potential problem with heteroscedasticity. To further investigate this issue, we use visual inference through the use of the lineup protocol proposed by Buja, Cook, Hofmann, Lawrence, Lee, Swayne, and Wickham (2009) (see Section A in the appendix for additional details and code). Figure 3 displays this lineup. We find the plot of the original data to be indistinguishable (refer to the appendix for the position of the data plot) from the plots generated from the simulated data, indicating the perceived structure in the residual plot is an artifact of the sparsity of data for large values of standLRT; thus, we may proceed with the analysis without the need for remedial measures. An alternative way to check homoscedasticity of level-1 residuals is to use boxplots of the level-1 residuals by group to assess within-group homoscedasticity. If the assumption of within-group homoscedasticity is plausible, then the boxplots should exhibit roughly constant interquartile ranges (IQRs). We omit this approach for brevity. and shows that the semi-standardized residuals appear normal except for the very low values where the values are larger in absolute value than expected; however, this discrepancy is quite small and does not offer much evidence against the assumption of normality.
In an exploratory setting we would go through this cycle of residual analysis, identification of explanatory variables, and model fitting multiple times until a satisfactory level-1 model is found (Tukey 1977). Since LS level-1 residuals come from LS fits, we can pair the evaluation of the goodness-of-fit of a model with the regular tools we have in this situation; such as comparisons of nested models (through F tests or ANOVA) or stepwise regression diagnostics.
Carrying out this iterative process for the Exam data (not shown) results in the inclusion of student gender in the model. Additionally, including a random slope for standardized LRT, allowing for the strength of relationship between the two exams to vary across schools, was found to significantly improve the model.

Level-2 residuals
The level-2, or random effects, residuals are defined as Z i b i or, more commonly, b i . Obviously, the method of estimation impacts on the value of this residual. Again, LS and EB are the two methods of estimation. We will discuss both briefly: 1. Least squares (LS): Rearranging Equation 2, we see that this estimate is of the form where β i is a vector of estimates from separate LS models and W i γ is the estimated global trend. HLMresid() calculates these residuals using adjust_lmList() to fit the separate LS models whose coefficients are then compared to the global trend.

Empirical Bayes (EB):
b i is estimated using the conditional mode given the data and the estimated parameter values. EB estimates of b i can be obtained directly from an 'merMod' object using ranef(), or by using HLMresid().
As stated above, if an upward residual analysis is followed, then we prefer the use of EB residuals at level 2. Our preference stems from the fact that the LS residuals are far more variable than the corresponding residuals, so exploratory plots of omitted variables will exhibit weaker associations than the corresponding plots of EB residuals, and that for small sample sizes LS residuals are untrustworthy or unavailable.
The level-2 residuals are used to identify additional explanatory variables that contribute significantly to the model, check linearity of the level-2 explanatory variables, and investigate whether the level-2 residuals follow a normal distribution.
To obtain the level-2 EB residuals from model fm3, we use the following code: R> resid2_fm3 <-HLMresid(object = fm3, level = "school") R> head(resid2_fm3) where the level refers to the name of the grouping factor contained in the data set. Notice that we did not need to specify type = "EB" as it is the default setting. This returns a data frame with columns corresponding to the EB estimates of the random intercept and slope.
Boxplots of the EB residuals for the intercept grouped by school gender (schgend, the left side of Figure 5) show schgend is useful in explaining some of the between-school variability  Figure 5: Plots of the level-2 EB residuals for the intercept plotted against the omitted explanatory variables schgend and schavg. We construct boxplots for schgend to appropriately display a categorical variable and a scatterplot with a smoother to display a continuous variable. The boxplots for schgend are not all centered around zero, indicating that the variable contains information useful in describing the between-school variation in exam scores. Similarly, we observe a positive association in the scatterplot for schavg.
and should be incorporated into the model. Similarly, a scatterplot of the EB residuals for the intercept against the average intake score (schavg, on the right side of Figure 5) exhibits positive association, indicating that schavg should be incorporated into the model. Below, we add the two variables as fixed effects.
The assumption of normality of residuals is assessed by normal quantile plots for each of the level-2 residual vectors. Figure 6 shows normal quantile plots of the level-2 EB residuals for both the intercept (left side) and slope (right side) terms, neither of which shows evidence of a deviation from normality. One outlier is seen in the plot for the random slope and is determined to be school 53. An inspection of the records for this school did not yield any immediately apparent anomalies, but we will further investigate this school in the discussion on influential data points.

Marginal residuals
Marginal residuals are obtained by plugging in the estimate of β, β, into the definition and are calculated from an 'merMod' object using HLMresid() specifying level = "marginal". These residuals can be used for diagnostics as they would be in single-level linear models; however, as these residuals are the sum of the level-1 and level-2 residuals, any problems exhibited must be accompanied by analysis of the other types of residuals to pinpoint the source of the problem. One situation in which the marginal residuals are uniquely valuable is in assessing the marginal covariance structure, such as in repeated measures and longitudinal data, as the marginal residuals, ζ i , and observed values, y i , have the same covariance structure.

Addressing residual deficiencies
In the above example, we did not observe significant model deficiencies, but if any had been observed remedial measures would have been necessary. In this section we briefly discuss such measures available for the HLM.
To correct for nonlinearity, heteroscedasticity, or nonnormality, transformations of either the response variable or explanatory variables may prove helpful (Snijders and Bosker 2012, Chapter 10 Gurka, Edwards, Muller, and Kupper (2006) and Goldstein, Carpenter, Kenward, and Levin (2009), who discuss how to use the Box-Cox transformation to correct for nonnormal distributions.
While transformations present a rather straight forward approach to addressing model deficiencies it may be preferrable to reformulate the model with weaker distributional assumptions. This approach has the advantage that the data will be represented in the original scale of the problem, retaining greater interpretability. If heteroscedasticity in the residuals is discovered, the model assumptions can be weakened to allow the residual variance to depend upon some explanatory variable (Snijders and Bosker 2012, Chapter 8). Currently, it is not possible to model the residual variance as a function of covariates using lmer(); however, this is possible using lme() in the R package nlme (we refer the reader to Pinheiro and Bates 2000, Section 5.2 for details). If nonnormal residual distributions are discovered, then the distributional assumptions can be reformulated to more adequately represent the data, though model estimation may become more challenging. Alternatively, strong distributional asssumptions on the random effects can be avoided through the use of semiparametric or nonparametric methods (Shen and Louis 1999;Zhang and Davidian 2001;Ghidey, Lesaffre, and Eilers 2004), or mixtures of normal distributions (Verbeke and Lesaffre 1996). We refer the reader to Ghidey, Lesaffre, and Verbeke (2010) for a recent review of these methods.
Although model reformulation leads to a more accurate representation of the data generating mechanism, for analyses focused on estimation rather than prediction it may not be necessary. An alternative approach useful when normality is violated is the use of robust 'sandwich' estimators of the standard errors (Verbeke and Lesaffre 1997;Yuan and Bentler 2002), assuming sample sizes are large enough in the highest-level of the model.

Influence analysis
Influence analysis consists of systematically investigating whether some observation, or group of observations, is given disproportionate importance in model estimation, and, consequently, on the conclusions made from the analysis. Such observations are deemed influential, and the analyst must understand what impact these influential points have on the fitted model.
The most straightforward way to assess this influence is through the use of deletion diagnostics. Deletion diagnostics are statistics that quantify the change in a parameter estimate when some subset of the data is deleted. These diagnostics are well documented in the literature for regression models (Belsley et al. 1980;Cook and Weisberg 1982) and are widely available in statistical software; however, they are less established for HLMs (see Table 2).
Compared to the classical regression model, an additional complication is introduced by the hierarchical structure of the data for this class of models. Hierarchical data contain natural clusters of observations, whereas the linear regression model assumes that observations are independent, resulting in the need for multiple deletions to assess the influence of both individual observations and clusters of observations. Specifically, in the case of a two-level model, we refer to the deletion of an individual observation as a level-1 deletion and the deletion of an entire level-2 group as a level-2 deletion. Note that these definitions extend naturally upward for models with additional levels in the hierarchy.
In this section we describe the implementation of diagnostics to assess changes in the estimation of the variance components using relative variance change, the estimation of the fixed effects using Cook's distance and a multivariate version of DFFITS (Belsley et al. 1980), the precision of the fixed effects estimates using the covariance ratio and trace, and the fitted values using leverage. These quantities are used to assess the influence in both level-1 and level-2 units. First, we will consider deletion diagnostics for the fixed effects of a HLM, but note that in an analysis we would begin with diagnostics for the variance components as the diagnostics for fixed effects require a specified covariance matrix. We reverse the order here for ease of explanation.

Changes in parameter values
Two statistics that are commonly used for measuring changes in fixed effects are Cook's distance and MDFFITS, a multivariate version of the DFFITS statistic. Both statistics measure the distance between the fixed effects estimates obtained from the full data and those obtained from the reduced data, and are generalized for the HLM as follows (see Christensen, Pearson, andJohnson 1992, andSchabenberger 2004): where p is the rank of X and β (i) is the estimate of β when the ith unit is deleted. Note that these definitions are general enough to allow for deletion at any level-for example, for a two-level model, in the case of level-1 deletion, i refers to an individual, whereas for level-2 deletion i refers to a group.
The difference between the two statistics is that Cook's distance scales the change in the parameter estimates by the estimated covariance matrix of the original parameter estimates, while MDFFITS is scaled by the estimated covariance matrix of the deletion estimates. This means that computation of Cook's distance only requires the covariance from the original fitted model while computation of MDFFITS requires the covariance structure to be reestimated in the absence of the ith unit and the inverse to be recalculated.
Regardless of the statistic used, larger values indicate higher levels of influence. In the case of unknown covariance structure we do not have an exact reference distribution for these statistics, so instead of relying on an approximation based on a large sample asymptotic result, we adhere to the use of measures of relative standing to determine which units should be considered for further investigation. For example, we might consider units associated with statistics that are more than 1.5×IQR or 3×IQR above the third quartile (Q 3 ) as extreme for diagnostics where large values are indicative of a problem. In practice we have found the more conservative cutoff, Q 3 + 3 × IQR, to be a more useful starting point for such an investigation, especially for large sample sizes, as it identifies "extreme" values of the empirical distribution of these diagnostics rather than all outlying values. An alternative strategy is to plot the statistic and visually identify unusual values based on gaps in the empirical distribution.
HLMdiag implements both Cook's distance and MDFFITS for 'merMod' objects. For our example we compute the level-2 (school-level) deletion statistics of model fm4 using the below code.
To evaluate diagnostic values, we use dotplots-or a modified version of them. The dotplot is modified by grouping all "non-influential" units-as identified by the values of the diagnosticinto one group and displaying the influential groups as single cases. For the modified version  Cook's distance school q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q of the dotplot, HLMdiag provides two types of modification for displaying the non-influential units: a dotplot (the right panel of Figure 7) or a boxplot (Figure 8). This type of plot allows us to see the overall distribution of the diagnostic while focusing on the influential points. Since this should be a commonly used plot, we provide the function dotplot_diag() using the plotting tools of ggplot2 (Wickham 2009).
R> dotplot_diag(x = cooksd_fm4, cutoff = "internal", + name = "cooks.distance") + ylab("Cook's distance") + xlab("school") R> dotplot_diag(x = cooksd_fm4, cutoff = "internal", name = "cooks.distance", + modify = "dotplot") + ylab("Cook's distance") + xlab("school") Figure 7 displays a dotplot (left) and modified dotplot (right) of Cook's distance for level-2 deletion. To obtain the modified dotplot the argument modify = "dotplot" is added to the above call and a cutoff must be specified. The boxplot modification of the dotplot can be obtained by specifying modify = "boxplot" instead. Here, we use the cutoff previously discussed by setting cutoff = "internal", but other numeric cutoffs can be manually specified using this argument. When cutoff = "internal" is specified a name is required, which should be one of the following: "cooks.distance", "mdffits", "covratio", "covtrace", "rvc", or "leverage". Using measures of internal standing, school 25 is flagged as influential, warranting further investigation into the change in the parameter vector when school 25 has been deleted. Below, we show how to access the change in the parameter vector associated with the deletion of school 25.

Precision of fixed parameters
The covariance matrix of β gives insight into the precision of the parameter estimates. Both the covariance trace (COVTRACE, Christensen et al. 1992) and the covariance ratio (COV-RATIO) are measures of how precision is effected by the deletion of unit i. Again, we make use of a general definition that allows us to examine level-specific dependencies at a later point: Both statistics compare the covariance matrices of β where β is estimated with and without unit i. Taking the covariance matrix of β with unit i as the baseline, COVTRACE compares the ratio of the two matrices to the p × p identity matrix, which has a trace of p. COVRATIO directly compares the volume of the matrices via their determinants. In the case that unit i is not influential, the covariance trace will be close to zero, while the covariance ratio is close to one. As with Cook's distance and MDFFITS, we recommend the use of measures of relative standing or the visual identification of gaps in the empirical distribution to evaluate how far the statistics must deviate from zero or one to be considered influential.
We calculate COVTRACE and COVRATIO for model fm4 by R> covratio_fm4 <-covratio(fm4, group = "school") R> covtrace_fm4 <-covtrace(fm4, group = "school") An investigation of the results reveals that no schools are influential with respect to the precision of the fixed effects estimates.

Diagnostics for variance components
Let θ denote the vector of variance components, that is, the vector containing the residual variance, σ 2 , and the unique entries of D. The deletion diagnostics presented for fixed effects can be adapted to the variance components by substituting θ and θ (i) in place of β and β (i) . Equations 10 through 13 display the analogs of the previously discussed diagnostics for variance components.
Note that the formulas for Cook's distance and MDFFITS (Equations 10 and 11) no longer contain division by the rank of X, and in Equation 12 q denotes the rank of the covariance matrix of VAR( θ) (see Christensen et al. 1992 andSchabenberger 2004 for a discussion).
Computationally, these diagnostics are more challenging as they are based on an estimate of the covariance matrix-such as the inverse Hessian matrix-for the variance components, which is not readily available for 'merMod' objects. While it would be possible to get an estimate of the covariance matrix for variance components using a parametric bootstrap, this would significantly increase the computational complexity. Instead, we focus on diagnostics that do not require an estimate of the covariance matrix, allowing the direct use of output from lmer().
One diagnostic measure that meets this requirement is the relative variance change (RVC) (Dillane 2005), which measures the change in estimates of the th variance component, θ , with and without unit i. RVC is defined as where θ (i) is the estimate of the variance component when the ith unit is deleted, and θ is the estimate of the variance component obtained for the full data. RVC is close to zero when unit i does not have a large influence on the variance component.
For model fm4, RVC is calculated for each school as R> rvc_fm4 <-rvc(fm4, group = "school") R> head(rvc_fm4) The command rvc returns a matrix with named columns for each variance component, where sigma2 is the residual variance, σ 2 , and D** denotes the unique entries of D where the trailing digits denote the position in the matrix. In this example, D11 is the variance associated with the random intercept for schools, D22 is the variance associated with the random slope for standardized LRT score, and D21 is the covariance associated with the random slope and random intercept. Through the use of internal scaling, schools 7 and 53 are identified as influential and warrant further investigation: school 53 appears to be a school with very good students (top verbal reasoning intake scores and above median exam and standLRT scores), and school 7 appears to "pull students up" (mediocre verbal reasoning intake scores and median standLRT below overall median but exam scores higher than the overall median).

Diagnostics for fitted values
In addition to exploring how subsets of observations directly impact the model parameters, it is also of interest to explore whether these observations are unusual with regard to the fitted values and explanatory variables. This is done by exploring the leverage of subsets of interest. As with linear regression, leverage can be defined as the rate of change in the predicted response with respect to the observed response (Demidenko and Stukel 2005;Nobre and Singer 2011). Formally, assuming that VAR(y i ) = σ 2 V i is fixed, the leverage at level i can be defined as In the above definition, leverage is described in two parts, which we refer to as the leverage associated with the fixed effects, H 1i , and the leverage associated with the random effects, H 2i . The leverage associated with the random effects depends on the leverage associated with the fixed effects; thus, using H 2i we are unable to separate the two components. Alternatively, the random effects leverage can be defined as which is unconfounded by the fixed effects (Nobre and Singer 2011).
Using Equations 15 and 16, we define the leverage of observation j in group i to be equal to the jth diagonal element of the leverage matrix of interest, and the leverage of group i to be the mean of the diagonal elements of the leverage matrix of interest. To reflect the plurality of statistics that can be defined as "leverage" in a hierarchical model, leverage() in HLMdiag returns numerous quantities: the overall leverage (overall, H), the fixed effects leverage (fixef, H 1 ), the random effects leverage (ranef, H 2 ), and the unconfounded random effects leverage (ranef.uc, H * 2 ).
R> leverage_fm4 <-leverage(fm4, level = "school") R> head(leverage_fm4) From an investigation of the resulting leverage for the fixed effects (fixef) we find that schools 37, 48, and 54 have high leverage. Interestingly, no schools are flagged as having high leverage on the random effects when using the unconfounded version (ranef.uc), while schools 48 and 54 are flagged when using the confounded version (ranef)-this supports our preference for investigation of the unconfounded version of leverage. With a more thorough investigation of the schools, we determine the schools are not influential. The flagged schools are near the extremes of the average intake scores, explaining why they were flagged, but none of the schools deviate much from the overall trend.

Addressing influential and outlying units
Having identified potential influential and outlying units, we consider modeling approaches to appropriately represent these units. First, when an influential or outlying unit is identified it is important to carefully explore the values of the response and explanatory variables for data entry errors and other peculiarities. If the identified units appear to be different with respect to some observed explanatory variable one approach is to include a dummy variable in the model explaining the apparent difference (for an example of this approach we refer the reader to Langford and Lewis 1998). This can also be used to adjust the intercept when no such explanatory variable is found. Occassionally, a unit may be detected that is from a population other than that of interest, in which case deletion of this unit from the data set is a viable option.
An alternative approach to address the issue of outlying and influential units is the use of robust methods protecting against the influence of such units. The use of heavy-tailed distributions for the residuals, such as the t distribution, to protect against the impacts of outlying units have been proposed by Pinheiro, Liu, and Wu (2001) and Staudenmayer, Lake, and Wand (2009). Additionally, Demidenko (2004, Section 4.4) discusses alternative approaches to robust modeling.

Package description
In this section we provide additional description of the functions provided by HLMdiag. Tables 3 and 4 outline the main functions for residual and influence analysis included in the package accompanied by brief descriptions.
For residual analysis, HLMresid() provides a unified framework to calculate either LS or EB residuals at any level of the model. While EB residuals were previously available from 'merMod' objects using resid() and ranef(), LS residuals required manual implementation by the user. Additionally, HLMresid() provides the necessary framework to conduct an upward residual analysis without the need for programming on the part of the user.
HLMdiag also provides the most complete suite of tools for influence analysis available in R, comparable to those available in SAS PROC MIXED. Among the functions outlined in Table 4 it is important to note that HLMdiag provides two implementations of cooks.distance, mdffits, covratio, and covtrace: one based on the full model refit, and the other based on a one-step approximation. The example discussed in the above sections illustrated the "fast" implementation based on one-step approximations for the fixed effects and associated covariance matrices (for further details we refer the reader to Christensen et al. 1992;Shi and Chen 2008;Zewotir 2008). The implementation of these approximations utilizes smaller, dense submatrices resulting in more efficient computation. Additional computational speed has been achieved by combining the Armadillo C++ linear algebra library (Sanderson 2010) with R via the RcppArmadillo package (Eddelbuettel and Sanderson 2014).
While this one-step approximation is faster, it is less accurate than diagnostics based on the full refit. To obtain diagnostics from a full refit of the model for each deletion case_delete() must first be run to extract all the necessary information from the models, after which the same influence functions can be called on the result. While the results are accurate, the time required to compute the full refit makes them prohibitive, as it greatly interrupts the user's workflow. A one-step approximation is not currently implemented for the variance components, but is an area of future development.

Discussion
In this paper, we have demonstrated how to obtain different types of residuals and influence diagnostics for HLMs fit using lmer() based on the functionality of the R package HLMdiag.
We have implemented functions in such a manner that once the desired residuals or influence diagnostics have been obtained, analysis proceeds largely as it would for an ordinary linear model. Additionally, we have demonstrated an approach to model building that utilizes the residuals at each level of the model.
While we have greatly increased the diagnostic tools available for HLMs in R, we did not implement all possible types of residuals and influence diagnostics. For example, we did not include the calculation of deletion residuals (cf., Haslett and Haslett 2007), but the plotting functions, such as dotplot_diag(), work regardless of specific residual estimation and can be usefully employed for user-defined diagnostics. Additionally, we do not discuss generalizations of residuals to models with crossed random effects (i.e., cross-classified models) due to decreased interpretability based on the imposed covariance structure, but EB residuals can be extracted from such models using HLMresid() (see Table 1). HLMdiag is only fully functional with strictly nested dependence structures, however, only the LS residuals (type = "LS" when using HLMresid()) and leverage() are unavailable for cross-classified models. For future versions of HLMdiag, we plan to extend functionality, in particular, to allow user-defined residual functions to be incorporated into HLMresid() and to provide full compatibility with cross-classified models.
HLMdiag is available from the Comprehensive R Archive Network at http://CRAN.R-project. org/package=HLMdiag and compatible with the current version of lme4. Future development of HLMdiag will include the implementation of methods for HLMs fit by nlme.
Up to now, a complete suite of tools for diagnosing HLMs was not available for an open source statistical software package, resulting in reduced awareness and use of developed diagnostics.
HLMdiag provides a complete, open source suite of tools for the assessment of HLMs, which are comparable to those of SAS PROC MIXED (see Table 2). We believe the availability of these tools will lead to increased utilization and better modeling practices.

A. Graphical inference for model diagnostics
In this paper we use the idea of the lineup protocol introduced by Buja et al. (2009) to assess model assumptions for the HLM. More specifically, we use this protocol to judge whether the assumption of constant error variance of the level-1 residuals across standLRT 2 is valid.
To conduct this visual test, we generate 19 null plots against which the true data will be compared by simulating from the model and obtaining the residuals.