MICE: Multivariate Imputation by Chained Equations in R

The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.


Introduction
Multiple imputation (Rubin 1987(Rubin , 1996 is the method of choice for complex incomplete data problems. Missing data that occur in more than one variable presents a special challenge. Two general approaches for imputing multivariate data have emerged: joint modeling (JM) and fully conditional specification (FCS), also known as multivariate imputation by chained equations (MICE). Schafer (1997) developed various JM techniques for imputation under the multivariate normal, the log-linear, and the general location model. JM involves specifying a multivariate distribution for the missing data, and drawing imputation from their conditional distributions by Markov chain Monte Carlo (MCMC) techniques. This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10-20) is often sufficient. FCS is attractive as an alternative to JM in cases where no suitable multivariate distribution can be found. The basic idea of FCS is already quite old, and has been proposed using a variety of names: stochastic relaxation (Kennickell 1991), variable-by-variable imputation (Brand 1999), regression switching , sequential regressions , ordered pseudo-Gibbs sampler (Heckerman et al. 2001), partially incompatible MCMC (Rubin 2003), iterated univariate imputation (Gelman 2004), MICE (van Buuren and Oudshoorn 2000;Groothuis-Oudshoorn 2011) andFCS (van Buuren 2007).

Software implementations
Several authors have implemented fully conditionally specified models for imputation. mice 1.0 (van Buuren and Oudshoorn 2000) was released as an S-PLUS library in 2000, and was converted by several users into R (R Development Core Team 2011). IVEware ) is a SAS-based procedure that was independently developed by Raghunathan and colleagues. The function aRegImpute in R and S-PLUS is part of the Hmisc package (Harrell 2001). The ice software (Royston 2004(Royston , 2005; Royston and White 2011) is a widely used implementation in Stata. SOLAS 3.0 (Statistical Solutions 2001) is also based on conditional specification, but does not iterate. WinMICE (Jacobusse 2005) is a Windows stand-alone program for generating imputations under the hierarchical linear model. A recent addition is the R package mi (Su et al. 2011). Furthermore, FCS is now widely available through the multiple imputation procedure part of the SPSS 17 Missing Values Analysis add-on module. See http://www.multiple-imputation.com/ for an overview.

General framework
To the uninitiated, multiple imputation is a bewildering technique that differs substantially from conventional statistical approaches. As a result, the first-time user may get lost in a labyrinth of imputation models, missing data mechanisms, multiple versions of the data, pooling, and so on. This section describes a modular approach to multiple imputation that forms the basis of the architecture of mice. The philosophy behind the MICE methodology is that multiple imputation is best done as a sequence of small steps, each of which may require diagnostic checking. Our hope is that the framework will aid the user to map out the steps needed in practical applications.

Notation
Let Y j with (j = 1, . . . , p) be one of p incomplete variables, where Y = (Y 1 , . . . , Y p ). The observed and missing parts of Y j are denoted by Y obs j and Y mis j , respectively, so Y obs = (Y obs 1 , . . . , Y obs p ) and Y mis = (Y mis 1 , . . . , Y mis p ) stand for the observed and missing data in Y . The number of imputation is equal to m ≥ 1. The hth imputed data sets is denoted as Y (h) where h = 1, . . . , m. Let Y −j = (Y 1 , . . . , Y j−1 , Y j+1 , . . . , Y p ) denote the collection of the p − 1 variables in Y except Y j . Let Q denote the quantity of scientific interest (e.g., a regression coefficient). In practice, Q is often a multivariate vector. More generally, Q encompasses any model of scientific interest. Figure 1 illustrates the three main steps in multiple imputation: imputation, analysis and pooling. The software stores the results of each step in a specific class: mids, mira and mipo. We now explain each of these in more detail.

Modular approach to multiple imputation
The leftmost side of the picture indicates that the analysis starts with an observed, incomplete data set Y obs . In general, the problem is that we cannot estimate Q from Y obs without making unrealistic assumptions about the unobserved data. Multiple imputation is a general framework that several imputed versions of the data by replacing the missing values by plau- sible data values. These plausible values are drawn from a distribution specifically modeled for each missing entry. In mice this task is being done by the function mice(). Figure 1 portrays m = 3 imputed data sets Y (1) , . . . , Y (3) . The three imputed sets are identical for the non-missing data entries, but differ in the imputed values. The magnitude of these difference reflects our uncertainty about what value to impute. The package has a special class for storing the imputed data: a multiply imputed dataset of class mids.
The second step is to estimate Q on each imputed data set, typically by the method we would have used if the data had been complete. This is easy since all data are now complete. The model applied to Y (1) , . . . , Y (m) is the generally identical. mice 2.9 contains a function with.mids() that perform this analysis. This function supersedes the lm.mids() and glm.mids(). The estimatesQ (1) , . . . ,Q (m) will differ from each other because their input data differ. It is important to realize that these differences are caused because of our uncertainty about what value to impute. In mice the analysis results are collectively stored as a multiply imputed repeated analysis within an R object of class mira.
The last step is to pool the m estimatesQ (1) , . . . ,Q (m) into one estimateQ and estimate its variance. For quantities Q that are approximately normally distributed, we can calculate the mean overQ (1) , . . . ,Q (m) and sum the within-and between-imputation variance according to the method outlined in Rubin (1987, pp. 76-77). The function pool() contains methods for pooling quantities by Rubin's rules. The results of the function is stored as a multiple imputed pooled outcomes object of class mipo.

MICE algorithm
The imputation model should Account for the process that created the missing data.
Preserve the relations in the data.
Preserve the uncertainty about these relations.
The hope is that adherence to these principles will yield imputations that are statistically correct as in Rubin (1987, Chapter 4) for a wide range in Q. Typical problems that may surface while imputing multivariate missing data are For a given Y j , predictors Y −j used in the imputation model may themselves be incomplete.
Circular dependence can occur, where Y 1 depends on Y 2 and Y 2 depends on Y 1 because in general Y 1 and Y 2 are correlated, even given other variables.
Especially with large p and small n, collinearity and empty cells may occur.
Rows or columns can be ordered, e.g., as with longitudinal data.
Variables can be of different types (e.g., binary, unordered, ordered, continuous), thereby making the application of theoretically convenient models, such as the multivariate normal, theoretically inappropriate.
The relation between Y j and Y −j could be complex, e.g., nonlinear, or subject to censoring processes.
Imputation can create impossible combinations (e.g., pregnant fathers), or destroy deterministic relations in the data (e.g., sum scores).
Imputations can be nonsensical (e.g., body temperature of the dead).
Models for Q that will be applied to the imputed data may not (yet) be known.
This list is by no means exhaustive, and other complexities may appear for particular data.
In order to address the issues posed by the real-life complexities of the data, it is convenient to specify the imputation model separately for each column in the data. This has led by to the development of the technique of chained equations. Specification occurs on at a level that is well understood by the user, i.e., at the variable level. Moreover, techniques for creating univariate imputations have been well developed.
Let the hypothetically complete data Y be a partially observed random sample from the pvariate multivariate distribution P (Y |θ). We assume that the multivariate distribution of Y is completely specified by θ, a vector of unknown parameters. The problem is how to get the multivariate distribution of θ, either explicitly or implicitly. The MICE algorithm obtains the posterior distribution of θ by sampling iteratively from conditional distributions of the form The parameters θ 1 , . . . , θ p are specific to the respective conditional densities and are not necessarily the product of a factorization of the 'true' joint distribution P (Y |θ). Starting from a simple draw from observed marginal distributions, the tth iteration of chained equations is a Gibbs sampler that successively draws through its relation with other variables, and not directly. Convergence can therefore be quite fast, unlike many other MCMC methods. It is important to monitor convergence, but in our experience the number of iterations can often be a small number, say 10-20. The name chained equations refers to the fact that the MICE algorithm can be easily implemented as a concatenation of univariate procedures to fill out the missing data. The mice() function executes m streams in parallel, each of which generates one imputed data set.
The MICE algorithm possesses a touch of magic. The method has been found to work well in a variety of simulation studies (Brand 1999;Horton and Lipsitz 2001;Moons et al. 2006;van Buuren et al. 2006b;Horton and Kleinman 2007;Yu et al. 2007;Schunk 2008;Drechsler and Rassler 2008;Giorgi et al. 2008). Note that it is possible to specify models for which no known joint distribution exits. Two linear regressions specify a joint multivariate normal given specific regularity condition (Arnold and Press 1989). However, the joint distribution of one linear and, say, one proportional odds regression model is unknown, yet very easy to specify with the MICE framework. The conditionally specified model may be incompatible in the sense that the joint distribution cannot exist. It is not yet clear what the consequences of incompatibility are on the quality of the imputations. The little simulation work that is available suggests that the problem is probably not serious in practice (van Buuren et al. 2006b;Drechsler and Rassler 2008). Compatible multivariate imputation models (Schafer 1997) have been found to work in a large variety of cases, but may lack flexibility to address specific features of the data. Gelman and Raghunathan (2001) remark that "separate regressions often make more sense than joint models". In order to bypass the limitations of joint models, Gelman (2004, pp. 541) concludes: "Thus we are suggesting the use of a new class of models-inconsistent conditional distributions-that were initially motivated by computational and analytical convenience." As a safeguard to evade potential problems by incompatibility, we suggest that the order in which variable are imputed should be sensible. This ordering can be specified in mice (cf. Section 3.6). Existence and uniqueness theorems for conditionally specified models have been derived (Arnold and Press 1989;Arnold et al. 1999;Ip and Wang 2009). More work along these lines would be useful in order to identify the boundaries at which the MICE algorithm breaks down. Barring this, the method seems to work well in many examples, is of great importance in practice, and is easily applied.

Simple example
The section presents a simple example incorporating all three steps. After installing the R package mice from the Comprehensive R Archive Network (CRAN), load the package.

R> library("mice")
This paper uses the features of mice 2.9. The data frame nhanes contains data from Schafer (1997, p. 237 There are 13 (out of 25) rows that are complete. There is one row for which only bmi is missing, and there are seven rows for which only age is known. The total number of missing values is equal to (7 × 3) + (1 × 2) + (3 × 1) + (1 × 1) = 27. Most missing values (10)  Thus, for pair (bmi,chl) there are 13 completely observed pairs, 3 pairs for which bmi is observed but hyp not, 2 pairs for which bmi is missing but with hyp observed, and 7 pairs with both missing bmi and hyp. Note that these numbers add up to the total sample size.
The R package VIM (Templ et al. 2011) contains functions for plotting incomplete data. The margin plot of the pair (bmi,chl) can be plotted by R> library("VIM") R> marginplot(nhanes[, c("chl", "bmi")], col = mdc(1:2), cex = 1.2, + cex.lab = 1.2, cex.numbers = 1.3, pch = 19) Figure 2 displays the result. The data area holds 13 blue points for which both bmi and chl were observed. The plot in Figure 2 requires a graphic device that supports transparent colors, e.g., pdf(). To create the plot in other devices, change the col = mdc(1:2) argument to col = mdc(1:2, trans = FALSE). The three red dots in the left margin correspond to the records for which bmi is observed and chl is missing. The points are drawn at the known values of bmi at 24.9, 25.5 and 29.6. Likewise, the bottom margin contain two red points with observed chl and missing bmi. The red dot at the intersection of the bottom and left margin indicates that there are records for which both bmi and chl are missing. The three numbers at the lower left corner indicate the number of incomplete records for various combinations.
There are 9 records in which bmi is missing, 10 records in which chl is missing, and 7 records in which both are missing. Furthermore, the left margin contain two box plots, a blue and a red one. The blue box plot in the left margin summarizes the marginal distribution of bmi of the 13 blue points. The red box plot summarizes the distribution of the three bmi values with missing chl. Under MCAR, these distribution are expected to be identical. Likewise, the two colored box plots in the bottom margin summarize the respective distributions for chl.

Creating imputations
Creating imputations can be done with a call to mice() as follows: R> imp <-mice(nhanes, seed = 23109) Imputations are generated according to the default method, which is, for numerical data, predictive mean matching (pmm). The entries imp$VisitSequence and imp$PredictorMatrix are algorithmic options that will be discusses later. The default number of multiple imputations is equal to m = 5.

Diagnostic checking
An important step in multiple imputation is to assess whether imputations are plausible. Imputations should be values that could have been obtained had they not been missing. Imputations should be close to the data. Data values that are clearly impossible (e.g., negative counts, pregnant fathers) should not occur in the imputed data. Imputations should respect relations between variables, and reflect the appropriate amount of uncertainty about their 'true' values. Diagnostic checks on the imputed data provide a way to check the plausibility of the imputations. The complete() function extracts the five imputed data sets from the imp object as a long (row-stacked) matrix with 125 records. The missing entries in nhanes have now been filled by the values from the first (of five) imputation. The second completed data set can be obtained by complete(imp, 2). For the observed data, it is identical to the first completed data set, but it may differ in the imputed data.
It is often useful to inspect the distributions of original and the imputed data. One way of doing this is to use the function stripplot() in mice 2.9, an adapted version of the same function in the package lattice (Sarkar 2008). The stripplot in Figure 3 is created as R> stripplot(imp, pch = 20, cex = 1.2) The figure shows the distributions of the four variables as individual points. Blue points are observed, the red points are imputed. The panel for age contains blue points only because age is complete. Furthermore, note that the red points follow the blue points reasonably well, including the gaps in the distribution, e.g., for chl.
The scatterplot of chl and bmi for each imputed data set in Figure 4 is created by R> xyplot(imp, bmi~chl | .imp, pch = 20, cex = 1.4) The figure redraws figure 2, but now for the observed and imputed data. Imputations are plotted in red. The blue points are the same across different panels, but the red point vary. The red points have more or less the same shape as blue data, which indicates that they could have been plausible measurements if they had not been missing. The differences between the red points represents our uncertainty about the true (but unknown) values.
Under MCAR, univariate distributions of the observed and imputed data are expected to be identical. Under MAR, they can be different, both in location and spread, but their multivariate distribution is assumed to be identical. There are many other ways to look at the completed data, but we defer of a discussion of those to Section 4.5.

Analysis of imputed data
Suppose that the complete-data analysis of interest is a linear regression of chl on age and bmi. For this purpose, we can use the function with.mids(), a wrapper function that applies the complete data model to each of the imputed data sets: Points are slightly jittered. Observed data in blue, imputed data in red.
R> fit <-with(imp, lm(chl~age + bmi)) The fit object has class mira and contains the results of five complete-data analyses. These can be pooled as follows: R> print(pool(fit)) Call: pool(object = fit)   After multiple imputation, we find a significant effect bmi. The column fmi contains the fraction of missing information as defined in Rubin (1987), and the column lambda is the proportion of the total variance that is attributable to the missing data (λ = (B + B/m)/T ).
The pooled results are subject to simulation error and therefore depend on the seed argument of the mice() function. In order to minimize simulation error, we can use a higher number of imputations, for example m=50. It is easy to do this as R> imp50 <-mice(nhanes, m = 50, seed = 23109) R> fit <-with(imp50, lm(chl~age + bmi)) R> round(summary(pool(fit)), 2) We find that actually both age and chl are significant effects. This is the result that can be reported.

Seven choices
The specification of the imputation model is the most challenging step in multiple imputation.
What are the choices that we need to make, and in what order? There are seven main choices: 1. First, we should decide whether the missing at random (MAR) assumption (Rubin 1976) is plausible. The MAR assumption is a suitable starting point in many practical cases, but there are also cases where the assumption is suspect. Schafer (1997, pp. 20-23) provides a good set of practical examples. MICE can handle both MAR and missing not at random (MNAR). Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations. There are many ways to do this. We refer to Section 6.2 for an example of how that could be realized.
2. The second choice refers to the form of the imputation model. The form encompasses both the structural part and the assumed error distribution. Within MICE the form needs to be specified for each incomplete column in the data. The choice will be steered by the scale of the dependent variable (i.e., the variable to be imputed), and preferably incorporates knowledge about the relation between the variables. Section 3.2 describes the possibilities within mice 2.9.
3. Our third choice concerns the set of variables to include as predictors into the imputation model. The general advice is to include as many relevant variables as possible including their interactions (Collins et al. 2001). This may however lead to unwieldy model specifications that could easily get out of hand. Section 3.3 describes the facilities within mice 2.9 for selecting the predictor set.
4. The fourth choice is whether we should impute variables that are functions of other (incomplete) variables. Many data sets contain transformed variables, sum scores, interaction variables, ratio's, and so on. It can be useful to incorporate the transformed variables into the multiple imputation algorithm. Section 3.4 describes how mice 2.9 deals with this situation using passive imputation.
5. The fifth choice concerns the order in which variables should be imputed. Several strategies are possible, each with their respective pro's and cons. Section 3.6 shows how the visitation scheme of the MICE algorithm within mice 2.9 is under control of the user. Linear discriminant analysis factor sample Random sample from the observed data any 6. The sixth choice concerns the setup of the starting imputations and the number of iterations. The convergence of the MICE algorithm can be monitored in many ways. Section 4.3 outlines some techniques in mice 2.9 that assist in this task.
7. The seventh choice is m, the number of multiply imputed data sets. Setting m too low may result in large simulation error, especially if the fraction of missing information is high.
Please realize that these choices are always needed. The analysis in Section 2.4 imputed the nhanes data using just a minimum of specifications and relied on mice defaults. However, these default choices are not necessarily the best for your data. There is no magical setting that produces appropriate imputations in every problem. Real problems need tailoring. It is our hope that the software will invite you to go beyond the default settings.

Univariate imputation methods
In MICE one specifies a univariate imputation model of each incomplete variable. Both the structural part of the imputation model and the error distribution need to be specified. The choice will depend on, amongst others, the scale of the variable to be imputed. The univariate imputation method takes a set of (at that moment) complete predictors, and returns a single imputation for each missing entry in the incomplete target column. The mice 2.9 package supplies a number of built-in univariate imputation models. These all have names mice.impute.name, where name identifies the univariate imputation method. Table 1 contains the list of built-in imputation functions. The default methods are indicated. The method argument of mice() specifies the imputation method per column and overrides the default. If method is specified as one string, then all visited data columns (cf. Section 3.6) will be imputed by the univariate function indicated by this string. So R> imp <-mice(nhanes, method = "norm") specifies that the function mice.impute.norm() is called for all columns. Alternatively, method can be a vector of strings of length ncol(data) specifying the function that is applied to each column. Columns that need not be imputed have method "". For example, R> imp <-mice(nhanes, meth = c("", "norm", "pmm", "mean")) applies different methods for different columns. The nhanes2 data frame contains one polytomous, one binary and two numeric variables.

Empty imputation method
The mice() function will automatically skip imputation of variables that are complete. One of the problems in previous versions this function was that all incomplete data needed to be imputed. In mice 2.9 it is possible to skip imputation of selected incomplete variables by specifying the empty method "". This works as long as the incomplete variable that is skipped is not being used as a predictor for imputing other variables. The mice() function will detect this case, and automatically remove the variable from the predictor list. For example, suppose that we do not want to impute bmi, but still want to retain in it the imputed data. We can run the following R> imp <-mice(nhanes2, meth = c("", "", "logreg", "norm")) This statement runs because bmi is removed from the predictor list. When removal is not possible, the program aborts with an error message like Error in check.predictorMatrix(predictorMatrix, method, varnames, nmis, : Variable bmi is used, has missing values, but is not imputed Section 3.3 explains how to solve this problem.

Perfect prediction
Previous versions produced warnings like fitted probabilities numerically 0 or 1 occurred and algorithm did not converge on these data. These warnings are caused by the sample size of 25 relative to the number of parameters. mice 2.9 implements more stable algorithms into mice.impute.logreg() and mice.impute.polyreg() based on augmenting the rows prior to imputation (White et al. 2010).

Default imputation method
The mice package distinguishes between four types of variables: numeric, binary (factor with 2 levels), and unordered (factor with more than 2 levels) and ordered (ordered factor with more than 2 levels). Each type has a default imputation method, which are indicated in Table 1. These defaults can be changed by the defaultMethod argument to the mice() function. For example R> mice(nhanes2, defaultMethod = c("norm", "logreg", "polyreg", "polr")) applies the function mice.impute.norm() to each numeric variable in nhanes instead of mice.impute.pmm(). It leaves the defaults for binary and categorical data unchanged. The mice() function checks the type of the variable against the specified imputation method, and produces a warning if a type mismatch is found.

Overview of imputation methods
The function mice.impute.pmm() implements predictive mean matching (Little 1988), a general purpose semi-parametric imputation method. Its main virtues are that imputations are restricted to the observed values and that it can preserve non-linear relations even if the structural part of the imputation model is wrong. It is a good overall imputation method. The functions mice.impute.norm() and mice.impute.norm.nob() impute according to a linear imputation model, and are fast and efficient if the model residuals are close to normal. The second model ignores any sampling uncertainty of the imputation model, so it is only appropriate for very large samples. The method mice.impute.mean() simply imputes the mean of the observed data. Mean imputation is known to be a bad strategy, and the user should be aware of the implications.
The function mice.impute.2L.norm() imputes according to the heteroscedastic linear twolevel model by a Gibbs sampler (Note: Interpret '2L' as 'two levels', not as 'twenty-one'). It is new in mice 2.9. The method considerably improves upon standard methods that ignore the clustering structure, or that model the clustering as fixed effects (van Buuren 2010). See multilevel imputation in Section 3.3 for an example.
The function mice.impute.polyreg() imputes factor with two or more levels by the multinomial model using the multinom() function in nnet (Venables and Ripley 2002) for the hard work. The function mice.impute.polr() implements the ordered logit model, also known as the proportional odds model. It calls polr from MASS (Venables and Ripley 2002). The function mice.impute.lda() uses the MASS function lda() for linear discriminant analysis to compute posterior probabilities for each incomplete case, and subsequently draws imputations from these posteriors. This statistical properties of this method are not as good as mice.impute.polyreg() (Brand 1999), but it is a bit faster and uses fewer resources. The maximum number of categories these function handle is set to 50. Finally, the function mice.impute.sample() just takes a random draw from the observed data, and imputes these into missing cells. This function does not condition on any other variable. mice() calls mice.impute.sample() for initialization.
The univariate imputation functions are designed to be called from the main function mice(), and this is by far the easiest way to invoke them. It is however possible to call them directly, assuming that the arguments are all properly specified. See the documentation for more details.

Predictor selection
One of the most useful features of the MICE algorithm is the ability to specify the set of predictors to be used for each incomplete variable. The basic specification is made through the predictorMatrix argument, which is a square matrix of size ncol(data) containing 0/1 data. Each row in predictorMatrix identifies which predictors are to be used for the variable in the row name. If diagnostics = TRUE (the default), then mice() returns a mids object containing a predictorMatrix entry. For example, type The row correspond to incomplete target variables, in the sequence as they appear in data. Row and column names of the predictorMatrix are ignored on input, and overwritten by mice() on output. A value of 1 indicates that the column variable is used as a predictor to impute the target (row) variable, and a 0 means that it is not used. Thus, in the above example, bmi is predicted from age, hyp and chl. Note that the diagonal is 0 since a variable cannot predict itself. Since age contains no missing data, mice() silently sets all values in the row to 0. The default setting of the predictorMatrix specifies that all variables predict all others.

Removing a predictor
The user can specify a custom predictorMatrix, thereby effectively regulating the number of predictors per variable. For example, suppose that bmi is considered irrelevant as a predictor. Setting all entries within the bmi column to zero effectively removes it from the predictor set, e.g., will not use bmi as a predictor, but still impute it. Using this new specification, we create imputations as R> imp <-mice(nhanes, pred = pred, pri = FALSE) This imputes the incomplete variables hyp and chl without using bmi as a predictor.

Skipping imputation
Suppose that we want to skip imputation of bmi, and leave it as it is. This can be achieved by 1) eliminating bmi from the predictor set, and 2) setting the imputation method to "". More specifically The first statement calls mice() with the maximum number of iterations maxit set to zero. This is a fast way to create the mids object called ini containing the default settings. It is then easy to copy and edit the predictorMatrix and method arguments of the mice() function. Now mice() will impute NA into the missing values of bmi. In effect, the original bmi (with the missing values included) is copied into the multiply imputed data set. This technique is not only useful for 'keeping all the data together', but also opens up ways to performs imputation by nested blocks of variables. For examples where this could be useful, see Shen (2000) and Rubin (2003).

Multilevel imputation
Imputation of multilevel data poses special problems. Most techniques have been developed under the joint modeling perspective (Schafer and Yucel 2002;Yucel 2008;Goldstein et al. 2009). Some work within the context of FCS has been done (Jacobusse 2005), but this is still an open research area. The mice 2.9 package include the mice.impute.2L.norm() function, which can be used to impute missing data under a linear multilevel model. The function was contributed by Roel de Jong, and implements the Gibbs sampler for the linear multilevel model where the within-class error variance is allowed to vary (Kasim and Raudenbush 1998). Heterogeneity in the variances is essential for getting good imputations in multilevel data. The method is an improvement over simpler methods like flat-file imputation or per-group imputation (van Buuren 2010).

Advice on predictor selection
The predictorMatrix argument is especially useful when dealing with data sets with a large number of variables. We now provide some advice regarding the selection of predictors for large data, especially with many incomplete data.
As a general rule, using every bit of available information yields multiple imputations that have minimal bias and maximal certainty (Meng 1995;Collins et al. 2001). This principle implies that the number of predictors should be chosen as large as possible. Including as many predictors as possible tends to make the MAR assumption more plausible, thus reducing the need to make special adjustments for NMAR mechanisms (Schafer 1997).
However, data sets often contain several hundreds of variables, all of which can potentially be used to generate imputations. It is not feasible (because of multicollinearity and computational problems) to include all these variables. It is also not necessary. In our experience, the increase in explained variance in linear regression is typically negligible after the best, say, 15 variables have been included. For imputation purposes, it is expedient to select a suitable subset of data that contains no more than 15 to 25 variables. van  provide the following strategy for selecting predictor variables from a large data base: 1. Include all variables that appear in the complete-data model, i.e., the model that will be applied to the data after imputation. Failure to do so may bias the completedata analysis, especially if the complete-data model contains strong predictive relations. Note that this step is somewhat counter-intuitive, as it may seem that imputation would artificially strengthen the relations of the complete-data model, which is clearly undesirable. If done properly however, this is not the case. On the contrary, not including the complete-data model variables will tend to bias the results towards zero. Note that interactions of scientific interest also need to be included into the imputation model.
2. In addition, include the variables that are related to the nonresponse. Factors that are known to have influenced the occurrence of missing data (stratification, reasons for nonresponse) are to be included on substantive grounds. Others variables of interest are those for which the distributions differ between the response and nonresponse groups. These can be found by inspecting their correlations with the response indicator of the variable to be imputed. If the magnitude of this correlation exceeds a certain level, then the variable is included.
3. In addition, include variables that explain a considerable amount of variance. Such predictors help to reduce the uncertainty of the imputations. They are crudely identified by their correlation with the target variable.
4. Remove from the variables selected in steps 2 and 3 those variables that have too many missing values within the subgroup of incomplete cases. A simple indicator is the percentage of observed cases within this subgroup, the percentage of usable cases.
Most predictors used for imputation are incomplete themselves. In principle, one could apply the above modeling steps for each incomplete predictor in turn, but this may lead to a cascade of auxiliary imputation problems. In doing so, one runs the risk that every variable needs to be included after all. In practice, there is often a small set of key variables, for which imputations are needed, which suggests that steps 1 through 4 are to be performed for key variables only. This was the approach taken in van , but it may miss important predictors of predictors. A safer and more efficient, though more laborious, strategy is to perform the modeling steps also for the predictors of predictors of key variables. This is done in Oudshoorn et al. (1999). We expect that it is rarely necessary to go beyond predictors of predictors. At the terminal node, we can apply a simply method like mice.impute.sample() that does not need any predictors for itself.

Quick predictor selection
Correlations for the strategy outlined above can be calculated with the standard function cor( yields a predictorMatrix for a model that includes all predictors with an absolute correlation with the target or with the response indicator of at least 0.1 (the default value of the mincor argument). Observe that the predictor matrix need not always be symmetric. In particular, bmi is not a predictor of hyp, but hyp is a predictor of bmi here. This can occur because the correlation of hyp with the response indicator of bmi (0.139) exceeds the threshold.
The quickpred() function has arguments that change the minimum correlation, that allow to select predictor based on their proportion of usable cases, and that can specify variables that should always be included or excluded. It is also possible to specify thresholds per target variable, or even per target-predictor combination. See the help files for more details.
It is easy to use the function in conjunction with mice(). For example, R> imp <-mice(nhanes, pred = quickpred(nhanes, minpuc = 0.25, + include = "age")) imputes the data from a model where the minimum proportion of usable cases is at least 0.25 and that always includes age as a predictor.
Any interactions of interest need to be appended to the data before using quickpred(). For large data, the user can experiment with the mincor, minpuc, include and exclude arguments to trim the imputation problem to a reasonable size. The application of quickpred() can substantially cut down the time needed to specify the imputation model for data with many variables.

Passive imputation
There is often a need for transformed, combined or recoded versions of the data. In the case of incomplete data, one could impute the original, and transform the completed original afterwards, or transform the incomplete original and impute the transformed version. If, however, both the original and the transform are needed within the imputation algorithm, neither of these approaches work because one cannot be sure that the transformation holds between the imputed values of the original and transformed versions.
mice implements a special mechanism, called passive imputation, to deal with such situations. Passive imputation maintains the consistency among different transformations of the same data. The method can be used to ensure that the transform always depends on the most recently generated imputations in the original untransformed data. Passive imputation is invoked by specifying a~(tilde) as the first character of the imputation method. The entire string, including the~is interpreted as the formula argument in a call to model.frame(formula, data[!r[,j],]). This provides a simple method for specifying a large variety of dependencies among the variables, such as transformed variables, recodes, interactions, sum scores, and so on, that may themselves be needed in other parts of the algorithm.

Preserving a transformation
As an example, suppose that previous research suggested that bmi is better imputed from log(chl) than from chl. We may thus want to add an extra column to the data with log(chl), and impute bmi from log(chl). Any missing values in chl will also be present in log(chl). The problem is to keep imputations in chl and log(chl) consistent with each other, i.e., the imputations should respect their relationship. The following code will take care of this: R> nhanes2.ext <-cbind(nhanes2, lchl = log(nhanes2$chl)) R> ini <-mice(nhanes2.ext, max = 0, print = FALSE) R> meth <-ini$meth R> meth["lchl"] <-"~log(chl)" R> pred <-ini$pred R> pred[c("hyp", "chl"), "lchl"] <-0 R> pred["bmi", "chl"] <-0 R> pred We defined the predictor matrix such that either chl or log(chl) is a predictor, but not both at the same time, primarily to avoid collinearity. Moreover, we do not want to predict chl from lchl. Doing so would immobilize the MICE algorithm at the starting imputation. It is thus important to set the entry pred["chl", "lchl"] equal to zero to avoid this. After running mice() we find imputations for both chl and lchl that respect the relation.
Note: A slightly easier way to create nhanes2.ext is to specify R> nhanes2.ext <-cbind(nhanes2, lchl = NA) followed by the same commands. This has the advantage that the transform needs to be specified only once. Since all values in lchl are now treated as missing, the size of imp will generally become (much) larger however. The first method is generally more efficient, but the second is easier.

Index of two variables
The idea can be extended to two or more columns. This is useful to create derived variables that should remain synchronized. As an example, we consider imputation of body mass index (bmi), which is defined as weight divided by height*height. It is impossible to calculate bmi if either weight or height is missing. Consider the data boys in mice. Data on weight and height are missing for 4 and 20 cases, respectively, resulting in 21 cases for which bmi could not be calculated. Using passive imputation, we can impute bmi from height and weight by means of the I() operator.

Interaction terms
In some cases scientific interest focusses on interactions terms. For example, in experimental studies we may be interested in assessing whether the rate of change differs between two treatment groups. In such cases, the primary goal is to get an unbiased estimate of the time by group interaction. In general imputations should be conditional upon the interactions of interest. However, interaction terms will be incomplete if the variables that make up the interaction are incomplete. It is straightforward to solve this problem using passive imputation.
Interactions involving categorical variables need a representation using dummy variables. The mice() function internally creates dummy variables for any factor that are being used as a predictor. The data and names of these dummy variables can be accessed from imp$pad$data.

Cautious remarks
There are some specific points that need attention when using passive imputation through the~mechanism. Deterministic relations between columns remain only synchronized if the passively imputed variable is updated immediately after any of its predictors are imputed. So in the last example variables age.1.bmi and age.2.bmi should be updated each time after age or bmi is imputed in order to stay synchronized. This can be done by changing the sequence in which the algorithm visits the columns. The mice() function does not automatically change the visiting sequence if passive variables are added. Section 3.6 provides techniques for setting the visiting sequence. Whether synchronization is really worthwhile will depend on the specific data at hand, but it is a healthy general strategy to pursue.
The~mechanism may easily lead to highly correlated variables or linear dependencies among predictors. Sometimes we want this behavior on purpose, for example if we want to impute using both X and X 2 . However, linear dependencies among predictors will produce a fatal error during imputation. In this section, our strategy has been to avoid this by requiring that either the original or the passive variable can be a predictor. This strategy may not always be desired or feasible however.
Another point is that passive imputation may easily lock up the algorithm when it is not done properly. Suppose that we make a copy bmi2 of bmi by passive imputation, and subsequently use bmi2 to impute missing data in bmi. Re-imputing bmi from bmi2 will fix the imputations to the starting imputations. This situation is easy to diagnose and correct (cf. Section 4.3).
The mice algorithm internally uses passive imputation to create dummy variables of factors. These dummy variables are created automatically and discarded within the main algorithm, and are always kept in sync with the original by passive imputation. The relevant data and settings are stored within the list imp$pad. Normally, the user will not have to deal with this, but in case of running problems it could be useful to be aware of this. Section 4 provides more details.

Post-processing imputations
It can be useful to post-process imputations generated by univariate methods. For example, we may require imputation to be bounded within a certain range, or we may wish to exclude implausible or impossible combinations. The mice() function has an argument post that takes a vector of strings of R commands. These commands are parsed and evaluated just after the univariate imputation function returns, and thus provide a way to post-process the imputed values. For example, a way to ensure positive imputations for chl under normal imputation (cf. Section 3.4 on the squeeze() function) is: R> nhanes2.ext <-cbind(nhanes2, lchl = NA) R> ini <-mice(nhanes2.ext, max = 0, print = FALSE) R> meth <-ini$meth R> meth[c("lchl", "chl")] <-c("~log(chl)", "norm") R> pred <-ini$pred R> pred[c("hyp", "chl"), "lchl"] <-0 R> pred["bmi", "chl"] <-0 R> post <-ini$post R> post ["chl"] [,i] in the definition of post["chl"] refers to a vector that is used to store the i-th imputation (i = 1, . . . , m) for the j-th column in p$data, a padded version of the input data, here nhanes2.ext. Expression(s) are evaluated within the sampler() function. Any expressions that are valid within that context can be executed, but be careful not the introduce any NA's if the variable is to be used as a predictor for another variable. The output shows that several imputed values have been constrained to lie within the specified range.
Another example refers to Figure 5. Puberty can already start at the age of 3 years in clinical populations of American girls (Herman-Giddens et al. 1997). For our data of healthy Dutch boys we assume that puberty will not start before the age of 5 years. We thus want to restrict any imputations of gen, phb and tv to the lowest possible category for children younger than 5 years. This can be achieved by using the post argument. The code below first repeats the setting of meth and pred from Section 3.4.

Visiting scheme
The default MICE algorithm imputes incomplete columns in the data from left to right. Theoretically, the visiting scheme is irrelevant as long as each column is visited often enough, but some schemes are more efficient than others. In particular, for monotonically missing data, convergence is immediate if variables are ordered according to their number of missing cases. Rather than reordering the data itself, it is more convenient to change the visiting scheme of the algorithm by the visitSequence argument. In its basic form, the visitSequence argument is a vector of integers in the range 1:ncol(data) of arbitrary length, specifying the sequence of column numbers for one iteration of the algorithm. Any given column may be visited more than once within the same iteration, which can be useful to ensure proper synchronization among variables. It is mandatory that all columns with missing data that are being used as predictors are visited, or else the function will stop with an error.
As an example, rerun the code of the Section 3.4, to obtain imputed data imp that allow for the interaction bmi.chl. The visiting scheme is If visitSequence is not specified, the mice() function imputes the data from left to right. In this case, bmi.chl is calculated after chl is imputed, so at point bmi.chl is synchronized with both bmi and chl. Note however that bmi.chl is not synchronized with bmi when imputing hyp, so bmi.chl is not representing the current interaction effect. This could result in wrong imputations. We can correct this by including an extra visit to bmi.chl after bmi has been imputed: The effect is that bmi.chl is now properly updated. By the way, a more efficient ordering of the variables is R> imp <-mice(nhanes2.ext, meth = meth, pred = pred, vis = c(2, 4, 5, 3)) iter imp variable 1 1 bmi chl bmi.chl hyp 1 2 bmi chl bmi.chl hyp 1 3 bmi chl bmi.chl hyp ...
When the missing data pattern is close to monotone, convergence may be speeded by visiting the columns in increasing order of the number of missing data. We can specify this order by the "monotone" keyword as R> imp <-mice(nhanes2.ext, meth = meth, pred = pred, vis = "monotone") iter imp variable 1 1 hyp bmi chl bmi.chl 1 2 hyp bmi chl bmi.chl 1 3 hyp bmi chl bmi.chl ...

Dry run
A dry run is a call to mice() with the maximum number of iterations maxit set to zero by

R> ini <-mice(nhanes2, maxit = 0)
A dry run is a fast way to create the mids object ini containing the default settings. The default settings of the attributes of this mids object, like ini$method, ini$predictorMatrix, ini$post, and ini$visitSequence can be used to define user-specific settings. Especially for datasets with many variables this is easier than defining these settings from scratch. This technique was already used in many examples in Section 3.
A mids object obtained with a dry run can also be used to include manually imputations obtained from other software. It is essential that external imputations are stored in the proper format of the mids object. For example, the matrix import contains two imputations for nine missing values in bmi generated by other software. It can be assigned to the mids object imp by R> import <-matrix(c(30, 30, 30, 29, 25, 21, 25, 25, 22, 33, 27, 22, 27, + 35, 27, 20, 27, 30), byrow = TRUE, nr = 9) R> imp <-mice(nhanes, print = FALSE, seed = 77172) R> imp$imp$bmi[, 1:2] <-import R> imp$imp$bmi It is important to realize that this technique assumes that the order of imputed values across software systems is identical. Section 7 shows some alternative ways to interact with other software.
The altered mids object can now be used as input for the MICE algorithm by calling the mice.mids function (see Section 4.2). Note that this requires all empty cells to be imputed, otherwise the sampler will get stuck on empty cells. Also, the altered mids object can be used for repeated complete data analyses by calling the with.mids function. For this use, empty cells in the imputation can be encoded by NA. The complete data method should then be able to handle the missing data correctly.

Step by step
The function mice.mids() takes a mids object as input, iterates maxit iterations and produces another mids object as output. This function enables the user to split up the computations of the MICE algorithm into smaller parts by providing a stopping point after every full iteration. There are various circumstances in which this might be useful: For large data, RAM memory may become exhausted if the number of iterations is large. Returning to prompt/session level may alleviate these problems.
The user wants to compute special convergence statistics at intermediate points, e.g., after each iteration, for monitoring convergence.
For computing a 'few extra iterations'.
The imputation model itself is specified in the mice() function and cannot be changed with mice.mids (Well actually you can by tweaking the mids object directly, but this is not recommended). The state of the random generator is saved with the mids object.

Assessing convergence
There is no clear-cut method for determining whether the MICE algorithm has converged. What is often done is to plot one or more parameters against the iteration number. The mice() function produces m parallel imputation streams. The mids object contains components chainMean and chainVar with the mean and variance of the imputations per stream, respectively. These can be plotted by the plot.mids object. On convergence, the different streams should be freely intermingled with each other, without showing any definite trends. Convergence is diagnosed when the variance between different sequences is no larger than the variance with each individual sequence.
Inspection of the stream may reveal particular problems of the imputation model. Section 3.4 shows that passive imputation should carefully set the predictor matrix. The code below is a pathological example where the MICE algorithm is stuck at the initial imputation.
By comparison, Figure 9 is an example of healthy convergence of the same three variables. The model fitted here is the imp.idx solution from Section 3.4, where bmi is passively imputed without feedback into hgt and wgt. There is very little trend and the streams mingle very well right from the start. Figure 10 shows the convergence plot for hc, gen and phb from the same solution. There is a strong initial trend for gen and phb). The MICE algorithm initializes the solution by taking a random draw from the observed data. For gen and phb these values are far too high. The algorithm quickly picks up the strong relation between age, gen and phb, and has more or less reached the appropriate level within about five iterations. This demonstrates that convergence can be very fast, even if the starting imputations are clearly off-target. Plotting somewhat longer iteration sequences will generally convey a good idea whether the between-imputation variability has stabilized and whether the estimates are free of trend.
Note that one iteration of the MICE algorithm involves a lot of work. For each variable and each repeated imputation a statistical model is fitted, imputations are drawn and the data are updated. Fortunately, the needed number of main iterations is typically much lower than is common in modern MCMC techniques, which often require thousands of iterations. The key to fast convergence is to achieve independence in the imputations themselves. Univariate imputation procedures create imputations that are already statistically independent for a given value of the regression parameters. The number of iterations should be large enough to stabilize the distributions of these parameters. Simulation work using moderate amounts of missing data yields satisfactory performance with just 5 or 10 iterations (Brand 1999;van Buuren et al. 2006b). In many cases, we can obtain good results with as few iterations, but it does not hurt to calculate some extra iterations to assess convergence over longer stretches. For large amounts of missing data or for remotely connected data (e.g., file matching) convergence can be slower.
The default plot of the mids object plots the mean and variance of the imputations. This may not correspond to the parameter of most interest. To check convergence of an arbitrary parameter of interest, one could write a function that loops over mice.mids, that extracts imputed data with the complete function, and that recomputes the parameter after each iteration using the current imputations. More in particular, one could set maxit to 1, generate imputations, compute the statistic of interest on the completed data, save the result, compute the second iteration using mice.mids, and so on.
As a example, suppose that we are interested in the association between gen and phb as measured by Kendall's τ . We can monitor convergence of the MICE algorithm with respect to Kendall's τ by the following lines:  Figure 11: Convergence of the MICE algorithm with respect to Kendall's τ between gen and phb.

Solving problems with the data
The MICE algorithm performs checks on the data before: missing data should be present, the data consists of at least two columns, and constant predictors are not allowed. Version V2.0 implements more stable algorithms for imputing categorical data, which caused a lot of warnings in previous versions.
One source of confusion is that the mice() function chooses the imputation method according to the type of the variable. Categorical variables should be coded as factors in order to invoke polyreg, otherwise pmm will be selected. Variables with many categories (> 20) are often problematic. Internally, the mice() function will create a dummy variable for each category, which can produce very large matrices, time consuming calculations, introduce empty cell problems and cause instability problems of the algorithm. For variable with many categories, we therefore recommend pmm.
The mice() function was programmed for flexibility, and not to minimize the use of time or resources. Combined with the greedy nature of R in general and the fact that the method does not use compiled functions, this poses some limits to the size of the data sets and the type of data that can be analyzed. The excution time of mice() depends primarily on the choice of the univariate imputation methods. Methods like pmm and norm are fast, while others, logreg and polyreg, can be slow. If there are many categorical variables, each with many categories, one can speed up the algorithm considerably by imputing (some of the) variables as numerical variables with pmm instead of polyreg. Since pmm imputes only values that are observed, the original categories of the variable are preserved.
A frequent source of problems is caused by predictors that are (almost) linearly related. For example, if one predictor can be written as a linear combination of some others, then this results in messages like Error in Solve.default(t(xobs) %*% xobs) : system is computationally singular: reciprocal condition number = 2.87491e-20 or Error in solve.default(t(xobs) %*% xobs) : Lapack routine dgesv: system is exactly singular. The solution to collinearity is to eliminate duplicate information, for example by specifying a reduced set of predictors via the predictorMatrix argument. Sometimes the source of the problem is obvious, for example if there are sum scores. However, finding the sets of nearly dependent variables can prove to be difficult and laborious. One trick that we use in practice is to study the last eigenvector of the covariance matrix of the incomplete data (after listwise deletion). Variables with high values on that factor often cause the problems. Alternatively, one could revert to collinearity diagnostics like the VIF (Kleinbaum et al. 1988) or graphical displays (Cook and Weisberg 1999). Pedhazur (1973) provided a good discussion on multicollinearity. Version 2.5 incorporated new functions for trapping multicollinearity, so these problems should be things of the past.
If all fails, use the debugger. Functions in mice are all full R. This allows you to trace every detail of your calculations, and eventually track down the source of the problem. This is your last resort, and requires you to understand our code, which can be difficult at times, even for us. Use traceback() to find out what R was doing at the time the error occurred. In order to use the debugger set options(error = dump.frames) before the error occurs, and then type debugger(). This will get you to choose the local frame where the error, and inspect the current value of local variables. If you have suggestions for improvement, please let us know.

Checking your imputations
Once the algorithm has converged, it is important to inspect the imputations. In general, a good imputed value is a value that could have been observed had it not been missing. In this phase, it is often useful to report results to the investigator who understands the science behind the data. Since there are some many aspects that we could look at, we have refrained from implemented specific tools. It is however not difficult with the powerful graphical functions in lattice to make diagnostic plots for checking the imputations. We will give some examples below, following the ideas of Raghunathan and Bondarenko (2007).
The MAR assumption can never be tested from the observed data. One can however check whether the imputations created by MICE algorithm are plausible. As a first step one can plot densities of both the observed and imputed values of all variables to see whether the imputations are reasonable. Differences in the densities between the observed and imputed values may suggest a problem that needs to be further checked. The densityplot() function from the lattice package can be used on mids objects to produce Figure 12 as R> densityplot(imp.kendall, scales = list(x = list(relation = "free")), + layout = c(5, 1)) Figure 12 shows that the imputed values can be quite different from the observed data. For example, the imputed heights are around 90 cm, which is due to the fact the some of the values of the two-year olds were missing. The same holds for testicular volume (tv), which was not measured below the age of 8 years. Reversely, the imputed values for head circumference (hc) are higher since hc was not measured in the older boys. In general, plots like this are useful to detect interesting differences between the observed and imputed data.
Another diagnostic tool involves comparing the distributions of observed and imputed data conditional on the propensity score (Raghunathan and Bondarenko 2007). The idea is that the conditional distributions should be similar if the assumed model for creating multiple imputations is a good fit. The following statements create the missing data indicator hc.na of hc in the global environment, and calculate propensity scores: R> hc.na <-is.na(boys$hc) R> fit.hc <-with(imp.kendall, glm(hc.na~age + wgt + hgt + reg, + family = binomial)) R> ps <-rep(rowMeans(sapply(fit.hc$analyses, fitted.values)), 6) The figure plots head circumference (observed and imputed) against the propensity score, where the propensity score is equal to the average over the imputations. By definition, there are more imputed values on the right hand side of the display than on the left hand side. The striped pattern in the distribution of the propensity score is related to region (variable reg). More missing data appear in the city and in the western part of the country. If the imputation model fits well, we expect that for a given propensity score, the distribution of the observed and imputed data conform well. This appears to be the case here, so this plot provides evidence that the imputations are reasonable. Realize however that the comparison is as good as the propensity score. If important predictors are omitted from the nonresponse model, then we cannot see the potential misfit related to these.
It is also useful to study the residual of the relation depicted in Figure 13. Under MAR we expect that the spread of the residuals will be similar (but not identical) for observed and imputed data, so their distributions should overlap. The relation between the propensity score and head circumference is clearly not linear, so we added polynomial terms to account for the nonlinearity.
R> hc <-complete(imp.kendall, "long", TRUE)$hc R> fit <-lm(hc~poly(ps, 4))  The figure displays the distributions of the residuals in the observed and imputed data. The amount of overlap is large, lending credit to the notion that the spread of imputations is appropriate.
More methods for diagnostic checking can be found in Section 2.4. Also consult methods for checking multiply imputed values by Abayomi et al. (2008). These are also easily implemented using standard tools.

After MICE
Imputations are stored in an object of class mids. The next step is to analyze the multiply imputed data by the model of scientific interest. The function with.mids() takes imputed data, fits the model to each of the m imputed data sets, and returns an object of type mira. Section 5.1 deals with this step. The result is m analysis instead of one, so the last task is to combine these analyses into one final final result. The function pool() takes an object of class mira and produces a combined result in an object of class mipo. Section 5.3 provides the details.

Repeated data analysis
Performing the desired analysis repeatedly for each imputed copy of the data can be done with the function with.mids. This function evaluates an expression in each multiply imputed dataset and creates an object of class mira. We impute nhanes2 and apply a linear regression on the imputed data to predict chl as follows: R> summary(pool (fit) The function with.mids() is completely general, and replaces lm.mids() and glm.mids(). For compatibility reasons, the latter two functions remain available.

Extracting imputed data
An alternative is to export the imputed data into a conventional rectangular form, followed by the analysis of the imputed data. The complete() function extracts imputed data sets from a mids object, and returns the completed data as a data frame. For example,

R> com <-complete(imp, 3)
extracts the third complete data set from the multiply imputed data in imp. Specifying R> com <-complete(imp, "long") produces a long matrix com where the m completed data matrices are vertically stacked and padded with the imputation number in a column called .imp. This form is convenient for making point estimates and for exporting multiply imputed data to other software. Other options are broad and repeated, which produce complete data in formats that are convenient for investigating between-imputation patterns. One could also optionally include the original incomplete data.
An alternative way of creating the above contingency tables using complete() is: R> com <-complete(imp, "long", include = TRUE) R> by(cbind(age = com$age, ov = cut(com$bmi, c(10, 25, 50))), + com$.imp, table)  Rubin (1987) developed a set of rules for combining the separate estimates and standard errors from each of the m imputed datasets into an overall estimate with standard error, confidence intervals and p values. These rules are based on asymptotic theory on the normal distribution, and are implemented in the functions pool() and pool.scalar().

mira objects
The function pool() take an object of class mira and creates an object of class mipo (multiple imputed pooled outcomes). For example, R> fit <-with(imp, lm(chl~age + bmi)) R> est <-pool(fit) In addition, the pool() function will also work for objects of class lme defined in the package nlme. It is possible to pool the fixed coefficients from a linear mixed model according to Rubin's rules. By default the number of degrees of freedom is calculated using the method of Barnard and Rubin (1999).

Scalars
The function pool.scalar pools univariate estimates of m repeated complete data analysis according to Rubin's rules. The arguments of the function are two vectors containing the m repeated complete data estimates and their corresponding m variances. The function returns a list containing the pooled estimate, the between, within and total variance, the relative increase in variance due to nonresponse, the degrees of freedom for the t reference distribution and the fraction of missing information due to nonresponse. This function is useful when the estimated parameters are not obtained through one of the regular R modeling functions.

Explained variance R 2
For combining estimates of R 2 and adjusted R 2 one can use the function pool.r.squared. The method is based on the familiar Fisher z transformation for correlations. Its properties in the context of multiple imputation were studied by Harel (2009).

Model testing
The function pool.compare() compares two nested models fitted on multiply imputed data. The function implements the Wald test and the likelihood ratio test. The function can be used to test whether one or more variables should be present in the complete-data model. Variable selection in the context of multiple imputation is somewhat different. Several strategies have been proposed that count the number of times that variable is in the model (Brand 1999;Heymans et al. 2007;Wood et al. 2008). The pool.compare() function provides an alternative that takes the between-imputation variability into account.
The Wald test can be used when the completed-data estimates and their covariance matrices are known (e.g., for estimates obtained with lm), and when the dimensionality of the estimates is not too high. The pool.compare() function with the argument method = "Wald" pools the p values for comparing the nested models using the method of Li et al. (1991).
R> imp <-mice(nhanes2, print = FALSE, m = 50, seed = 219) R> fit0 <-with(data = imp, expr = lm(bmi~age + hyp)) R> fit1 <-with(data = imp, expr = lm(bmi~age + hyp + chl)) R> stat <-pool.compare(fit1, fit0, method = "Wald") R> stat$p [,1] [1,] 0.005015501 Meng and Rubin (1992) provide a procedure for testing nested hypotheses by likelihood ratio tests from multiple imputed data. The likelihood function needs to be fully specified in order to calculate the likelihood ratio statistics at the average over the imputations of the parameter estimates under both the null and alternative hypotheses. The current version of pool.compare() implements the likelihood function for logistic regression, i.e., complete-data models obtained with glm(family = "binomial"). The example below illustrates the use of the method in the boys data. The question is whether the factor reg (region, with five levels) should be included in the logistic regression model for onset of puberty.
6. Miscellaneous topics 6.1. Adding your own imputation functions Some organizations have made considerable investments to develop procedures for imputing key variables, like income or family size, whose values are subject to all kinds of subtle constraints. Using one of the built-in imputation methods could be a waste of this investment, and may fail to produce what is needed.
It is possible to write your own univariate imputation function, and call this function from within the MICE algorithm. The easiest way to write such a function is to copy and modify an existing mice.impute.xxx() function, for example mice.impute.norm(). Most univariate imputation functions have just three arguments: the variable to be imputed y, the response indicator ry and the matrix of predictors x (without intercept). The function should return a vector of length(y)-sum(ry) imputations of the correct type. Consult mice.impute.norm() or mice.impute.polyreg() for inspiration. Your new function can be called from within the MICE algorithm by the method argument. For example, calling your function mice.impute.myfunc() for each column can be done by typing R> mice(nhanes, method = "myfunc") Using this procedure enables you to speed up the algorithm by including pre-compiled Fortran or C code, or to dump imputation models for closer inspection.

Sensitivity analysis under MNAR
A common misunderstanding about multiple imputation is that it is restricted to MAR. While it is certainly true that imputation techniques commonly assume MAR, the theory of multiple imputation is completely general and also applies to MNAR. Under MNAR, the model fitted to the complete cases is incorrect for the missing cases, and thus cannot be used for imputation. Unless we have external data, there is no way of estimating the amount of error.
A sensible alternative is to set up a number of plausible scenarios, and investigate the consequences of each of them on the final inferences. Chapter 6 in Rubin (1987) contains a number of basic techniques. Up-to-date overviews of specialized MNAR models can be found in Little (2009b) and Albert and Follmann (2009). If the influence under these scenarios is small, then the analysis is said to be robust against the investigated violations of the MAR mechanism.
Suppose that we have reason to believe that our imputations made under the MAR assumption are too low, even after accounting for the predictive information in the available data. One simple trick is to multiply the imputations by a factor. With mice(), this is easily achieved by post-processing imputations through the post argument. The scenario involves increasing imputations for chl by 0% (MAR), 10%, 20%, 30%, 40% and 50% (MNAR).
R> ini <-mice(nhanes2, maxit = 0, print = FALSE) R> post <-ini$post R> k <-seq(1, 1.5, 0.1) R> est <-vector("list", length(k)) R> for (i in 1:length (k) The parameters of interest under these six scenarios are stored in the list est. Inspection of these solutions reveals that the size of k primarily increases the intercept term. There is a slightly trend that moves the model estimates towards zero if k goes up. Note that we used fixed seed value, so all differences between scenarios are strictly due to k.
Carefully monitor convergence if variables are highly correlated. A higher imputed value in chl will produce a higher than average imputation in bmi and hyp. If relations are strong, these higher average will feedback into a higher imputations for chl, which is then multiplied by k, and so on. Such feedback could result in explosive behavior of the MICE algorithm that should be diagnosed and prevented. A remedy is to remove the strongest predictors from the imputation model. Little (2009a) stresses that all MNAR models are subject to a fundamental lack of identification. More fancy models do not make this problem go away. He therefore advocates the use of simple models, like multiplying by a factor (Rubin 1987, pp. 203) or adding offsets . Both are basic forms of pattern-mixture models. Little (2009a, pp. 49) writes: "The idea of adding offsets is simple, transparent, and can be readily accomplished with existing software." 7. Interacting with other software

Microsoft Excel
Microsoft's Excel is the most widely tool to store and manipulate data. Excel has strong facilities for interactive data editing. We use the software in combination with mice to specify mice: Multivariate Imputation by Chained Equations in R Figure 15: A predictor matrix in Excel for an imputation model of longitudinal data.
to predictor matrix for imputation problems that involve many variables. Figure 15 shows an example for an imputation problem for longitudinal data, where we have used to Excel's conditional formatting option for automatic coloring. It is straightforward to transfer the predictor matrix between Excel and R by simple tab-delimited files.

SPSS
SPSS 17 (now called PASW Statistics 17) implements the FCS approach to multiple imputation. See the documentation on the new MULTIPLE IMPUTATION procedure (SPSS Inc. 2008b,a) for details. The procedure is largely based on FCS, and the functionality of mice and the MULTIPLE IMPUTATION procedure has many overlaps. A difference is that SPSS has no support for passive imputation or post-processing, though it is possible to specify bounds.
The procedure in SPSS produces a long data set that is recognized by other procedures in SPSS as a multiply imputed data set. Application of general regression models like linear, logistic, GEE analysis, mixed model, Cox regression and general linear models automatically gives pooled estimates of the coefficients, standard errors and p values. Diagnostics like R 2 , comparison of models using the likelihood ratio, are not pooled. It is also possible to import a multiply imputed data set into SPSS so that SPSS recognizes it as such. The key to importing data is to create a variable named Imputation_, and use this as split variable. The function mids2spss() in mice 2.9 converts a mids object into a format recognized by SPSS, and writes the data and the SPSS syntax files. Multiply imputed data can exported from SPSS to R. Converting this data set into a mids object has not yet been automated.
SPSS and R form a strong combination. SPSS has superior data handling and documentation functions, whereas R is free, extensible and flexible. Starting from SPSS 16.0, it is possible to integrate both software packages by the SPSS Integration Plug-In. A small program to impute data stored in SPSS with mice() looks like this (run this from the SPSS syntax window).
The code instructs R to import the data and the dictionary from SPSS. Subsequently mice is loaded, the data are imputed. Finally, the stacked imputed data and its dictionary is exported back from R into SPSS.
7.3. mitools mitools 2.0.1 is a small R package available from CRAN written by Lumley (2010) containing tools for analyzing and pooling multiply imputed data. The multiply imputed data should be of class ImputationList. A mids object (imp) can be transformed into an object of class ImputationList object as follows: R> library("mitools") R> mydata <-imputationList(lapply(1:5, complete, x = imp)) Statistical analyses on these data can be done by with(): R> fit <-with(mydata, expr = lm(chl~age + bmi)) The pooled outcomes are obtained with summary(MIcombine(fit)). MIcombine() can be used for all fit objects for which a vcov() function exists, like our function pool(). Since this function does not exists for objects of class lme, pooling of mixed models is not possible with mitools. The package itself has no functions for creating imputations.

Zelig
The package Zelig 3.4-6 (Imai et al. 2008) offers a framework with a simple structure and syntax that encompasses a number of existing statistical methods. One of the functions of Zelig is mi() which makes a list of imputed data sets of class mi. Zelig does not contain functions for generating multiple imputations. When a data set of class mi is passed to the general modeling function zelig(), it is recognized as an imputed data set. For a number of modeling functions in Zelig the outcomes are then automatically pooled when the data is such a list of imputed data sets. It is however not clear for which modeling function this works (ls.mixed() does not seem to be supported). The code for estimating the propensity score model for hc (cf. Section 4.5) with zelig() is as follows: R> library("Zelig") R> imp <-cbind.mids(imp.idx, data.frame(r.hc = is.na(boys$hc))) R> mydata2 <-mi(complete(imp, 1), complete(imp, 2), complete(imp, 3), + complete(imp, 4), complete(imp, 5)) R> fit <-zelig(r.hc~age + wgt + hgt + bmi + gen + phb + tv + reg, + model = "logit", data = mydata2) R> summary(fit) Model: logit Number of multiply imputed data sets: 5 Combined results: Call: zelig(formula = r.hc~age + wgt + hgt + bmi + gen + phb + tv + reg, model = "logit", data = mydata2) The result of zelig() is not as detailed as pool(). The confidence intervals of the estimates and the fraction of missing information are not printed. 7.5. mi mi 0.09-14 is an R package for multiple imputation (Su et al. 2011). The imputation method used in mi is also based on the FCS principle, and as in Zelig is called mi(). The package contains modeling functions that pull together the estimates (point estimates and their standard errors) from multiply imputed data sets for several models like lm(), glm() and lmer(). The package is quite extensive, and complements our package in some aspects, e.g., other facilities for diagnostic checking. The internal structure of class mi and class mids is different, but it would certainly be useful to have conversion functions mi2mids() and mids2mi() in order to be able to combine the strengths of both packages.

Conclusion
The principle of fully conditional specification (FCS) has now gained wide acceptance. Good software is now available. Many applications using FCS have appeared, and many more will follow. This paper documents a major update of MICE. FCS has recently been adopted and implemented by SPSS, and was advertised by SPSS as the major methodological improvement of SPSS Statistics 17.
In the years to come, attention will shift from computational issues to the question how we can apply the methodology in a responsible way. We need guidelines on how to report MI, we need a better understanding of the dangers and limitations of the technique, we need pooling methods for special distributions, and we need entry-level texts that explain the idea and that demonstrate how to use the techniques in practice. Assuming that this all happens, multiple imputation using FCS will prove to be a great addition to our statistical tool chest.