Imputation with the R Package VIM

The package VIM (Templ, Alfons, Kowarik, and Prantner 2016) is developed to explore and analyze the structure of missing values in data using visualization methods, to impute these missing values with the built-in imputation methods and to verify the imputation process using visualization tools, as well as to produce high-quality graphics for publications. Thisarticlefocuses on the diﬀerent imputation techniques available in the package. Four diﬀerent imputation methods are currently implemented in VIM , namely hot-deck imputation, k -nearest neighbor imputation, regression imputation and iterative robust model-based imputation (Templ, Kowarik, and Filzmoser 2011). All of these methods are implemented in a ﬂexible manner with many options for customization. Furthermore in this article practical examples are provided to highlight the use of the implemented methods on real-world applications. In addition, the graphical user interface of VIM has been re-implemented from scratch resulting in the package VIMGUI (Schopfhauser, Templ, Alfons, Kowarik, and Prantner 2016) to enable users without extensive R skills to access these imputation and visualization methods.


Introduction
Data sets often include missing values but most statistical methods available in software packages can only be applied to complete data sets. The aim of this paper is not to provide a thorough discussion of single versus multiple imputation; a comparison of both in practice can be found, for example, in Gómez-Carracedo, Andrade, López-Mahía, Muniategui, and Prada (2014). However, some practical issues in this regard are also mentioned in this paper.
In a typical statistical production process, the aim is often to generate one complete data set for further analysis by subject matter specialists, researchers or analysts. Therefore single imputation is still of great importance. Typically, the data is handed from the data collection system to the experts on imputation to perform imputation on the data, before the data is then used by subject matter specialists and made available to researchers, analysts, and is published.
On the other hand, if the estimation of the variance of certain parameters is of interest, the variances might be underestimated using (single) imputed data sets since unique values are predicted for missing values and thereby the uncertainty in the prediction is not reflected (Rubin 1987;Little and Rubin 2014). Multiple imputation methods are designed to reflect also this uncertainty by imputing several values for each missing value, leading to a set of imputed data sets. The variability between these imputations is then taken into account as it reflects the variance of the prediction. The variances of the estimators of interest are then estimated on each imputed data set and these results are combined, e.g., using the formulas given in Rubin (1987).
In many fields the data contains variables of different distributions. For example, data from official statistics, social statistics or econometrics consists of a variety of differently scaled variables. To give an example, demographic variables like gender, age, region, economic status or branch, citizenship, household size, etc., are present in data along with variables on income and income components. For imputing missing values in one variable, all these variables of different scales might be used since they provide essential information.
However, most of these packages are not designed to deal with a mix of continuous, semicontinuous, categorical and count data. Note that for the predictive mean matching method (van Buuren and Groothuis-Oudshoorn 2011) in the package mice, treating semi-continuous variables as continuous variables may not necessarily give worse results. Regarding the mix of variables, an exception is the package mi (Su, Gelman, Hill, and Yajima 2011) which uses a two-part model to deal with semi-continuous variables and which supports a mix of semicontinuous, continuous and categorical variables, further it allows for multiple imputation.
In addition, some of the implemented popular methods cannot be found in those packages, like hot-deck imputation, robust regression imputation or k nearest neighbor (kNN) imputation using generalized distance functions. An exception is the package yaImpute (Crookston and Finley 2008) which has excellent features for kNN methods for continuous scaled variables. Another package, rrp (Iacus and Porro 2007), includes nearest neighbor hot-deck imputation using a random recursive partitioning dissimilarity matrix. Random hot-deck imputation is also available in an SPSS (IBM Corporation 2015) implementation (Myers 2011) and in SAS (SAS Institute Inc. 2013) with SUDAAN's (Research Triangle Institute 2008) SAS-callable PROC HOTDECK (Izrael and Battaglia 2013). However, these packages are not designed to deal with generalized distances, which should be used when dealing with a mix of differently scaled variables.
In addition, the data sets are often of moderate or large size and fast imputation algorithms are needed. Especially, the packages for multiple imputation based on MCMC methods may not provide feasible solutions in terms of computational speed for large data sets.
Another important issue in practice is how imputation methods deal with outliers in continuous variables. The literature on robust statistics clearly shows that methods based on non-robust estimators are highly influenced by outliers (see, e.g., Maronna, Martin, and Yohai 2006). The imputation model can be highly influenced by outliers and the imputed values may deviate from the majority of the data, and can itself be outliers. As a result, the variances of estimators of imputed data can become arbitrary large but also point estimates might be highly influenced by non-robust imputations of continuous variables (Templ et al. 2011).
The R package VIM , available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=VIM, provides solutions for mixed scaled variables, large data sets and data sets including outliers. It includes several improvements with respect to the robustness of the parameter estimation of models to impute missing values. This functionality is included in the EM-based imputation method (see Section 4).
In addition, the package VIM includes two single imputation methods, (random and sequential) hot-deck imputation and kNN imputation using a generalized distance function to consider mixed scaled variables in the data set to impute, and user-specified aggregation functions.
The only method in package VIM that supports multiple imputation is the (robust) EMbased imputation method in Section 4. It works differently from other well-known methods that draw values for imputations from their conditional distributions by Markov chain Monte Carlo (MCMC) techniques (see e.g., van Buuren and Groothuis-Oudshoorn 2011 ;Schunk 2008). Such techniques need longer to converge and are typically more demanding in terms of computer resources. The implemented EM-approach has advantages regarding the speed of "convergence" since expected values are used for imputations and only in a last iteration variability is added to the model (see Section 4 and Templ et al. 2011 for details).
Typically, multiple imputation methods such as those available in packages mice or mix follow a fully conditional specification approach, where iteratively one variable is selected as the response and all other variables serve as predictors. With package VIM also a model for each variable can be specified. Note that package mi includes this additional feature as well.
The package VIM also supports the visual exploration of the data by several visualization methods that have been adapted to work with and show the structure of missing values. Further details on these methods can be found in Templ, Alfons, and Filzmoser (2012). To visualize the structure of missing values and to learn about these structure is highly recommended before missing values are imputed. In addition, the same visualization features can be used to visualize the structure of imputed values to get an impression if the imputed values follows the main structure of the (observed) data.
All methods can be applied by using the point-and-click graphical user interface (available in the package VIMGUI; Schopfhauser et al. 2016) that has been newly implemented from scratch. In addition, objects from the package survey (Lumley 2016(Lumley , 2004 can be imputed. This is especially useful in the area of survey methodology. However, this contribution does not focus on the graphical user interface. The methods in package VIM also have some limitations. There is no direct support for longitudinal data. For such data, package AmeliaII (Honaker et al. 2011) can be mentioned that is designed to impute cross-sectional data and time series. In VIM longitudinal data can only be combined by rows (e.g., using rbind()) before imputation to consider a longitudinal aspect in data imputation. In addition the MI approach implemented in the function irmi is not based on MCMC methods and thus coverage rates might theoretically be biased.
The four implemented imputation methods are described in the following sections, where each section contains theoretical background, implementation details and an example of application for each method. Section 2 describes the hot-deck imputation function of the package VIM. Since the main advantage of this implementation is computational speed, also information on the computational time for a large data set is given. The kNN imputation method is the focus of Section 3. Attention is given to the description of the generalized distance function and an application with a user-defined aggregation function is presented. In Section 4, the usage and implementation of iterative robust model-based imputation is shown on practical applications. Here, user-defined models are specified for each variable including missing values in order to show the flexibility of the implemented functions. Section 5 shows how to impute a single variable by regression imputation using the formula interface. Section 6 concludes by giving an outline on the broad range of applications where package VIM can be used.

Hot-deck imputation
Hot-deck imputation dates back to the days when data sets were saved on punch cards, the hot-deck referring to the "hot" staple of cards (in opposite to the "cold" deck of cards from the previous period). This very heuristic approach was later also studied theoretically, e.g., by Sande (1983), Ford (1983) and Fuller and Kim (2005). Most of the time, hot-deck imputation refers to sequential hot-deck imputation, meaning that the data set is sorted and missing values are imputed sequentially running through the data set line (observation) by line (observation). Nowadays, methods with better theoretical properties are available, but the method is still quite popular due to its simplicity and speed. In a lot of cases the quality of imputation with hot-deck imputation can be similar to nearest neighbor imputation, however as seen in Figure 1 it is much faster. The speed disadvantage of the nearest neighbor approach is mainly due to computing distances between all observations with missing values and the possible donor observations to find the k nearest neighbors (see Section 3 for more details). Therefore hot-deck imputation is very well suited for imputation of big data sets as seen in Figure 1. The computation speed of the hot-deck method is much faster than kNN (described in Section 3) and the model-based methods (described in Section 4). For this small simulation, eight variables are synthetically produced, where ten percent of the values of four variables are set to missing. One variable determines the domain and three variables are used for ordering the data within the hot-deck procedure. For example, for 10,000 observations, the model-based methods need around 150 seconds while hot-deck is below 0.1 seconds (compare Figure 1). The computational time increases linearly with the number of observations in the data sets and it is still below 26 seconds for a data set with 10,000,000 observations (see Figure 1).
The function hotdeck() is the implementation of the popular sequential and random hotdeck algorithm with the option to use it within a domain. It is optimized for usage on big data sets by using the R package data.table (Dowle, Srinivasan, Short, and Lianoglou 2015). Timing on a desktop computer with 8 GB RAM and 3.4 GHz Intel Core i7-2600. The data set consists of two numerical and two categorical variables with 10 percent missing each. Three numerical and one categorical auxiliary variable were used.

How to use hotdeck
The most important arguments are: • data -a data frame or matrix, which contains the data set with missing values in some variables; • variable -a vector of variable names for which missing values should be imputed; • ord_var -a vector of variable names for sorting; • domain_var -a vector of variable names for building domains and to impute within them.
The synopsis of the function is (sensible defaults are given for the available function parameters; for details on all function parameters have a look at the help page): hotdeck(data, variable = NULL, ord_var = NULL, domain_var = NULL, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, imp_var = TRUE, imp_suffix = "imp")

Application of hotdeck()
As a practical example we want to impute a population database of the Austrian population aged 16 to 64. The data set eusilcP from the package simFrame (Alfons, Templ, and Filzmoser 2010) is used, it is synthetically generated from real Austrian survey data (European Union Statistics on Income and Living Conditions; EU-SILC). The data set is enlarged by sampling 5.5 million observations with replacement from the data set with 39 thousand observations. The data set is split by federal state (region) and gender by using these variables as domain variables. The numeric variables are equalized household income (eqIncome), age and the number of persons in a household (hsize) and are used for sorting the data set. Ten percent missing values are generated in the person's economic status (ecoStat), citizenship and three income components (py010n, py050n and py090n). Detailed information about the data set and the variables can be found in the help file of eusilcP in the package simFrame.

k nearest neighbor imputation
Similar to the hot-deck method, the k nearest neighbor method is based on donor observation. An aggregation of the k values of the nearest neighbors is used as imputed value. The kind of aggregation depends on the type of the variable.
The distance computation for defining the nearest neighbors is based on an extension of the Gower distance (Gower 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous and semi-continuous. The distance between two observations is the weighted mean of the contributions of each variable, where the weight should represent the importance of the variable. Therefore the distance between the ith and the jth observation can be defined as where w k is the weight and δ i,j,k is the contribution of the kth variable.
For continuous variables the absolute distance divided by the total range is used where x i,k is the value of the kth variable of the ith observation and r k is the range of the kth variable. Ordinal variables are converted to integer variables and then the absolute distance divided by the range is computed. The categories are therefore treated as if they were equidistant; this can be changed by manually converting the ordinal variables to integer variables.
For nominal and binary variables a simple 0/1 distance is used Another special type of variables are semi-continuous variables, consisting of a continuously distributed part and probability mass at one point. An example for such a variable might be an income component, which is 0 for some observations and continuously distributed in the remaining observations. The contributions for semi-continuous variables are computed as a mixture of the contribution for nominal and continuous variables where s k is the special value for the kth variable, e.g., the 0 in the income variable example.
As described above, all δ i,j,k are in [0, 1], as a consequence the computed distances d i,j between two observation are also inside this interval.
The distance computation is implemented in C++ with usage of the R package Rcpp (Eddelbuettel, François, Allaire, Chambers, Bates, and Ushey 2011).
The second important part of this method is the aggregation of the k values to one imputed value. For continuous variables the default is the median, but also other statistics are possible e.g., the arithmetic mean. For categorical variables the default method is to use the category with the most occurrences in the k values, if this results in a tie, a category from the tied categories is randomly drawn (function maxCat()). A second implemented method of aggregation is to sample the category from the categories in the k nearest neighbors with probabilities equal to the occurrences in the k values (function sampleCat()).

How to use kNN()
The most important arguments are: • dist_var -a vector of variable names to be used for calculating the distances; • weights -a numeric vector containing a weight for each distance variable; • numFun -a function for aggregating the k nearest neighbors in case of a numerical variable, defaults to the median; • catFun -a function for aggregating the k nearest neighbors in case of a categorical variable, defaults to the function maxCat(); • addRandom -a Boolean variable if an additional variable containing only random numbers is added to avoid multiple selection of the same donor; • useImputedDist -a Boolean variable if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables; • weightDist -a Boolean variable if the inverse of the distances of the k nearest neighbours should be used as weights in the aggregation step.
The full synopsis (including sensible defaults) of the function is: kNN(data, variable = colnames(data), metric = NULL, k = 5, dist_var = colnames(data), weights = NULL, numFun = median, catFun = maxCat, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, mixed = vector(), mixed.constant = NULL, trace = FALSE, imp_var = TRUE, imp_suffix = "imp", addRandom = FALSE, useImputedDist = TRUE, weightDist = FALSE) The method is implemented in a sophisticated manner so that not the whole distance matrix has to be calculated but only distances for observations including missing values, and variables which are chosen by the variable argument. Thus the implementation in VIM is also applicable for reasonably large data sets. However, computation of the distance is in any case more time-consuming than the hot-deck method (Figure 1).

Application of kNN
Again we use the EU-SILC data set for showcasing the imputation method. As mentioned before the function kNN() is versatile in handling different variable types in the distance function, but it is also possible to use aggregation functions for different variable types. For a semi-continuous variable an aggregation function might look like the function medianMixed(). It returns 0 with a probability equivalent to the relative frequency of 0 in the k nearest neighbors and the median of the non-0 observations otherwise.
Again missing values are introduced. The user-defined aggregation function (medianMixed()) is given in the following code snippet and kNN() is applied where all variables defined in the object samp are used for distance calculation. In this case we do not use the function argument weights to weight the variables for distance calculations. However if the importance of the variables is considered to differ substantially, this can be handled by this argument.

Iterative robust model-based imputation
This iterative regression imputation method is described in detail in Templ et al. (2011). In each step of the iteration (inner loop), one variable is used as a response variable and the remaining variables serve as the regressors. The procedure is repeated until the algorithm converges (outer loop). The data can consist of a mix of binary, categorical, count, continuous and semi-continuous variables, appropriate regression methods are selected internally by the algorithm. Robust regression using MM-estimation (Maronna et al. 2006) is used by default to get reliable results even if the data contains outliers. This method is most useful to get reliable imputations in an automated manner. However, if the users are trained in regression modeling, a model for each variable can also be specified.

How to use irmi()
The most important arguments are: • robust -a Boolean variable to enable or disable robust regression; • step -a Boolean variable to enable or disable a step-wise (stepAIC) selection of regressors in each iteration; • mixed -column index of the semi-continuous variables: • count -column index of the count variables; • modelFormulas -a named list with the name of variables for the right-hand side of the formulas. The list must contain a right-hand side formula for each variable with missing values and it should look like list(y1 = c("x1", "x2"), y2 = c("x1", "x3")) if factor variables for the mixed variables should be created for the regression models; • mi -the number of multiple imputed values.
The latter two arguments are important for selecting the correct regression method, which is done within the algorithm in an automatized manner hidden from the user. If the data contains semi-continuous or count variables, this must be specified while the algorithm detects the correct distribution for all other types (continuous, binary, categorical).

Application of irmi()
The data set ses from the package laeken (Alfons and Templ 2013) is a synthetically generated data for the Austrian structural earnings survey. It contains variables of various kinds. Ten percent of the following four variables are set to missing: • location -the NUTS3 region in Austria; • earningsMonth -the gross earnings in the reference month; • earningsOvertime -the gross earnings related to overtime; • overtimeHours -the number of paid overtime hours in the reference month.

Individual regression imputation
This method is based on using well-known regression models to impute missing values on a per variable basis. For continuous variables a linear model is internally fitted with the function lm() or robustly with the function lmrob() from the R package robustbase (Rousseeuw et al. 2016). The functions glm() and glmrob() (again from the package robustbase) are used when a family is specified. For ordinal variables the function multinom() from the R package nnet (Venables and Ripley 2002) is used.

How to use regressionImp
The most important arguments are: • formula -model formula to impute one variable; • robust -TRUE/FALSE indicating if robust regression should be used; • family -family argument passed to glm() or glmrob(). The default value "AUTO" tries to select the correct model family automatically.

Application of regressionImp
The data set of the structural earnings survey of Austria is used again. Ten percent missing values are introduced in two variables earningsHour -the hourly earnings -and NACE1 -the economic branch. These two variables are imputed using a linear regression model for the first variable and multinomial regression for the second variable.

A short note on the graphical user interface
The R package VIMGUI implements a graphical user interface for the R package VIM. The imputation methods described in this paper and the visualization techniques described in Templ et al. (2012) are accessible via an easy-to-use point and click user interface.
Some of the additionally supported features of the graphical user interface are: Import/Export: Import of .csv files with interactive selection of parameters for importing.
The import and export of SPSS, SAS (XPORT format), Stata (StataCorp. 2015) and R binary files is supported. Objects of class 'survey', from the R package survey (Lumley 2016(Lumley , 2004 for analyzing data from sample surveys, can be used as input for the imputation process through the menu entry Survey where such survey objects can be imported to VIMGUI and exported. Facilities to create a survey object are included as well.
Scaling/Transformation: Standardization and/or transformation of continuous and semicontinuous variables.

Script:
The results produced within the GUI are saved as commands in a separated file. This is useful to provide reproducibility.

Methods:
A dozen of univariate, bivariate, multiple and multivariate diagnostic plots are available to evaluate the structure of missing values (∼ missing values diagnostics) and imputed values (∼ imputed values diagnostics).
As the graphical user interface is still under development, improvement to its functionality can be expected with future versions of the package.

Conclusions
The VIM (and VIMGUI) package includes a comprehensive collection of imputation (and visualization methods). Before imputation, the structure of missing values can be explored using the built-in visualization tools. Methods for the imputation of missing values are available and special attention was given to efficient implementations of the methods.
In summary, the package VIM can be widely applied since it • can be used to impute incomplete data sets with continuous, semi-continuous, categorical, ordered-categorical, binary or count variables; • is highly customizable by providing own functions and models; • is optimized for large data sets, i.e., it includes efficient implementations of algorithms using the R packages Rcpp (Eddelbuettel et al. 2011) and data.table (Dowle et al. 2015); • is possible to apply the imputation methods either to data frames or objects from the package survey; • can be used by users with no experience in R via the package VIMGUI; • includes a lot of visualization features for analyzing the structure of missing values (here we refer to Templ et al. 2012); • even not shown in this paper, VIM include a lot of visualization features for analyzing the imputed values. Imputations are highlighted in the plots, similar to the methods in Templ et al. (2012) where missing values are highlighted.
The packages VIM and VIMGUI are currently widely used around the world. The download statistics for the R package downloads over one of many download mirrors -the RStudio server -shows that the package has been downloaded more than 150 times a week over the last two years (package updates also contribute to this number).