Contents

Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improves imputation models by allowing researchers to put Bayesian priors on individual cell values, thereby including a great deal of potentially valuable and extensive information. It also includes features to accurately impute cross-sectional datasets, individual time series, or sets of time series for different cross-sections. A full set of graphical diagnostics are also available. The program is easy to use, and the simplicity of the algorithm makes it far more robust; both a simple command line and extensive graphical user interface are included.


Introduction
Missing data is a perennial problem in social science data. Respondents do not answer every question, countries do not collect statistics in every year and, most unfortunately, researchers do not always have the resources to collect every piece of available data. Most statistical analysis methods, however, assume the absence of missing data. Amelia II allows users to impute ("fill in" or rectangularize) incomplete data sets so that analyses which require complete observations can appropriately use all the information present in a dataset with missingness, and avoid the biases, inefficiencies, and incorrect uncertainty estimates that can result from dropping all partially observed observations from the analysis.
Amelia II performs multiple imputation, a general-purpose approach to data with missing values. Multiple imputation has been shown to reduce bias and increase efficiency compared to listwise deletion. Furthermore, ad-hoc methods of imputation, such as mean imputation, can lead to serious biases in variances and covariances. Unfortunately, creating multiple imputations can be a burdensome process due to technical nature of algorithms involved. Ameliaprovides users with a simple way to create an imputation model, implement it, and check its fit using diagnostics.
The Amelia II program goes several significant steps beyond the capabilities of the first version of Amelia Singh., 1998-2002). For one, the bootstrap-based EMB algorithm included in Amelia II can impute many more variables, with many more observations, in much less time. The great simplicity and power of the EMB algorithm made it possible to write Amelia II so that it virtually never crashes -which to our knowledge makes it unique among all existing multiple imputation software -and is much faster than the alternatives too. Amelia II also has features to make valid and much more accurate imputations for cross-sectional, time-series, and time-series-cross-section data, and allows the incorporation of observation and data-matrix-cell level prior information. In addition to all of this, Amelia II provides many diagnostic functions that help users check the validity of their imputation model. This software implements the ideas developed in ?.

What Amelia Does
Multiple imputation involves imputing m values for each missing cell in your data matrix and creating m "completed" data sets. (Across these completed data sets, the observed values are the same, but the missing values are filled in with different imputations that reflect the uncertainty about the missing data.) After imputation with Amelia II's EMB algorithm, you can apply whatever statistical method you would have used if there had been no missing values in each of the m data sets, and use a simple procedure, described below, to combine the results 1 . Under normal circumstances, you only need to impute once and can then analyze the m imputed data sets as many times and for as many purposes as you wish. The advantage of Amelia II is that it combines the comparative speed and ease-of-use of our algorithm with the power of multiple imputation, to let you focus on your substantive research questions rather than spending time developing complex application-specific models for nonresponse in each new data set. Unless the rate of missingness is very high, m = 5 (the program default) is probably adequate.

Assumptions
The imputation model in Amelia II assumes that the complete data (that is, both observed and unobserved) are multivariate normal. If we denote the (n × k) dataset as D (with observed part D obs and unobserved part D mis ), then this assumption is which states that D has a multivariate normal distribution with mean vector µ and covariance matrix Σ. The multivariate normal distribution is often a crude approximation to the true distribution of the data, yet there is evidence that this model works as well as other, more complicated models even in the face of categorical or mixed data (see Schafer, 1997;Schafer and Olsen, 1998). Furthermore, transformations of the data can often make this normality assumption more plausible (see 5.3 for more information on how to implement this in Amelia).
The essential problem of imputation is that we only observe D obs , not all of D. In order to gain traction, we need to make the usual assumption in multiple imputation that the data are missing at random (MAR). This assumption means that the pattern of missingness only depends on the observed data D obs , not the unobserved data D mis . Let M to be the missingness matrix, with cells m ij = 1 if d ij ∈ D mis and m ij = 0 otherwise. Put simply, M is a matrix that indicates whether or not a cell is missing in the data. With this, we can define the MAR assumption as p(M |D) = p(M |D obs ). (2) Note that MAR includes the case when missing values are created randomly by, say, coin flips, but it also includes many more sophisticated missingness models. When missingness is not dependent on the data at all, we say that the data are missing completely at random (MCAR). Amelia requires both the multivariate normality and the MAR assumption (or the simpler special case of MCAR). Note that the MAR assumption can be made more plausible by including more information into the imputation model.

Algorithm
In multiple imputation, we are concerned with the complete-data parameters, θ = (µ, Σ). When writing down a model of the data, it is clear that our observed data is actually D obs and M , the missingness matrix. Thus, the likelihood of our observed data is p(D obs , M |θ). Using the MAR assumption 2 , we can break this up, p(D obs , M |θ) = p(M |D obs )p(D obs |θ). (3) As we only care about inference on the complete data parameters, we can write the likelihood as L(θ|D obs ) ∝ p(D obs |θ), which we can rewrite using the law of iterated expectations as p(D obs |θ) = p(D|θ)dD mis .
With this likelihood and a flat prior on θ, we can see that the posterior is p(θ|D obs ) ∝ p(D obs |θ) = p(D|θ)dD mis .
The main computational difficulty in the analysis of incomplete data is taking draws from this posterior. The EM algorithm (Dempster, Laird and Rubin, 1977) is a simple computational approach to finding the mode of the posterior. Our EMB algorithm combines the classic EM algorithm with a bootstrap approach to take draws from this posterior. For each draw, we bootstrap the data to simulate estimation uncertainty and then run the EM algorithm to find the mode of the posterior for the bootstrapped data, which gives us fundamental uncertainty too (see ? for details of the EMB algorithm).
Once we have draws of the posterior of the complete-data parameters, we make imputations by drawing values of D mis from its distribution conditional on D obs and the draws of θ, which is a linear regression with parameters that can be calculated directly from θ.

Analysis
In order to combine the results across m data sets, first decide on the quantity of interest to compute, such as univariate mean, regression coefficient, predicted probability, or first difference. Then, the easiest way is to draw 1/m simulations of q from each of the m data sets, combine them into one set of m simulations, and then to use the standard simulation-based methods of interpretation common for single data sets (King, Tomz and Wittenberg, 2000).
Alternatively, you can combine directly and use as the multiple imputation estimate of this parameter,q, the average of the m separate estimates, q j (j = 1, . . . , m): The variance of the point estimate is the average of the estimated variances from within each completed data set, plus the sample variance in the point estimates across the data sets (multiplied by a factor that corrects for the bias because m < ∞). Let SE(q j ) 2 denote the estimated variance (squared standard error) of q j from the data set j, and S 2 q = Σ m j=1 (q j −q) 2 /(m − 1) be the sample variance across the m point estimates. The standard error of the multiple imputation point estimate is the square root of

Versions of Amelia
Two versions of Amelia II are available, each with its own advantages and drawbacks, but both of which use the same underyling code. First, Amelia II exists as a package for the R statistical software package. Users can utilize their knowledge of the R language to run Amelia II at the command line or to create scripts that will run Amelia II and preserve the commands for future use. Alternatively, you may prefer AmeliaView, where a interactive Graphical User Interface (GUI) allows you to set options and run Amelia without any knowledge of the R programming language. Both versions of Amelia II are available on the Windows, Mac OS X, and Linux platforms and Amelia II for R runs in any environment that R can. All versions of Amelia require the R software, which is freely available at http://www.r-project. org/.

Installation and Updates
Before installing Amelia II, you must have installed R version 2.1.0 or higher, which is freely available at http://www.r-project.org/.
To install the Amelia package on any platform, simply type the following at the R command prompt, > install.packages("Amelia") and R will automatically install the package to your system from CRAN. If you wish to use the most current beta version of Amelia feel free to install the test version, > install.packages("Amelia", repos = "http://gking.harvard.edu") In order to keep your copy of Amelia completely up to date, you should use the command > update.packages()

Windows -AmeliaView
To install a standalone version of AmeliaView in the Windows environment, simply download the installer setup.exe from http://gking.harvard.edu/amelia/ and run it. The installer will ask you to choose a location to install Amelia II. If you have installed R with the default options, Amelia II will automatically find the location of R. If the installer cannot find R, it will ask you to locate the directory of the most current version of R. Make sure you choose the directory name that includes the version number of R (e.g. C:/Program Files/R/R-2.9.0) and contains a subdirectory named bin. The installer will also put shortcuts on your Desktop and Start Menu.
Even users familiar with the R language may find it useful to utilize AmeliaView to set options on variables, change arguments, or run diagnostics. From the command line, AmeliaView can be brought up with the call:

Linux (local installation)
Installing Amelia on a Linux system is slightly more complicated due to user permissions. If you are running R with root access, you can simply run the above installation procedure. If you do not have root access, you can install Amelia to a local library. First, create a local directory to house the packages, w4:mblackwell [~]: mkdir~/myrlibrary and then, in an R session, install the package directing R to this location: > install.packages("Amelia", lib = "~/myrlibrary") Once this is complete you need to edit or create your R profile. Locate or creatẽ /.Rprofile in your home directory and add this line: .libPath("~/myrlibrary") This will add your local library to the list of library paths that R searches in when you load libraries.
Linux users can use AmeliaView in the same way as Windows users of Amelia for R. From the command line, AmeliaView can be brought up with the call: > AmeliaView() 5 A User's Guide

Data and Initial Results
We now demonstrate how to use Amelia using data from Milner and Kubota (2005) which studies the effect of democracy on trade policy. For the purposes of this user's guide, we will use a subset restricted to nine developing countries in Asia from 1980 to 1999 3 . This dataset includes 9 variables: year (year), country (country), average tariff rates (tariff), Polity IV score 4 (polity), total population (pop), gross domestic product per capita (gdp.pc), gross international reserves (intresmi), a dummy variable for if the country had signed an IMF agreement in that year (signed), a measure of financial openness (fivop), and a measure of US hegemony 5 (usheg). These variables correspond to the variables used in the analysis model of Milner and Kubota (2005)  In the presence of missing data, most statistical packages use listwise deletion, which removes any row that contains a missing value from the analysis. Using the base model of Milner and Kubota (2005)  Note that 60 of the 171 original observations are deleted due to missingness. Most of these observations, however, have information in them and multiple imputation will help us retrieve that information and make better inferences.

Multiple Imputation
When performing multiple imputation, the first step is to identify the variables to include in the model. It is crucial to include at least as much information as in the analysis model. That is, any variable that will be in the analysis model should also be in the imputation model. In fact, it is often useful to add more information. Since imputation is predictive, any variables that would increase predictive power should be included in the model, even if including them in the analysis model would produce bias (such as for post-treatment variables). In our case, we include all the variables in freetrade in the imputation model, even though we our analysis model focuses on polity, pop and gdp.pc 6 .

Saving imputed datasets
If you need to save your imputed datasets, you can either save the output from amelia, > save(a.out, file = "imputations.RData") In addition, you can save each of the imputed datasets to its own file using the write.amelia command, > write.amelia(obj=a.out, file.stem = "outdata") This will create one comma-separated value file for each imputed dataset in the following manner: outdata1.csv outdata2.csv outdata3.csv outdata4.csv outdata5.csv The write.amelia function can also save files in tab-delimited and Stata (.dta) file formats. For instance, to save Stata files, simply change the format argument to "dta", > write.amelia(obj=a.out, file.stem = "outdata", format = "dta")

Combining Multiple Amelia Runs
The EMB algorithm is what computer scientists call embarrassingly parallel, meaning that it is simple to separate each imputation into parallel processes. With Amelia it is simple to run subsets of the imputations on different machines and then combine them after the imputation for use in analysis model. This allows for a huge increase in the speed of the algorithm.
For instance, suppose that we wanted to add another ten imputated datasets to my first call to amelia. First, run the function to get these additional imputations, > a.out.more <-amelia(freetrade, m = 10, ts = "year", + cs = "country", p2s = 0) > a.out.more Amelia output with 10 imputed datasets. Return code: 1 Message: Normal EM convergence.
A simple way to execute a parallel processing scheme with Amelia would be to run amelia with m set to 1 on m different machines, save each output using the save function, load them all on the same R session using load command and then combine them using ameliabind. In order to do this, however, make sure to name each of the outputs a different name so that they do not overwrite each other when loading into the same R session.

Screen Output
Screen output can be adjusted with the "print to screen" argument, p2s. At a value of 0, no screen printing will occur. This may be useful in large jobs or simulations where a very large number of imputation models may be required. The default value of 1, lists each bootstrap, and displays the number of iterations required to reach convergence in that bootstrapped dataset. The value of 2 gives more thorough screen output, including, at each iteration, the number of parameters that have significantly changed since the last iteration. This may be useful when the EM chain length is very long, as it can provide an intuition for many parameters still need to converge in the EM chain, and a sense of the time remaining. However, it is worth noting that the last several parameters can often take a significant fraction of the total number of iterations to converge. Setting p2s to 2 will also generate information on how EM algorithm is behaving, such as a ! when the current estimated complete data covariance matrix is not invertible and a * when the likelihood has not monotonically increased in that step. Having many of these two symbols in the screen output is an indication of a problematic imputation model 7 .

Imputation-improving Transformations
Social science data commonly includes variables that fail to fit to a multivariate normal distribution. Indeed, numerous models have been introduced specifically to deal with the problems they present. As it turns out, much evidence in the literature (discussed in King et al. 2001) indicates that the multivariate normal model used in Amelia usually works well for the imputation stage even when discrete or nonnormal variables are included and when the analysis stage involves these limited dependent variable models. Nevertheless, Amelia includes some limited capacity to deal directly with ordinal and nominal variables and to variables that require other transformations. In general nominal and log transform variables should be declared to Amelia, whereas ordinal (including dichotomous) variables often need not be, as described below. (For harder cases, see (Schafer, 1997), for specialized MCMC-based imputation models for discrete variables.) Although these transformations are taken internally on these variables to better fit the data to the multivariate normal assumptions of the imputation model, all the imputations that are created will be returned in the original untransformed form of the data. If the user has already performed transformations on their data (such as by taking a log or square root prior to feeding the data to amelia) these do not need to be declared, as that would result in the transformation occurring doubly in the imputation model. The fully imputed data sets that are returned will always be in the form of the original data that is passed to the amelia routine.

Ordinal
In much statistical research, researchers treat independent ordinal (including dichotomous) variables as if they were really continuous. If the analysis model to be employed is of this type, then nothing extra is required of the of the imputation model. Users are advised to allow Amelia to impute non-integer values for any missing data, and to use these non-integer values in their analysis. Sometimes this makes sense, and sometimes this defies intuition. One particular imputation of 2.35 for a missing value on a seven point scale carries the intuition that the respondent is between a 2 and a 3 and most probably would have responded 2 had the data been observed. This is easier to accept than an imputation of 0.79 for a dichotomous variable where a zero represents a male and a one represents a female respondent. However, in both cases the non-integer imputations carry more information about the underlying distribution than would be carried if we were to force the imputations to be integers. Thus whenever the analysis model permits, missing ordinal observations should be allowed to take on continuously valued imputations.
In the freetrade data, one such ordinal variable is polity which ranges from -10 (full autocracy) to 10 (full democracy). If we tabulate this variable from one of the imputed datasets, > we can see that there is one imputation between -4 and -3 and one imputation between 6 and 7. Again, the interpretation of these values is rather straightforward even if they are not strictly in the coding of the original Polity data. Often, however, analysis models require some variables to be strictly ordinal, as for example the dependent variable must be in a logistical or Poisson regression. Imputations for variables set as ordinal are created by taking the continuously valued imputation and using an appropriately scaled version of this as the probability of success in a binomial distribution. The draw from this binomial distribution is then translated back into one of the ordinal categories.

Nominal
Nominal variables 8 must be treated quite differently than ordinal variables. Any multinomial variables in the data set (such as religion coded 1 for Catholic, 2 for Jewish, and 3 for Protestant) must be specified to Amelia. In our freetrade dataset, we have signed which is 1 if a country signed an IMF agreement in that year and 0 if it did not. Of course, our first imputation did not limit the imputations to these two categories > In order to fix this for a p-category multinomial variable,Amelia will determine p (as long as your data contain at least one value in each category), and substitute p−1 binary variables to specify each possible category. These new p − 1 variables will be treated as the other variables in the multivariate normal imputation method chosen, and receive continuous imputations. These continuously valued imputations will then be appropriately scaled into probabilities for each of the p possible categories, and one of these categories will be drawn, where upon the original p-category multinomial variable will be reconstructed and returned to the user. Thus all imputations will be appropriately multinomial.
For our data we can simply add signed to the noms argument: > a.out2 <-amelia(freetrade, m = 5, ts = "year", cs = "country", Note that Amelia can only fit imputations into categories that exist in the original data. Thus, if there was a third category of signed, say 2, that corresponded to a different kind of IMF agreement, but it never occurred in the original data, Amelia could not match imputations to it.
Since Amelia properly treats a p-category multinomial variable as p −1 variables, one should understand the number of parameters that are quickly accumulating if many multinomial variables are being used. If the square of the number of real and constructed variables is large relative to the number of observations, it is useful to use a ridge prior as in section 5.6.1.

Natural Log
If one of your variables is heavily skewed or has outliers that may alter the imputation in an unwanted way, you can use a natural logarithm transformation of that variable in order to normalize its distribution. This transformed distribution helps Amelia to avoid imputing values that depend too heavily on outlying data points. Log transformations are common in expenditure and economic variables where we have strong beliefs that the marginal relationship between two variables decreases as we move across the range.
For instance, figure 2 show the tariff variable clearly has positive (or, right) skew while its natural log transformation has a roughly normal distribution.

Square Root
Event count data is often heavily skewed and has nonlinear relationships with other variables. One common transformation to tailor the linear model to count data is to take the square roots of the counts. This is a transformation that can be set as an option in Amelia.

Logistic
Proportional data is sharply bounded between 0 and 1. A logistic transformation is one possible option in Amelia to make the distribution symmetric and relatively unbounded.

Identification Variables
Datasets often contain identification variables, such as country names, respondent numbers, or other identification numbers, codes or abbreviations. Sometimes these are text and sometimes these are numeric. Often it is not appropriate to include these variables in the imputation model, but it is useful to have them remain in the imputed datasets (However, there are models that would include the ID variables in the imputation model, such as fixed effects model for data with repeated observations of the same countries). Identification variables which are not to be included in the imputation model can be identified with the argument idvars. These variables will not be used in the imputation model, but will be kept in the imputed datasets.
If the year and country contained no information except labels, we could omit them from the imputation: > amelia(freetrade, idvars = c("year", "country")) Note that Amelia will return with an error if your dataset contains a factor or character variable that is not marked as a nominal or identification variable. Thus, if we were to omit the factor country from the cs or idvars arguments, we would receive an error: > a.out2 <-amelia(freetrade, idvars = c("year")) Amelia Error Code: 38 The variable(s) country are "characters". You may have wanted to set this as a In order to conserve memory, it is wise to remove unnecessary variables from a data set before loading it into Amelia. The only variables you should include in your data when running Amelia are variables you will use in the analysis stage and those variables that will help in the imputation model. While it may be tempting to simply mark unneeded variables as IDs, it only serves to waste memory and slow down the imputation procedure.

Time Series, or Time Series Cross Sectional Data
Many variables that are recorded over time within a cross-sectional unit are observed to vary smoothly over time. In such cases, knowing the observed values of observations close in time to any missing value may enormously aid the imputation of that value. However, the exact pattern may vary over time within any cross-section. There may be periods of growth, stability, or decline; in each of which the observed values would be used in a different fashion to impute missing values. Also, these patterns may vary enormously across different cross-sections, or may exist in some and not others. Amelia can build a general model of patterns within variables across time by creating a sequence of polynomials of the time index. If, for example, tariffs vary smoothly over time, then we make the modeling assumption that there exists some polynomial that describes the economy in cross-sectional unit i at time t as: And thus if we include enough higher order terms of time then the pattern between observed values of the tariff rate can be estimated. Amelia will create polynomials of time up to the user defined k-th order, (k ≤ 3). We can implement this with the ts and polytime arguments. If we thought that a second-order polynomial would help predict we could run > a.out2 <-amelia(freetrade, ts = "year", cs = "country", + polytime = 2) With this input, Amelia will add covariates to the model that correspond to time and its polynomials. These covariates will help better predict the missing values. If cross-sectional units are specified these polynomials can be interacted with the cross-section unit to allow the patterns over time to vary between cross-sectional units. Unless you strongly believe all units have the same patterns over time in all variables (including the same constant term), this is a reasonable setting. When k is set to 0, this interaction simply results in a model of fixed effects where every unit has a uniquely estimated constant term. Amelia does not smooth the observed data, and only uses this functional form, or one you choose, with all the other variables in the analysis and the uncertainty of the prediction, to impute the missing values.
The above code would predict the same trend in a variable for each country. It is clear, however, that each country will have a different time series for tariff rates, for instance. Some countries may start higher than other or possibly some countries dropped dramatically while other remained fairly constant over time. In order to capture this in the style above, we can set intercs to TRUE: > a.out.time <-amelia(freetrade, ts = "year", cs = "country", + polytime = 2, intercs = TRUE, p2s = 2) Note that attempting to use polytime without the ts argument, or intercs without the cs argument will result in an error.

Lags and Leads
An alternative way of handling time-series information is to include lags and leads of certain variables into the imputation model. Lags are variables that take the value of another variable in the previous time period while leads take the value of another variable in the next time period. Many analysis models use lagged variables to deal with issues of endogeneity, thus using leads may seems strange. It is important to remember, however, that imputation models are predictive, not causal. Thus, since both past and future values of a variable are likely correlated with the present value, both lags and leads should improve the model. If we wanted to include lags and leads of tariffs, for instance, we would simply pass this to the lags and leads arguments: > a.out2 <-amelia(freetrade, ts = "year", cs = "country", + lags = "tariff", leads = "tariff")

Including Prior Information
Amelia has a number of methods of setting priors within the imputation model. Two of these are commonly used and discussed below, ridge priors and observational priors.
5.6.1 Ridge Priors for High Missingness, Small n's, or Large Correlations When the data to be analyzed contain a high degree of missingness or very strong correlations among the variables, or when the number of observations is only slightly greater than the number of parameters p(p + 3)/2 (where p is the number of variables), results from your analysis model will be more dependent on the choice of imputation model. This suggests more testing in these cases of alternative specifications under Amelia. This can happen when using the polynomials of time interacted with the cross section are included in the imputation model. In our running example, we used a polynomial of degree 2 and there are 9 countries. This adds 3 × 9 = 18 more variables to the imputation model. When these are added, the EM algorithm can become unstable, as indicated by the differing chain lengths for each imputation: > a.out.time Amelia output with 5 imputed datasets. Return code: 1 Message: Normal EM convergence.
Chain Lengths: --------------Imputation 1: 316 Imputation 2: 246 Imputation 3: 122 Imputation 4: 96 Imputation 5: 164 In these circumstances, we recommend adding a ridge prior which will help with numerical stability by shrinking the covariances among the variables toward zero without changing the means or variances. This can be done by including the empri argument. Including this prior as a positive number is roughly equivalent to adding empri artificial observations to the data set with the same means and variances as the existing data but with zero covariances. Thus, increasing the empri setting results in more shrinkage of the covariances, thus putting more a priori structure on the estimation problem: like many Bayesian methods, it reduces variance in return for an increase in bias that one hopes does not overwhelm the advantages in efficiency. In general, we suggest keeping the value on this prior relatively small and increase it only when necessary. A recommendation of 0.5 to 1 percent of the number of observations, n, is a reasonable starting value, and often useful in large datasets to add some numerical stability. For example, in a dataset of two thousand observations, this would translate to a prior value of 10 or 20 respectively. A prior of up to 5 percent is moderate in most applications and 10 percent is reasonable upper bound.
Amelia output with 5 imputed datasets. Return code: 1 Message: Normal EM convergence.

Observation-level priors
Researchers often have additional prior information about missing data values based on previous research, academic consensus, or personal experience. Amelia can incorporate this information to produce vastly improved imputations. The Amelia algorithm allows users to include informative Bayesian priors about individual missing data cells instead of the more general model parameters, many of which have little direct meaning. The incorporation of priors follows basic Bayesian analysis where the imputation turns out to be a weighted average of the model-based imputation and the prior mean, where the weights are functions of the relative strength of the data and prior: when the model predicts very well, the imputation will down-weight the prior, and vice versa (?).
The priors about individual observations should describe the analyst's belief about the distribution of the missing data cell. This can either take the form of a mean and a standard deviation or a confidence interval. For instance, we might know that 1986 tariff rates in Thailand around 40%, but we have some uncertainty as to the exact value. Our prior belief about the distribution of the missing data cell, then, centers on 40 with a standard deviation that reflects the amount of uncertainty we have about our prior belief.
To input priors you must build a priors matrix with either four or five columns. Each row of the matrix represents a prior on either one observation or one variable. In any row, the entry in the first column is the row of the observation and the entry is the second column is the column of the observation. In the four column priors matrix the third and fourth columns are the mean and standard deviation of the prior distribution of the missing value.
For instance, suppose that we had some expert prior information about tariff rates in Thailand. We know from the data that Thailand is missing tariff rates in many years, > freetrade[freetrade$country == "Thailand", c("year", + "country", "tariff")] year Suppose that we had expert information that tariff rates were roughly 40% in Thailand between 1986 and 1988 with about a 6% margin of error. This corresponds to a standard deviation of aobut 3. In order to include this information, we must form the priors matrix: > pr <-matrix(c (158, 159, 160, 3, 3, 3, 40, 40 The first column of this matrix corresponds to the row numbers of Thailand in these three years, the second column refers to the column number of tariff in the data and the last two columns refer to the actual prior. Once we have this matrix, we can pass it to amelia, > a.out.pr <-amelia(freetrade, ts = "year", cs = "country", + priors = pr) In the five column matrix, the last three columns describe a confidence range of the data. The columns are a lower bound, an upper bound, and a confidence level between 0 and 1, exclusive. Whichever format you choose, it must be consistent across the entire matrix. We could get roughly the same prior as above by utilizing this method. Our margin of error implies that we would want imputations between 34 and 46, so our matrix would be > pr. 2 <-matrix(c(158, 159, 160, 3, 3, 3, 34, 34, 34, + 46, 46, 46, 0.95 These priors indicate that we are 95% confident that these missing values are in the range 34 to 46. If a prior has the value 0 in the first column, this prior will be applied to all missing values in this variable, except for explitictly set priors. Thus, we could set a prior for the entire tariff variable of 20, but still keep the above specific priors with the following code: > pr. 3 <-matrix(c(158, 159, 160, 0, 3, 3, 3, 3, 40, 40, + 40, 20, 3, 3, 3, 5) In some cases, variables in the social sciences have known logical bounds. Proportions must be between 0 and 1 and duration data must be greater than 0, for instance. Many of these logical bounds can be handled by the proper transformation (see 5.3 for more details on the transformations handled by Amelia). In the rare case that imputations must satisfy certain logical bounds not handled by these transformations, Amelia can take draws from a truncated normal distribution in order to achieve imputations that satisfy the bounds. Note, however, that this procedure imposes extrememly strong restrictions on the imputations and can lead to lower variances than the imputation model implies. In general, building a more predictive imputation model will lead to better imputations than imposing these bounds. Amelia implements these bounds by rejection sampling. When drawing the imputations from their posterior, we repeatedly resample until we have a draw that satisfies all of the logical constraints. You can set an upper limit on the number of times to resample with the max.resample arguments. Thus, if after max.resample draws, the imputations are still outside the bounds, Amelia will set the imputation at the edge of the bounds. Thus, if the bounds were 0 and 100 and all of the draws were negative, Amelia would simply impute 0.

Diagnostics
Amelia currently provides a number of diagnostic tools to inspect the imputations that are created.

Comparing Densities
One check on the plausibility of the imputaiton model is check the distribution of imputed values to the distribution of observed values. Obviously we cannot expect, a priori, that these distribution will be identical as the missing values may differ systematically from the observed value-this is whole reason to impute to begin with! Imputations with strange distributions or those that are far from the observed data may indicate that imputation model needs at least some investigation and possibly some improvement.
The plot method works on output from amelia and, by default, shows for each variable a plot of the relative frequencies of the observed data with an overlay of the relative frequency of the imputed values.
> plot(a.out, which.vars = 3: 6) where the argument which.vars indicates which of the variables to plot (in this case, we are taking the 3rd through the 6th variables).
The imputed curve (in red) plots the density of the mean imputation over the m datasets. That is, for each cell that is missing in the variable, the diagnostic will find the mean of that cell across each of the m datasets and use that value for the density plot. The black distributions are the those of the observed data. When variables are completely observed, their densities are plotted in blue. These graphs will allow you to inspect how the density of imputations compares to the density of observed data. Some discussion of these graphs can be found in ?. Minimally, these graphs can be used to check that the mean imputation falls within known bounds, when such bounds exist in certain variables or settings.
We can also use the function compare.density directly to make these plots for an individual variable: > compare.density(a.out, var = "signed")

Overimpute
Overimputing is a technique we have developed to judge the fit of the imputation model. Because of the nature of the missing data mechanism, it is impossible to  Figure 6: The output of the plot method as applied to output from amelia. In the upper panels, the distribution of mean imputations (in red) is overlayed on the distribution of observed values (in black) for each variable. In the lower panels, there are no missing values and the distribution of observed values is simply plotted (in blue). Note that how imputed tariff rates are very similar to observed tariff rates, but the imputation of the Polity score are quite different. This is plausible if different types of regimes tend to be missing at different rates.
tell whether the mean prediction of the imputation model is close to the unobserved value that is trying to be recovered. By definition this missing data does not exist to create this comparison, and if it existed we would no longer need the imputations or care about their accuracy. However, a natural question the applied researcher will often ask is how accurate are these imputed values?
Overimputing involves sequentially treating each of the observed values as if they had actually been missing. For each observed value in turn we then generate several hundred imputed values of that observed value, as if it had been missing. While m = 5 imputations are sufficient for most analysis models, this large number of imputations allows us to construct a confidence interval of what the imputed value would have been, had any of the observed data been missing. We can then graphically inspect whether our observed data tends to fall within the region where it would have been imputed had it been missing.
For example, we can run the overimputation diagnostic on our data by running > overimpute(a.out, var = "tariff") Our overimputation diagnostic, shown in 7, runs this procedure through all of the observed values for a user selected variable. We can graph the estimates of each observation against the true values of the observation. On this graph, a y = x line indicates the line of perfect agreement; that is, if the imputation model was a perfect predictor of the true value, all the imputations would fall on this line. For each observation, Amelia also plots 90% confidence intervals that allows the user to visually inspect the behavior of the imputation model. By checking how many of the confidence intervals cover the y = x line, we can tell how often the imputation model can confidently predict the true value of the observation.
Occasionally, the overimputation can display unintuitive results. For example, different observations may have different numbers of observed covariates. If covariates that are useful to the prediction are themselves missing, then the confidence interval for this observation will be much larger. In the extreme, there may be observations where the observed value we are trying to overimpute is the only observed value in that observation, and thus there is nothing left to impute that observation with when we pretend that it is missing, other than the mean and variance of that variable. In these cases, we should correctly expect the confidence interval to be very large.
An example of this graph is shown in figure 8. In this simulated bivariate dataset, one variable is overimputed and the results displayed. The second variable is either observed, in which case the confidence intervals are very small and the imputations (yellow) are very accurate, or the second variable is missing in which case this variable is being imputed simply from the mean and variance parameters, and the imputations (red) have a very large and encompassing spread. The circles represent the mean of all the imputations for that value. As the amount of missing information in a particular pattern of missingness increases, we expect the width of the confidence interval to increase. The color of the confidence interval reflects the percent of covariates observed in that pattern of missingness, as reflected in the legend at the bottom.

Overdispersed Starting Values
If the data given to Amelia has a poorly behaved likelihood, the EM algorithm can have problems finding a global maximum of the likelihood surface and starting values can begin to effect imputations. Because the EM algorithm is deterministic, the point in the parameter space where you start it can impact where it ends, though this is irrelevant when the likelihood has only one mode. However, if the starting values of an EM chain are close to a local maximum, the algorithm may find this maximum, unaware that there is a global maximum farther away. To make sure that our imputations do not depend on our starting values, a good test is to run the EM algorithm from multiple, dispersed starting values and check their convergence. In a well behaved likelihood, we will see all of these chains converging to the same value, and reasonably conclude that this is the likely global maximum. On the other hand, we might see our EM chain converging to multiple locations. The algorithm may also wander around portions of the parameter space that are not fully identified, such as a ridge of equal likelihood, as would happen for example, if the same variable were accidentally included in the imputation model twice.
Amelia includes a diagnostic to run the EM chain from multiple starting values that are overdispersed from the estimated maximum. The overdispersion diagnostic will display a graph of the paths of each chain. Since these chains move through spaces that are in an extremely high number of dimensions and can not be graphically displayed, the diagnostic reduces the dimensionality of the EM paths by showing the paths relative to the largest principle components of the final mode(s) that are reached. Users can choose between graphing the movement over the two largest principal components, or more simply the largest dimension with time (iteration number) on the x-axis. The number of EM chains can also be adjusted. Once the diagnostic draws the graph, the user can visually inspect the results to check that all chains convergence to the same point. For our original model, this is a simple call to disperse: > disperse(a.out, dims = 1, m = 5) > disperse(a.out, dims = 2, m = 5) where m designates the number of places to start EM chains from and dims are the number of dimensions of the principal components to show.
In one dimension, the diagnostic plots movement of the chain on the y-axis and time, in the form of the iteration number, on the x-axis.

Overdispersed Starting Values
First Principle Component Second Principle Component Figure 9: A plot from the overdispersion diagnostic where all EM chains are converging to the same mode, regardless of starting value. On the left, the y-axis represents movement in the (very high dimensional) parameter space, and the x-axis represents the iteration number of the chain. On the right, we visualize the parameter space in two dimensions using the first two principal components of the end points of the EM chains. The iteration number is no longer represented on the y-axis, although the distance between iterations is marked by the distance between arrowheads on each chain. > freetrade2 <-freetrade > freetrade2$tariff2 <-freetrade2$tariff * 2 + 3 If we tried to impute this dataset, Amelia could draw imputations without any problems: > a.out.bad <-amelia(freetrade2, ts = "year", cs = "country") --Imputation 1 --1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 --Imputation 2 --1 2 3 4 5 6 7 8 9 10 11 12 --Imputation 3 -- While this is a special case of a problematic likelihood, situations very similar to this can go undetected without using the proper diagnostics. More generally, an unidentified imputation model will lead to non-unique ML estimates (see King (1989) for a more detailed discussion of identification and likelihoods).

Time-series plots
As discussed above, information about time trends and fixed effects can help produce better imputations. One way to check the plausibility of our imputation model is to see how it predicts missing values in a time series. If the imputations for the Malaysian tariff rate were drastically higher in 1990 than the observed years of 1989 or 1991, we might worry that there is a problem in our imputation model. Checking these time series is easy to do with the tscsPlot command. Simply choose the variable (with the var argument) and the cross-section (with the cs argument) to plot the observed time-series along with distributions of the imputed values for each missing time period. For instance, we can run > tscsPlot(a.out.time, cs = "Malaysia", main = "Malaysia (with time settings)", + var = "tariff", ylim = c(-10, 60)) to get the plot in figure 5.7.4. Here, the black point are observed tariff rates for Malaysia from 1980 to 2000. The red points are the mean imputation for each of the missing values, along with their 95% confidence bands. We draw these bands by imputing each of missing values 100 times to get the imputation distribution for that observation.
In figure 5.7.4, we can see that the imputed 1990 tariff rate is quite in line with the values around it. Notice also that values toward the beginning and end of the time series have higher imputation variance. This occurs because the fit of the polynomials of time in the imputation model have higher variance at the beginning and end of the time series. This is intuitive because these points have fewer neighbors from which to draw predictive power.
A word of caution is in order. As with comparing the histograms of imputed and obseved values, there could be reasons that the missing values are systematically different than the observed time series. For instance, if there had been a major financial crisis in Malaysia in 1990 which caused the government to close off trade, then we would expect that the missing tariff rates should be quite different than the observed time series. If we have this information in our imputation model, we might expect to see out-of-line imputations in these time-series plots. If, on the other hand, we did not have this information, we might see "good" time-series plots that fail to point out this violation of the MAR assumption. Our imputation model would produce poor estimates of the missing values since it would be unaware that both the missingness and the true unobserved tariff rate depend on another variable. Hence, the tscsPlot is useful for finding obvious problems in imputation model and comparing the efficiency of various imputation models, but it cannot speak to the untestable assumption of MAR.

Missingness maps
One useful tool for exploring the missingness in a dataset is a missingness map. This is a map that visualizes the dataset a grid and colors the grid by missingness status. The column of the grid are the variables and the rows are the observations, as in any spreadsheet program. This tool allows for a quick summary of the patterns of missingness in the data.
If we simply call the missmap function on our output from amelia, > missmap(a.out) we get the plot in figure 5.7.5. The missmap function arrange the columns so that the variables are in decreasing order of missingness from left to right. If the cs argument was set in the amelia function, the labels for the rows will indicate where each of the cross-sections begin. In figure 5.7.5, it is clear that the tariff rate is the variable most missing in the data and it tends to be missing in blocks of a few observations. Gross international reserves (intresmi) and financial openness (fivop), on the other hand, are missing mostly at the end of each cross-section. This suggests missingness by merging, when variables with different temporal coverages are merged to make one dataset. Sometimes this kind of missingness is an artifact of the date at which the data was The missingness map is an important tool for understanding the patterns of missingness in the data and can often indicate potential ways to improve the imputation model or data collection process.

Analysis Models
Imputation is most often a data processing step as opposed to a final model in of itself. To this end, it is easy to to pass output from amelia to other functions. The easiest and most integrated way to run an analysis model is to pass the output to the zelig function from the Zelig package. For example, in Milner and Kubota (2005), the dependent variable was tariff rates. We can replicate Zelig is one way to run analysis models on imputed data, but certainly not the only way. The imputations list in the amelia output contains each of the imputed datasets. Thus, users could simply program a loop over the number of imputations and run the analysis model on each imputed dataset and combine the results using the rules described in King et al. (2001) and Schafer (1997). Furthermore, users can easily export their imputations using the write.amelia function as described in 5.2.1 and use statistical packages other than R for the analysis model.

The amelia class
The output from the amelia function is an instance of the S3 class "amelia." Instances of the amelia class contain much more than simply the imputed datasets. The mu object of the class contains the posterior draws of the means of the complete data. The covMatrices contains the posterior draws of the covariance matrices of the complete data. Note that these correspond to the variables as they are sent to the EM algorithm. Namely, they refer to the variables after being transformed, centered and scaled.
The iterHist object is a list of m 3-column matrices. Each row of the matrices corresponds to an iteration of the EM algorithm. The first column indicates how many parameters had yet to converge at that iteration. The second column indicates if the EM algorithm made a step that decreased the number of converged parameters. The third column indicates whether the covariance matrix at this iteration was singular. Clearly, the last two columns are meant to indicate when the EM algorithm enters a problematic part of the parameter space.

AmeliaView Menu Guide
Below is a guide to the AmeliaView menus with references back to the users's guide. The same principles from the user's guide apply to AmeliaView. The only difference is how you interact with the program. Whether you use the GUI or the command line versions, the same underlying code is being called, and so you can read the command line-oriented versions of this manual even if you intend to use the GUI.

Loading AmeliaView
The easiest way to load AmeliaView is to open an R session and type the following two commands: > library(Amelia) > AmeliaView() This will bring up the AmeliaView window on any platform.
On the Windows operating system, there is an alternative way to start AmeliaView from the Desktop. See section 4.1 for a guide on how to install this version. Once installed, there should be a Desktop icon for AmeliaView. Simply double-click this icon and the AmeliaView window should appear. If, for some reason, this approach does not work, simply open an R session and use the approach above.

6.2
Step 1 -Input Note that when using a CSV file, Amelia assumes that your file has a header (that is, a row at the top of the data indicating the variable names).
2. Input Data File -Enter the location of your dataset. If your file is located in a high level directory, it might be you are trying to access for more information.
3. Browse -Find files on the system. 4. Load Data -Loads the data in the "Input Data File" line. Once the file is loaded, you can reload a different file, but you will lose any work on currently loaded file.
5. Summarize Data -View plots and summary statistics for the individual variables. This button will bring up a dialog box with a list of variables. Clicking on each variable will display the summary statistics on the right. Below these statistics, there is a "Plot Variable" button, which will show a histogram of the variable. For data that are string or character based, AmeliaView will not show summary statistics or plot histograms.

6.3
Step 2 -Options Figure 14: Detail for step 2 on the front page of AmeliaView.

Time Series Variable -Choose the variable that indexes time in the dataset.
If there is no time series component in your data, set it to "(none)." You must set this option in order to access the Time Series Cross Sectional options dialog.
2. Cross Sectional Variable -Choose the variable that indexes the crosssection. You must set this in order to access the "Set Case Priors" in the "Priors" dialog.
3. Variables -Becomes available after you load the data. See 6.3.1 for more information.
4. TSCS -Becomes available after you set the Time Series variable. See 6.3.2 for more information.

5.
Priors -Becomes available after you load the data. See 6.3.3 for more information.

Variables Dialog
1. Variable Transformations -Choose the transformation that best tailors the variable to the multivariate normal, if appropriate. See 5.3 on Transformations to see how each transformation is useful. You can also choose whether or not the variable is an identification (ID) variable. If so, it will be left out of the imputation model, but will remain in the imputed datasets. This is useful for variables that have no explanatory power like extra case identifiers.
2. Tolerance -Adjust the level of tolerance that Amelia uses to check convergence of the EM algorithm. In very large datasets, if your imputation chains run a long time without converging, increasing the tolerance will allow a lower threshold to judge convergence and end chains after fewer iterations.

Time Series Cross Sectional Dialog
1. Polynomials of Time -This option, if activated, will have Amelia use trends of time as a additional condition for fitting the missing data. The higher the level of polynomial will allow more variation in the trend structure, yet it will take more degrees of freedom to estimate.
2. Interact with Cross-Section -Interacting this with the cross section is way of allowing the trend of time to vary across cases as well. Using a 0 level polynomial and interacting with the cross section is the equivalent of using a fixed effects. For more information see 5.5 above.
3. Variable Listbox -Choose the variables whose lag or lead you would like to include in the imputation model.

4.
Lag Settings -Choose to include lags and leads in the data set to handle the effects of time. See 5.5.1 above.

Priors Dialog
1. Empirical Prior -A prior that adds observations to your data in order to shrink the covariances. A useful place to start is around 0.5% of the total number of observations in the dataset (see 5.6.1).
2. Set Observational Priors -Set prior beliefs about ranges for individual missing observations. For more information about observational priors, see 5.6.2.

Observational Priors
1. Current Priors -A list of current priors in distributional form, with the variable and case name. 2. Variable -The variable associated with the prior you would like specify. The list provided only shows the missing variables for the currently selected observation.
3. Minimum -The minimum value of the prior. The textbox will not accept letters or out of place punctuation.
4. Maximum -The maximum value of the prior. The textbox will not accept letters or out of place punctuation.

5.
Confidence -The confidence level of the prior. This should be between 0 and 1, non-inclusive. This value represents how certain your priors are. This value cannot be 1, even if you are absolutely certain of a give range. This is used to convert the range into an appropriate distributional prior.

6.4
Step 3 -Output 1. Output Data Format -Choose the format of output data. If you would like to not save any output data sets (if you wanted, for instance, to simply look at diagnostics), set this option to "(no save)." Currently, you can save the output data as: Comma Separated Values (.CSV), Tab Delimited Text (.TXT), Stata (.DTA), R save object (.RData), or to hold it in R memory. This last option will only work if you have called AmeliaView from an R session and want to return to the R command line to work with the output. It will have the name in memory from "Name of Imputed Datasets".
2. Name of Imputed Datasets -Enter the prefix for the output data files. If you set this to"mydata", your output files will be mydata1.csv, mydata2.csv... etc. Try to keep this name short as some operating systems have a difficult time reading long filenames.
3. Number of Imputed Datasets -Set the number of imputations you would like. In most cases, 5 will be enough to make accurate predictions about the means and variances.
4. Seed -Sets the seed for the random number generator used by Amelia. Useful if you need to have the same output twice.
5. Run Amelia -Runs the Amelia procedure on the input data. A dialog will open marking the progress of Amelia. Once it is finished, it will tell you that you can close the dialog. If an error message appears, follow its instructions; this usually involves closing the dialog, resetting the options, and running the procedure again.
6. Diagnostics -Post-imputation diagnostics. The only currently available graph compares the densities of the observed data to the mean imputation across the m imputed datasets. 1. Compare Plots -This will display the relative densities of the observed (red) and imputed (black) data. The density of the imputed values are the average imputations across all of the imputed datasets.

2.
Overimpute -This will run Amelia on the full data with one cell of the chosen variable artificially set to missing and then check the result of that imputation against the truth. The resulting plot will plot average imputations against true values along with 90% confidence intervals. These are plotted over a y = x line for visual inspection of the imputation model.
3. Number of overdispersions -When running the overdispersion diagnostic, you need to run the imputation algorithm from several overdispersed starting points in order to get a clear idea of how the chain are converging. Enter the number of imputations here.
4. Number of dimensions -The overdispersion diagnostic must reduce the dimensionality of the paths of the imputation algorithm to either one or two dimensions due to graphical restraints.

5.
Overdisperse -Run overdispersion diagnostic to visually inspect the convergence of the Amelia algorithm from multiple start values that are drawn randomly.

Arguments
x either a matrix, data.frame or a object of class "amelia". The first two will call the default S3 method. The third a convenient way to perform more imputations with the same parameters.
m the number of imputed datasets to create.
p2s an integer value taking either 0 for no screen output, 1 for normal screen printing of iteration numbers, and 2 for detailed screen output. See "Details" for specifics on output when p2s=2.
frontend a logical value used internally for the GUI.
idvars a vector of column numbers or column names that indicates identification variables. These will be dropped from the analysis but copied into the imputed datasets. startvals starting values, 0 for the parameter matrix from listwise deletion, 1 for an identity matrix.
tolerance the convergence threshold for the EM algorithm.
logs a vector of column numbers or column names that refer to variables that require log-linear transformation.
sqrts a vector of numbers or names indicating columns in the data that should be transformed by a sqaure root function. Data in this column cannot be less than zero.
lgstc a vector of numbers or names indicating columns in the data that should be transformed by a logistic function for proportional data. Data in this column must be between 0 and 1.
noms a vector of numbers or names indicating columns in the data that are nominal variables.
ords a vector of numbers or names indicating columns in the data that should be treated as ordinal variables.
incheck a logical indicating whether or not the inputs to the function should be checked before running amelia. This should only be set to FALSE if you are extremely confident that your settings are non-problematic and you are trying to save computational time.
collect a logical value indicating whether or not the garbage collection frequency should be increased during the imputation model. Only set this to TRUE if you are experiencing memory issues as it can significantly slow down the imputation process.
arglist an object of class "ameliaArgs" from a previous run of Amelia. Including this object will use the arguments from that run.
empri number indicating level of the empirical (or ridge) prior. This prior shrinks the covariances of the data, but keeps the means and variances the same for problems of high missingness, small N's or large correlations among the variables. Should be kept small, perhaps 0.5 to 1 percent of the rows of the data; a reasonable upper bound is around 10 percent of the rows of the data.
priors a four or five column matrix containing the priors for either individual missing observations or variable-wide missing values. See "Details" for more information.
autopri allows the EM chain to increase the empirical prior if the path strays into an nonpositive definite covariance matrix, up to a maximum empirical prior of the value of this argument times n, the number of observations. Must be between 0 and 1, and at zero this turns off this feature. ... further arguments to be passed.

Details
Multiple imputation is a method for analyzing incomplete multivariate data. This function will take an incomplete dataset in either data frame or matrix form and return m imputed datatsets with no missing values. The algorithm first bootstraps a sample dataset with the same dimensions as the original data, estimates the sufficient statistics (with priors if specified) by EM, and then imputes the missing values of sample. It repeats this process m times to produce the m complete datasets where the observed values are the same and the unobserved values are drawn from their posterior distributions.
The function will start a "fresh" run of the algorithm if x is either a incomplete matrix or data.frame. In this method, all of the options will be user-defined or set to their default. If x the output of a previous Amelia run (that is, an object of class "amelia"), then Amelia will run with the options used in that previous run. This is a convenient way to run more imputations of the same model.
You can provide Amelia with informational priors about the missing observations in your data. To specify priors, pass a four or five column matrix to the priors argument with each row specifying a different priors as such: one.prior <-c(row, column, mean,standard deviation) or, one.prior <-c(row, column, minimum, maximum, confidence).
So, in the first and second column of the priors matrix should be the row and column number of the prior being set. In the other columns should either be the mean and standard deviation of the prior, or a minimum, maximum and confidence level for the prior. You must specify your priors all as distributions or all as confidence ranges. Note that ranges are converted to distributions, so setting a confidence of 1 will generate an error.
Setting a priors for the missing values of an entire variable is done in the same manner as above, but inputing a 0 for the row instead of the row number. If priors are set for both the entire variable and an individual observation, the individual prior takes precedence.
In addition to priors, Amelia allows for logical bounds on variables. The bounds argument should be a matrix with 3 columns, with each row referring to a logical bound on a variable. The first column should be the column number of the variable to be bounded, the second column should be the lower bounds for that variable, and the third column should be the upper bound for that variable. As Amelia enacts these bounds by resampling, particularly poor bounds will end up resampling forever. Amelia will stop resampling after max.resample attempts and simply set the imputation to the relevant bound.
If each imputation is taking a long time to converge, you can increase the empirical prior, empri. This value has the effect of smoothing out the likelihood surface so that the EM algorithm can more easily find the maximum. It should be kept as low as possible and only used if needed.
Amelia assumes the data is distributed multivariate normal. There are a number of variables that can break this assumption. Usually, though, a transformation can make any variable roughly continuous and unbounded. We have included a number of commonly needed transformations for data. Note that the data will not be transformed in the output datasets and the transformation is simply useful for climbing the likelihood.
Please refer to the Amelia manual for more information on the function or the options.

Value
An instance of S3 class "amelia" with the following objects: imputations a list of length m with an imputed dataset in each entry. The class (matrix or data.frame) of these entries will match x.  arguments a instance of the class "ameliaArgs" which holds the arguments used in the Amelia run.
Note that the theta, mu and covMatrcies objects refers to the data as seen by the EM algorithm and is thusly centered, scaled, stacked, tranformed and rearranged. See the manual for details and how to access this information.
Author ( ... further graphical parameters for the plot.

Details
This function first plots a density plot of the observed units for the variable var in col [2]. The the function plots a density plot of the mean or modal imputations for the missing units in col [1]. If a variable is marked "ordinal" or "nominal" with the ords or noms options in amelia, then the modal imputation will be used. If legend is TRUE, then a legend is plotted as well. dims the number of principle components of the parameters to display and assess convergence on (up to 2).
p2s an integer that controls printing to screen. 0 (default) indicates no printing, 1 indicates normal screen output and 2 indicates diagnostic output.
frontend a logical value used internally for the Amelia GUI.
... further graphical parameters for the plot.

Details
This function tracks the convergence of m EM chains which start from various overdispersed starting values. This plot should give some indication of the sensitivity of the EM algorithm to the choice of starting values in the imputation model in output. If all of the lines converge to the same point, then we can be confident that starting values are not affecting the EM algorithm.
As the parameter space of the imputation model is of a high-dimension, this plot tracks how the first (and second if dims is 2) principle component(s) change over the iterations of the EM algorithm. Thus, the plot is a lower dimensional summary of the convergence and is subject to all the drawbacks inherent in said summaries.
For dims==1, the function plots a horizontal line at the position where the first EM chain converges. Thus, we are checking that the other chains converge close to that horizontal line. For dims==2, the function draws a convex hull around the point of convergence for the first EM chain. The hull is scaled to be within the tolerance of the EM algorithm. Thus, we should check that the other chains end up in this hull.
7.5 freetrade: Trade Policy and Democracy in 9 Asian States Description Economic and political data on nine developing countries in Asia from 1980 to 1999. This dataset includes 9 variables including year, country, average tariff rates, Polity IV score, total population, gross domestic product per capita, gross international reserves, a dummy variable for if the country had signed an IMF agreement in that year, a measure of financial openness, and a measure of US hegemony. These data were used in Milner and Kubota (2005).

Usage freetrade
Format A data frame with 10 variables and 171 observations.

Source
World Bank, World Trade Organization, Polity IV and others.

missmap: Missingness Map
Description Plots a missingness map showing where missingness occurs in the dataset passed to amelia.
legend should a legend be drawn?
col a vector of length two where the first element specifies the color for missing cells and the second element specifies the color for observed cells.
main main title of the plot. Defaults to "Missingness Map".
x.cex expansion for the variables names on the x-axis.
y.cex expansion for the unit names on the y-axis.
y.labels a vector of row labels to print on the y-axis y.at a vector of the same length as y.labels with row nmumbers associated with the labels.
csvar column number or name of the variable corresponding to the unit indicator. Only used when the obj is not of class amelia. tsvar column number or name of the variable corresponding to the time indicator. Only used when the obj is not of class amelia.
Details missmap draws a map of the missingness in a dataset using the image function. The columns are reordered to put the most missing variable farthest to the left. The rows are reordered to a unit-period order if the ts and cs arguments were passed to amelia. If not, the rows are not reordered.
The y.labels and y.at commands can be used to associate labels with rows in the data to identify them in the plot. The y-axis is internally inverted so that the first row of the data is associated with the top-most row of the missingness map. The values of y.at should refer to the rows of the data, not to any point on the plotting region. ylab the label for the y-axis. The default is "Imputed Values." main main title of the plot. The default is to smartly title the plot using the variable name.
frontend a logical value used internally for the Amelia GUI.
... further graphical parameters for the plot.

Details
This function temporarily treats each observed value in var as missing and imputes that value based on the imputation model of output. The dots are the mean imputation and the vertical lines are the 90% percent confidence intervals for imputations of each observed value. The diagonal line is the y = x line. If all of the imputations were perfect, then our points would all fall on the line. A good imputation model would have about 90% of the confidence intervals containing the truth; that is, about 90% of the vertical lines should cross the diagonal.
The color of the vertical lines displays the fraction of missing observations in the pattern of missingness for that observation. The legend codes this information. Obviously, the imputations will be much tighter if there are more observed covariates to use to impute that observation.

See Also
Other imputation diagnostics are compare.density, disperse, and tscsPlot.
7.8 plot.amelia: Summary plots for Amelia objects Description Plots diagnostic plots for the output from the amelia function.

Arguments
x an object of class "amelia"; typically output from the function amelia.
which.vars a vector indicating the variables to plot. The default is to plot all of the numeric variables that were actually imputed.
compare plot the density comparisons for each variable?
overimpute plot the overimputation for each variable?
ask prompt user before changing pages of a plot?

See Also
compare.density, overimpute 7.9 summary.amelia: Summary of an Amelia object Description Returns summary information from the Amelia run along with missingles information.
Usage ## S3 method for class 'amelia': summary(object, ...) Arguments object an object of class amelia. Typically, an output from the function amelia.

See Also
amelia, plot.amelia 7.10 tscsPlot: Plot observed and imputed time-series for a single cross-section Description Plots a time series for a given variable in a given cross-section and provides confidence intervals for the imputed values.
var the column number or variable name of the variable to plot.
cs the name of the cross-section to plot.
draws the number of imputations on which to base the confidence intervals.
conf the confidence level of the confidence intervals to plot for the imputated values.
misscol the color of the imputed values and their confidence intervals.
obscol the color of the points for observed units. xlab,ylab,main,pch,ylim,xlim various graphical parameters.
... further graphical parameters for the plot.

Details
The cs argument should be a value from the variable set to the cs argument in the amelia function for this output. This function will not work if the ts and cs arguments were not set in the amelia function.
7.11 write.amelia: Write Amelia imputations to file Description Writes the imptuted datasets to file from a run of amelia.
file.stem the leading part of the filename to save to output The imputation number and extension will be added to complete the filename. This can include a directory path.
extension the extension of the filename. This is simply what follows file.stem and the imputation number.
format one of the following output formats: csv, dta or table. See details.
... further arguments for the write functions.

Details
write.amelia writes each of the imputed datasets to a file using one of the following functions: write.csv, write.dta, or write.table. You can pass arguments to these functions from write.amelia.
If you were to set file.stem to "outdata" and the extension to ".csv" , then the resulting filename of the written files will be outdata1.csv outdata2.csv outdata3.csv ... and so on.

See Also
write.csv, write.table, write.dta