# Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS

Authors: Ranjit Lall and Thomas Robinson

We provide a single replication script (`Code/code.R`) that
substantively reproduces all results presented in the paper (including a
full imputation of the CCES data using MIDASpy).

This script takes approximately one hour to run.  Generated figures will be
saved in the subdirectory `Figures/Replication`.  To facilitate comparison,
`Figures` contains the figures presented in the paper.

Note: due to the complex nature of our full tests, this is only a
substantive replication (in line with JSS guidelines).  For a complete
replication, please run `Code/full_code.R`.  This file has a runtime
of 1.4 days, most of which is spent on the hyperparameter test.

All file paths in scripts are relative to the main replication folder.

## IMPORTANT: Setting up the replication environment ##

* If using a new Mac with M1/M2 chipset, please see the Apple Silicon
  section below

To aid replication, we include both a YAML file (in `Data`) that initializes a
conda environment with the correct Python package dependencies.


### Manual conda setup

Please ensure you have conda installed on your machine.  Next, in a terminal
window, navigate to this replication folder.  Then, run the following at the
command line:

```
conda env create -f Data/midas-env.yml

```

#### NOTE: Setup for Apple Silicon (i.e., Macs with M1 or M2 chips)

rMIDAS and MIDASpy are compatible with Apple's new ARM64 architecture. 
However, we recommend using the miniforge installer rather than anaconda or
miniconda, as it offers better support for the ARM64 architecture.

Once you have installed miniforge, Apple Silicon users should navigate to
this replication folder and run the following at the command line:

```
conda env create -f Data/midas-env-arm64.yml

```

## Dependency details 

### Replication script

We replicated this code limiting the memory available to 8GB.
The script was also tested on a MacBook Pro
with Apple M1 Max chip using miniforge, and a Ubuntu 22.04 linux system.

### Full results

The paper results generated from `Code/full_code.R` were produced on
an Amazon AWS EC2 server using a c6a.8xlarge instance with 64GB RAM and
Ubuntu 22.04 Server operating system.

### Runtimes

- Code/code.R: 58 minutes (single replication script - recommended)

- Code/full_code.R: 1.4 days (full script for exact replication)

- Code/py_example.py: 12 minutes 33 seconds (just the MIDASpy example)

## Replication script discrepancies

As noted above, we provide a substantive replication script (`code.R`) due
to the Lengthy runtime of the full replication script (`full_code.R`).  The
two scripts differ in the following ways:

### Section 5.1

* In `code.R`, we specify a subset of categorical and binary columns rather
  than using all such columns, as in `full_code.R`.  This reduces memory
  load.

### Section 5.2

* As in Section 5.1, we specify a subset of categorical and binary columns
  in `code.R` rather than using all such columns, as in `full_code.R`.

### Section 6.2

* In `code.R`, we do not run the hyperparameter or learning rate tests in
  `full_code.R` to avoid lengthy runtimes.  Instead, we leave the code for
  these tests as comments and read in the results of our original run to
  generate the figures.

### Section 6.3

* Minor formatting differences caused by subsetting the data in
`full_code.R`, which matches the existing subsetted data in `code.R`.
