Using GNU Make to Manage the Workflow of Data Analysis Projects
Main Article Content
Data analysis projects invariably involve a series of steps such as reading, cleaning, summarizing and plotting data, statistical analysis and reporting. To facilitate reproducible research, rather than employing a relatively ad-hoc point-and-click cut-and-paste approach, we typically break down these tasks into manageable chunks by employing separate files of statistical, programming or text processing syntax for each step including the final report. Real world data analysis often requires an iterative process because many of these steps may need to be repeated any number of times. Manually repeating these steps is problematic in that some necessary steps may be left out or some reported results may not be for the most recent data set or syntax. GNU Make may be used to automate the mundane task of regenerating output given dependencies between syntax and data files. In addition to facilitating the management of and documenting the workflow of a complex data analysis project, such automation can help minimize errors and make the project more reproducible. It is relatively simple to construct Makefiles for small data analysis projects. As projects increase in size, difficulties arise because GNU Make does not have inbuilt rules for statistical and related software. Without such rules, Makefiles can become unwieldy and error-prone. This article addresses these issues by providing GNU Make pattern rules for R, Sweave, rmarkdown, SAS, Stata, Perl and Python to streamline management of data analysis and reporting projects. Rules are used by adding a single line to project Makefiles. Additional flexibility is incorporated for modifying standard program options. An overall strategy is outlined for Makefile construction and illustrated via simple and complex examples.