Published by the Foundation for Open Access Statistics Editors-in-chief: Bettina Grün, Torsten Hothorn, Rebecca Killick, Edzer Pebesma, Achim Zeileis    ISSN 1548-7660; CODEN JSSOBK
Authors: Peter Baker
Title: Using GNU Make to Manage the Workflow of Data Analysis Projects
Abstract: Data analysis projects invariably involve a series of steps such as reading, cleaning, summarizing and plotting data, statistical analysis and reporting. To facilitate reproducible research, rather than employing a relatively ad-hoc point-and-click cut-and-paste approach, we typically break down these tasks into manageable chunks by employing separate files of statistical, programming or text processing syntax for each step including the final report. Real world data analysis often requires an iterative process because many of these steps may need to be repeated any number of times. Manually repeating these steps is problematic in that some necessary steps may be left out or some reported results may not be for the most recent data set or syntax. GNU Make may be used to automate the mundane task of regenerating output given dependencies between syntax and data files. In addition to facilitating the management of and documenting the workflow of a complex data analysis project, such automation can help minimize errors and make the project more reproducible. It is relatively simple to construct Makefiles for small data analysis projects. As projects increase in size, difficulties arise because GNU Make does not have inbuilt rules for statistical and related software. Without such rules, Makefiles can become unwieldy and error-prone. This article addresses these issues by providing GNU Make pattern rules for R, Sweave, rmarkdown, SAS, Stata, Perl and Python to streamline management of data analysis and reporting projects. Rules are used by adding a single line to project Makefiles. Additional flexibility is incorporated for modifying standard program options. An overall strategy is outlined for Makefile construction and illustrated via simple and complex examples.

Page views:: 1739. Submitted: 2017-08-10. Published: 2020-08-31.
Paper: Using GNU Make to Manage the Workflow of Data Analysis Projects     Download PDF (Downloads: 1165)
Supplements: Source code Download (Downloads: 41; 14KB) Replication materials Download (Downloads: 33; 29KB)

DOI: 10.18637/jss.v094.c01

This work is licensed under the licenses
Paper: Creative Commons Attribution 3.0 Unported License
Code: GNU General Public License (at least one of version 2 or version 3) or a GPL-compatible license.