Hands-on reproducible research workflow

Motivation: why do we need reproducible research workflow?

With the replication crisis in social science now widely recognized, it may seem unnecessary to restate the motivation for building reproducible research workflows. Nonetheless, I will re-iterate it here once more for both their importance and the urgency of adopting them—the time for such practices has truly arrived.

Innovations in computation tools and access to big data have opened a new era for researchers in social science. In terms of theory, such innovations have allowed researchers to create a more rich and flexible model that represent the real world. It also allowed researchers to test their models using the beneifts of wider access to data. In terms of empirical research, researchers can now answer many interesting questions that were not possible before due to the limitations of data availability. But this did not come without a cost. Unfortunately, sudden rise in the computation power and wider access to data were not fully accompanied by the development of best practices in reproducible research. It didn’t take long for researchers to realize that large portion of the academic research had serious issues in reproducibility. Many journals did not have any standard policy for replication package. Many researchers were having hard time in replicating results from published papers. This was a serious issue as it was making the academic research process less transparent and less reliable. In order to address this issue, many journals and researches started to emphasize the need for creating standard for reproducible research.

However, having a reproducible research workflow is not an easy task. It does have some steep learning curve that takes time and effort. This is where this blog post comes in. I will walk you through the process of building a reproducible research workflow. I will also share some tips and personal experiences that I have gathered from my own journey of building a reproducible research workflow.

Documentation

First step of every reproducible research workflow is documentation. Essence of reproducible research workflow is about creating a set of records that can be used by others to reproduced the results of the research. This is basically a researcher talking to other researchers about the detailed steps to reproduce the results of the research. Since we can not physically be present in the other researcher’s office, we need to have a set of records that can be used by others to reproduced the results of the research.

Follow the standard

When in doubt, follow the standard. Fortunately, you don’t need to reinvent the wheel for the necessary documentation for reproducible workflow. In fact, there are many official guidlines that you can consult to. For example, data editors of the journals of the American Economic Association, the Royal Economic Society, the Review of Economic Studies, the Canadian Journal of Economics, and Economic Inquiry have created the Data and Code Availability Standard (DCAS). This standard allows researchers to have a consistent way to document the data and code availability of their research.

In my template, I follow the DCAS standard and incorporated it into the README.md file. You can use the example in the README as a template for your own documentation.

Document everything

When in doubt, document everything. Marginal utility of documenting additional information is mostly non-negative for other researchers. Thus, when you are unsure whether this additional information is useful or not, just document it and think later.

Comment (almost) everything

When in doubt, comment everything in your code. This should also obvious as I am pretty sure you had many experiences of going through someone else’s codes and getting frustrated by 1,000 lines of convoluted codes with no explanations as all.

For example, imagine you have to decode this (I know this is a bit extreme…):

#include                                     <math.h>
#include                                   <sys/time.h>
#include                                   <X11/Xlib.h>
#include                                  <X11/keysym.h>
                                          double L ,o ,P
                                         ,_=dt,T,Z,D=1,d,
                                         s[999],E,h= 8,I,
                                         J,K,w[999],M,m,O
                                        ,n[999],j=33e-3,i=
                                        1E3,r,t, u,v ,W,S=
                                        74.5,l=221,X=7.26,
                                        a,B,A=32.2,c, F,H;
                                        int N,q, C, y,p,U;
                                       Window z; char f[52]
                                    ; GC k; main(){ Display*e=
 XOpenDisplay( 0); z=RootWindow(e,0); for (XSetForeground(e,k=XCreateGC (e,z,0,0),BlackPixel(e,0))
; scanf("%lf%lf%lf",y +n,w+y, y+s)+1; y ++); XSelectInput(e,z= XCreateSimpleWindow(e,z,0,0,400,400,
0,0,WhitePixel(e,0) ),KeyPressMask); for(XMapWindow(e,z); ; T=sin(O)){ struct timeval G={ 0,dt*1e6}
; K= cos(j); N=1e4; M+= H*_; Z=D*K; F+=_*P; r=E*K; W=cos( O); m=K*W; H=K*T; O+=D*_*F/ K+d/K*E*_; B=
sin(j); a=B*T*D-E*W; XClearWindow(e,z); t=T*E+ D*B*W; j+=d*_*D-_*F*E; P=W*E*B-T*D; for (o+=(I=D*W+E
*T*B,E*d/K *B+v+B/K*F*D)*_; p<y; ){ T=p[s]+i; E=c-p[w]; D=n[p]-L; K=D*m-B*T-H*E; if(p [n]+w[ p]+p[s
]== 0|K <fabs(W=T*r-I*E +D*P) |fabs(D=t *D+Z *T-a *E)> K)N=1e4; else{ q=W/K *4E2+2e2; C= 2E2+4e2/ K
 *D; N-1E4&& XDrawLine(e ,z,k,N ,U,q,C); N=q; U=C; } ++p; } L+=_* (X*t +P*M+m*l); T=X*X+ l*l+M *M;
  XDrawString(e,z,k ,20,380,f,17); D=v/l*15; i+=(B *l-M*r -X*Z)*_; for(; XPending(e); u *=CS!=N){
                                   XEvent z; XNextEvent(e ,&z);
                                       ++*((N=XLookupKeysym
                                         (&z.xkey,0))-IT?
                                         N-LT? UP-N?& E:&
                                         J:& u: &h); --*(
                                         DN -N? N-DT ?N==
                                         RT?&u: & W:&h:&J
                                          ); } m=15*F/l;
                                          c+=(I=M/ l,l*H
                                          +I*M+a*X)*_; H
                                          =A*r+v*X-F*l+(
                                          E=.1+X*4.9/l,t
                                          =T*m/32-I*T/24
                                           )/S; K=F*M+(
                                           h* 1e4/l-(T+
                                           E*5*T*E)/3e2
                                           )/S-X*d-B*A;
                                           a=2.63 /l*d;
                                           X+=( d*l-T/S
                                            *(.19*E +a
                                            *.64+J/1e3
                                            )-M* v +A*
                                            Z)*_; l +=
                                            K *_; W=d;
                                            sprintf(f,
                                            "%5d  %3d"
                                            "%7d",p =l
                                           /1.7,(C=9E3+
                              O*57.3)%0550,(int)i); d+=T*(.45-14/l*
                             X-a*130-J* .14)*_/125e2+F*_*v; P=(T*(47
                             *I-m* 52+E*94 *D-t*.38+u*.21*E) /1e2+W*
                             179*v)/2312; select(p=0,0,0,0,&G); v-=(
                              W*F-T*(.63*m-I*.086+m*E*19-D*25-.11*u
                               )/107e2)*_; D=cos(o); E=sin(o); } }

Data management

Have consistent project structure

Mens sana in corpore sano–same for reproducible research workflow. Healthy reproducible habit comes from having a clear, reproducible folder structure for your research. How can you easily automate certain process in your workflow if you don’t know where your raw data is in your folder? You would be very error-prone if you have your source codes everywhere. If you are having hard time following what you did in your own workflow, other people will do much worse. Thus, having a consistent folder structure for your research is one of the first steps of reproducible research workflow.

Fortunately, answer to this is quite simple: setup a consistent project structure. Project structure means you module your research into a project with consistent set of folders. Crucial point about project structure is that everything necessary to run your research should be contained within the project. This includes your source codes, data, environment, etc. This ensures that you isolate the materials for your research from other unnecessary materials that could corrupt your workflow. Within the project parent folder, you should also have some consistent sets of folders that divide files by their functions. For example, you might have src folder to contain all your source codes, input folder for all your data, etc. In fact, this is how my replication template is structured.

Don’t directly modify the raw data

NEVER directly modify the raw data. First, it is very hard to remember how you modified the raw data. If you forget this process, reproducibility of the every subsequent analysis can be severely impaired. Second, if it is hard to for you do re-do this modification, it will be even harder for other replicators. Always use source code and create a new file from the raw data if you want to modify it. If you have no choice but to manually modify the raw data due to some reasons, save the result as a new file instead of overwriting it.

Automate the data download process (if possible)

Humans are so error-prone. That being said, don’t expect them to be very good at understanding your document. People will always make mistakes. Thus, it is always best to automate certain process in your workflow rather than telling people some instructions in the document.

One example of this is the data download process. Instead of manually clicing and downloading the data from the website, you can automate this process by using a bash script. First, locate the data download link in the website. You can get this by right-clicking the link and selecting “Copy link address”. Then, you can write a bash script as follows:

#!/bin/bash

wget -nc -P [file_path] [data_download_link]

wget is a command-line tool that allows you to download files from the internet. -nc option means “no clobber”, which means it will not download the file if it already exists. -P option means “directory path”, which means it will download the file to the specified directory.

Also, lot of the bulk data nowadays provide API to download the data using code script. If they provide such API, you can use it to write a script that automatically download the data.

We will later discuss it in the build automation section, but I have included this logic in the Makefile file.

Coding

Set same environment for all environments

If you have been coding for a while, probably you would have experienced either of the following:

My code works on my computer but not on other people’s computer.
Code that I wrote 6 months ago does not work anymore because of the new package update.
My code that worked on MacOS does not work on Windows.

Even though they look just incidental, these are all very serious issues that can severely impair the reproducibility of your research. How can we trust your results if it does not work after 6 months? In fact, how can we trust your results if it does not even work on your own computer? Thus, it is very important to set the same environment for all environments. You need to ensure that your original environment is portable and can work on any other settings.

In general, having reproducible workflow means your code can be run in different environments. Different environments can range from different coding packages to different operating systems. But how can we do this? Best pratice is to always setup your research workflow so that you can setup the environment you used in your research in other people’s computer. This can be usually done by using several tools: Docker, package dependencies manager, etc. We will discuss this in more detail in the upcoming sections.

Relative, relative, relative (path)!

Please please don’t have something like this in the first line of your source code:

setwd("C:\Users\jenny\path\that\only\I\have")

If you do, I swear I will come to your office and set your computer on fire.

You can clearly see that this violates the environment-independent environment principle we were discussing earlier. There is just no chance that other people’s computer will have the same absolute path to the data as yours. Thus, all your codes will break if you embed this directly in your code.

Then what should we do? The answer is simple: use relative path. Relative path is a path that is relative to the current working directory. For example, if you are in the C:\Users\jenny\path\that\only\I\have directory, the relative path to the data is data/data.csv. This is because the data is in the data folder.

Way to implement this is very simple. In case of R, you can use the R project with the here package. here package allows you to use relative path for your research project. All you need to do is to have either R project file or .here file in your project root directory. Then, you can use here() function to get the relative path to the data.

Let’s suppose that your project has the following structure:

project/
├── data/
├── src/code.R
├── README.md
├── .here

using here package, you can use relative path in your code.R file as follows:

library(here)
data_path <- here("data", "data.csv")

In this code, here("data", "data.csv") will return something like /home/usrs/project/data/data.csv regardless of the current working directory. Thus, here package takes care of the relative/absolute path issue without you having to worry about it.

If you are familiar with the R and the R project, you might be wondering whty you need to use the here package explicitly. After all, if you have the R project, you can juse use the relative path in your code even without using the here() function. There are several reasons. First, other researchers might want to run the code in the terminal. In this case, the terminal not automatically detect the R project file and lead to path errors. Second, using here() function is more robust to different operating systems. For example, Windows and Linux have different path separator. here() function takes care of this issue for you.

These sort of project environment are common in many different programming languages. Thus, you should have no problem using this in your own research workflow if you use other programming languages.

Get used to shell scripting

Why use shell/bash/zsh/terminal, etc? There are several reasons.

It is more efficient to use shell scripting to automate the process.

Suppose you need to create 100 files names such as file_1.txt, file_2.txt, file_3.txt, …, file_100.txt. You can write a shell script to do this as follows:

touch file_{1..100}.txt

You can also delete them all as follows:

rm file_*.txt

Many tools used for reproducible research workflow are designed to be used in the terminal. For example, Docker is designed to be used in the terminal. If you are using Docker, it will be lot easier to use it in the terminal. Same for makefiles.
Powerful tools helpful for your research use terminal as their interface. For example, HPC clusters usually use linux and most of the tools are designed to be work in the terminal.

In my template, bash will be used a lot in the Makefile file.

Use version control

Remember writing a final version of your documents? final-version.pdf, final-final-version.pdf, really-really-final-version.pdf, etc. This is just inefficient because (1) it is hard to keep track of all the versions, (2) file name is not descriptive enough to know what is the difference between the versions, (3) it is hard to revert to a previous version if you need to, (4) you might accidentally overwrite the previous version if you are not careful.

Same problem applies to your source code. When you write your code, you will always be changing its content. You might add on to it, delete some parts, or go back and forth between different versions. If you are not careful, this might lead to all sorts of errors. This is where version control systems come in. Version control systems allow you to keep track of all the versions of your code and revert to a previous version if you need to. In the coding world, Git is the industry standard for version control systems. If you don’t know about Git yet, please learn it ASAP. While learning Git, you should also learn about GitHub as it is the most popular platform for hosting Git repositories.

Use Docker to “ship” your computer

Docker is a tool that allows you to create a container that contains all the dependencies for your code. This means that you can “ship” your computer to other people and they can run your code without having to install all the dependencies on their own computer. This is very useful because it is highly likely that other replicators will be using different operating systems and different versions of the software. By using Docker, you can ensure that your code will work on any other computer.

If you want to know the basics of Docker, check out this. In my template, I included two Dockerfile files. One is Dockerfile which uses R Rocker project as the base image and installs some necessary system dependencies for installing certain R packages (e.g. sf). It also installs version 1.1.5 of renv package. The other is Dockerfile_r_julia_quarto which is based on my previously mentioned Dockerfile and adds Julia and Quarto. It also takes care of the issue with installing tinytex in AMD and ARM architecture.

Use package dependencies managers

One of the reasons your old code does not work is because you have a newer version of the package which does not work with your old code. For example, suppose you had a R code that used tidyverse package version 1.0.2. After 2 years, you now have a newer version. Unfortunately, the newer version of the tidyverse package does not work anymore with your old code because there were some changes in the package. This is a very common issue that can seriously impact the reproducibility of your research.

Fortunately, you can use package dependenceis managers to resolve this. Package dependencies managers record the version of the packages being installed and keep track of the dependencies between packages. Thus, you can easily create a new environment that has the same version of the packages as the original environment. Nowadays, most of the programming languages have their own package dependenceis managers. For example, R has renv and Julia has Pkg.

Use build automation tools

If you have multiple scripts, it is hard to remember which script to run and in what order. This is where build automation tools come in. Build automation tools allow you to define the order of the scripts to run and the dependencies between the scripts. This means that you can run your entire workflow with a single command.

There are many build automations tools out there. One of the popular one is GNU Make. For reference, I have my own Makefile file in the template.

Make your code invulnerable to “restart session”

Please please don’t have something like this in the first line of your source code:

rm(list = ls())

If you do, I swear I will again come to your office and set your computer on fire.

This is because it does not exactly do what you intend to do. When you run this code, you are probably trying to clean up the environment and start from a fresh state. However, this code will not do that. It will only remove the global objects in the environment. That being said, it will still have the non-data portions of the environment. For example, it will still have the packages that you loaded in the environment.

Then what should we do? Instead of using the above code, you should just restart the whole session. This ensures that your session is completely fresh. It also gives you good discipline since your code should be able to run from a fresh state.