#!/bin/bash
wget -nc -P [file_path] [data_download_link]
Motivation: why do we need reproducible research workflow?
With the replication crisis in social science now widely recognized, it may seem unnecessary to restate the motivation for building reproducible research workflows. Nonetheless, I will re-iterate it here once more for both their importance and the urgency of adopting them—the time for such practices has truly arrived.
Innovations in computation tools and access to big data have opened a new era for researchers in social science. In terms of theory, such innovations have allowed researchers to create a more rich and flexible model that represent the real world. It also allowed researchers to test their models using the beneifts of wider access to data. In terms of empirical research, researchers can now answer many interesting questions that were not possible before due to the limitations of data availability. But this did not come without a cost. Unfortunately, sudden rise in the computation power and wider access to data were not fully accompanied by the development of best practices in reproducible research. It didn’t take long for researchers to realize that large portion of the academic research had serious issues in reproducibility. Many journals did not have any standard policy for replication package. Many researchers were having hard time in replicating results from published papers. This was a serious issue as it was making the academic research process less transparent and less reliable. In order to address this issue, many journals and researches started to emphasize the need for creating standard for reproducible research.
However, having a reproducible research workflow is not an easy task. It does have some steep learning curve that takes time and effort. This is where this blog post comes in. I will walk you through the process of building a reproducible research workflow. I will also share some tips and personal experiences that I have gathered from my own journey of building a reproducible research workflow.
Documentation
First step of every reproducible research workflow is documentation. Essence of reproducible research workflow is about creating a set of records that can be used by others to reproduced the results of the research. This is basically a researcher talking to other researchers about the detailed steps to reproduce the results of the research. Since we can not physically be present in the other researcher’s office, we need to have a set of records that can be used by others to reproduced the results of the research.
Follow the standard
When in doubt, follow the standard. Fortunately, you don’t need to reinvent the wheel for the necessary documentation for reproducible workflow. In fact, there are many official guidlines that you can consult to. For example, data editors of the journals of the American Economic Association, the Royal Economic Society, the Review of Economic Studies, the Canadian Journal of Economics, and Economic Inquiry have created the Data and Code Availability Standard (DCAS). This standard allows researchers to have a consistent way to document the data and code availability of their research.
In my template, I follow the DCAS standard and incorporated it into the README.md file. You can use the example in the README as a template for your own documentation.
Document everything
When in doubt, document everything. Marginal utility of documenting additional information is mostly non-negative for other researchers. Thus, when you are unsure whether this additional information is useful or not, just document it and think later.
Data management
Have consistent project structure
Mens sana in corpore sano–same for reproducible research workflow. Healthy reproducible habit comes from having a clear, reproducible folder structure for your research. How can you easily automate certain process in your workflow if you don’t know where your raw data is in your folder? You would be very error-prone if you have your source codes everywhere. If you are having hard time following what you did in your own workflow, other people will do much worse. Thus, having a consistent folder structure for your research is one of the first steps of reproducible research workflow.
Fortunately, answer to this is quite simple: setup a consistent project structure. Project structure means you module your research into a project with consistent set of folders. Crucial point about project structure is that everything necessary to run your research should be contained within the project. This includes your source codes, data, environment, etc. This ensures that you isolate the materials for your research from other unnecessary materials that could corrupt your workflow. Within the project parent folder, you should also have some consistent sets of folders that divide files by their functions. For example, you might have src
folder to contain all your source codes, input
folder for all your data, etc. In fact, this is how my replication template is structured.
Don’t directly modify the raw data
NEVER directly modify the raw data. First, it is very hard to remember how you modified the raw data. If you forget this process, reproducibility of the every subsequent analysis can be severely impaired. Second, if it is hard to for you do re-do this modification, it will be even harder for other replicators. Always use source code and create a new file from the raw data if you want to modify it. If you have no choice but to manually modify the raw data due to some reasons, save the result as a new file instead of overwriting it.
Automate the data download process (if possible)
Humans are so error-prone. That being said, don’t expect them to be very good at understanding your document. People will always make mistakes. Thus, it is always best to automate certain process in your workflow rather than telling people some instructions in the document.
One example of this is the data download process. Instead of manually clicing and downloading the data from the website, you can automate this process by using a bash script. First, locate the data download link in the website. You can get this by right-clicking the link and selecting “Copy link address”. Then, you can write a bash script as follows:
wget
is a command-line tool that allows you to download files from the internet. -nc
option means “no clobber”, which means it will not download the file if it already exists. -P
option means “directory path”, which means it will download the file to the specified directory.
Also, lot of the bulk data nowadays provide API to download the data using code script. If they provide such API, you can use it to write a script that automatically download the data.
We will later discuss it in the build automation section, but I have included this logic in the Makefile
file.
Coding
Set environment-independent environment
In general, having reproducible workflow means your code can be run in different environments. Different environments can range from different coding packages to different operating systems. But how can we do this? Best pratice is to always setup your research workflow so that you can setup the environment you used in your research in other people’s computer. This can be usually done by using several tools: Docker
, package dependencies manager, etc. We will discuss this in more detail in the upcoming sections.
Relative, relative, relative (path)!
Please please don’t have something like this in the first line of your source code:
setwd("C:\Users\jenny\path\that\only\I\have")
If you do, I swear I will come to your office and set your computer on fire.
You can clearly see that this violates the environment-independent environment principle we were discussing earlier. There is just no chance that other people’s computer will have the same absolute path to the data as yours. Thus, all your codes will break if you embed this directly in your code.
Then what should we do? The answer is simple: use relative path. Relative path is a path that is relative to the current working directory. For example, if you are in the C:\Users\jenny\path\that\only\I\have
directory, the relative path to the data is data/data.csv
. This is because the data is in the data
folder.
Way to implement this is very simple. In case of R
, you can use the R project with the here
package. here
package allows you to use relative path for your research project. All you need to do is to have either R project file or .here
file in your project root directory. Then, you can use here()
function to get the relative path to the data.
Let’s suppose that your project has the following structure:
project/
├── data/
├── src/code.R
├── README.md
├── .here
using here
package, you can use relative path in your code.R
file as follows:
library(here)
<- here("data", "data.csv") data_path
In this code, here("data", "data.csv")
will return something like /home/usrs/project/data/data.csv
regardless of the current working directory. Thus, here
package takes care of the relative/absolute path issue without you having to worry about it.
If you are familiar with the R
and the R project, you might be wondering whty you need to use the here
package explicitly. After all, if you have the R project, you can juse use the relative path in your code even without using the here()
function. There are several reasons. First, other researchers might want to run the code in the terminal. In this case, the terminal not automatically detect the R project file and lead to path errors. Second, using here()
function is more robust to different operating systems. For example, Windows and Linux have different path separator. here()
function takes care of this issue for you.
These sort of project environment are common in many different programming languages. Thus, you should have no problem using this in your own research workflow if you use other programming languages.
Get used to shell scripting
Why use shell/bash/zsh/terminal, etc? There are several reasons.
- It is more efficient to use shell scripting to automate the process.
Suppose you need to create 100 files names such as file_1.txt
, file_2.txt
, file_3.txt
, …, file_100.txt
. You can write a shell script to do this as follows:
touch file_{1..100}.txt
You can also delete them all as follows:
rm file_*.txt
Many tools used for reproducible research workflow are designed to be used in the terminal. For example,
Docker
is designed to be used in the terminal. If you are usingDocker
, it will be lot easier to use it in the terminal. Same for makefiles.Powerful tools helpful for your research use terminal as their interface. For example, HPC clusters usually use linux and most of the tools are designed to be work in the terminal.
In my template, bash
will be used a lot in the Makefile
file.
Use version control
Remember writing a final version of your documents? final-version.pdf
, final-final-version.pdf
, really-really-final-version.pdf
, etc. This is just inefficient because (1) it is hard to keep track of all the versions, (2) file name is not descriptive enough to know what is the difference between the versions, (3) it is hard to revert to a previous version if you need to, (4) you might accidentally overwrite the previous version if you are not careful.
Same problem applies to your source code. When you write your code, you will always be changing its content. You might add on to it, delete some parts, or go back and forth between different versions. If you are not careful, this might lead to all sorts of errors. This is where version control systems come in. Version control systems allow you to keep track of all the versions of your code and revert to a previous version if you need to. In the coding world, Git
is the industry standard for version control systems. If you don’t know about Git
yet, please learn it ASAP. While learning Git
, you should also learn about GitHub
as it is the most popular platform for hosting Git
repositories.
Use Docker to “ship” your computer
Use package dependencies managers
Use build automation tools
Use symlinks to avoid storing big data in the repository
Make you code invulnerable to “restart session”
Please please don’t have something like this in the first line of your source code:
rm(list = ls())
If you do, I swear I will again come to your office and set your computer on fire.
This is because it does not exactly do what you intend to do. When you run this code, you are probably trying to clean up the environment and start from a fresh state. However, this code will not do that. It will only remove the global objects in the environment. That being said, it will still have the non-data portions of the environment. For example, it will still have the packages that you loaded in the environment.
Then what should we do? Instead of using the above code, you should just restart the whole session. This ensures that your session is completely fresh. It also gives you good discipline since your code should be able to run from a fresh state.
Comment (almost) everything
When in doubt, comment everything in your code. This should also obvious as I am pretty sure you had many experiences of going through someone else’s codes and getting frustrated by 1,000 lines of convoluted codes with no explanations as all.
For example, imagine you have to decode this (I know this is a bit extreme…):