Reproducible Research from Day 1

(derived from https://github.com/codedthinking/day1)

Instructors: Miklós Koren and Lars Vilhuber, data editors of the Review of Economic Studies and the American Economic Association

Summary: Journals require that you share your code and data in a replication package at the end of your research project. Following some best practices from day 1 can not only help you prepare this package later, but also make you a more productive researchers. In this workshop, we start with an empty folder, and finish with a mini-project about public procurement across various European countries. Together we discuss and document all the choices we make about data collection and analysis, in a way that can help future readers of our research.

Outline

Tools needed:

Stata (see Tools.md if you do not have a Stata license). It is feasible to do the whole exercise in R as well, but examples will be in Stata.
text editor (we suggest VS Code, but any will do)
browser
file browser (e.g., Windows Explorer, Finder, Nautilus, etc.)

Setup
1. Create an empty folder, like Documents/day1
2. Navigate to this folder in Stata
3. Open this folder in your text editor, if it can open folders
4. Google "CEPII GeoDist"
Explain need for automation and data documentation
1. No point and click download
2. No download before checking and documenting we have the right to do so
3. Code: (Stata) copy "$URL" (outputfile), replace (R) download.file(url="$URL",destfile="(outputfile)")
STOP - how do we keep track of that?
Create README.md before doing anything
Add DAS and data citation
1. Cite both dataset and working paper
2. Add data URL and time accessed (can you think of a way to automate this?)
3. Add a link to license (also: download and store the license)
Create separate folders for each function: code/ and data/
Create a script 01_download_data.do
Run script with relative path only, check data has been downloaded
1. Where should the downloaded data go? (data/raw, data/external?)
Repeat same steps for Tenders Electronic Data, using sample distributed by us
1. license, citations etc are already introduced so it is easier to deal with the extra complication of a data extract
2. Do we publish this extract to be cited as distributors? I have a GitHub release, but no DOI.
Write a data filtering script for TED.
1. What do we filter on? Provide instructions.
Write a merging script. These two can go in the same script, 02_create_sample.do
1. Are country codes compatible?
2. Where does the newly merged data go? (data/interim?)
Write 03_define_variables.do
Write 04_analysis.do. Save result (figure?) in a separate folder (results/figures?)
1. Figure of log procurement amount against log distance
2. Style figure using plotplain (Stata) or theme_minimal() (R)
  - What do you need in Stata to do that?
3. Automate saving the figure: (Stata) graph export "results/figures/figure1.png", replace (R) ggsave("results/figures/figure1.png")
Write a main script, main.do.
Document these steps in README.md
ZIP and ready

Stretch goals

In both Stata and R versions, we used external packages. How to document them? Where to install them?
Publish the ZIP file on a repository (e.g., Zenodo Sandbox) or via a share (e.g., Dropbox, OSF, etc.)
Exchange information with somebody in the room.
Download their ZIP file, and try to reproduce their results.
1. Can you do so by changing only 1 line of code?
2. Can you do so without changing a single line of code?

License

This presentation:

Licensed under

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Reproducible Research from Day 1

Outline

Stretch goals

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Reproducible Research from Day 1

Outline

Stretch goals

License