(derived from https://github.com/codedthinking/day1)
Instructors: Miklós Koren and Lars Vilhuber, data editors of the Review of Economic Studies and the American Economic Association
Summary: Journals require that you share your code and data in a replication package at the end of your research project. Following some best practices from day 1 can not only help you prepare this package later, but also make you a more productive researchers. In this workshop, we start with an empty folder, and finish with a mini-project about public procurement across various European countries. Together we discuss and document all the choices we make about data collection and analysis, in a way that can help future readers of our research.
- Tools needed:
- Stata (see Tools.md if you do not have a Stata license). It is feasible to do the whole exercise in R as well, but examples will be in Stata.
- text editor (we suggest VS Code, but any will do)
- browser
- file browser (e.g., Windows Explorer, Finder, Nautilus, etc.)
- Setup
- Create an empty folder, like
Documents/day1
- Navigate to this folder in Stata
- Open this folder in your text editor, if it can open folders
- Google "CEPII GeoDist"
- Create an empty folder, like
- Explain need for automation and data documentation
- No point and click download
- No download before checking and documenting we have the right to do so
- Code: (Stata)
copy "$URL" (outputfile), replace
(R)download.file(url="$URL",destfile="(outputfile)")
- STOP - how do we keep track of that?
- Create
README.md
before doing anything - Add DAS and data citation
- Cite both dataset and working paper
- Add data URL and time accessed (can you think of a way to automate this?)
- Add a link to license (also: download and store the license)
- Create separate folders for each function:
code/
anddata/
- Create a script
01_download_data.do
- Run script with relative path only, check data has been downloaded
- Where should the downloaded data go? (
data/raw, data/external
?)
- Where should the downloaded data go? (
- Repeat same steps for Tenders Electronic Data, using sample distributed by us
- license, citations etc are already introduced so it is easier to deal with the extra complication of a data extract
- Do we publish this extract to be cited as distributors? I have a GitHub release, but no DOI.
- Write a data filtering script for TED.
- What do we filter on? Provide instructions.
- Write a merging script. These two can go in the same script,
02_create_sample.do
- Are country codes compatible?
- Where does the newly merged data go? (
data/interim
?)
- Write
03_define_variables.do
- Write
04_analysis.do
. Save result (figure?) in a separate folder (results/figures
?)- Figure of log procurement amount against log distance
- Style figure using
plotplain
(Stata) ortheme_minimal()
(R)- What do you need in Stata to do that?
- Automate saving the figure: (Stata)
graph export "results/figures/figure1.png", replace
(R)ggsave("results/figures/figure1.png")
- Write a main script,
main.do
. - Document these steps in
README.md
- ZIP and ready
- In both Stata and R versions, we used external packages. How to document them? Where to install them?
- Publish the ZIP file on a repository (e.g., Zenodo Sandbox) or via a share (e.g., Dropbox, OSF, etc.)
- Exchange information with somebody in the room.
- Download their ZIP file, and try to reproduce their results.
- Can you do so by changing only 1 line of code?
- Can you do so without changing a single line of code?
This presentation: