From c1269c7231d06e39ad191d21ed41b8933b1397d2 Mon Sep 17 00:00:00 2001 From: Lars Vilhuber Date: Mon, 1 Jul 2024 01:34:03 -0400 Subject: [PATCH] Last minute updates relatively complete. --- presentation/01-run_it_again.md | 14 +- presentation/02-00-intermezzo.md | 2 +- presentation/02-hands_off_running.md | 20 +- .../03-automatically_saving_figures.md | 12 +- presentation/04-creating_log_files.md | 10 +- presentation/12-environments-in-stata.md | 286 +++++++---- presentation/70-other_methods.md | 23 + presentation/99-confidential_data.md | 14 +- presentation/index.Rmd | 13 +- presentation/index.html | 485 +++++++++++------- presentation/notes.md | 2 +- 11 files changed, 568 insertions(+), 313 deletions(-) create mode 100644 presentation/70-other_methods.md diff --git a/presentation/01-run_it_again.md b/presentation/01-run_it_again.md index 454dd41..0611e5a 100644 --- a/presentation/01-run_it_again.md +++ b/presentation/01-run_it_again.md @@ -18,26 +18,28 @@ While the code, once set to run, can do so on its own, *you* might need to spend --- -![](images/Red-Warning-PNG-Clipart.png) +![](images/Red-Warning-PNG-Clipart.png){.center width="350" height="300"} -*This should be a warning sign:* if it takes you a long time to get it to run, or to manually reproduce the results, it might take others even longer.[^warning-sign] +*This should be a warning sign:* + +If it takes you a long time to get it to run, or to manually reproduce the results, it **might take others even longer**.[^warning-sign] [^warning-sign]: Source: [Red Warning PNG Clipart](https://www.pngall.com/warning-sign-png/download/69408), CC-BY. --- -Furthermore, it may suggest that you haven't been able to re-run your own code very often, which can be correlated with fragility or even lack of reproducibility. +Furthermore, it may suggest that you **haven't been able to re-run** your own code very often, which can be indicate **fragility** or even **lack of reproducibility**. -## Takeaways +## Takeaways {.smaller} ::: {.incremental} -- ✅ your code runs without problem, after all the debugging. +- [x] your code runs without problem, after all the debugging. - [ ] your code runs without manual intervention, and with low effort - [ ] it actually produces all the outputs -- [ ]your code generates a log file that you can inspect, and that you could share with others. +- [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer ::: \ No newline at end of file diff --git a/presentation/02-00-intermezzo.md b/presentation/02-00-intermezzo.md index f70220c..dd3d8b4 100644 --- a/presentation/02-00-intermezzo.md +++ b/presentation/02-00-intermezzo.md @@ -9,7 +9,7 @@ Automation and robustness checks, as well as efficiency. Generating a log file means that you can inspect it, and you can share it with others. Also helps in debugging, for you and others. -## Will it run on somebody else's computer? +## Will it run on somebody else's computer? {.smaller} Running it again does not help: diff --git a/presentation/02-hands_off_running.md b/presentation/02-hands_off_running.md index 8bd52bd..a04a411 100644 --- a/presentation/02-hands_off_running.md +++ b/presentation/02-hands_off_running.md @@ -4,15 +4,23 @@ Did it take you a long time to run everything again? ![⏳](https://c.tenor.com/4qs0klfg8nMAAAAC/tenor.gif) -# Let's ramp it up a bit. +## Let's ramp it up a bit. -Your code must run, beginning to end, top to bottom, without error, and without any user intervention. This should in principle (re)create all figures, tables, and numbers you include in your paper. +- Your code must run, **beginning to end**, top to bottom, without error, and **without any user intervention**. + +- This should in principle (re)create all **figures**, **tables**, and **in-text numbers** you include in your paper. ::: {.notes} We have seen users who appear to highlight code and to run it interactively, in pieces, using the program file as a kind of notepad. This is not reproducible, and should be avoided. It is fine for debugging. ::: +## Seem trivial? + +> Out of **8280** replication packages in ~20 top econ journals, only **2594** (**31.33%**) had a main/controller script.[^resultsmain] + +[^resultsmain]: Results computed on Nov 26, 2023 based on a scan of replication packages conducted by Sebastian Kranz. 2023. "Economic Articles with Data". https://ejd.econ.mathematik.uni-ulm.de/, searching for the words `main`, `master`, `makefile`, `dockerfile`, `apptainer`, `singularity` in any of the program files in those replication packages. Code not yet integrated into this presentation. + ## TL;DR - Create a "main" file that runs all the other files in the correct order. @@ -21,7 +29,7 @@ We have seen users who appear to highlight code and to run it interactively, in ## Creating a main or master script -In order to be able to enable "hands-off running", the main script is key. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters. +In order to be able to enable "hands-off running", the **main (controller) script is key**. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters. ## Examples @@ -210,11 +218,11 @@ Do not use `Rscript`, as it will not generate enough output! For examples for **Julia, Python, MATLAB,** and **multi-software scripts**, see the [full text](https://larsvilhuber.github.io/self-checking-reproducibility/02-hands_off_running.html). -## Takeaways +## Takeaways {.smaller} -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort - [ ] it actually produces all the outputs - [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer diff --git a/presentation/03-automatically_saving_figures.md b/presentation/03-automatically_saving_figures.md index dcd91d0..17837dc 100644 --- a/presentation/03-automatically_saving_figures.md +++ b/presentation/03-automatically_saving_figures.md @@ -79,15 +79,11 @@ Learn how to save tables in robust, reproducible ways. Do not try to copy-paste `xtable`, `stargazer`, others. -## Takeaways +## Takeaways {.smaller} -::: {.incremental} - -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort -- ✅it actually produces all the outputs +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs - [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer - -::: \ No newline at end of file diff --git a/presentation/04-creating_log_files.md b/presentation/04-creating_log_files.md index 49cae6f..9cd4d84 100644 --- a/presentation/04-creating_log_files.md +++ b/presentation/04-creating_log_files.md @@ -197,11 +197,11 @@ which will create a log file with everything that would normally appear on the c -## Takeaways +## Takeaways {.smaller} -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort -- ✅it actually produces all the outputs -- ✅your code generates a log file that you can inspect, and that you could share with others. +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs +- [x] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer diff --git a/presentation/12-environments-in-stata.md b/presentation/12-environments-in-stata.md index 271b91b..9a5f295 100644 --- a/presentation/12-environments-in-stata.md +++ b/presentation/12-environments-in-stata.md @@ -5,19 +5,30 @@ - Creating virtual environments in Stata is feasible - Doing so stabilizes the code, and makes it more transportable -## Search paths in Stata +## Search paths in Stata {.smaller} -In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands. Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path. This is typically the `ado` directory in the Stata installation directory, and one will find replication packages that contain instructions to copy files into that directory. Once we've shown how environments work in Stata, this will become a lot simpler! +In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands. -### The `sysdir` directories +## Search paths in Stata {.smaller} +Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path. + + +## The `sysdir` directories {.smaller} + +:::: {.columns} + +::: {.column width=40%} The default set of directories which can be searched, from a freshly installed Stata, can be queried with the `sysdir` command, and will look something like this: ```stata sysdir ``` +::: -``` +::: {.column width=60%} + +```{.stata} STATA: C:\Program Files\Stata18\ BASE: C:\Program Files\Stata18\ado\base\ SITE: C:\Program Files\Stata18\ado\site\ @@ -25,8 +36,15 @@ sysdir PERSONAL: C:\Users\lv39\ado\personal\ OLDPLACE: c:\ado\ ``` +::: + +:::: + +## The `adopath` search order {.smaller} -### The `adopath` search order +:::: {.columns} + +::: {.column width=40%} The search paths where Stata looks for commands is queried by `adopath`, and looks similar, but now has an order assigned to each entry: @@ -34,7 +52,11 @@ The search paths where Stata looks for commands is queried by `adopath`, and loo adopath ``` -``` +::: + +::: {.column width=60%} + +```{.stata} [1] (BASE) "C:\Program Files\Stata18\ado\base/" [2] (SITE) "C:\Program Files\Stata18\ado\site/" [3] "." @@ -43,30 +65,78 @@ adopath [6] (OLDPLACE) "c:\ado/" ``` -To look for a command, say `reghdfe`, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error. We can query the location of `reghdfe` explicitly with `which`: +::: + +:::: + +## The path at work {.smaller} + +:::: {.columns} + +::: {.column width=40%} + + +To look for a command, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error. ```stata which reghdfe ``` +::: + +::: {.column width=60%} ``` command reghdfe not found as either built-in or ado-file r(111); ``` -### Where are packages installed? +::: + +:::: + +## Where are packages installed? [^net-ref]: [`net install` refererence](https://www.stata.com/manuals/rnet.pdf). Strictly speaking, the location where ado packages are installed can be changed via the `net set ado` command, but this is rarely done in practice, and we won't do it here. +:::: {.columns} + +::: {.column width=40%} -When we install a package, using one of the various package installation commands (`net install`, `ssc install`)[^net-ref], only one of the (`sysdir`) paths is relevant: `PLUS`. So if we install `reghdfe` with `ssc install reghdfe`, it will be installed in the `PLUS` directory, and we can see that with `which`: +When we install a package (`net install`, `ssc install`)[^net-ref], only one of the (`sysdir`) paths is relevant: `PLUS`. + +::: + +::: {.column width=60%} + +```{.stata code-line-numbers="5"} + [1] (BASE) "C:\Program Files\Stata18\ado\base/" + [2] (SITE) "C:\Program Files\Stata18\ado\site/" + [3] "." + [4] (PERSONAL) "C:\Users\lv39\ado\personal/" + [5] (PLUS) "C:\Users\lv39\ado\plus/" + [6] (OLDPLACE) "c:\ado/" +``` + +::: + +:::: + +## Installing packages {.smaller} + +:::: {.columns} + +::: {.column width=40%} ```stata ssc install reghdfe which reghdfe ``` -``` +::: + +::: {.column width=60%} + +```{.stata code-line-numbers="3|7"} . ssc install reghdfe checking reghdfe consistency and verifying not already installed... installing into C:\Users\lv39\ado\plus\... @@ -76,51 +146,79 @@ installation complete. C:\Users\lv39\ado\plus\r\reghdfe.ado *! version 6.12.3 08aug2023 ``` - -:::{.important} -It is important here to recognize that it is the value of the special `sysdir` directory `PLUS` that determines where Stata installs commands, but the separate list of `adopath` locations where it looks for commands. It is possible to install a command in a location that Stata does not look for commands! ::: -## Using environments in Stata +:::: -But the `(PLUS)` directory can be manipulated, and that creates the opportunity to create an "environment". +## Using environments in Stata {auto-animate=true} +:::: {.columns} +::: {.column width=40%} -```stata +But the `(PLUS)` directory can be manipulated -* Set the root directory +::: -global rootdir : pwd +::: {.column width=60%} -* Define a location where we will hold all packages in THIS project (the "environment") +```{.stata code-line-numbers="4|13-14"} +* Set the root directory +global rootdir : pwd +* Define a location where we will hold all packages in THIS project (the "environment") global adodir "$rootdir/ado" - * make sure it exists, if not create it. - cap mkdir "$adodir" - * Now let's simplify the adopath * - remove the OLDPLACE and PERSONAL paths * - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen! - adopath - OLDPLACE adopath - PERSONAL - * modify the PLUS path to point to our new location, and move it up in the order - sysdir set PLUS "$adodir" adopath ++ PLUS - * verify the path - adopath ``` -which should show something like this: +::: + +:::: +## Using environments in Stata {.smaller auto-animate=true transition="none"} + +:::: {.columns} + + +::: {.column width=40%} + + +```{.stata code-line-numbers="13-14"} +* Set the root directory +global rootdir : pwd +* Define a location where we will hold all packages in THIS project (the "environment") +global adodir "$rootdir/ado" +* make sure it exists, if not create it. +cap mkdir "$adodir" +* Now let's simplify the adopath +* - remove the OLDPLACE and PERSONAL paths +* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen! +adopath - OLDPLACE +adopath - PERSONAL +* modify the PLUS path to point to our new location, and move it up in the order +sysdir set PLUS "$adodir" +adopath ++ PLUS +* verify the path +adopath ``` + +::: + +::: {.column width=60%} + + +```{.stata code-line-numbers="2"} . adopath [1] (PLUS) "C:\Users\lv39\Documents/PROJECT123/ado/" [2] (BASE) "C:\Program Files\Stata18\ado\base/" @@ -128,124 +226,116 @@ which should show something like this: [4] "." ``` +::: +:::: + +## Using environments in Stata {auto-animate=true} + +:::: {.columns} + +::: {.column width=40%} + Let's verify again where the `reghdfe` package is: ```stata which reghdfe ``` +::: -``` +::: {.column width=60%} + +```{.stata code-line-numbers="2"} . which reghdfe command reghdfe not found as either built-in or ado-file r(111); ``` -So it is no longer found. Why? Because we have removed the previous location (the old `PLUS` path) from the search sequence. It's as if it didn't exist. +::: +:::: +## Using environments in Stata {.smaller auto-animate=true transition="none"} -## Installing packages when an environment is active +So it is no longer found. Why? Because we have removed the previous location (the old `PLUS` path) from the search sequence. It's as if it didn't exist. +:::: {.columns} -When we now install `reghdfe` again: +::: {.column width=50%} -``` -. ssc install reghdfe -checking reghdfe consistency and verifying not already installed... -installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\... -installation complete. +Previously: +```{.stata code-line-numbers="2"} . which reghdfe -C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado +C:\Users\lv39\ado\plus\r\reghdfe.ado *! version 6.12.3 08aug2023 ``` +::: -We now see it in the **project-specific** directory, which we can distribute with the whole project (more on that [later](reproducing-environments)). +::: {.column width=50%} -## Installing precise versions of packages +```{.stata code-line-numbers="2|3|4|5"} +. adopath + [1] (PLUS) "C:\Users\lv39\Documents/PROJECT123/ado/" + [2] (BASE) "C:\Program Files\Stata18\ado\base/" + [3] (SITE) "C:\Program Files\Stata18\ado\site/" + [4] "." +``` +::: -Let's imagine we need an older version of `reghdfe`. In general, it is **not** possible in Stata to install an older version of a package in a straightforward fashion. You *may* have success with the [Wayback Machine archive of SSC](https://web.archive.org/web/20141226200440/http://fmwww.bc.edu/RePEc/bocode/), which in some cases goes back to 2000, by carefully reconstructing the necessary files. +:::: -Here, we will leverage the **SSC Snapshot** maintained by Lars Vilhuber on Github ([https://github.com/labordynamicsinstitute/ssc-mirror/](https://github.com/labordynamicsinstitute/ssc-mirror/)), which has been capturing snapshots of SSC since [late 2021](https://github.com/labordynamicsinstitute/ssc-mirror/releases/tag/2021-12-21) (details are for a different tutorial): +## Installing packages when an environment is active {.smaller} -```stata -* define the date -global sscdate "2021-12-21" -net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinstitute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r) -``` -which gives us +When we now install `reghdfe` again: -``` -. net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinsti -> tute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r) +```{.stata code-line-numbers="3|7"} +. ssc install reghdfe checking reghdfe consistency and verifying not already installed... -installing into C:\Users\lv39\Documents/ado\... +installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\... installation complete. . which reghdfe -C:\Users\lv39\Documents/ado\r\reghdfe.ado -*! version 5.7.3 13nov2019 -``` - -So we now have TWO different version of `reghdfe` installed: - -- Version 5.7.3 from Nov 2019 is installed at `C:\Users\lv39\Documents/ado\r\reghdfe.ado` -- Version 6.12.3 from Aug 2023 is installed at `C:\Users\lv39\ado\plus\r\reghdfe.ado` - -:::{admonition} Stata can get confused about how to write paths... -:class: dropdown - -Stata on Windows can understand two types of path syntax: the "Windows" syntax, with backslashes `\`, and the "Unix" syntax, with forward slashes '/'. It will usually report paths in the "Windows" syntax, but these will not work, if coded as such, on non-Windows platforms, which do not understand the backslash as a path separator. We have used platform-agnostic paths above, using forward slashes. This then generates the "weird" mixed notation: - -``` -C:\Users\lv39\Documents/ado\r\reghdfe.ado +C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado +*! version 6.12.3 08aug2023 ``` -Other software (e.g, R), will consistently use the forward slash, even on Windows, when paths are coded internally. +We now see it in the **project-specific** directory, which we can distribute with the whole project. +## Installing precise versions of Stata packages {.smaller} -Only the former is used by the "environment" we just configured! Which is a good thing, since there are a few functional differences between these two packages. But for the life of *this* project, that functionality can now be relied upon - as long as we take care to use the same "environment" for all code run within this project. This can be achieved by using the `main.do` defined in [one of the previous sections](hands-off-running): +Let's imagine we need an older version of `reghdfe`. -```stata -* main.do -* This is a simple example of a main file in Stata -* It runs all the other files in the correct order +- In general, it is **not** possible in Stata to install an older version of a package in a straightforward fashion. +- You *may* have success with the [Wayback Machine archive of SSC](https://web.archive.org/web/20141226200440/http://fmwww.bc.edu/RePEc/bocode/). -* Set the root directory +## Package repositories -global rootdir : pwd +Most package repositories are versioned: -* Define a location where we will hold all packages in THIS project (the "environment") +- R: CRAN, Bioconductor +- Python: PyPI +- Julia: "General" default Julia package registry. -global adodir "$rootdir/ado" +Stata does not (as of 2024). **But** see [the full site](https://larsvilhuber.github.io/self-checking-reproducibility/12-environments-in-stata.html#installing-precise-versions-of-packages) for one approach. -* Enable project environment +## Takeaways {auto-animate=true .smaller} -cap mkdir "$adodir" -adopath - OLDPLACE -adopath - PERSONAL -sysdir set PLUS "$adodir" -adopath ++ PLUS -adopath - -* Run the data preparation file -do $rootdir/01_data_prep.do - -// etc. etc. -``` +From the earlier desiderata of *environments*: -:::{.notes} +- ✅ **Isolated**: Installing a new or updated package for one project won't break your other projects, and vice versa. +- ✅ **Portable**: Easily transport your projects from one computer to another, *even across different platforms*. +- ❌ **Reproducible**: Records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. -While we used interactive commands to install the various packages here, that was only for illustrative purposes. **Always** script the installation of packages in a `setup.do` file. We will address how and when to run that file in the [next section](reproducing-environments). -## Takeaways +## Takeaways {.smaller} -From the earlier desiderata of *environments*: -- ✅ **Isolated**: Installing a new or updated package for one project won't break your other projects, and vice versa. -- ✅ **Portable**: Easily transport your projects from one computer to another, *even across different platforms*. -- ❌ **Reproducible**: Records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs +- [x] your code generates a log file that you can inspect, and that you could share with others. +- [x] ❓ it will run on somebody else's computer diff --git a/presentation/70-other_methods.md b/presentation/70-other_methods.md new file mode 100644 index 0000000..d86418d --- /dev/null +++ b/presentation/70-other_methods.md @@ -0,0 +1,23 @@ +# Other methods + +## Non-technical means + +- Use a new computer +- Have an undergraduate student run it +- Ask your office neighbor to run it + +## More technical means {auto-animate=true} + +- [Virtual Machines](https://larsvilhuber.github.io/self-checking-reproducibility/72-virtual_machines.html) +- [Use of containers](https://larsvilhuber.github.io/self-checking-reproducibility/80-docker.html) + +## Use of containers {auto-animate=true} + +- **Containers** are a way to simulate a "computer within a computer", which can be used to run code in an isolated environment. +- They are relatively lightweight, and are *starting* to be used as part of replication packages in economics (but **only 0.13% of 8280 packages**...) + +## Use of containers {auto-animate=true} + +- They do not work in all situations, and require some **more advanced technical skills** (typically Linux, in addition to the statistical software). +- Using containers to test for reproducibility is easier, and should be considered as part of a toolkit. +- Several **online services** make such testing (and development) easy. \ No newline at end of file diff --git a/presentation/99-confidential_data.md b/presentation/99-confidential_data.md index 472d1f2..6cbbfbf 100644 --- a/presentation/99-confidential_data.md +++ b/presentation/99-confidential_data.md @@ -4,11 +4,17 @@ Do you know your rights? +## TL;DR + +- be able to separate the confidential data from other (to be made public) components +- all code must be available +- do not publish what you are not allowed to! + ## Permissions These will be noted in the **data use agreement (DUA), license, or non-disclosure agreement (NDA)** that you **signed or clicked through** to obtain access to the data from the data provider. -- Careful: scraped or downloaded data that did not have an explicit license! +> Careful: scraped or downloaded data that did not have an explicit license! ## Keep in mind @@ -17,7 +23,7 @@ Just because - you (and the entire world) can **download the data** - does **NOT** give you the (automatic) right to **re-publish** the data. -## Permissions +## Permissions {.smaller} - Do NOT **transfer or publish** data that you have no rights to transfer. Always carefully read your data use agreement (DUA), license, or non-disclosure agreement (NDA) that you signed. - Do NOT upload restricted-access data to the journal's platform. @@ -97,7 +103,7 @@ data/ :::: -## Strategy {auto-animate=true} +## Strategy {auto-animate=true .smaller} When the replication package relies on confidential data that cannot be shared, or is shared under different conditions, you should @@ -151,7 +157,7 @@ data/ :::{.column width=50%} -Prepare a **non-confidential replication package** that contains all code, and any data that is not subject to publication controls +Prepare a **non-confidential replication package** that contains **all code**, and **any data** that is not subject to publication controls ::: diff --git a/presentation/index.Rmd b/presentation/index.Rmd index cb68d98..2f88347 100644 --- a/presentation/index.Rmd +++ b/presentation/index.Rmd @@ -1,5 +1,5 @@ --- -title: "Checking reproducibility on Day T-1" +title: "Checking reproducibility on Day T" author: - "Lars Vilhuber" date: June 30 2024 @@ -7,6 +7,14 @@ bibliography: "paper.bib" --- +![](https://teleseries.com.br/wp-content/uploads/2013/11/tumblr_lr9o5nVNb41qknbv6.gif) + +--- + +## Day T {auto-animate=true .smaller} + +## Day T**-1** {auto-animate=true .smaller} + ```{r, child=c('00-index-mod.md')} ``` @@ -63,6 +71,9 @@ Unpredictable things happening to your computing environment: ```{r, child=c('12-environments-in-stata.md')} ``` +```{r, child=c('70-other_methods.md')} +``` + ```{r, child=c('99-confidential_data.md')} ``` diff --git a/presentation/index.html b/presentation/index.html index 1b1f6d3..f34e419 100644 --- a/presentation/index.html +++ b/presentation/index.html @@ -12,7 +12,7 @@ - Checking reproducibility on Day T-1 + Checking reproducibility on Day T @@ -393,7 +393,7 @@
-

Checking reproducibility on Day T-1

+

Checking reproducibility on Day T

@@ -405,6 +405,16 @@

Checking reproducibility on Day T-1

2024-06-30

+
+ + +
+
+

Day T

+
+
+

Day T-1

+

Introduction

@@ -482,21 +492,22 @@

Making the code run takes YOU a very long time

-

-

This should be a warning sign: if it takes you a long time to get it to run, or to manually reproduce the results, it might take others even longer.1

+

+

This should be a warning sign:

+

If it takes you a long time to get it to run, or to manually reproduce the results, it might take others even longer.1

-

Furthermore, it may suggest that you haven’t been able to re-run your own code very often, which can be correlated with fragility or even lack of reproducibility.

+

Furthermore, it may suggest that you haven’t been able to re-run your own code very often, which can be indicate fragility or even lack of reproducibility.

-
+

Takeaways

-
    -
  • ✅ your code runs without problem, after all the debugging.
  • +
      +
    • -
    • [ ]your code generates a log file that you can inspect, and that you could share with others.
    • +
@@ -514,7 +525,7 @@

Does your code run without manual intervention?

Can you provide evidence that you ran it?

Generating a log file means that you can inspect it, and you can share it with others. Also helps in debugging, for you and others.

-
+

Will it run on somebody else’s computer?

Running it again does not help:

@@ -527,16 +538,18 @@

Will it run on somebody else’s computer?

+

Hands-off running: Creating a controller script

Did it take you a long time to run everything again?

- -
-
-

Let’s ramp it up a bit.

-

Your code must run, beginning to end, top to bottom, without error, and without any user intervention. This should in principle (re)create all figures, tables, and numbers you include in your paper.

+
+

Let’s ramp it up a bit.

+
    +
  • Your code must run, beginning to end, top to bottom, without error, and without any user intervention.

  • +
  • This should in principle (re)create all figures, tables, and in-text numbers you include in your paper.

  • +
+
+

Seem trivial?

+
+

Out of 8280 replication packages in ~20 top econ journals, only 2594 (31.33%) had a main/controller script.2

+
+

TL;DR

    @@ -561,7 +580,7 @@

    TL;DR

Creating a main or master script

-

In order to be able to enable “hands-off running”, the main script is key. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters.

+

In order to be able to enable “hands-off running”, the main (controller) script is key. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters.

Examples

@@ -698,11 +717,11 @@

Notes for R

Other examples

For examples for Julia, Python, MATLAB, and multi-software scripts, see the full text.

-
+

Takeaways

-
    -
  • ✅ your code runs without problem, after all the debugging.
  • -
  • ✅your code runs without manual intervention, and with low effort
  • +
      +
    • +
    • @@ -802,17 +821,15 @@

      Stata

      R

      xtable, stargazer, others.

-
+

Takeaways

-
-
    -
  • ✅ your code runs without problem, after all the debugging.
  • -
  • ✅your code runs without manual intervention, and with low effort
  • -
  • ✅it actually produces all the outputs
  • -
  • -
  • +
      +
    • +
    • +
    • +
    • +
    -
@@ -988,13 +1005,13 @@

Creating log files automatically

-
+

Takeaways

-
    -
  • ✅ your code runs without problem, after all the debugging.
  • -
  • ✅your code runs without manual intervention, and with low effort
  • -
  • ✅it actually produces all the outputs
  • -
  • ✅your code generates a log file that you can inspect, and that you could share with others.
  • +
      +
    • +
    • +
    • +
@@ -1035,7 +1052,7 @@

Hold on…

Understanding search paths

-

Generically, all “environments” simply modify where the specific software searches (the “search path”) for its components, and in particular any supplementary components (packages, libraries, etc.).2

+

Generically, all “environments” simply modify where the specific software searches (the “search path”) for its components, and in particular any supplementary components (packages, libraries, etc.).3

Search paths

@@ -1109,167 +1126,260 @@

TL;DR

  • Doing so stabilizes the code, and makes it more transportable
  • -
    +
    +

    Search paths in Stata

    +

    In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands.

    +
    +

    Search paths in Stata

    -

    In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands. Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path. This is typically the ado directory in the Stata installation directory, and one will find replication packages that contain instructions to copy files into that directory. Once we’ve shown how environments work in Stata, this will become a lot simpler!

    -

    The sysdir directories

    +

    Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path.

    +
    +
    +

    The sysdir directories

    +
    +

    The default set of directories which can be searched, from a freshly installed Stata, can be queried with the sysdir command, and will look something like this:

    sysdir
    -
       STATA:  C:\Program Files\Stata18\
    -    BASE:  C:\Program Files\Stata18\ado\base\
    -    SITE:  C:\Program Files\Stata18\ado\site\
    -    PLUS:  C:\Users\lv39\ado\plus\
    -PERSONAL:  C:\Users\lv39\ado\personal\
    -OLDPLACE:  c:\ado\
    -

    The adopath search order

    +
    +
       STATA:  C:\Program Files\Stata18\
    +    BASE:  C:\Program Files\Stata18\ado\base\
    +    SITE:  C:\Program Files\Stata18\ado\site\
    +    PLUS:  C:\Users\lv39\ado\plus\
    +PERSONAL:  C:\Users\lv39\ado\personal\
    +OLDPLACE:  c:\ado\
    +
    +
    +
    +
    +

    The adopath search order

    +
    +

    The search paths where Stata looks for commands is queried by adopath, and looks similar, but now has an order assigned to each entry:

    adopath
    -
      [1]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    -  [2]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    -  [3]              "."
    -  [4]  (PERSONAL)  "C:\Users\lv39\ado\personal/"
    -  [5]  (PLUS)      "C:\Users\lv39\ado\plus/"
    -  [6]  (OLDPLACE)  "c:\ado/"
    -

    To look for a command, say reghdfe, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error. We can query the location of reghdfe explicitly with which:

    +
    +
      [1]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    +  [2]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    +  [3]              "."
    +  [4]  (PERSONAL)  "C:\Users\lv39\ado\personal/"
    +  [5]  (PLUS)      "C:\Users\lv39\ado\plus/"
    +  [6]  (OLDPLACE)  "c:\ado/"
    +
    +
    +
    +
    +

    The path at work

    +
    +
    +

    To look for a command, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error.

    which reghdfe
    +
    command reghdfe not found as either built-in or ado-file
     r(111);
    -

    Where are packages installed?

    -

    When we install a package, using one of the various package installation commands (net install, ssc install)3, only one of the (sysdir) paths is relevant: PLUS. So if we install reghdfe with ssc install reghdfe, it will be installed in the PLUS directory, and we can see that with which:

    -
    ssc install reghdfe
    -which reghdfe
    -
    . ssc install reghdfe
    -checking reghdfe consistency and verifying not already installed...
    -installing into C:\Users\lv39\ado\plus\...
    -installation complete.
    -
    -. which reghdfe
    -C:\Users\lv39\ado\plus\r\reghdfe.ado
    -*! version 6.12.3 08aug2023
    -
    -

    It is important here to recognize that it is the value of the special sysdir directory PLUS that determines where Stata installs commands, but the separate list of adopath locations where it looks for commands. It is possible to install a command in a location that Stata does not look for commands!

    +
    +
    +
    +
    +

    Where are packages installed?

    +
    +
    +

    When we install a package (net install, ssc install)4, only one of the (sysdir) paths is relevant: PLUS.

    +
    +
      [1]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    +  [2]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    +  [3]              "."
    +  [4]  (PERSONAL)  "C:\Users\lv39\ado\personal/"
    +  [5]  (PLUS)      "C:\Users\lv39\ado\plus/"
    +  [6]  (OLDPLACE)  "c:\ado/"
    +
    -
    -

    Using environments in Stata

    -

    But the (PLUS) directory can be manipulated, and that creates the opportunity to create an “environment”.

    -
    
    -* Set the root directory
    -
    -global rootdir : pwd
    +
    +

    Installing packages

    +
    +
    +
    ssc install reghdfe
    +which reghdfe
    +
    +
    . ssc install reghdfe
    +checking reghdfe consistency and verifying not already installed...
    +installing into C:\Users\lv39\ado\plus\...
    +installation complete.
     
    -* Define a location where we will hold all packages in THIS project (the "environment")
    -
    -global adodir "$rootdir/ado"
    -
    -* make sure it exists, if not create it.
    -
    -cap mkdir "$adodir"
    -
    -* Now let's simplify the adopath
    -* - remove the OLDPLACE and PERSONAL paths
    -* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
    -
    -adopath - OLDPLACE
    -adopath - PERSONAL
    -
    -* modify the PLUS path to point to our new location, and move it up in the order
    -
    -sysdir set PLUS "$adodir"
    -adopath ++ PLUS
    -
    -* verify the path
    -
    -adopath
    -

    which should show something like this:

    -
    . adopath
    -  [1]  (PLUS)      "C:\Users\lv39\Documents/PROJECT123/ado/"
    -  [2]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    -  [3]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    -  [4]              "."
    +. which reghdfe +C:\Users\lv39\ado\plus\r\reghdfe.ado +*! version 6.12.3 08aug2023
    + + +
    +
    +

    Using environments in Stata

    +
    +
    +

    But the (PLUS) directory can be manipulated

    +
    +
    * Set the root directory
    +global rootdir : pwd
    +* Define a location where we will hold all packages in THIS project (the "environment")
    +global adodir "$rootdir/ado"
    +* make sure it exists, if not create it.
    +cap mkdir "$adodir"
    +* Now let's simplify the adopath
    +* - remove the OLDPLACE and PERSONAL paths
    +* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
    +adopath - OLDPLACE
    +adopath - PERSONAL
    +* modify the PLUS path to point to our new location, and move it up in the order
    +sysdir set PLUS "$adodir"
    +adopath ++ PLUS
    +* verify the path
    +adopath
    +
    +
    +
    +
    +

    Using environments in Stata

    +
    +
    +
    * Set the root directory
    +global rootdir : pwd
    +* Define a location where we will hold all packages in THIS project (the "environment")
    +global adodir "$rootdir/ado"
    +* make sure it exists, if not create it.
    +cap mkdir "$adodir"
    +* Now let's simplify the adopath
    +* - remove the OLDPLACE and PERSONAL paths
    +* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
    +adopath - OLDPLACE
    +adopath - PERSONAL
    +* modify the PLUS path to point to our new location, and move it up in the order
    +sysdir set PLUS "$adodir"
    +adopath ++ PLUS
    +* verify the path
    +adopath
    +
    +
    . adopath
    +  [1]  (PLUS)      "C:\Users\lv39\Documents/PROJECT123/ado/"
    +  [2]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    +  [3]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    +  [4]              "."
    +
    +
    +
    +
    +

    Using environments in Stata

    +
    +

    Let’s verify again where the reghdfe package is:

    -
    which reghdfe
    -
    . which reghdfe
    -command reghdfe not found as either built-in or ado-file
    -r(111);
    +
    which reghdfe
    +
    +
    . which reghdfe
    +command reghdfe not found as either built-in or ado-file
    +r(111);
    +
    +
    +
    +
    +

    Using environments in Stata

    So it is no longer found. Why? Because we have removed the previous location (the old PLUS path) from the search sequence. It’s as if it didn’t exist.

    +
    +
    +

    Previously:

    +
    . which reghdfe
    +C:\Users\lv39\ado\plus\r\reghdfe.ado
    +*! version 6.12.3 08aug2023
    +
    +
    . adopath
    +  [1]  (PLUS)      "C:\Users\lv39\Documents/PROJECT123/ado/"
    +  [2]  (BASE)      "C:\Program Files\Stata18\ado\base/"
    +  [3]  (SITE)      "C:\Program Files\Stata18\ado\site/"
    +  [4]              "."
    +
    +
    -
    +

    Installing packages when an environment is active

    When we now install reghdfe again:

    -
    . ssc install reghdfe
    -checking reghdfe consistency and verifying not already installed...
    -installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\...
    -installation complete.
    -
    -. which reghdfe
    -C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado
    -*! version 6.12.3 08aug2023
    -

    We now see it in the project-specific directory, which we can distribute with the whole project (more on that later).

    -
    -
    -

    Installing precise versions of packages

    -

    Let’s imagine we need an older version of reghdfe. In general, it is not possible in Stata to install an older version of a package in a straightforward fashion. You may have success with the Wayback Machine archive of SSC, which in some cases goes back to 2000, by carefully reconstructing the necessary files.

    -

    Here, we will leverage the SSC Snapshot maintained by Lars Vilhuber on Github (https://github.com/labordynamicsinstitute/ssc-mirror/), which has been capturing snapshots of SSC since late 2021 (details are for a different tutorial):

    -
    * define the date
    -global sscdate "2021-12-21"
    -net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinstitute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r)
    -

    which gives us

    -
    . net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinsti
    -> tute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r)
    -checking reghdfe consistency and verifying not already installed...
    -installing into C:\Users\lv39\Documents/ado\...
    -installation complete.
    -
    -. which reghdfe
    -C:\Users\lv39\Documents/ado\r\reghdfe.ado
    -*! version 5.7.3 13nov2019
    -

    So we now have TWO different version of reghdfe installed:

    +
    . ssc install reghdfe
    +checking reghdfe consistency and verifying not already installed...
    +installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\...
    +installation complete.
    +
    +. which reghdfe
    +C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado
    +*! version 6.12.3 08aug2023
    +

    We now see it in the project-specific directory, which we can distribute with the whole project.

    +
    +
    +

    Installing precise versions of Stata packages

    +

    Let’s imagine we need an older version of reghdfe.

      -
    • Version 5.7.3 from Nov 2019 is installed at C:\Users\lv39\Documents/ado\r\reghdfe.ado
    • -
    • Version 6.12.3 from Aug 2023 is installed at C:\Users\lv39\ado\plus\r\reghdfe.ado
    • +
    • In general, it is not possible in Stata to install an older version of a package in a straightforward fashion.
    • +
    • You may have success with the Wayback Machine archive of SSC.
    -

    :::{admonition} Stata can get confused about how to write paths… :class: dropdown

    -

    Stata on Windows can understand two types of path syntax: the “Windows” syntax, with backslashes \, and the “Unix” syntax, with forward slashes ‘/’. It will usually report paths in the “Windows” syntax, but these will not work, if coded as such, on non-Windows platforms, which do not understand the backslash as a path separator. We have used platform-agnostic paths above, using forward slashes. This then generates the “weird” mixed notation:

    -
    C:\Users\lv39\Documents/ado\r\reghdfe.ado
    -

    Other software (e.g, R), will consistently use the forward slash, even on Windows, when paths are coded internally.

    -

    Only the former is used by the “environment” we just configured! Which is a good thing, since there are a few functional differences between these two packages. But for the life of this project, that functionality can now be relied upon - as long as we take care to use the same “environment” for all code run within this project. This can be achieved by using the main.do defined in one of the previous sections:

    -
    * main.do
    -* This is a simple example of a main file in Stata
    -* It runs all the other files in the correct order
    -
    -* Set the root directory
    -
    -global rootdir : pwd
    -
    -* Define a location where we will hold all packages in THIS project (the "environment")
    -
    -global adodir "$rootdir/ado"
    -
    -* Enable project environment
    -
    -cap mkdir "$adodir"
    -adopath - OLDPLACE
    -adopath - PERSONAL
    -sysdir set PLUS "$adodir"
    -adopath ++ PLUS
    -adopath
    -
    -* Run the data preparation file
    -do $rootdir/01_data_prep.do
    -
    -// etc. etc.
    -

    :::{.notes}

    -

    While we used interactive commands to install the various packages here, that was only for illustrative purposes. Always script the installation of packages in a setup.do file. We will address how and when to run that file in the next section.

    -
    -
    -

    Takeaways

    +
    +
    +

    Package repositories

    +

    Most package repositories are versioned:

    +
      +
    • R: CRAN, Bioconductor
    • +
    • Python: PyPI
    • +
    • Julia: “General” default Julia package registry.
    • +
    +

    Stata does not (as of 2024). But see the full site for one approach.

    +
    +
    +

    Takeaways

    From the earlier desiderata of environments:

    • Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa.
    • Portable: Easily transport your projects from one computer to another, even across different platforms.
    • Reproducible: Records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
    +
    +
    +

    Takeaways

    +
      +
    • +
    • +
    • +
    • +
    • +
    +
    +
    +
    +

    Other methods

    + +
    +
    +

    Non-technical means

    +
      +
    • Use a new computer
    • +
    • Have an undergraduate student run it
    • +
    • Ask your office neighbor to run it
    • +
    +
    +
    +

    More technical means

    + +
    +
    +

    Use of containers

    +
      +
    • Containers are a way to simulate a “computer within a computer”, which can be used to run code in an isolated environment.
    • +
    • They are relatively lightweight, and are starting to be used as part of replication packages in economics (but only 0.13% of 8280 packages…)
    • +
    +
    +
    +

    Use of containers

    +
      +
    • They do not work in all situations, and require some more advanced technical skills (typically Linux, in addition to the statistical software).
    • +
    • Using containers to test for reproducibility is easier, and should be considered as part of a toolkit.
    • +
    • Several online services make such testing (and development) easy.
    • +
    @@ -1280,12 +1390,20 @@

    Last but not least

    Confidential data

    Do you know your rights?

    +
    +

    TL;DR

    +
      +
    • be able to separate the confidential data from other (to be made public) components
    • +
    • all code must be available
    • +
    • do not publish what you are not allowed to!
    • +
    +

    Permissions

    These will be noted in the data use agreement (DUA), license, or non-disclosure agreement (NDA) that you signed or clicked through to obtain access to the data from the data provider.

    -
      -
    • Careful: scraped or downloaded data that did not have an explicit license!
    • -
    +
    +

    Careful: scraped or downloaded data that did not have an explicit license!

    +

    Keep in mind

    @@ -1295,7 +1413,7 @@

    Keep in mind

  • does NOT give you the (automatic) right to re-publish the data.
  • -
    +

    Permissions

    • Do NOT transfer or publish data that you have no rights to transfer. Always carefully read your data use agreement (DUA), license, or non-disclosure agreement (NDA) that you signed.
    • @@ -1356,13 +1474,13 @@

      Organize your project so you can exclude conf

    -
    +

    Strategy

    When the replication package relies on confidential data that cannot be shared, or is shared under different conditions, you should

    • preserve (archive) the confidential replication package
        -
      • If the data cannot be removed from a secure enclave, they should nevertheless be archived wherever the confidential data are kept4
      • +
      • If the data cannot be removed from a secure enclave, they should nevertheless be archived wherever the confidential data are kept5
      • If the data can be shared, but are subject to access restrictions, follow this guide on creating a separate data deposit and, when creating the restricted deposit at ICPSR, follow these instructions on how to do so.
    @@ -1394,7 +1512,7 @@

    Strategy

    Strategy

    -

    Prepare a non-confidential replication package that contains all code, and any data that is not subject to publication controls

    +

    Prepare a non-confidential replication package that contains all code, and any data that is not subject to publication controls

    README.pdf
     code/
    @@ -1492,9 +1610,10 @@ 

    References

    1. Source: Red Warning PNG Clipart, CC-BY.

    2. -
    3. Formally, this is true for operating systems as well, and in some cases, the operating system and the programming language interact (for instance, in Python).

    4. -
    5. net install refererence. Strictly speaking, the location where ado packages are installed can be changed via the net set ado command, but this is rarely done in practice, and we won’t do it here.

    6. -
    7. see this FAQ)

    8. +
    9. Results computed on Nov 26, 2023 based on a scan of replication packages conducted by Sebastian Kranz. 2023. “Economic Articles with Data”. https://ejd.econ.mathematik.uni-ulm.de/, searching for the words main, master, makefile, dockerfile, apptainer, singularity in any of the program files in those replication packages. Code not yet integrated into this presentation.

    10. +
    11. Formally, this is true for operating systems as well, and in some cases, the operating system and the programming language interact (for instance, in Python).

    12. +
    13. net install refererence. Strictly speaking, the location where ado packages are installed can be changed via the net set ado command, but this is rarely done in practice, and we won’t do it here.

    14. +
    15. see this FAQ)

    diff --git a/presentation/notes.md b/presentation/notes.md index fdd5744..8a12a8a 100644 --- a/presentation/notes.md +++ b/presentation/notes.md @@ -30,4 +30,4 @@ sed 's/{note}/{.notes}/' > presentation/$arg ``` -🎲✅❌ \ No newline at end of file +🎲❓✅ ❌ \ No newline at end of file