diff --git a/presentation/01-run_it_again.md b/presentation/01-run_it_again.md index 454dd41..0611e5a 100644 --- a/presentation/01-run_it_again.md +++ b/presentation/01-run_it_again.md @@ -18,26 +18,28 @@ While the code, once set to run, can do so on its own, *you* might need to spend --- -![](images/Red-Warning-PNG-Clipart.png) +![](images/Red-Warning-PNG-Clipart.png){.center width="350" height="300"} -*This should be a warning sign:* if it takes you a long time to get it to run, or to manually reproduce the results, it might take others even longer.[^warning-sign] +*This should be a warning sign:* + +If it takes you a long time to get it to run, or to manually reproduce the results, it **might take others even longer**.[^warning-sign] [^warning-sign]: Source: [Red Warning PNG Clipart](https://www.pngall.com/warning-sign-png/download/69408), CC-BY. --- -Furthermore, it may suggest that you haven't been able to re-run your own code very often, which can be correlated with fragility or even lack of reproducibility. +Furthermore, it may suggest that you **haven't been able to re-run** your own code very often, which can be indicate **fragility** or even **lack of reproducibility**. -## Takeaways +## Takeaways {.smaller} ::: {.incremental} -- ✅ your code runs without problem, after all the debugging. +- [x] your code runs without problem, after all the debugging. - [ ] your code runs without manual intervention, and with low effort - [ ] it actually produces all the outputs -- [ ]your code generates a log file that you can inspect, and that you could share with others. +- [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer ::: \ No newline at end of file diff --git a/presentation/02-00-intermezzo.md b/presentation/02-00-intermezzo.md index f70220c..dd3d8b4 100644 --- a/presentation/02-00-intermezzo.md +++ b/presentation/02-00-intermezzo.md @@ -9,7 +9,7 @@ Automation and robustness checks, as well as efficiency. Generating a log file means that you can inspect it, and you can share it with others. Also helps in debugging, for you and others. -## Will it run on somebody else's computer? +## Will it run on somebody else's computer? {.smaller} Running it again does not help: diff --git a/presentation/02-hands_off_running.md b/presentation/02-hands_off_running.md index 8bd52bd..a04a411 100644 --- a/presentation/02-hands_off_running.md +++ b/presentation/02-hands_off_running.md @@ -4,15 +4,23 @@ Did it take you a long time to run everything again? ![⏳](https://c.tenor.com/4qs0klfg8nMAAAAC/tenor.gif) -# Let's ramp it up a bit. +## Let's ramp it up a bit. -Your code must run, beginning to end, top to bottom, without error, and without any user intervention. This should in principle (re)create all figures, tables, and numbers you include in your paper. +- Your code must run, **beginning to end**, top to bottom, without error, and **without any user intervention**. + +- This should in principle (re)create all **figures**, **tables**, and **in-text numbers** you include in your paper. ::: {.notes} We have seen users who appear to highlight code and to run it interactively, in pieces, using the program file as a kind of notepad. This is not reproducible, and should be avoided. It is fine for debugging. ::: +## Seem trivial? + +> Out of **8280** replication packages in ~20 top econ journals, only **2594** (**31.33%**) had a main/controller script.[^resultsmain] + +[^resultsmain]: Results computed on Nov 26, 2023 based on a scan of replication packages conducted by Sebastian Kranz. 2023. "Economic Articles with Data". https://ejd.econ.mathematik.uni-ulm.de/, searching for the words `main`, `master`, `makefile`, `dockerfile`, `apptainer`, `singularity` in any of the program files in those replication packages. Code not yet integrated into this presentation. + ## TL;DR - Create a "main" file that runs all the other files in the correct order. @@ -21,7 +29,7 @@ We have seen users who appear to highlight code and to run it interactively, in ## Creating a main or master script -In order to be able to enable "hands-off running", the main script is key. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters. +In order to be able to enable "hands-off running", the **main (controller) script is key**. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters. ## Examples @@ -210,11 +218,11 @@ Do not use `Rscript`, as it will not generate enough output! For examples for **Julia, Python, MATLAB,** and **multi-software scripts**, see the [full text](https://larsvilhuber.github.io/self-checking-reproducibility/02-hands_off_running.html). -## Takeaways +## Takeaways {.smaller} -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort - [ ] it actually produces all the outputs - [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer diff --git a/presentation/03-automatically_saving_figures.md b/presentation/03-automatically_saving_figures.md index dcd91d0..17837dc 100644 --- a/presentation/03-automatically_saving_figures.md +++ b/presentation/03-automatically_saving_figures.md @@ -79,15 +79,11 @@ Learn how to save tables in robust, reproducible ways. Do not try to copy-paste `xtable`, `stargazer`, others. -## Takeaways +## Takeaways {.smaller} -::: {.incremental} - -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort -- ✅it actually produces all the outputs +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs - [ ] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer - -::: \ No newline at end of file diff --git a/presentation/04-creating_log_files.md b/presentation/04-creating_log_files.md index 49cae6f..9cd4d84 100644 --- a/presentation/04-creating_log_files.md +++ b/presentation/04-creating_log_files.md @@ -197,11 +197,11 @@ which will create a log file with everything that would normally appear on the c -## Takeaways +## Takeaways {.smaller} -- ✅ your code runs without problem, after all the debugging. -- ✅your code runs without manual intervention, and with low effort -- ✅it actually produces all the outputs -- ✅your code generates a log file that you can inspect, and that you could share with others. +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs +- [x] your code generates a log file that you can inspect, and that you could share with others. - [ ] it will run on somebody else's computer diff --git a/presentation/12-environments-in-stata.md b/presentation/12-environments-in-stata.md index 271b91b..9a5f295 100644 --- a/presentation/12-environments-in-stata.md +++ b/presentation/12-environments-in-stata.md @@ -5,19 +5,30 @@ - Creating virtual environments in Stata is feasible - Doing so stabilizes the code, and makes it more transportable -## Search paths in Stata +## Search paths in Stata {.smaller} -In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands. Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path. This is typically the `ado` directory in the Stata installation directory, and one will find replication packages that contain instructions to copy files into that directory. Once we've shown how environments work in Stata, this will become a lot simpler! +In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands. -### The `sysdir` directories +## Search paths in Stata {.smaller} +Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path. + + +## The `sysdir` directories {.smaller} + +:::: {.columns} + +::: {.column width=40%} The default set of directories which can be searched, from a freshly installed Stata, can be queried with the `sysdir` command, and will look something like this: ```stata sysdir ``` +::: -``` +::: {.column width=60%} + +```{.stata} STATA: C:\Program Files\Stata18\ BASE: C:\Program Files\Stata18\ado\base\ SITE: C:\Program Files\Stata18\ado\site\ @@ -25,8 +36,15 @@ sysdir PERSONAL: C:\Users\lv39\ado\personal\ OLDPLACE: c:\ado\ ``` +::: + +:::: + +## The `adopath` search order {.smaller} -### The `adopath` search order +:::: {.columns} + +::: {.column width=40%} The search paths where Stata looks for commands is queried by `adopath`, and looks similar, but now has an order assigned to each entry: @@ -34,7 +52,11 @@ The search paths where Stata looks for commands is queried by `adopath`, and loo adopath ``` -``` +::: + +::: {.column width=60%} + +```{.stata} [1] (BASE) "C:\Program Files\Stata18\ado\base/" [2] (SITE) "C:\Program Files\Stata18\ado\site/" [3] "." @@ -43,30 +65,78 @@ adopath [6] (OLDPLACE) "c:\ado/" ``` -To look for a command, say `reghdfe`, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error. We can query the location of `reghdfe` explicitly with `which`: +::: + +:::: + +## The path at work {.smaller} + +:::: {.columns} + +::: {.column width=40%} + + +To look for a command, Stata will look in the first directory, then the second, and so on, until it finds it. If it does not find it, it will return an error. ```stata which reghdfe ``` +::: + +::: {.column width=60%} ``` command reghdfe not found as either built-in or ado-file r(111); ``` -### Where are packages installed? +::: + +:::: + +## Where are packages installed? [^net-ref]: [`net install` refererence](https://www.stata.com/manuals/rnet.pdf). Strictly speaking, the location where ado packages are installed can be changed via the `net set ado` command, but this is rarely done in practice, and we won't do it here. +:::: {.columns} + +::: {.column width=40%} -When we install a package, using one of the various package installation commands (`net install`, `ssc install`)[^net-ref], only one of the (`sysdir`) paths is relevant: `PLUS`. So if we install `reghdfe` with `ssc install reghdfe`, it will be installed in the `PLUS` directory, and we can see that with `which`: +When we install a package (`net install`, `ssc install`)[^net-ref], only one of the (`sysdir`) paths is relevant: `PLUS`. + +::: + +::: {.column width=60%} + +```{.stata code-line-numbers="5"} + [1] (BASE) "C:\Program Files\Stata18\ado\base/" + [2] (SITE) "C:\Program Files\Stata18\ado\site/" + [3] "." + [4] (PERSONAL) "C:\Users\lv39\ado\personal/" + [5] (PLUS) "C:\Users\lv39\ado\plus/" + [6] (OLDPLACE) "c:\ado/" +``` + +::: + +:::: + +## Installing packages {.smaller} + +:::: {.columns} + +::: {.column width=40%} ```stata ssc install reghdfe which reghdfe ``` -``` +::: + +::: {.column width=60%} + +```{.stata code-line-numbers="3|7"} . ssc install reghdfe checking reghdfe consistency and verifying not already installed... installing into C:\Users\lv39\ado\plus\... @@ -76,51 +146,79 @@ installation complete. C:\Users\lv39\ado\plus\r\reghdfe.ado *! version 6.12.3 08aug2023 ``` - -:::{.important} -It is important here to recognize that it is the value of the special `sysdir` directory `PLUS` that determines where Stata installs commands, but the separate list of `adopath` locations where it looks for commands. It is possible to install a command in a location that Stata does not look for commands! ::: -## Using environments in Stata +:::: -But the `(PLUS)` directory can be manipulated, and that creates the opportunity to create an "environment". +## Using environments in Stata {auto-animate=true} +:::: {.columns} +::: {.column width=40%} -```stata +But the `(PLUS)` directory can be manipulated -* Set the root directory +::: -global rootdir : pwd +::: {.column width=60%} -* Define a location where we will hold all packages in THIS project (the "environment") +```{.stata code-line-numbers="4|13-14"} +* Set the root directory +global rootdir : pwd +* Define a location where we will hold all packages in THIS project (the "environment") global adodir "$rootdir/ado" - * make sure it exists, if not create it. - cap mkdir "$adodir" - * Now let's simplify the adopath * - remove the OLDPLACE and PERSONAL paths * - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen! - adopath - OLDPLACE adopath - PERSONAL - * modify the PLUS path to point to our new location, and move it up in the order - sysdir set PLUS "$adodir" adopath ++ PLUS - * verify the path - adopath ``` -which should show something like this: +::: + +:::: +## Using environments in Stata {.smaller auto-animate=true transition="none"} + +:::: {.columns} + + +::: {.column width=40%} + + +```{.stata code-line-numbers="13-14"} +* Set the root directory +global rootdir : pwd +* Define a location where we will hold all packages in THIS project (the "environment") +global adodir "$rootdir/ado" +* make sure it exists, if not create it. +cap mkdir "$adodir" +* Now let's simplify the adopath +* - remove the OLDPLACE and PERSONAL paths +* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen! +adopath - OLDPLACE +adopath - PERSONAL +* modify the PLUS path to point to our new location, and move it up in the order +sysdir set PLUS "$adodir" +adopath ++ PLUS +* verify the path +adopath ``` + +::: + +::: {.column width=60%} + + +```{.stata code-line-numbers="2"} . adopath [1] (PLUS) "C:\Users\lv39\Documents/PROJECT123/ado/" [2] (BASE) "C:\Program Files\Stata18\ado\base/" @@ -128,124 +226,116 @@ which should show something like this: [4] "." ``` +::: +:::: + +## Using environments in Stata {auto-animate=true} + +:::: {.columns} + +::: {.column width=40%} + Let's verify again where the `reghdfe` package is: ```stata which reghdfe ``` +::: -``` +::: {.column width=60%} + +```{.stata code-line-numbers="2"} . which reghdfe command reghdfe not found as either built-in or ado-file r(111); ``` -So it is no longer found. Why? Because we have removed the previous location (the old `PLUS` path) from the search sequence. It's as if it didn't exist. +::: +:::: +## Using environments in Stata {.smaller auto-animate=true transition="none"} -## Installing packages when an environment is active +So it is no longer found. Why? Because we have removed the previous location (the old `PLUS` path) from the search sequence. It's as if it didn't exist. +:::: {.columns} -When we now install `reghdfe` again: +::: {.column width=50%} -``` -. ssc install reghdfe -checking reghdfe consistency and verifying not already installed... -installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\... -installation complete. +Previously: +```{.stata code-line-numbers="2"} . which reghdfe -C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado +C:\Users\lv39\ado\plus\r\reghdfe.ado *! version 6.12.3 08aug2023 ``` +::: -We now see it in the **project-specific** directory, which we can distribute with the whole project (more on that [later](reproducing-environments)). +::: {.column width=50%} -## Installing precise versions of packages +```{.stata code-line-numbers="2|3|4|5"} +. adopath + [1] (PLUS) "C:\Users\lv39\Documents/PROJECT123/ado/" + [2] (BASE) "C:\Program Files\Stata18\ado\base/" + [3] (SITE) "C:\Program Files\Stata18\ado\site/" + [4] "." +``` +::: -Let's imagine we need an older version of `reghdfe`. In general, it is **not** possible in Stata to install an older version of a package in a straightforward fashion. You *may* have success with the [Wayback Machine archive of SSC](https://web.archive.org/web/20141226200440/http://fmwww.bc.edu/RePEc/bocode/), which in some cases goes back to 2000, by carefully reconstructing the necessary files. +:::: -Here, we will leverage the **SSC Snapshot** maintained by Lars Vilhuber on Github ([https://github.com/labordynamicsinstitute/ssc-mirror/](https://github.com/labordynamicsinstitute/ssc-mirror/)), which has been capturing snapshots of SSC since [late 2021](https://github.com/labordynamicsinstitute/ssc-mirror/releases/tag/2021-12-21) (details are for a different tutorial): +## Installing packages when an environment is active {.smaller} -```stata -* define the date -global sscdate "2021-12-21" -net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinstitute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r) -``` -which gives us +When we now install `reghdfe` again: -``` -. net install reghdfe, from(https://raw.githubusercontent.com/labordynamicsinsti -> tute/ssc-mirror/${sscdate}/fmwww.bc.edu/repec/bocode/r) +```{.stata code-line-numbers="3|7"} +. ssc install reghdfe checking reghdfe consistency and verifying not already installed... -installing into C:\Users\lv39\Documents/ado\... +installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\... installation complete. . which reghdfe -C:\Users\lv39\Documents/ado\r\reghdfe.ado -*! version 5.7.3 13nov2019 -``` - -So we now have TWO different version of `reghdfe` installed: - -- Version 5.7.3 from Nov 2019 is installed at `C:\Users\lv39\Documents/ado\r\reghdfe.ado` -- Version 6.12.3 from Aug 2023 is installed at `C:\Users\lv39\ado\plus\r\reghdfe.ado` - -:::{admonition} Stata can get confused about how to write paths... -:class: dropdown - -Stata on Windows can understand two types of path syntax: the "Windows" syntax, with backslashes `\`, and the "Unix" syntax, with forward slashes '/'. It will usually report paths in the "Windows" syntax, but these will not work, if coded as such, on non-Windows platforms, which do not understand the backslash as a path separator. We have used platform-agnostic paths above, using forward slashes. This then generates the "weird" mixed notation: - -``` -C:\Users\lv39\Documents/ado\r\reghdfe.ado +C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado +*! version 6.12.3 08aug2023 ``` -Other software (e.g, R), will consistently use the forward slash, even on Windows, when paths are coded internally. +We now see it in the **project-specific** directory, which we can distribute with the whole project. +## Installing precise versions of Stata packages {.smaller} -Only the former is used by the "environment" we just configured! Which is a good thing, since there are a few functional differences between these two packages. But for the life of *this* project, that functionality can now be relied upon - as long as we take care to use the same "environment" for all code run within this project. This can be achieved by using the `main.do` defined in [one of the previous sections](hands-off-running): +Let's imagine we need an older version of `reghdfe`. -```stata -* main.do -* This is a simple example of a main file in Stata -* It runs all the other files in the correct order +- In general, it is **not** possible in Stata to install an older version of a package in a straightforward fashion. +- You *may* have success with the [Wayback Machine archive of SSC](https://web.archive.org/web/20141226200440/http://fmwww.bc.edu/RePEc/bocode/). -* Set the root directory +## Package repositories -global rootdir : pwd +Most package repositories are versioned: -* Define a location where we will hold all packages in THIS project (the "environment") +- R: CRAN, Bioconductor +- Python: PyPI +- Julia: "General" default Julia package registry. -global adodir "$rootdir/ado" +Stata does not (as of 2024). **But** see [the full site](https://larsvilhuber.github.io/self-checking-reproducibility/12-environments-in-stata.html#installing-precise-versions-of-packages) for one approach. -* Enable project environment +## Takeaways {auto-animate=true .smaller} -cap mkdir "$adodir" -adopath - OLDPLACE -adopath - PERSONAL -sysdir set PLUS "$adodir" -adopath ++ PLUS -adopath - -* Run the data preparation file -do $rootdir/01_data_prep.do - -// etc. etc. -``` +From the earlier desiderata of *environments*: -:::{.notes} +- ✅ **Isolated**: Installing a new or updated package for one project won't break your other projects, and vice versa. +- ✅ **Portable**: Easily transport your projects from one computer to another, *even across different platforms*. +- ❌ **Reproducible**: Records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. -While we used interactive commands to install the various packages here, that was only for illustrative purposes. **Always** script the installation of packages in a `setup.do` file. We will address how and when to run that file in the [next section](reproducing-environments). -## Takeaways +## Takeaways {.smaller} -From the earlier desiderata of *environments*: -- ✅ **Isolated**: Installing a new or updated package for one project won't break your other projects, and vice versa. -- ✅ **Portable**: Easily transport your projects from one computer to another, *even across different platforms*. -- ❌ **Reproducible**: Records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. +- [x] your code runs without problem, after all the debugging. +- [x] your code runs without manual intervention, and with low effort +- [x] it actually produces all the outputs +- [x] your code generates a log file that you can inspect, and that you could share with others. +- [x] ❓ it will run on somebody else's computer diff --git a/presentation/70-other_methods.md b/presentation/70-other_methods.md new file mode 100644 index 0000000..d86418d --- /dev/null +++ b/presentation/70-other_methods.md @@ -0,0 +1,23 @@ +# Other methods + +## Non-technical means + +- Use a new computer +- Have an undergraduate student run it +- Ask your office neighbor to run it + +## More technical means {auto-animate=true} + +- [Virtual Machines](https://larsvilhuber.github.io/self-checking-reproducibility/72-virtual_machines.html) +- [Use of containers](https://larsvilhuber.github.io/self-checking-reproducibility/80-docker.html) + +## Use of containers {auto-animate=true} + +- **Containers** are a way to simulate a "computer within a computer", which can be used to run code in an isolated environment. +- They are relatively lightweight, and are *starting* to be used as part of replication packages in economics (but **only 0.13% of 8280 packages**...) + +## Use of containers {auto-animate=true} + +- They do not work in all situations, and require some **more advanced technical skills** (typically Linux, in addition to the statistical software). +- Using containers to test for reproducibility is easier, and should be considered as part of a toolkit. +- Several **online services** make such testing (and development) easy. \ No newline at end of file diff --git a/presentation/99-confidential_data.md b/presentation/99-confidential_data.md index 472d1f2..6cbbfbf 100644 --- a/presentation/99-confidential_data.md +++ b/presentation/99-confidential_data.md @@ -4,11 +4,17 @@ Do you know your rights? +## TL;DR + +- be able to separate the confidential data from other (to be made public) components +- all code must be available +- do not publish what you are not allowed to! + ## Permissions These will be noted in the **data use agreement (DUA), license, or non-disclosure agreement (NDA)** that you **signed or clicked through** to obtain access to the data from the data provider. -- Careful: scraped or downloaded data that did not have an explicit license! +> Careful: scraped or downloaded data that did not have an explicit license! ## Keep in mind @@ -17,7 +23,7 @@ Just because - you (and the entire world) can **download the data** - does **NOT** give you the (automatic) right to **re-publish** the data. -## Permissions +## Permissions {.smaller} - Do NOT **transfer or publish** data that you have no rights to transfer. Always carefully read your data use agreement (DUA), license, or non-disclosure agreement (NDA) that you signed. - Do NOT upload restricted-access data to the journal's platform. @@ -97,7 +103,7 @@ data/ :::: -## Strategy {auto-animate=true} +## Strategy {auto-animate=true .smaller} When the replication package relies on confidential data that cannot be shared, or is shared under different conditions, you should @@ -151,7 +157,7 @@ data/ :::{.column width=50%} -Prepare a **non-confidential replication package** that contains all code, and any data that is not subject to publication controls +Prepare a **non-confidential replication package** that contains **all code**, and **any data** that is not subject to publication controls ::: diff --git a/presentation/index.Rmd b/presentation/index.Rmd index cb68d98..2f88347 100644 --- a/presentation/index.Rmd +++ b/presentation/index.Rmd @@ -1,5 +1,5 @@ --- -title: "Checking reproducibility on Day T-1" +title: "Checking reproducibility on Day T" author: - "Lars Vilhuber" date: June 30 2024 @@ -7,6 +7,14 @@ bibliography: "paper.bib" --- +![](https://teleseries.com.br/wp-content/uploads/2013/11/tumblr_lr9o5nVNb41qknbv6.gif) + +--- + +## Day T {auto-animate=true .smaller} + +## Day T**-1** {auto-animate=true .smaller} + ```{r, child=c('00-index-mod.md')} ``` @@ -63,6 +71,9 @@ Unpredictable things happening to your computing environment: ```{r, child=c('12-environments-in-stata.md')} ``` +```{r, child=c('70-other_methods.md')} +``` + ```{r, child=c('99-confidential_data.md')} ``` diff --git a/presentation/index.html b/presentation/index.html index 1b1f6d3..f34e419 100644 --- a/presentation/index.html +++ b/presentation/index.html @@ -12,7 +12,7 @@ -