From 6a2485db033f62fe73d076178b9459d6b754d50e Mon Sep 17 00:00:00 2001 From: Lars Vilhuber Date: Sun, 30 Jun 2024 14:34:59 -0400 Subject: [PATCH] Added instructions on providing ado files --- 20-reproducing-environments.md | 31 +------ 21-reproducing-environments-python.md | 34 ++++++++ 22-reproducing-environments-stata.md | 115 ++++++++++++++++++++++++++ _toc.yml | 3 + 4 files changed, 153 insertions(+), 30 deletions(-) create mode 100644 21-reproducing-environments-python.md create mode 100644 22-reproducing-environments-stata.md diff --git a/20-reproducing-environments.md b/20-reproducing-environments.md index 82190c3..db7c7ea 100644 --- a/20-reproducing-environments.md +++ b/20-reproducing-environments.md @@ -1,39 +1,10 @@ (reproducing-environments)= # Reproducing and documenting environments -There is a difference between documenting environments after the fact, and creating environments. +There is a difference between documenting environments after the fact, and creating environments. We describe two methods of allowing others to reproduce environments. ## TL;DR - Provide a documentation of what your environment looks like when you run it - Provide instructions on how to create the minimal environment needed to run your code -## The issue - -```bash -pip freeze -``` - -will output all the packages installed in your environment. These will include the packages you explicitly installed, but also the packages that were installed as dependencies. Some of those dependencies may be specific to your operating system or environment. In some cases, they contain packages that you needed to develop the code, but that are not needed to run it. - -```bash -pip freeze > requirements.txt -``` - -will output all the packages installed in your environment in a file called `requirements.txt`. This file can be used to recreate the environment. Obviously, because of the above issue, it will likely contain too many packages. - -```bash -pip install -r requirements.txt -``` - -will install all the packages listed in `requirements.txt`. If you run this on your computer, in a different environment, this will duplicate your environment, which is fine. But it probably will not work on somebody else's Mac, or Linux, system, and may not even work on somebody else's Windows computer. - -## The solution - -The solution is to create a minimal environment, and document it. This is done in two steps: - -1. Identify the packages that are needed to run your code. There are packages that may help you with this, but in principle, you want to include everything you explicitly `import` in your code, and nothing else. This is the minimal environment. -2. Prune the `requirements.txt` file to only include the packages that are needed to run your code. This will be the file you provide to replicators to recreate your necessary environment, and let the package installers solve all the other dependencies. - -The resulting `requirements.txt` file will contain "pinned" versions of the packages you have, so it will be very precise. Possibly overly precise. - diff --git a/21-reproducing-environments-python.md b/21-reproducing-environments-python.md new file mode 100644 index 0000000..615085b --- /dev/null +++ b/21-reproducing-environments-python.md @@ -0,0 +1,34 @@ +(reproducing-environments-python)= +# Reproducing and documenting environments in Python + +Python allows for pinpointing exact versions of packages in the *PyPi* repository. This is done by creating a `requirements.txt` file that lists all the packages that are needed to run your code. In principle, this file can be used by others to recreate the environment you used. The problem is that it might contain TOO many packages, some of which are not relevant, even if you carefully constructed the environment, because it will contain dependencies that are specific to your platform (OS or version of Python). + +## The issue + +```bash +pip freeze +``` + +will output all the packages installed in your environment. These will include the packages you explicitly installed, but also the packages that were installed as dependencies. Some of those dependencies may be specific to your operating system or environment. In some cases, they contain packages that you needed to develop the code, but that are not needed to run it. + +```bash +pip freeze > requirements.txt +``` + +will output all the packages installed in your environment in a file called `requirements.txt`. This file can be used to recreate the environment. Obviously, because of the above issue, it will likely contain too many packages. + +```bash +pip install -r requirements.txt +``` + +will install all the packages listed in `requirements.txt`. If you run this on your computer, in a different environment, this will duplicate your environment, which is fine. But it probably will not work on somebody else's Mac, or Linux, system, and may not even work on somebody else's Windows computer. + +## The solution + +The solution is to create a minimal environment, and document it. This is done in two steps: + +1. Identify the packages that are needed to run your code. There are packages that may help you with this, but in principle, you want to include everything you explicitly `import` in your code, and nothing else. This is the minimal environment. +2. Prune the `requirements.txt` file to only include the packages that are needed to run your code. This will be the file you provide to replicators to recreate your necessary environment, and let the package installers solve all the other dependencies. + +The resulting `requirements.txt` file will contain "pinned" versions of the packages you have, so it will be very precise. Possibly overly precise. + diff --git a/22-reproducing-environments-stata.md b/22-reproducing-environments-stata.md new file mode 100644 index 0000000..bf5eaae --- /dev/null +++ b/22-reproducing-environments-stata.md @@ -0,0 +1,115 @@ +(reproducing-environments-stata)= +# Reproducing and documenting environments in Stata + +Stata poses additional challenges, since there are no robust mechanisms to point to specific versions of packages in the standard package repositories. + +- Stata Journal provides clearly versioned packages that go with the relevant publication, but are not necessarily updated. +- SSC packages are not versioned, and will change over time. +- Github-hosted Stata packages may or may not be correctly versioned. + + + + +## TL;DR + +- Provide a an install program that you used to install packages, and documents to others how you installed them. +- Provide a directory with the installed Stata packages as part of the replication package. + +## Bad solution + +One solution we often find in replication packages is that authors force installation of the latest package: + +```stata +cap noi net uninstall package +ssc install package, replace +``` + +This is fragile, because it will install the latest version of the package, which may not be the version you used, and may fail for the replicator when the version you used worked for you. + +## Solution + +### Construct a Stata environment + +As described in [Environments in Stata](stata-environments), construct a Stata environment, which installs packages into a specific directory, say, `installed-ado`. + +### Create a setup script + +Create a setup script that installs all the packages you used. Ideally, this is executed only when needed, and will not overwrite existing packages. Note the absence of the `replace` option. From now on, never manually install packages, always use the setup script. + +```stata +* setup.do +* Installed on 2024-06-21 +ssc install package1 +* Installed on 2024-07-01 +net install package2, from("https://example.com/package2") +``` + +```stata +* main.do +* Set the root directory +global rootdir : pwd +* other stuff as previously outlined +* Set install flag +global install 0 +* ... +* Run the setup program only if the flag is set +if $install == 1 { + do $rootdir/setup.do +} +``` + +### Provide `setup.do` and the `installed-ado` directory as part of your replication package. + +The `setup.do` file documents how you installed, and with the noted dates, when you installed it. However, the replicator will not actually need to run it, since you also provide the `installed-ado` directory. + +### Instructions to the replicator + +Your README might now say: + +> The replication package depends on the following Stata packages: +> +> - `package1` (installed on 2024-06-21) +> - `package2` (installed on 2024-07-01) +> +> The packages are included in the `installed-ado` directory. The `main.do` automatically sets the `adopath` to include this directory. +> The `setup.do` documents how these were installed, and can be used to re-install, if so desired (not suggested). +> To re-install, delete the contents of the `installed-ado` directory, and set the global `install` to 1 in `main.do`. + + + +## Extra-good solution + +One way to ensure that, even when the installed packages are lost, re-installation provides the same packages as before is to attempt something like the "version pinning" of Python, R, Julia, etc. This only works + +- for certain packages +- for a certain time period. + +### Github-hosted packags + +If Github-hosted packages have specifically tagged versions, a correct `net install` command can be constructed. + +```stata +local github "https://raw.githubusercontent.com" +local multeversion "1.1.0" +net install multe, from(`github'/gphk-metrics/stata-multe/`mutleversion'/) +``` + +**Downside** + +If the author decides to remove the package from Github (which is not a trusted archive), this still fails. + +### SSC packages after Jan 1, 2022 + +For SSC packages, a mirror of the SSC archive has been maintained by *Lars Vilhuber* at [github.com/labordynamicsinstitute/ssc-mirror/](https://github.com/labordynamicsinstitute/ssc-mirror/), allowing for installation of packages "as of" a specific date. + +```stata +local github "https://raw.githubusercontent.com" +local sscurl "fmwww.bc.edu/repec/bocode" +local sscdate "2022-01-01" +net install a2reg, from(`github'/labordynamicsinstitute/ssc-mirror/`sscdate`/`sscurl'/a) +``` + +**Downside** + +This only works for SSC hosted packages. + diff --git a/_toc.yml b/_toc.yml index 4b29c45..eee0b9b 100644 --- a/_toc.yml +++ b/_toc.yml @@ -18,6 +18,9 @@ parts: - file: 13-environments-in-other - file: 19-environments-takeaway - file: 20-reproducing-environments + sections: + - file: 21-reproducing-environments-python + - file: 22-reproducing-environments-stata - caption: More complex ways to test replication packages chapters: - file: 70-new-computer