Skip to content

Commit

Permalink
Added instructions on providing ado files
Browse files Browse the repository at this point in the history
  • Loading branch information
larsvilhuber committed Jun 30, 2024
1 parent 9155e73 commit 6a2485d
Show file tree
Hide file tree
Showing 4 changed files with 153 additions and 30 deletions.
31 changes: 1 addition & 30 deletions 20-reproducing-environments.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,10 @@
(reproducing-environments)=
# Reproducing and documenting environments

There is a difference between documenting environments after the fact, and creating environments.
There is a difference between documenting environments after the fact, and creating environments. We describe two methods of allowing others to reproduce environments.

## TL;DR

- Provide a documentation of what your environment looks like when you run it
- Provide instructions on how to create the minimal environment needed to run your code

## The issue

```bash
pip freeze
```

will output all the packages installed in your environment. These will include the packages you explicitly installed, but also the packages that were installed as dependencies. Some of those dependencies may be specific to your operating system or environment. In some cases, they contain packages that you needed to develop the code, but that are not needed to run it.

```bash
pip freeze > requirements.txt
```

will output all the packages installed in your environment in a file called `requirements.txt`. This file can be used to recreate the environment. Obviously, because of the above issue, it will likely contain too many packages.

```bash
pip install -r requirements.txt
```

will install all the packages listed in `requirements.txt`. If you run this on your computer, in a different environment, this will duplicate your environment, which is fine. But it probably will not work on somebody else's Mac, or Linux, system, and may not even work on somebody else's Windows computer.

## The solution

The solution is to create a minimal environment, and document it. This is done in two steps:

1. Identify the packages that are needed to run your code. There are packages that may help you with this, but in principle, you want to include everything you explicitly `import` in your code, and nothing else. This is the minimal environment.
2. Prune the `requirements.txt` file to only include the packages that are needed to run your code. This will be the file you provide to replicators to recreate your necessary environment, and let the package installers solve all the other dependencies.

The resulting `requirements.txt` file will contain "pinned" versions of the packages you have, so it will be very precise. Possibly overly precise.

34 changes: 34 additions & 0 deletions 21-reproducing-environments-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
(reproducing-environments-python)=
# Reproducing and documenting environments in Python

Python allows for pinpointing exact versions of packages in the *PyPi* repository. This is done by creating a `requirements.txt` file that lists all the packages that are needed to run your code. In principle, this file can be used by others to recreate the environment you used. The problem is that it might contain TOO many packages, some of which are not relevant, even if you carefully constructed the environment, because it will contain dependencies that are specific to your platform (OS or version of Python).

## The issue

```bash
pip freeze
```

will output all the packages installed in your environment. These will include the packages you explicitly installed, but also the packages that were installed as dependencies. Some of those dependencies may be specific to your operating system or environment. In some cases, they contain packages that you needed to develop the code, but that are not needed to run it.

```bash
pip freeze > requirements.txt
```

will output all the packages installed in your environment in a file called `requirements.txt`. This file can be used to recreate the environment. Obviously, because of the above issue, it will likely contain too many packages.

```bash
pip install -r requirements.txt
```

will install all the packages listed in `requirements.txt`. If you run this on your computer, in a different environment, this will duplicate your environment, which is fine. But it probably will not work on somebody else's Mac, or Linux, system, and may not even work on somebody else's Windows computer.

## The solution

The solution is to create a minimal environment, and document it. This is done in two steps:

1. Identify the packages that are needed to run your code. There are packages that may help you with this, but in principle, you want to include everything you explicitly `import` in your code, and nothing else. This is the minimal environment.
2. Prune the `requirements.txt` file to only include the packages that are needed to run your code. This will be the file you provide to replicators to recreate your necessary environment, and let the package installers solve all the other dependencies.

The resulting `requirements.txt` file will contain "pinned" versions of the packages you have, so it will be very precise. Possibly overly precise.

115 changes: 115 additions & 0 deletions 22-reproducing-environments-stata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
(reproducing-environments-stata)=
# Reproducing and documenting environments in Stata

Stata poses additional challenges, since there are no robust mechanisms to point to specific versions of packages in the standard package repositories.

- Stata Journal provides clearly versioned packages that go with the relevant publication, but are not necessarily updated.
- SSC packages are not versioned, and will change over time.
- Github-hosted Stata packages may or may not be correctly versioned.




## TL;DR

- Provide a an install program that you used to install packages, and documents to others how you installed them.
- Provide a directory with the installed Stata packages as part of the replication package.

## Bad solution

One solution we often find in replication packages is that authors force installation of the latest package:

```stata
cap noi net uninstall package
ssc install package, replace
```

This is fragile, because it will install the latest version of the package, which may not be the version you used, and may fail for the replicator when the version you used worked for you.

## Solution

### Construct a Stata environment

As described in [Environments in Stata](stata-environments), construct a Stata environment, which installs packages into a specific directory, say, `installed-ado`.

### Create a setup script

Create a setup script that installs all the packages you used. Ideally, this is executed only when needed, and will not overwrite existing packages. Note the absence of the `replace` option. From now on, never manually install packages, always use the setup script.

```stata
* setup.do
* Installed on 2024-06-21
ssc install package1
* Installed on 2024-07-01
net install package2, from("https://example.com/package2")
```

```stata
* main.do
* Set the root directory
global rootdir : pwd
* other stuff as previously outlined
* Set install flag
global install 0
* ...
* Run the setup program only if the flag is set
if $install == 1 {
do $rootdir/setup.do
}
```

### Provide `setup.do` and the `installed-ado` directory as part of your replication package.

The `setup.do` file documents how you installed, and with the noted dates, when you installed it. However, the replicator will not actually need to run it, since you also provide the `installed-ado` directory.

### Instructions to the replicator

Your README might now say:

> The replication package depends on the following Stata packages:
>
> - `package1` (installed on 2024-06-21)
> - `package2` (installed on 2024-07-01)
>
> The packages are included in the `installed-ado` directory. The `main.do` automatically sets the `adopath` to include this directory.
> The `setup.do` documents how these were installed, and can be used to re-install, if so desired (not suggested).
> To re-install, delete the contents of the `installed-ado` directory, and set the global `install` to 1 in `main.do`.


## Extra-good solution

One way to ensure that, even when the installed packages are lost, re-installation provides the same packages as before is to attempt something like the "version pinning" of Python, R, Julia, etc. This only works

- for certain packages
- for a certain time period.

### Github-hosted packags

If Github-hosted packages have specifically tagged versions, a correct `net install` command can be constructed.

```stata
local github "https://raw.githubusercontent.com"
local multeversion "1.1.0"
net install multe, from(`github'/gphk-metrics/stata-multe/`mutleversion'/)
```

**Downside**

If the author decides to remove the package from Github (which is not a trusted archive), this still fails.

### SSC packages after Jan 1, 2022

For SSC packages, a mirror of the SSC archive has been maintained by *Lars Vilhuber* at [github.com/labordynamicsinstitute/ssc-mirror/](https://github.com/labordynamicsinstitute/ssc-mirror/), allowing for installation of packages "as of" a specific date.

```stata
local github "https://raw.githubusercontent.com"
local sscurl "fmwww.bc.edu/repec/bocode"
local sscdate "2022-01-01"
net install a2reg, from(`github'/labordynamicsinstitute/ssc-mirror/`sscdate`/`sscurl'/a)
```

**Downside**

This only works for SSC hosted packages.

3 changes: 3 additions & 0 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ parts:
- file: 13-environments-in-other
- file: 19-environments-takeaway
- file: 20-reproducing-environments
sections:
- file: 21-reproducing-environments-python
- file: 22-reproducing-environments-stata
- caption: More complex ways to test replication packages
chapters:
- file: 70-new-computer
Expand Down

0 comments on commit 6a2485d

Please sign in to comment.